 Hello everyone, my name is Christina Young and I've been on the pan-cancer analysis of whole genome project for the last two years or so. So this is a large-scale project that is a use case, can consider to be a use case for all the cloud computing basics that you've learned over the last day and a half. And so this is a project where we utilize multiple cloud resources, not just one, we used actually 14 cloud resources to get this project done. And I'm going to show you some of the obstacles, some of the challenges we have encountered, some of the lessons we have learned, because the lessons learned are very valuable as we're only going to get bigger and bigger with our data analysis. There are also a lot of resources that came out of this project, whether you are a researcher, a bountenetician, there's hopefully something that you will find useful that we consider to be legacy data set or legacy analysis pipelines. So at the end of this, you will run one of the workflow, BWMM is an alignment workflow. It typically takes days to run, so that's why you would just run on a very small test data set just to get a flavor of it. So what is the pan-cancer analysis of whole genomes? This project started out two and a half years ago with the goal to collect 2,000 whole genome for cancer patients. So we want a tumor whole genome and also normal whole genome. So this is, you can consider an extension of what TCHG has done for pan-cancer exomes. So we made a call out to the ICGC members. We have very overwhelming response. We end up collecting over 2,800 donors with whole genomes. And this is different from the exomes because now you have data to look at non-coding regions, structural variations, and any pathogen insertions. But of course, we still are interested in any driver mutations, whether they're coding or non-coding, and also driving driver pathways. So this project is organized with a steering committee of five members from different institutes, from Peter Campbell from Sanger, Gettigetz from Broad, Jan Colbell from Amble, Lincoln Stein, who's the director of bioinformatics and biocomputing here, and Josh Stewart over at UC Santa Cruz. So at the beginning, we also asked researchers to submit abstracts on what they would like to do with this data set. So we received 130 abstracts and based on these abstracts, 16 working groups were created using the different scientific themes. And then there's also a technical working group, which I'm involved in. We are responsible for aligning all the genomes because one goal we want to make is a uniform alignment and also uniform bearing calling. This will allow us to eliminate any variations that is because of different processing pipelines. So these are the 16 working groups. You see obviously we have mutation calling at the top that's very important for this project. We also have groups specialized in looking at regulatory regions, transcriptomes. So transcriptome is another data set we do collect in this project. And we also have methylation data at the smallest scale, but it's still important to look at. And of course we look at driver pathways, mutational signatures, and now we also can look at germline cancer genome across these patients. So all in all is a very interesting set of questions, scientific questions to probe. And now these working groups are working hard to try to make some discovery out of this data set. So out of these 28, 100 donors, they are really collected from multiple projects. So as you can see there are a large number of projects coming from the United States. These are actually from the TCGA projects that have whole genomes. And then we have other projects coming from Europe and from Asia as well. So based on this map we got to make our planning. How are we going to collect the data? Because there's so much data we cannot actually have just one center centralizing all the data. We need to distribute the data and this map actually helps us plan because obviously you need a server or data center in Europe, in Asia, and also in North America. This bar chart just shows you the primary tumor sites that this project have data on. Each little color is actually a different project. So for example in pancreas is actually we have four different projects contributing the pancreatic endocrine tumor from the Italian group, from the Australian group, and we also have pancreas adenocostenoma from the Australian group and also from Canada. So it's actually a mesh of projects contributing to the different primary sites. So the initial roadmap as I said was to collect 2,000 donors. That alone would give us 600 terabytes of unlined BAM. So that's quite a lot of data to handle. And we asked the data owner to submit this data by giving us lane BAM levels. And it's very important right from the beginning that we establish the metadata that has to go with the data. Otherwise it's just BMS to track. So they found the metadata, they upload the BAMs, and they upload to a server called Genos. It's a gene torrent server. And the good thing about Genos, although they had a lot of technical problems like hung uploads. I know someone is laughing there because they have a bad experience. Your uploads will hang, it will abort, you have to contact the administrator to help you out. But it does have one advantage. It does validate your metadata. If it is not correct or matching, it will not accept your submission. It also makes sure your file size, it does check zones. Your file sizes, file names are everything as expected. So once it gets in, it's very useful to have all that metadata associated with the BAMs. So we started off doing alignment back in August 2014. It took us about four months to align the 2000 donors. And then afterwards, we actually have a second train for people to submit more data because they did not make it into the first train, but they wanted to be part of it. And that's why we got another additional 800 donors. So after the alignment phase, we have three core varying calling pipelines. One from Sanger, another one from DeCabs Enombo, and a third one from Broad. Of course, at the time, there was question, why these three pipelines get to be the core? And why do we need three? Why can't we just do with one? It seemed like the decision was political at the time. But when we looked at the validation, so on the right, this is our path to validate. Out of the 2000 donors, we picked out 63 donors. They were picked because they had sufficient DNA for validation. And also because of the material transfer agreement that a sample from one country can go to another country for sequencing. So with these 63 donors, we did the deep targeted sequencing. It actually took nine months to do because it took time to strategically pick the variants for validation. You cannot validate every single one of them. So they didn't need to be picked out. And then a panel had to be designed. And it simply took four months to order a panel before we can actually sequence in the lab and do the analysis and got the data back. Now, thankfully, after all that validation, we realized the three core pipelines do perform much better than the other callers that participated in the validation phase. So we did have the three core pipelines correctly. And even better is that when we merge the calls from the three pipelines, we do have better accuracy than any of the pipeline alone. So that's why we actually merged the calls from the three pipelines to get a consensus set that gives us very high accuracy. So as I said, in the alignment phase, we had already 600 terabyte of raw data. It also gives us 600 terabyte of aligned data. So we had to host this data in multiple sites. And also questions we had at the time also, well, if we aligned it at one time at one center, aligned it a second time at another center, we would get the same results. And we did check and we do. So in order to track this data, we use elastic search indexing. So this is compiled every night so that we know where the data is, what state it is, it is in, whether it has been aligned or not aligned. Before we had elastic search, we had situations where some sample was aligned multiple times. And so you don't want that. That's just a waste of your compute. So to do this, we had seven data centers over at UCSC that at the NCI cluster at the time, it was hosting CG Hub. So all the TCGA data could only be hosted there. We also had a compute center at the University of Chicago that protected data cloud. And then in Europe, we have three data centers, EBI, Barcelona, and DKZ. George has a question. At those stacks, we have basically our tendency and we have accounts and we can go in there, start VMs, and do any work that we need to do. Other places like the University of Tokyo and Barcelona, Barcelona is actually an HPC environment. In those environments, we don't even have accounts. So we only could give the software to the team over there and they run it for us. They are our cloud shepherds, as we call them. They're all Illumina data, actually. Yes, so it just so happened everyone was using Illumina, so we didn't have any issues with different kind of data. So to do the alignment, we had, at the time, we basically asked the data owners to upload the data to the local regions. So there's no overlapping of the data and each center was just responsible for aligning the data at their own center. And this requires about, this requires an eight core machine, 16 giggram, about five days per sample. We had nearly 6,000 samples to align, so that's why it took that much resources and that much time. So as I said, it was done within four months time. The second train comes in with 800 donors. That's okay. We knew how to do it all over. That was easy. But then when it comes to varying calling, it was actually a lot harder because, first of all, these pipelines were not just ready to go. They were running very well. They've been running very well in the local institute at Senya and Brode. For example, you know, Brode has something called fire hose. This is sort of their August data to run productions. And then they now have to take out the pieces, tear it out from the fire hose and package it to run in another environment like OpenStack. So whenever when you try to move pieces of code like that, it bugs are created. We're all humans. So things are not smooth at all. It takes a long time to put that code to a different environment. They all happen at different timings. And also these pipelines are very specific to their own sequencing centers. So for example, Sanger had a typical way of naming their samples. Their tumors and normals are always named differently. But then when they run it on another sent to who happens to use the same names for the tumor and their normal in the header, they run into issues. So this was a challenge for every pipeline here because they're facing something that they've never seen before. Now Brode also had pipelines that are completely new. They're not even published. So it was even more challenging to try to run these in the production environment when they may have to pull back and change the parameters. But the key thing here is that more compute is required. So here's actually a list of their algorithms if you're about imitation and are curious about what was done. In each of these pipeline, we call SNVs, single nucleotide variants, indels, structural variants, copy number, germline. And so they all each have their own algorithm. But all these, they start off with this downloading the data from Genos and eventually they have to upload the data back to Genos. So this is what George said before. You download the data when you need, when you're done, you upload it back then so you can kill your instance without having to save the data elsewhere. Now these are core hours that were needed per donor. So 800 for Sanger, 800 for DeGavza, and 2300 for Brode. These are average numbers based on our runs on Amazon. And I'll tell you that after running so many environments, Amazon typically gives us the best performance. We don't know exactly why maybe they just have the latest CPUs that have more optimized IO network performance. So when you go to any other different environment, you would expect a slightly slower runtime. So we had a challenge also that back then that we didn't really know what these pipelines need in terms of cores or memory. So there was a bit of trial and error. And like George said, the lesson is start small. You pick different kind of samples, bake in small, and try it out so you get a flavor of what exactly to do next when you try to scale up. So when we scaled up, we actually went up to 14 compute resources. We're lucky in the sense that our members volunteered additional compute resources, such as iDash over at UCSD, Sanger Institute, and also even here at OICR, we added additional compute resources. So here at OICR, we also have an open stack. And very importantly, there was a change in policies. So for the longest time, we could not put TCGA data in the cloud. And then the policy changed. We're allowed to process TCGA data and also ICGC data in commercial clouds. That's a big change. So that's why we started using Microsoft Azure and also Amazon Web Services. That really allows us to scale up, even though it costs a bit, but that allows to really scale up and get resources when we need them. So where are we today? We have actually finished all our analysis. But as you can see, we had that burst of BW alignment. And then the three core pipelines actually start at very different times. And so that's why we had this gradual flow of completion over two years time. And we did some calculation. If we were to do this all over again, and if we managed to say start all three pipelines at the same time, right after the alignment, we can possibly get this project done within four months, provided that we can get all the compute needs that we can. So we'll say at Amazon. So that's a very valuable lesson because it would be much faster if you want to do this all over again. Now, just to complete the story, core analysis is done, but it doesn't mean that the project is done for the data to be good. So in order to get good quality data, we actually get the calls out there to all the researchers, ask them to take a look, okay, where are the problems? What are some problematic donors? You don't expect every donor and ever sample to be perfect. So with about three months of looking at the data, they found that we had to exclude 6% of the donors. Sometimes it's the obvious reason that we don't have any clinical information. Other times it's because we've discovered that tumor itself was contaminated with foreign DNA. Sometimes the normal that gets contaminated with the same donor's tumor DNA. And then there are cases where we see CDNA or mouse contamination. And also this exercise requires looking at a lot of plots visually. And then based on visualization of structural variants, then some artifacts could be discovered. So 6% being blacklisted, that's pretty normal, I think for a project of this scale. And then we had another 2.6% of donors that have low level contamination. We did manage to rescue them, we think, by filtering the contaminated calls, but it's never complete filtering. So we just flag them as ones that people have to pay attention when they do the analysis. So ultimately we end up with 2,583 donors. And some of them have multiple tumors. And then half of them have RNA-seqs data to match with it. So this is a good resource for anyone interested in large-scale analysis. Now when I talked about filtering of artifacts, there are actually multiple kinds of artifacts that needs to be filtered out. So there's oxidation. There is PCR and strand bias. We also did non-robust mapping. That means we take the read and blot it against the reference. And if they map to a better another region with a higher score than the one that was aligned to, then that read is out. And then we also built a panel of normals using this cohort in order to get rid of some of the false positives. And any SNVs that overlap with germ lines or the thousand genome SNPs, those are filtered out. And in some small cases we did have calls on chromosome Y, even in female donors. So those were filtered out as well. And other groups have looked at it and tried to look at annotations. So for example, we have a mutation of signature group. They think certain things are artifacts. And also there's the observation that they're an enrichment of SNVs near indels. So that needs to be flagged in case someone finds something interesting in those regions. They need to be aware that they could be artifacts. And then we provide other estimates and ratings to help guide people in their analysis. So now finishing the story is that using the consensus strategy, we managed to get a very high precision and sensitivity for our SNV calls. And then for indels, as you know, indels are always hard to call. If you overlap any two independent indel caller, the overlap is only 50%. So in this case, we did manage to get 90% precision, but of course, sensitivity slightly lower at 60%. So any questions about the project at this point? Okay. Then I'll talk about the lessons we have learned from this project. These are more, I would say, technical implementations that we should do differently the next time. So during this project, the way that it's done, we have 14 environments, but all we have is actually something that is centralized metadata that we can look at to see what donor, whether the donor has gone through alignment or which varying calling pipeline has been done. And certain, if the pipeline has not been run, then okay. The project manager will tell a specific compute center or a cloud shepherd to run a variant pipeline on a specific sample. So this is where the project manager has to communicate to the cloud shepherd. We use git to do that actually, just put certain files or data in the git so that the cloud shepherd can go there and know which one, which donors to run through each pipeline. But the monitoring is a bit scattered. So in order to know what's going on, project manager actually has to ask the cloud shepherd, hey, I haven't seen any completion of donors over the last couple of days from your compute site. What's going on? So it was more of a human intervention in this case. And so what I'll show you in a couple of slides is a better case that we can automate these things and make it a lot easier, especially when you have multiple clouds to manage. Now, there is also the debate, well, we have academic clouds, we have Amazon, why don't we just stick with Amazon? It's a lot easier. Or maybe let's just go with academic clouds because it doesn't cost us anything. Well, I think we need both. The reason is sure with academic clouds, you have like Collaboratory, for example, it's a low upfront startup cost. So that's a good environment to start small and test and learn from it before you scale up. But also keep in mind, each environment, it's going to require at least a half full-time employee as a cloud shepherd to monitor that site. So the more sites you have, the more FTU you're going to need and salary costs as well, as well, not just compute. The other reason, the other cases where we cannot use academic is because one workflow such as the Broad actually requires a lot of resources. It required 32 cores and 256 giggram. And very few institutes can give us a large number of VMs with that kind of resources. That's when we actually have to go to Amazon and Microsoft Azure. Okay, so well, what about we run on commercial clouds? That's great because you can start up really, you know, a couple hundred of VMs with only half an FT monitoring these jobs. So you can get a really burst of productivity there. However, there are some tricky things here. Certain data cannot go on the cloud. So specifically, the German cohorts have told us that the data cannot be processed on any commercial clouds. They can only be processed in academic clouds. So that gave us a challenge as well. And also, so I think it's both, because commercial clouds are all owned by you Americans right now. So it's a bit of both. But if we run everything on commercial clouds, some of the donors will cost us a lot. They have run for over two months and even three months at a time. And you just wait and you don't know when it's going to finish. We're a little nervous about our costs at the time. But now we have learned we can actually use BAM statistics to predict which ones are going to be long running. And then we can save those for academic clouds and not pay for the long running time. So as I said, metadata is key. Our projects so far have only focused on whole genomes and RNA-seq. But more and more data are going to come online with BISELF-I-SEQ. And it's definitely important to establish the metadata right up front when you start a large project like this. Now during our projects, as you can imagine, we experience a number of outages at the data centers. Or like you said, sometimes you try to download from EBI and that EBI is not available. And then your instance is just sitting there and waiting for that download to happen. So A, I think we need to do replicate data to multiple data centers to have redundancy. And then next we need to have a downloader that is smart enough that if it tries one data source, it doesn't work, go to a second data source, go to a third data source. And if all the data sources are not available, shut it down, shut that instance down, and then notify the cloud sharper. I'm having trouble downloading from all these data centers is maybe we can't run anything for a day or two that has happened as well. Now also we have crashes at some sites, unexpected, maybe power outage. But for whatever reason, we need some way to handle these crashes more gracefully, because some cloud shoppers have actually spent a good day just to clean up after a crash. So the queue can be cleaned up and then can go to the next job. So the queuing system needs to be simpler, I hope, for cleanup like that. And also if we can have a queue that goes to multiple clouds, then if one queue, one cloud doesn't work, then schedule it to another cloud. That would be very nice to have. And so as I said, we could be smarter than we did during this project. So George has already showed you an example based on the size of your BAM file, choose your instance correctly right from the beginning. So in this project, because of the way the queuing system was, we always use the same instance size for every sample. And then so it might fail because it didn't have enough memory, enough just space. That's something we could avoid by choosing the right VM right from the beginning. So some kind of calculator would really help. And then we can also predict our long runners based on BAM stats. So that's important. Now as I said, cost of compute is typically a little lower than your FTE. So we try not to get a cloud shepherd to get involved with failed jobs. And so it would be nice if the queuing system is a little smarter. Yes, it's start off with a good VM already. But if the job fails, maybe progressively try another VM that is larger with resources, and try it again, maybe try three times, and then eventually it does come back, then the cloud shepherd can get hands on and look at what the problem is. And like George said, we need monitoring. So very good monitoring system to tell you of any problems. So you don't have to look at every single VM to find out if they're running or not. Because sometimes you have 100 running, and it's nice to have a report that a VM has been idle for a couple of hours already, report back alerts you so you know what it is doing because it's easy. We have had VMs sitting idle for days just because no one caught them. Last thing that may not be as obvious is actually to validate our results before submitting the data back to the server. A lot of times we had good data done, but then we had cases, well actually maybe a chromosome was missing. Those little things that we have all experienced but forgot to validate before uploading the data back. So this is what I like to see as a project manager. I like to just be able to have one system, an Augustrator, that allows me to queue my jobs to multiple clouds that gives me monitoring across multiple clouds, give me alerts, because at some point you can think of this actually as maybe even a grad student who is doing a very large-scale project. They got the DACO and DBGAP approval. They got a budget from a grant. Just Augustrate the work wherever is possible. And then of course at the end, sometimes they get a bill back, but then now this is algorithm pushing rather than pulling data into their local machines. They're just pushing the compute algorithms into the cloud that already has the data local to the system. So all this data moving should be avoided and should make it easy for one person to compute across multiple clouds. So why are these lessons valuable? There are a couple projects probably coming up that will be using large-scale analysis like this. So there's already a pan-prostate initiative that's going to involve whole genomes, exomes, and many other data types. The data is going to be in US, Canada, UK, Germany, Australia. So again, this is very distributed. Actually, I forgot to tell you, for the Australian group when they participated in the pan-cancer project, the bandwidth was absolutely terrible. They couldn't even upload the data to any servers. So they end up shipping us hard drives. We received 40 pounds of hard drives and have to upload the data for them. Now, the good news is that there's now an AWS Center in Australia. So hopefully that will solve the problem in the future. Another large-scale project is ICGC Met. This is an extension of ICGC that will include clinical trials and a lot more clinical information. The scale is going to be almost 100 times bigger than pan-cancer. And the plan is to collect the raw data, do uniform alignment, do uniform variant calling. There will be multiple data processing centers, typically based on region again, and we'll be doing exactly what it has been doing for pan-can. So this time, the project is going to last 10 years. So you're not going to have a project manager who'll be willing to sit there and assign workload on a daily basis. So all the things that I've just said needs to be really automated and there needs to be some intelligent system to get the job done. Any questions about lessons? Eight years? Getting more aggressive. All right. Sorry. We are just putting in a grant. So ICGC Met was started in 2018. So we have some time to put in the infrastructure and getting the project ready. Okay. So after all this hard work from a lot of people, what resources are available publicly to people? Either now or very soon in the future. So of course, data. People were going to care a lot about this cancer data. Methods. I mean, we put a lot of effort into perfecting these methods, gains validation data. So we'll share those two. And they're best practices that we have learned from this, we've established in this project. And those needs to be shared as well. So first to talk about data. Where can you find the data, right? So if you're not familiar already, this is the ICGC data portal is at dcc.icgc.org. And this is a portal where you can find the data you're interested in. So in this case, so you see that on the left, there's a faceted search interface. So you can pick the project you're interested in. So for example, these are all liver projects. And so yeah, so liver got picked. And you can, it's just like shopping. You narrow down your data set that you're interested in. It shows here what your query you have been using so far. And then from here, if you click one of these file ID, then you go to a page that will tell you a lot of details about this file. So in this case, this is a BAM file. And then it is actually available in multiple repositories that is listed here. So you see Peacock, I think this is Heidelberg, this is Heidelberg, Barcelona, London, and also the data is in Collaboratory as well. So this gives you a choice of where to download the data, which we'll get into next. But if you click on this BAM stats, you will see something like this. So this is actually a pretty cool tool that Vincent Faraday's team has used to give you real-time statistics of the BAM. So it's actually streaming the BAM. If you go to it, it will tell you the coverage. You can even zoom in on different chromosomes to get a sense of the coverage in those regions. It will give you all kinds of reads information as well. So it's kind of cool. There's another page, also a similar page for VCF that's being developed that you can visualize your base changes or your variants along the chromosome. So this is being developed right now. Going back to the file page, yeah. No, you don't actually. This is just on the website. And the way that they have the security setup, it will not leak any specific germline information. You can try it. It went through a lot of security checks before this. So this one will be individual, but they mask, if it is a germline that is not the same as the reference, it gets masked. So that's why you won't see any germline information that is identifiable. If you brick it, let us know. I always like breaking things. Going back to the file page, there's another thing that you can click on, which is the metadata page. This just tells you when this was generated, the kind of specimen type. But this is sort of an example of the metadata that we keep track of. Now this is all open data. So you can look at it without logging in or having any kind of account. But there is data that requires authentication. So if you already have a DACO application approved, then you can log into the site and you will see something like a token manager. So this is where you can set your token in order to download from Collaboratory, because a lot of data is already in Collaboratory, or to download from AWS. So yesterday when George asked you to try out the download, we didn't put any token in, because you don't have DACO authentication. And also we asked you to download data that was open access. That's why you did not need the token. So coming back to the page after you have the search results, you can download the data very easily by clicking a button called Manifest. So everything that you have searched and selected, you're interested in, it would just come in a manifest and you'll get in. So this is kind of what you will see at the beginning. It will show you, okay, you might start it. So this is GIF. So it actually is a little video that repeats itself. But when you first start with the manifest, it has actually over 1.5 petabyte of data if you download all of it, because they're actually residing in multiple repositories. Same files in multiple repositories because we have data redundancy. But it has a very smart feature that if you click this box called Remove Duplicate Files, then it will decide what to, it will remove the duplicate files for you based on the priority of repos you choose. So then that's why you can now move these repos up or down. So if there's one that you're specifically interested in, say Collaboratory, you want to use that as your primary repo, move it to the top. So most of the files will be from Collaboratory. The other files that Collaboratory doesn't have, it will go to the next repo to download. So this is a very smart tool and very useful and you don't have to worry about removing duplicates and all. The other thing that you don't have to worry about with this tool is where your data is coming from. So there's another new tool created by this, this Vincent's team is, they call the ICGC GET. As a universal tool, you could be downloading from Amazon, from Collaboratory, from any of the Genos servers, even from GDC, that is the new genomic data comments hosting on all the NIH data. All of it, you can just put it in this ICGC GET command, click tells it to download and then specify the manifest that you have just created previously. And this tool will just do everything for you. Otherwise, you will have to install different tools like GT download, also the ICGC storage client and other tools that are specific to the servers. Yeah, so the ICGC GET will have a configuration file where you put your token, your key as well, in order to handle all of that. So this is a, so ICGC GET is a universal tool, but the other tool that you tried yesterday was ICGC storage client. Now this tool is slightly, is sort of slightly different because it's specific to Collaboratory and AWS. And it also has additional features that ICGC GET doesn't have. So downloading using a manifest, I already told you already, there's something that's interesting called BAM slicing. So instead of download the data and then use SAM tools to view a region, you can actually use this tool to just get a small region back and look at either through SAM tools or actually I haven't tried IGV, but it would be nice to be able to see small regions in IGV as well. So you don't have to get a big file to your computer. Another nice feature is actually called manifest, a fuse. So if some of you work with fuse file system, you can actually mount foreign drive onto your local computer and it would just look like your own local drive. So you can do LS, you can do DU, all these commands. So this allows you to do that again. By giving it a manifest, you can mount all those files that you're interested in onto your own local drive and now you can explore it. You can look at the files and see whether those are indeed what you're interested in. So that's very powerful. Everything that I've just told you, you can definitely look at it in more details at the user guy at docs.icgc.org. It's very easy to follow some of the commands to get download the data. Yes, Francis? Yes, so as part of this project, we created mini bams that extract the reads flanking the variants. So it could be SMVs, indels and structural variants. Only those relevant reads are pulled out and the mini bams will be available through the portal eventually. We have an upload. Yeah, so for the entire project, if you look at the raw bams, the aligned bams, they're 800 terabytes in total. And then the mini bams comes down to only four terabytes. So it's half a percent of the original. So just the last couple slides. I wanted to talk about the workflows that we have. So when we started the project, we didn't use dockers. Back in 2014, it wasn't quite as hot. And then gradually we shift to dockers. And also we're now making all our workflows available as dockers. So we want to make sure things are reproducible. And any other researchers can use the same algorithms. And they're specifically registered at dockstore.org. So you ran the commands yesterday is a lot easier to just have a couple of commands and look at the and describe the workflow in common workflow language. So we're going to run one of them at the lab. So so far, the BWMM is available through dockstore. And also the Sanger workflow, DKFZ, Enombo workflow, those are all at dockstore now. We are going to get our broad pipeline in there as well. It's a little bit more complicated because there are a lot more components. And then the other one that we are working on is actually all the filtering methods that we've used. It's easy to ignore the filtering methods. But as someone said, that's actually the secret sauce to get really good data. And a lot of it actually came from the broad team. And for the longest time, they kept it a secret. And now they're sharing with the community. So it's a very good thing. So lastly, there are some best practices that have been established by this project. So for example, even just looking at your data and to determine whether it's good or bad is an art. So we will put together a Rhodes gallery of sequencing artifacts. We want to show some plots. They could be rainfall plots, IGV, anything from that help us identify problematic samples. We want to illustrate those examples with the community. And like I said, the code for filtering out the artifacts will also be made available. Is that politically correct? Okay. So the lecture part is actually over. But yeah, so we'll come back and do the lab after lunch. Any questions? So definitely, we definitely have library effects. So one that we famously call the Sanger effect. The Sanger knew that their library preparation has certain mutation context that are in high demand. So they filter those out and other sense. So we actually have to do those kind of filtering afterwards. And as we go, we see other effects that are specific to centers. But we haven't got the really systematic analysis by sequencing centers. It's definitely on the list for us to do. Yes. Oh, so during library preparation is very possible for oxidation to occur. And then there would be cannot remember the context of the mutation, but it will be very dominant. So if you look at a Lego plot is very dominant. So what can remove that artifact is something called oxo G algorithm is developed by the broad team. It's publicly available. So we apply that to remove the artifacts. But since then the broad team also wrote a protocol experimental protocol to add certain reagents into the library prep to avoid that kind of oxidation. Yeah. Yes. It's pretty much anonymized in the sense that clinical data we asked for specific terms that are not identifiable. So we're not going to we didn't ask for birthdays. For example, we will ask for age at the diagnosis. And we don't ask for postal code. So that's how I that's we piggyback on what I see GC has established. So that really helped. So the metadata is indexed into elastic search. So it's pretty much in a JSON format. It's possible. And you can do reports and queries out of it. Yeah. Yes. So when the okay, so first off is that there was a change in policy by TCGA by DB gap and also I see GC saying that okay, we allow our members to put the data in the cloud. But then for the ICGC portion, we actually had to go back to each data owner and asked if they will allow us to put it in Amazon. It was actually a very, very lengthy process because we made a request. They come back with a lot of questions. Okay, what are the implications? How save is AWS? Do they abide by these rules? Sometimes they ask for ISO certification. The lawyers have questions. They had to go through their REB again to make sure their patient do consent, this kind of environment for their data. It was a very lengthy process. And it really took people like Lincoln Stein and Bath and Noppers to and Tom Hudson to really negotiate with them. So eventually, most of them came through. There were still other projects that didn't come through China. I believe a breast cancer project that span across multiple countries in Europe. So it was a little harder to get consent. And then the German. So those are the ones that never did not consent to putting the data in the cloud. And also one thing to point out is that OICR is considered the custodian of the ICGC data. So we have a responsibility. And so we do have cyber insurance. And in order to get that cyber insurance, we actually had a threat risk assessment of our system before getting that insurance. So that involves getting a third party coming in to test the system, trying to break the system, and they do penetration testing as they call it. And eventually we pass with flying colors. All right. Thanks, everyone.