 So before I had to the, can everybody feel me? Yeah, so before I had to the data portal as ICGC data portal as Francis mentioned earlier, I'll give you a few slides to introduce you to how the data is processed, how the data is submitted to us at DCC. So the screen's cut off a little bit, but here it shows that opening of the submission. So basically it goes by cycles every year. There's three to four different submission sessions. So the member of the ICGC was submitting the data to the DCC through a submission system. I started with a few projects submitting the data and then being verified, being validated on the server side and ARs reported and then they were correcting the errors and then when the data is all good at high quality, it will send off. And then there's a data processing data annotation ETL start and I'll give you a little more detail about the annotation we do. And then we closed the data portal and we closed the submission and then released the data portal for the end user for kinds of researchers to browse through the data portal search data of the interest. So for the data coming to us, those are analyzed high-level data, simple somatic mutations, germline mutations, coping number mutations, structure variations, DNA methylation, gene expression, protein expression, microRNA, exon junction. Those are data analyzed through the raw data, either primarily sequencing data, and then the raw data, sequence raw data, they are submitted to EGA as Francis mentioned earlier and for TCGA, they're submitting to CGHub in the form of FASTQ or BAM files. So this screenshot shows you the submission system open to the submitters and we have a status showing how many files are valid or unvalid and they get submitted as reported and then they can correct the errors and sign off. And this slide is the most important slide that connects to the data I'm giving to you. In order to add value to the data we have, so basically we will provide annotation. So data submitters are strict analyzed data primarily mutations. So we will merge all the mutations based on the genomic coordinates and the mutation types and then we calculate the frequency of each individual mutations across all the project or all the donors. And we can compute the consequences of mutations if mutation happened to affect proteins, for example, amino acid changes as we annotate those for all the transcripts, all the transcript forms of all the genes. We could have frame shapes now since I mediated decay or these different kinds. And then there's another group of software. Here we pick up one software to predict mutation functional impact. We categorize the high impact, low impact, and unknown using software, the one actually developing the UK group here. Also we add annotations about gene sets. Those are a group of genes being annotated with go terms or pathways. So genes participating in different pathways. Also there's a list of genes are unknown based on literature created by Sanger Institute here called cancer gene census. Right now it's about over 500 genes. So we annotate those as well. And finally, when these annotations is completed and we run the ETL, put all the integrated all the gene annotations and mutation data together and transform the index, I mean the highly integrated index and then allow integrated search. So that's the introduction. So we have a few slides showing you how the website is organized. I'm not going through these because it's all in your handout. So you will be able to use that as a reference. But now I'm gonna just jump into the demo. So what I'm gonna do is just go through the size and work you through different pages and click through different links. And then at the end, I'll try to demonstrate some useful use cases. If I've done that, finish all of them. If not, we can exercise those on your own because I cover most of the functionality there. So the homepage, dcc.iccc.org. So the homepage give you the access. Okay, I need to adjust the window a little bit. Thank you, Michelle. So the homepage give you an overview of data types we have and... Okay, so he's gonna do a demo. And rather than you clicking through to follow along beside, just pay attention because I want you to be able to pull out from what his demonstration is. I want you to be able to pull out the organization between gene, project, mutation pages, et cetera. We are gonna spend two and a half hours doing exercises, I think, in the afternoon. And so if you don't capture how it's organized, it'll be a bit muddled for you in the afternoon. So try and focus on how the portal is organized so that we can apply that in the afternoon, okay? That's a great suggestion. Thank you, Michelle. So as you can see here, give you an overview of cancer projects and the kinds of donors we have and the mutation we have, mutated genes we have. And then there's a three, four major functional area we have in the portal. And there's a big search box here. You can type in any keywords of your interest, usually gene, can be a pathway name, can be anything you can think of. So you can get an interactive response so to tell you what the data we have regarding those keywords. So I'm gonna step into the cancer project page. So this one lists all the cancer projects participating in ICGC consortium. The page layout is a little different than usual because of data resolution, I think. It's better. So basically this is an interface you will see very often in our data portal that there's two major panels. The left one is we call it a faceted search. So these are facets. Basically perform two function. One is that you can click on that and you will filter down the result on the right panel. The right panels are the result of your search material. So it's an interactive interface. So in this case I'm gonna click on, let's say brain cancer. So as you can see that the right side are changed accordingly. So what they were showing here that we have the brain tumor here and we have three projects that are studied in the brain tumor. And on this graph we show that top 20 mutated genes with high functional impact. Again, this is based on the annotation we have earlier. And then this clause it shows you, let me reset here. So you can reset the filter by click on this icon here just move away all the filtering. So this one shows you that it's a mutation frequency summary. So as you can see here that each project is one column here and then each dot is one patient, one donor. As you can see that many, some projects spread out more and some are tighter in terms of distribution of mutation per macabase in the genome. So not surprisingly that a higher frequency mutation are melanoma, skin cancer because of the sign exposed and also the lung cancer, the smoke. The low end are pediatric cancers, brain tumor. So those are indication of the high level indication of the mutation frequency. In the detail page we have the listing of the audit project and the accounts of the data availability. So how many donors we have for each kind of a mutation. Let's just click on one of the project we follow through that. So this is a brain tumor from a TCGA project. So this is what we call a project entity page. We have some other entity page as Michelle mentioned earlier that mutation page, gene page, project page, donor page, we'll give you through those pages. So for those entity page we have multiple sections. The left side allow you to navigate through different sections. So the first one is a summary. So give you some basic statistics of the project. What kind of cancer types, how many donors we have when it's different data types, what donor we have and what kind of an experiment data has been performed for these patients. And we have links to raw data repositories. As mentioned earlier that EGA and CGR are the primary ones. So for one project we have mutated genes. So this one shows you how the genes are affected in this project. So you can also limit the mutations by high consequences only meaning that is more of a functionally damaging to the protein. So once you click on this high impact mutation you get a filtered cancer as well. So all these are dynamic, interactive. So as you can see here this one shows that P10 gene is the most affected genes in this project meaning that there's the most mutated donors. In this case there's 75 out of 268 patients in this project. So as you can see that 268 donors has simple somatic mutations. That's the total number of donors we have data for this project. And out of that you have 75 donors with this mutation with mutation on this gene with high impact mutation on this gene. This table also tells you about although it's about one particular project it can also show information about other project. So basically it tells you besides this project what happens to that gene in other project. So as you can see that the GBM is ranked number two here and this one project actually has higher frequency of mutation on this gene compared to this project. So similar idea is that the frequent mutations so you have mutations listed observed in this project. These are mutations in this case the top one has 12 donors there and out of 268 total donors. And then for the mutations we show you that the genomic location, what are the base changes, what are the amino acid changes. So in this case it shows a miss sense. The colors indicate a high impact or low impact. Red or high impact. So each of the mutation has a red here because I selected the high impact here. The last section is the donor section in this order donors with this high impact mutation found in this project. So I'm gonna click on the mutated genes. Second highest one is the TP53. If you go to the gene page again is the different sections you have summary, tell you what genes this is and external links to other databases. And then cancer distribution. Once again you can select the high impact only. I found personally I found it useful to restrict the mutation type to be high impact because there's many passenger mutations that's not actually provide meaningful information. It's more of a noise. So once you click on that, the older table, all the graphs are filtered out. So for genes we also have the annotation from pathways, so these genes is involved with at least all the pathways and the go terms apply to these genes. And then cancer distribution again, you just see the other way around. So I showed you one project and all the genes mutating this project and ranked by frequency. Now this is the other around that you have a gene and I want to compare the frequency among different kinds of project. Yes, how do you? Oh, okay, okay, okay. Yeah, okay, I'll give you a little bit of detail. There's many, many software to do this. The one we choose is the hidden markup model. I don't remember the full name, but it's developed in UK group. The idea is that they use some semi-true data from literature and train the algorithm and knowing that some of the mutations is known to be cancer causing, so meaning that is more consequential. SniffF? Oh, it's no, it's fat, HMM. There's many, many, there's SniffF and there's a mutation assessor. We actually try to do more and then combine the result. So basically the annotation would precalculate them and then assign the category to the mutation. So these are actually more of, most of these are working on the single-nuclear mutations. They're not working on insertion of deletions and also they're not working on other non-translate, non, the mutation does not affect the proteins. So for a frame shift, for example, we actually automatically assign to high impact. So the algorithm is not perfect, but it's more of an added value there. So if you're more interested, we can chat off-line a little bit on that. So again, these are listing the cancer project with the mutations in this gene. As you can see, some of the project has a really high-impact frequency. So in this case, this Orion cancer has 81 out of 93 donors 87% of mutations on this gene. It's a very famous cancer gene. So also we highlight, we have a diagram show that the mutations and showing in a diagram show that protein domains align with these mutations so that you can see that this mutation occur on which protein domain. And also the frequency is highlighted here as well. So these lollipops are, each one is one mutation. So this one is the highest one. It has 101 donors has this mutation. So colors are a high impact. In this case, most of them are high impact because this is a well-known gene, cancer gene. So down here, we have a genome browser embedded as well. It gives you a little bit more detailed context of this mutation of this gene and the axons and translated regions. And also it plots the histogram of the mutation. Down here is that the most mutation with highest frequency listed within this gene, what are the mutations? As you can see that this is the same donor. We can see that in this plot that has higher frequency, 101 donors. So I'm gonna click on this mutation. You can see more detail of this mutation. Yes. So for sure, all the mutations, all the data is based on one, the same genome reference. In this case, 18.19 is the one built earlier than the current one. We do not have a plan right away to move to the new one, but it should happen at some point. So all the mutation, all the annotation were based on the new build. Is that answer your question? Okay. So mutation page, same way. We have organized our page in a consistent way for easier navigation. So you can see we give a mutation ID. So this ID is stable across all the releases. Even though we changed the genome build in the future, we still have the same mutation. So this is our goal. So you can cite this mutation ID in your paper, in wherever you link to us. So again, basic information about this mutation is categorized with a high impact. And here's the consequence table. It shows you the mutation affects all the transcribed slice forms. In this case, there's almost 20 slice forms. It tells you about the amino acid change and the location of these changes. And once again, a similar idea as the gene page. We list all the projects, which has this mutation observed and sorted by frequency. So these diagrams are similar to the gene page. So now, all the counts, you see a lot of counts in this portal. Each, every count is actually all clickable. If you want to know, say, 15, what this 15 is, what this 80 is, you can click it. And in this case, I'm gonna just click on this 101 donors. So any number, actually, an any number, you can click it. So when you click it, it gives you another page. We call it advanced search page, which is the most flexible, most advanced interface allow you to perform all kinds of integrated search for data of your interest. So in this case, I click on the 101 in the mutation page. You can see that, so this shows you the, what this is a 110 and one donors there. So this page is similar to the page I showed you earlier that it's just a little bit more than the project list page. You have left panel, now we have three tabs. The layout is changing a little bit. So that's why you have donors, genes and mutation. Similarly, on the right side, you have the donor gene and mutation and with the result. So the left panel again, you tells you about, we have 101 donors and then you can see, so this is a great tool that allows you to see how these donors break down into different kinds of types. For example, this 101 donors has 23 of them in our progress donor patients, 18 of them are breast cancer patients and how this is break down by projects as well. Same idea as how many male, how many female. So these are actually interactive. You can actually apply this filter. So for example, I'm interested in these donors but I'm interested in only the donors has micro on a sequencing data. So you click on that, all the number will go down and then you say, I only interested in breast cancer. These conditions were combined together. You narrowed down the results. Now I have 18 of the donors. So this is the way that it allows you to narrow down your interest. And before actually you're narrowed down, you actually tells you about how many, for example, life. So if you click on the CDD, I got two there. So it keeps narrowing down and then guides you through the underlying data, the content of your filter. So one feature I want to demo you is that when you find the donors of your interest, you want to download the data. So here is a button here, download the donor data. So when you click on that, it's 18 donors of your interest. You can download the data submitted to ICCCCC. So this pop up window allows you to choose for which data types you're interested in. For example, I'm interested in the clinical data. So I'm going to just select that and then 18 donors for clinical data will be downloaded. So it's a process running on a server side. I will let it run there. And in a minute or two, the file will be provided and you can follow the link to download it. So meanwhile, I'm going to try some more filter on the advanced search page. So in this case, I'm going to select brand cancer. I'm interested in let's say, so available data is like on a sequencing data. And for genes, I want the genes to have reactant pathways. So genes with a reactant pathway only. So when you select these and all the pages, all the results were refreshed, basically means that when you constrain genes, the mutation will be only the one that meets this criteria as well. So these are all interconnected. And for the mutation, I want to search for frameshift mutations. And as you can see, this counts all going down when you apply additional filters. And you can examine the result in the gene, in the donor, in the mutation. So one more feature here that I want to highlight is that when you identify the interesting genes, mutations, you can open them in the genome browser. The one we showed you was embedded in the project page, gene page, now is the interactive one based on your search criteria. So these are the mutations you selected, these are genes you selected. And you can go through these mutations and see the detail of the content of the genome, where they are and what the base mutations are, the detailed information about them in the context of the genome browser. You can flip through genes and mutations, click on different genes as well. There's more functionality here, I'm not going to too much detail about that. But let's go back to advanced search page. So basically there's a link here, to give you a chance to go back to advanced search page. So I have 15 minutes, okay. So let's check the data is ready. Okay, so the data is ready. So basically you can then download, click on the link, and download that and then you can save the file. So that's the advanced search page and genome browser, the interactive search there. So even within the advanced search page, all the accounts are clickable. So you're saying, oh, what is this 15 donors for this TP53 of the current mutation, current searching criteria, you can click it. So when you click it, what it does, it stays on this advanced search page, it shows you which 15 that is. Basically, it keeps all the current searching criteria, but just add one more condition that is the gene, what the gene you click there. So it's very powerful searching tool. So finally there's a donor, I'm interested in. I can click on the donor ID and then I got to the donor page. So this one again, summary, sections, available data, mutations for this donor, excuse me. So specimens for this donor, and you have, in this case, there's a tumor, there's normal. For the tumor, we have the link to the external resource shows the slides of the tissue, pathology slides, that's another tool. So again, let me go back to the mutation, 45 mutation for this donor and click on that number, it give me back to the advanced search page. So just whenever you see a number, you click that, it could get you to the advanced search page. And advanced search page always listing all the searching criteria you have. You can reset them, you can share the link with them. I have not talked about that, but this share button really just saying, your search criteria, you can visit the URL, but you don't want to send this full URL here, you can shorten the URL and share it very easily. In your study material, there are lots of shorting URL there, which means that you can click it, you can share with your colleagues, or your search criteria will be all capped there. So for this one particular donor, now I have one more case demo that in this case is the real data in external repository. So as Francis mentioned that data sent to us is analyzed data, but raw data is actually not hosted as ICCCC, and this is this external repository is meant. So we click on that, and then we see one particular donor, and these are the data available through other repository. In this case, we have TCGA, clinical data, and we have CG Hub with all different kinds of raw data. In this case, we are interested in Michael on a sequencing data, and the same idea here as before. We can filter that down by clicking on the facets, and now you have two left, there's two bands, so the way to download the data, you can download it. So the raw data download mechanism is different, you need to download the manifest first, and the manifest contains information you need in order to gather data, which is actually using other tool. We need a client-side tool, not the mapping active download. So this is a tool that we don't have the time to cover that, but this one carries the information you can send to another tool. Using the client-side tool, you can download the raw data. So that's what I have for this part demo, and there are three use cases I see if I have time to go through, but if not all of them, I can guess one or two. So find common mutations, mutated genes, with high impacts of organ cancer and prostate cancer. So these are the two. So let's go to the advanced phase. So now we have no condition applied, let's say ovarian cancer. So find ovarian cancer. And so those are 677 donors, and we want high impact mutation. Yeah, high impact mutation, so high. So now this is the donor. So we want to find mutated genes, common genes. So these are genes, 2,400 genes. And then we save this gene set. This is another functionality. So these are genes from ovarian cancer. I'll just use short name. Actually, I saved some before the demo. So all the cells are saved here, so I saved earlier. So I will just have some difference here. So this is the name of this gene set here. So move on to the next one that is prostate. So let's just pick up another gene list. So go to prostate cancer here. Same idea with the high impact mutation. And then we'll go to genes. Now this time we have 3,000 genes. And save this gene. So genes, so prostate cancer. Save. So when you save, it actually goes to this page. So basically we covered an external browser of data. We covered all the major ones. The one I'm focusing on now is this analysis part. So as you can see, the gene set I just saved are here. So the use case is that find common genes about these two. So what are the common genes between these two sets? We call sets, gene sets. So to do that, right now we have three different analysis component that there's plans to add more. You can get some ideas how this works. So basically each of the analysis takes one or more lists or sets, so which you build earlier somewhere else, as I just did two list sets. So our case is that we do a set operation, meaning that we want to do the intersection. So now I choose to do a set operation, and then I choose the two. I wanted to find out what a common gene set would then I choose these two, and then I say wrong. So as you can see that, you know, this one shows that van diagram, set 1, set 2, or even and process. And so this circle is the S1, that's the total, 2,400, and then that's the total for the other one, 3,100. And the common one is just this one, 852. So from there you can save again. So basically you pick up the pieces of your interest, not only the common ones you can say only existing this side, not the other side, or only happening in one particular set, not in common. So it's free, it's up to you, how do you want to manipulate that? You can download a union of everything. So it's quite flexible. So, and when you have three sets, it's become more fun as well. So in this case we want to have this 852 donors there. So once again you can, once you have that you can save again. So I can save, save common genes, common genes, and this is over here in posting. So I can save. So when I save is actually go back to save sets, which is here. So that is the first demonstration. Okay, this has saved the set and share with the calling. So how do I share? So to share that is just, okay, I just actually click on that already. So you click that, you can see that in the advanced page. You can see all these genes, what exactly this 852. So then you can click on share, and you get a URL, and you send this URL to the calling. And then you can paste on another browser, another, you know in my case I paste another tab. You'll see exactly same gene list there. So that's the first use case. The second one. So a colleague shared with you a interesting mutation and we're the same idea and give you a URL and you're wondering what are the genes affected by this mutation? And whether there's any genes over represented in any pathways? And then how do I see the enriched pathways? So you're gonna paste this URL and you'll get all the mutations, 458 mutations. And then your question is about the genes. What are the genes affected by this mutation in this case tells you right away that's 41. So now you want to see whether these genes are over represented in pathways. You just launch here that announced enrichment analysis. So arena in the next section will tell you more about how does this work? What does it mean over-representing? So in this case, I will launch this mutation. So I don't, I just choose, yeah. With some default settings and then I just say, analyze that. So as you can see that, so these are mutation results. Briefly, again, arena will tell you more about these details about these kinds, what does it mean? So we have 51 genes in the input and 21 of them has been annotated with reactant pathways and the reactant pathway has a totally 8,000 genes there. And the genes, overall genes with the, what is this number? So overall there's 1,000 gene sets in the universe. In this case, it's reactant. So for each enriched pathways meaning that it has more, basically the idea is that it has more, so basically you randomly pick up genes and then just say that there's any bias to those genes. In this case, we are expecting to see a very low number of genes from these pathways, but in fact, we get 18 of them. So basically it's really unlikely by chance. So basically we perform certain statistics and then give you the adjusted p-values. In this case, it's so low that it's almost zero. So you have all the numbers and then the heat are sorted by significant p-values. So you can open up the reactant, open up this pathways in reactant browser view to see where the, so basically some of these pathways are high level pathways. That's why you have this sub pathways inside. Some of them are individual pathways you can see. So you can zoom in and see. So there's these are proteins and these are protein complex and they highlight how many genes, how many mutations have in these genes and yeah, you can browse through all these things. And you come back to the enrichment result as well now. So I don't think I have time for the third one. Probably as one task you can then do on the lab session. Any questions?