 Okay, everyone. Good morning again. Today we're going to talk about somatic mutations in cancer and the learning objectives are going to be, well, they're listed here, hopefully you had a chance to read them, but basically we're going to look at drivers, passengers, oncogenic versus loss of function mutations. We're going to talk a little bit about mutation classes and rates and signatures, and we'll look at some examples of somatic mutations that have clinical relevance. And then we'll move on to statistical considerations for modeling lealic distribution, lealic distributions from an extra sequencing data, and we'll consider some analytic approaches to mutation characterization. So we'll go over some tools, some of the tools that you'll be using today in the lab, and we'll talk in particular about functional annotation using DB-SNF and COSMIC, and considering purity and ployty and the effect of purity and ployty on the observed variant lealic frequencies. And then finally we'll end with how one might interpret mutations. So let's start again with this diagram from yesterday. So this is where we discussed the idea that cancer is an evolutionary process that really conforms to the rules of Darwinian selection. And so I want to add some definitions today specifically that cancers arise as a result of driver mutations. So these are mutations that directly or indirectly confer a selective advantage, a growth advantage to the cells that have these mutations. So a selective advantage is basically the difference between the birth and death in a cell population. And the difference between the birth and death in a normal cell population is pretty much zero in a tissue that is self-maintaining. So any mutation that either increases the birth rate of cells, or the division rate of cells, or any mutation that decreases the death rate of cells will confer a selective advantage. So as part of this process of mutagenesis, there will also be other mutations that do not confer any selective advantages, or perhaps that confer a negative selective advantage. So they decrease the birth rate, for instance, and those will be out-competed. But you will also have a lot of passenger mutations. So in cancer genomics, one of the key tasks is to separate these driver mutations from passenger mutations. And so this is a big challenge. How do we tell these apart? So these passenger mutations arise also during every cell cycle. And so as cells divide and gain more and more mutations, and at some point gain that key mutation that gives them a selective advantage, they already have these additional thousand other mutations. And so when we look at this green clone, it will have a thousand mutations. And we have to identify which one of those is the one that gave it that growth advantage. So when we profile a tumor at diagnosis, it will be a mixture of these cells, some of which had the mutation that gave rise to the green clone, and some of which had the mutations that gave rise to the orange clone. And there will be lots of passenger mutations in common or specific to each clone. Okay, so tumors have this heterogeneity both in space and in time, because when we look at tumors that recurrence, often we see a different mutational distribution. And just a quick note on drivers, they can often be, so when people talk about drivers, they just say driver mutations, but really some mutations are necessary as initiating events. Some are necessary as maintenance events. If they happen early on, they don't necessarily initiate a tumor, but they're required for maintenance of the cell population. And some are only drivers in the face of specific selective pressure like disease or resection and then chemotherapy. And so there will be drivers of progression, but not necessarily of initiation. Okay, so today, unlike yesterday where we talked about copy number variations and alterations, which are events that encompass thousands of base pairs, today we're going to talk about the tiny things in the genome that matter. So these are point mutation. Here's one example. Here we see the normal cell DNA. This is just a bit of the reference sequence from the P53 gene. And this red base indicates a change from the reference. And this change, this T is just in the tumor cell. And so the thing to note here is that we're looking for somatic events. This is a somatic event. This event is only found in the tumor cell DNA and not in the normal cell DNA. There are also germline events, so mutations can happen in the germline, and then you get cancer predisposition syndromes. You could have a P53 mutation in the germline, and such a person would have leave from any syndrome, so they would be predisposed to multiple types of tumors. Okay, so mutations have a few main classes. The mutations that happen in coding regions of genes include missense mutations. So these are changes that lead to a change in the amino acid sequence of the protein. So often these occur in the first or second position of a codon, which is a group of three bases. And these three bases are recognized by the cellular machinery as a unit that is then matched with a particular amino acid. So if you change a bit of this unit, you'll get a match with a different amino acid. So you can see in this example that the second base of this codon is changed to a T, and then the amino acid instead of being glutamine is now veiling, and so this protein will have a substitution at this position. There are also silent or synonymous mutations. These often happen in the third position of a codon, and the third position is called this wobble position. It's less important than the other two. Many amino acids will have codons that are the same over the first two bases, but differ over the third one. So even though this base is different, we don't see a change in the protein in this particular case. So when we see synonymous mutations in general in cancer, we don't think that they're functional. There could be rare cases where a synonymous mutation could be functional, for instance, by changing a splice site or altering binding of some additional protein that binds with mRNA, but for functional mutations, we tend to filter these out. So we only focus on missense and nonsense the next category. So these are truncating mutations. These are single base substitutions that introduce a premature stop codon. So this is one of the class of mutations that you typically see affecting tumor suppressor genes because that's a great way to inactivate a protein, is to just truncate it halfway through or part of the way through. And so here we see a change in glutamine in the first position of this codon, and that changes it to a stop. There are also short insertions and deletions. These are abbreviated as indels. So indels are trickier to detect from short read sequencing data, but basically they comprise two categories, frame shift indels. These are small deletions or insertions. We can see here insertion of a T. So if we have a deletion or insertion of one or two bases or not a triplicate, not a multiplicate of three, then that changes the reading frame. And so you can see here in this middle example that basically everything subsequent to this T is now read or corresponds to different amino acids. So you frame shift the whole rest of the protein, often introducing a stop codon fairly early. If you have an insertion of three bases or a deletion of three bases or any multiple of three, then you just have an insertion or deletion of a whole codon. So you don't shift the whole rest of the protein. You just take out a bit of it or add a bit to it. So as I said, this is more of a computational challenge to detect these compared to point mutations. Okay, so like copy number alterations, SNVs and point mutations have typical patterns of frequencies in tumor suppressor genes and oncogenes. And so this graph depicts the patterns in oncogenes, two examples here on top and tumor suppressor genes, two examples on the bottom. And these bars are essentially the genes. So here we have PIK3CA, IDH1, RB1 and BHL. And each of these glyphs corresponds to mutations, so missense mutations in red and truncating mutations in this black triangle. And so the pattern that we observe often in oncogenes are these stacks of missense mutations. And so these change the amino acid of the protein, so there will be a new version of the protein that has a gain of function or a switch of function. And so it's really critical that this gain or switch of function, that it's a particular amino acid that confers that functionality. So these are so-called hotspot mutations. And so they will often, for instance, cause a protein to be constitutively active. So the second protein is IDH1, isocitrate dehydrogenase. These are basically mutations are only seen at this one position. So it's very obvious which part of the protein is necessary for oncogenic transformation. It's this particular codon or amino acid. And then the genes at the bottom are tumor suppressor genes, and they show this classic pattern of loss of function. So a lot of these mutations in red are missense, and they will essentially inactivate the protein as opposed to activating a protein like we see for oncogenes. And often there's only one way to activate a protein. There are many, many ways to disrupt a protein. There's lots of ways to destroy something. So you can see that these mutations have some clustering, but not really. There are throughout. And then on the bottom we see these black triangles, which are the truncating mutations. So these proteins in many cancer samples are inactivated. And they're a classic tumor suppressor pattern. And someone yesterday was asking, do we often see mutations or copy number alterations or both? It really depends on the particular gene. So for instance, P53 is more often mutated. Patch one and other tumor suppressor is often mutated on one copy and deleted on the other copy. So you get the two hits. So it really depends on the gene. And we'll see how we might take a look at it for genes of interest in a second. Okay. So consortium projects like the TCGA have really aimed to sequence large cohorts of people or tumors to find in an unbiased way all the mutations in tumor suppressors and oncogenes. So get as much data to start to see those patterns for genes other than the ones that are always, that are obvious, that are coming up. So we can start to understand cancer biology. So the goal here is to find those events that cause the phenotypic attributes of these hallmarks of cancer that we talked briefly about yesterday. And so that could explain how these hallmarks are acquired. So how these transformations happen. So let's take a quick look at one of these cancer genome landscape papers that revealed some patterns of these mutations. So this is a paper from 2013. And basically, this is a synthesis analysis of the mutational landscape of tumors from many different types of cancer, from many tissues of origin in both pediatric and adult tumors, so across a range of ages. And in general, it looks like there are about 120 to 140 genes that are recurrently mutated in human cancers out of the 20,000 genes we have. In any particular tumor, there are between two to eight acquired driver genes. And so it takes multiple hits to make a cancer. We don't often see tumors with just one mutation or we almost never see tumors with just one mutation. And so it takes that many hits. It takes a few sequential hits to change the phenotype of a normal cell progressively to become a malignant cell. So, however, these are not the only mutations in a tumor on average, a tumor so that's depicted in this plot. This is the average number of coding mutations in a tumor that changed the amino acid or are truncating or in some way damaging. So they're on average about 33 to 66 genes with protein coding changes. So these would be drivers plus passengers. There are some notable outliers, melanomas and lung cancers. These have mutations in the hundreds and also involvement of potent mutagens. So melanoma and lung cancer are associated with smoking and UV radiation. And they're only outmutated by these tumors that have defects in DNA repair, so mismatch repair. So these are a lot of the colorectal cancers and also if you have a germline mutation in one of the mismatch repair genes, then you will have, for instance, a mismatch repair deficient brain tumor. So a lot of the gliomas are ultra hyper mutated if they fall in this category. It turns out that there are more tumor suppressor genes that are mutated compared to oncogenes. So mutations predominate in tumor suppressor genes rather than oncogenes, which is in a way unfortunate because oncogenes are much easier to target therapeutically. If something is overexpressed, it's easy to knock it down. If a tumor suppressor is missing, it's very difficult to add it back or to add that functionality back. So, again, this goes back to what we were talking about yesterday with the PARP inhibitors that we need to look at pathways because tumor suppressors and oncogenes are really two sides of a pathway and so drug targeting really needs to be performed or thought about from that perspective. Okay, so the most frequently mutated gene in the human genome that leads to cancer is P53. This is a gene that's involved in program cell death and DNA repair. If there are significant abnormalities in the genome of a cell that's going through the cell cycle, P53 will prevent that cell from completing the cell cycle and targeted for cell death. And so it's kind of a built-in safety system to not let cells accumulate too many mutations. And so mutations in this gene allow cells to survive despite massive genomic rearrangements or mutations and proceed through the cell cycle. So this is the most, this is sort of the culprit in cancer. I remember doing a study where we were sequencing recurrent menchiloblastomas and we were really eager to see the results of what gene was recurrently mutated in our and our recurrent tumor cohort and of course it was P53. So when we saw the results we were in a way a bit disappointed because we didn't discover anything new but in another way it makes perfect sense because this is the way that cells, cancer cells sort of escape the built-in limitations and acquire resistance to therapy for instance. Does anyone know why elephants do not get cancer? Do you remember reading this in the in the paper? Yeah they have like 60 copies of P53. So elephants don't get cancer. Yeah and when they sequence the genome of the elephant they found that elephants have about 60 copies of P53. So their built-in system has a large set of redundancy. Just one more question I just had in mind. So you're talking about oncogenes, right? So tumor suppressor and oncogenes. Evaluationary speaking, is there any requests for oncogenes? Like they are the ones, as far as I understand, one is tumor suppression and so if something goes wrong then maybe those suppressors and oncogenes give rise to cancer, right? So yeah. So is there any benefit by doing that? So the terms oncogen and tumor suppressor gene are really what we call those genes that are that when something goes wrong with their normal function in the cell are what drive cancer. So these genes their normal function in the cell of oncogenes, so normal PI3 kinase is involved in growth and cellular replication. So you need yourselves to replicate. Tumor suppressor genes are those genes that are the balance to genes like PI3 kinase because you have feedback loops. You don't want to stop all cellular replication otherwise the organism doesn't grow but you need this balance between the two. So when the balance is altered then you get a phenotype like cancer. Have people looked at C-bio-portal yet? Yeah, maybe less than a third of people. So this is a portal for a lot of this data that has come out of the TCGA. So you can put in I think up to a hundred genes of interest. I put in one gene of interest, P53, but basically it gives you everything that has been found about this gene in the TCGA cohort. So here we see the incidence of mutations in P53 in all the cancers that were sequenced as part of TCGA and of course it's too small to read because this tail is very long, but you can appreciate that there's a lot of green. So a lot of these cancers have and here I've zoomed in on the end of this distribution. So you can see the Y scale is up to a hundred percent and so some tumors, almost a hundred percent of tumors in certain types of tumors have P53 mutations. There are some tumors on the southern end of the scale that do not have very frequent P53 mutations but I encourage you if you are curious about any gene to look it up and see bioportal and click along on the tabs and you'll see how mutations correlate with expression for instance, if there are what kinds of events affect that particular gene. So in this case green means mutations, so this gene is often mutated, but you might see that PI3 kinase for instance is amplified or mutated in that hot spot passion. So it's a really great resource for exploring these types of alterations. So that's P53, the guardian of the genome. What about the other 120 or so genes that are frequently mutated in human cancers? So these are shown here. These are genes that are significantly mutated to a greater degree than you would expect by chance, so like P53. So this figure depicts these 127 genes, these are all these rows, the gene names are on the right, and the cellular processes in which these genes are involved is how the genes are grouped. So we see here transcription factors, regulators, histone modifiers, genome integrity, receptor tyrosine kinases, cell cycle, and so on and so forth. And so there are a lot of processes in the cell that can be co-opted to take part in the oncogenic transformation. And so the way to read this plot is that what each square shows is essentially the columns are the different kinds of cancer that were profiled, and again the rows are all the different genes. And the number and the color corresponds to the proportion of the cases of that kind of cancer that has a mutation in this gene. And so you can see that the darker the color the more percent of cases have mutation in that gene. And so P53 is right here, it's very easy to pick out. It's basically a red bar almost all the way across because every cancer nearly has a P53 mutation. There are other examples like I think that's VHL up there in kidney cancers. And so VHL, so P53 is not highly currently mutated in kidney cancers, but VHL is one of the main ways to initiate that malignancy. And so we see these patterns of genes. I think PI3, it's too small for me to read. Can you guys see PI3 kinase? Is it this one? Yeah. So PI3 kinase is another one that is recurrently mutated in a number of different cancers. And so some of these were not surprising. And what was surprising was that there were some new categories of cellular processes that are involved in tumor genesis. So some of you about before like MAPK signaling, PI3 kinase, WIND signaling, and some are new pathways like splicing or transcriptional regulation or metabolism and histones. So these indicate the potential for development of new therapies. And these new mutational themes would not really have been obvious without this sort of broad survey and an unbiased way of mutations in the genome. Okay. So some mutations define cancers as we saw. And there's a lot of clinical utility in terms of knowing the status of mutations in particular genes in specific cancer types. And so for instance, P53 mutations define this high-grade sarasovarian cancer. If you don't see P53 mutations in one of these cancers, you might suspect that it's a misdiagnosis, for instance. BCR-Able translocations are diagnostic of CML. If a patient has a BCR-Able translocation, you can give them GleeVec. So there are lots of companion diagnostics now emerging for these therapeutics, most common of which are these tests for EGFR mutations in lung cancer. So these would correspond to a recommendation for anti-EGFR tyrosine kinase inhibitors, BRAF V600E mutations in melanoma. If a patient has this mutation, there would be a good candidate for this particular drug and so on and so forth. And so there's a good resource for finding which targeted therapies and diagnostics are available on the market. And of course, mutations also evolve and emerge as potent markers of drug resistance. And so for instance, mutations in EGFR, when a patient is treated with anti-EGFR therapy, I really am a mark of an indication of resistance for that treatment. So anti-EGFR resistant tumor will develop a secondary mutation that allows them to become resistant. And same with BRCA. BRCA is often, it's a tumor suppressor, so often it has indels, so insertions that put it out of frame. And when you treat that tumor, they will have a secondary insertion or deletion that will put it back into frame. So the tumors evolve to counteract therapies. Okay, so we looked at this one briefly. This is IDH1, anglioblastoma. This was an important discovery. This is a metabolism gene, so it was really surprising when IDH1 was identified as highly recurrent mutation in angliomas. But really the nature of this protein has opened up a new investigation about the role of metabolism in this disease. And so the mechanism is based essentially on this gain-of-function mutation. When this gene is mutant, it generates a rare metabolite which accumulates in these cells and actually competitively inhibits histone demethylases. And so the effect in gliomas is that tumors have increased levels of histone methylation across the genome, especially repressive marks, and this is a block to differentiation. So we see epigenetic changes as a result of this mutation in a metabolism gene, which is a very surprising turn of events. On its own, it's not sufficient to initiate tumors, but coupled with additional events like P53 mutations, these turn out to play a big role in tumor genesis. And actually, there's a lot of research activity now devoted to studying this process and how it can be co-opted for therapeutic benefit. Here's another example. This is from Saurab Shah's lab. And this came from studying a rare form of ovarian cancer. In this case, they performed RNA sequencing and noticed that at this particular position in the FoxL2 gene, there was a recurrent mutation present essentially in every case of this rare ovarian cancer. And so when they looked in an additional cohort, they found that every patient had this mutation. And so this is what's called a pathogonomic event. This is a mutation that essentially defines the etiology of that disease, like PCR-Able for CML. And some of these cases of really rare cancers can be hard to diagnose. And having this type of mutation is essentially what can provide a really high accuracy molecular diagnostic. We talked briefly about PI3 kinase. This gene has a couple of hot spots. Lots of targeted inhibitors have been developed, but they don't work very well. So this is still an active area of investigation. I'm going to skip over this. This just depicts the reading frame of the mutation. But I wanted to talk about the pathway in which PI3. So the genetic context in which PI3 kinase is found and the signaling pathway that it's part of. And so this sort of gets back to the question about why do we even have tumor suppressor genes or oncogenes? And so this is actually the normal cellular context of a gene like this. How many people have seen keg pathways before? Okay, so more than half the people. So basically this is a curated diagram that describes the relationships between genes from experimental evidence and from literature. And you can see things start here on the left and end up on the right. And you can appreciate that PI3 kinase, which is right here, is pretty much at the top of this of the structure. And so signals come in from the left and converge through PI3 kinase and downstream to AKT. And AKT then triggers a cascade of downstream effectors that promote cell cycle progression and cell survival. So the impact of an activating mutation at the top of the pathway in a gene like PI3 kinase or AKT is huge compared to mutations in downstream effectors. So if you mutate this gene, you'll have a very small effect because you're very close to the endpoint, whereas if you mutate something upstream, you have an effect on the whole cascade. And then you can also notice in this pathway diagram that we have here P10. So this is the tumor suppressor we keep mentioning. And it's frequently mutated or deleted. And that's actually a powerful P10 normally turns off this pathway, or it decreases signaling through AKT. And so the input and the breaks on the input are normally in balance. And if you either increase signaling through PI3 kinase or decrease the normal breaking of the signaling propagation by taking away P10, then you have a lot of cellular proliferation. Or you have way too much cellular proliferation. Okay, so here's an example of another tumor suppressor gene. I read 1A, where sequencing, so sequencing revealed that this was a tumor suppressor because the researchers saw this particular pattern. So this is a classic tumor suppressor pattern. So sequencing essentially gets us or is one way to classify, functionally classified genes into tumor suppressors or oncogenes. Okay, so the table basically summarizes some of the important mutations that are currently being tested for and for which there are targeted agents and for which clinicians could prescribe a therapy, for instance. We've talked about some of these. And in the next slide we're going to see an example of the BRAF V600E mutation. So basically this is a mutation that's carried by half patients with melanoma. It leads to constitutive activation of downstream signaling through the MAPK pathway. And 90% of these mutations are at that one particular amino acid 600. So it changes it from a valine to a glutamic acid. And Vemuraphanib is the drug that they were testing in this. So in this particular paper was the paper that described the clinical trial that showed that Vemuraphanib, which is an inhibitor of this particular mutated form of BRAF, actually works really well. So in the top panel we see the response in terms of tumor growth of each patient. So each patient is a bar and tumor growth is on the Y scale. So we either see growth of tumor or shrinkage of tumor in the presence of this particular drug, Vemuraphanib, versus the standard of care for melanoma at the time, which was the carbazine, I think. So you can see that for the standard of care a small subset of patients had a benefit from the standard of care drugs, but a lot of patients had a uniform decrease in their tumor mass as a result of treatment of Vemuraphanib. And so this was really exciting and people thought perhaps a drug like this would be very useful in other cancers that have this BRAF, V600E mutation. And so this hope led to testing of Vemuraphanib in colon cancer. Ten percent of colon cancers have the same mutation of V600E. And what these plots show is that when you put on, when you take xenografts of colon cancers that are mutant and you either don't treat them or treat them with Vemuraphanib, there is really no difference. So Vemuraphanib doesn't make a difference in these cancers. And that's because in these colorectal cancer cells there is activation of EGFR. And in melanoma cells there is no activation of EGFR. So EGFR would be a parallel way to activate the same pathway that BRAF was activating. And so the cellular context of the mutation actually really makes a difference. So in melanoma cells the pathway activation is in such a way that by using this inhibitor you're blocking all signaling through that pathway and in colorectal cancer even though you have the same mutation that's not the case. And so just the presence of the mutation isn't necessarily sufficient to predict the consequence of therapy or the response without the cell context. Another relatively recent discovery is this description of mutations in the regulatory region of genes. This is in melanoma. So these mutations occur both in sporadic and familial or inherited forms of melanoma. And really this was the first showcasing of an inherited regulatory mutation that is a driver. Previous to this people were looking for protein coding events. So missense events, truncating events, hotspot activating events. And now it became obvious that actually you can have a mutation in the promoter. And at these two positions this one and this one. And essentially what they do is they add transcription factor binding site. So when you have the mutation certain transcription factors are able to bind here. And this promoter drives expression of TERT which is the telomerase reverse transcriptase gene. So that's involved in genome maintenance. And so cancers often have activation of TERT and they'll have longer telomeres. And so those cells are able to survive or become immortalized because of this. So this is another way to turn on TERT that no one really suspected previous to this. And so in sporadic melanoma these are back-to-back papers in an issue of science. In sporadic melanoma these mutations are recurrent. They're found in 33% of primary melanomas and 85% of metastatic tumors. So they're actually really frequent. And so these regulatory regions of the genome if they're mutated that could actually be a really important tumorogenic mechanism. And so that's a departure from the systemic characterization of just protein coding events that we've been focusing on. So don't forget to pay attention to the regulatory regions especially if you have whole genome data. If you have whole exome data then you don't have to worry about any of this. Okay. So what are these patterns of mutations across the genome? Tell us about the biology of mutations in cancer. We've looked at some single genes but there are two properties of mutation patterns that we can ascertain through whole genome or whole exome analysis. One is the mutation rate. So we already talked about this. There are some ultra hyper mutated tumors. There are some tumors that have specific mutagenic inputs like in lung cancer or skin cancer. And those have a lot more mutations. And so we're going to talk next about these mutational signatures which kind of give us insight in the processes that influence the types of mutations we see in cancer genomes. So a great way to think about this is to consider these patterns of mutations. So this figure is from a recent pan cancer report and this really shows from the center to the outside the abundance of mutations. These are the number of mutations per megabase. Each dot on this plot is an individual tumor. And each of these quadrants defines the type of mutation. So these are C to A mutations. These are C to T mutations and so on and so forth. So you can see that some tumors don't have a lot of mutations and they cluster close to the center and some tumors have a lot of mutations and they are sort of scaled to the outside of this plot. So each dot is colored by the type of tumor it is. And so you can see that these black dots are melanoma tumors. And these are characterized first of all by many mutations. So they're again some of the hyper mutated ones. And also that they mostly cluster in this quadrant. So they mostly have C to T mutations. So do you guys know what that is? Any guesses? Yeah, it's a signature for UV mutagenesis. So no other tumors would suffer from this because they're internal for instance. But these melanomas are generated through this mutational process. What about these C to A mutations? Yeah, smoking. So smoking is the big culprit for lung cancers. And lung cancers that occur in patients who have smoked throughout their lifetime will have this pattern. And lung cancers do also occur in patients without a smoking history and they do not have this pattern. And so smoking causes this pattern of mutations that causes this type of like lung cancer. And so these mutational patterns tell us something about the biology of the tumors. And they can be described as mutational signatures. So how many people have heard of mutational signatures before? So mutational signatures are basically these six substitution patterns represented in a matrix. So I'm showing you here this is just the first six signatures. So basically each substitution is represented in the context of the base preceding the mutation and the base after the mutation. So there's 96 possible combinations. So the x-axis of each one of these plots is the 96 possible changes. And if we zoom into this particular C to T part of signature one, we see for instance that C to T mutations in the context of an ACG far outnumber those C to T mutations in the context of an ACT. And so we get the specific pattern. These signatures have been derived from thousands of sequence tumors in 40 different types of cancers. And for each mutational signature, we know the cancer type in which that signature has been found. And there is if possible a proposed etiology for the mutational process underlying that signature. And so here's an example, a couple of examples we've seen already. So the lung adenocarcinomas have this signature four. So these are individual tumors. And you can see the how many of the mutations correspond to each or what proportion of mutations correspond to each signature. So signature four is responsible for a huge proportion of the mutations in each one of these particular cancers. And similarly for melanoma, signature seven, which is that UV induced signature is responsible for essentially all of the mutations in these tumors. And so here we have some, we have information from, so this is from the cosmic database. Have you guys been to cosmic? Yes, I see nodding. So cosmic is great for a number of purposes, but one of them is to look at mutational signatures. So for instance, we can see for each one of these tumors, what the cancer types are. So for one of each of these signatures, what the cancer type was, what the proposed etiology is, and any additional mutational features. So we're going to run later in the lab, a mutational signature algorithm for our data. So you'll have a chance to do this and have a hands-on experience with this. And also, I encourage you to export cosmic as a resource for reading about these signatures. And so basically, if we look at all the signatures, there are 30 signatures currently. As we sequence more and more cancers, we may find more signatures because there could be rare ways in which to induce mutations that are not yet evident given our level of sampling. And here we see the 40 different tumors. So some signatures are essentially ubiquitous, and some signatures are specific to particular cancer types. Okay, so you'll have a chance to look hands-on at some signatures in the lab. I'm going to skip this slide and move on to statistical considerations for modeling allelic distributions from next-gen sequencing data. And just to revisit the properties of the cancer genomes that need to be accounted for when we do mutation calling. So we're interested in somatic mutations that are not in the germline of an individual. So the things that matter are, again, purity of the sample. So the tumor normal admixture. If your tumor DNA is contaminated with a lot of normal DNA, it's much harder to find mutations. So that dilutes the signal. We know that there is intratumoral heterogeneity because that cancer is evolved. And so by default, there will be some level of heterogeneity. There is genomic instability. So there will be copy number changes, LOH, and so on that affect the observed frequency of a particular mutation in our samples. And I just want to mention that the ideal way to do an experiment where we look for mutations is, especially somatic mutations, is to pair tumor and normal from the same patient. So I know some of you have exomes from tumors that have no match normal. And in that case, it's actually really difficult to do a meaningful analysis of the mutations unless you're also willing to look at germline events. And especially if you have known drivers in that disease, then you could look for their prevalence in your cohort, knowing that you probably won't be able to distinguish somatic from germline events. So here is a simplistic flowchart for an analytical approach. This is what an analysis might look like. We start with a cohort of interests, so whatever it is that you want to sequence. And of course, the first step is to align the reads to a genome. These alignments are then put into one or more of these mutation calling tools, as well as tools for detection of copy number, as we saw yesterday, and for detection of purity and ployty, which is important, as I mentioned. And so now, out of these analyses, we will have variant allele frequencies for the mutations, copy number ratios, and purity. And then we want to take those mutations and annotate them with databases, with a tool like ANOVAR that pulls annotations from databases like DBSNP and COSMIC. And so we'll have functional annotations for all these genes, at which point we would like to correct the observed variant allele frequencies because we know that there will be some level of purity or copy number ratio that may affect them. And instead of VAFs, we would like to work with CCFs, which are the cancer cell fractions. So in most some genomics papers that you see in the literature these days, you will see plots of CCF, not VAF. And so CCF is the, it basically tells you in what fraction of cancer cells your mutation is present in. And then that easily translates to splitting those mutations into ones that are clonal because they're in every cancer cell and ones that are subclonal because they're later events and are only in a subset of cells. And then at the end, once we have this nicely annotated filtered and corrected VAF set of mutations, we can do interpretation and validations. So first up, alignments. Alignments, have you guys talked about alignments yesterday with Jared? Yeah, so you're experts at aligning. So these basically are, there are many tools that will take reads and find the best match in the reference genome. So we would take one of these tools, BWA, perhaps align our reads to the genome. Here's the reference genome on the bottom. And then we would summarize the positions of interest as those that are different from the reference sequence. And furthermore, those positions of interest to us in a cancer, a somatic cancer, a mutation analysis would be the ones that are different from the reference and different from the normal. So you need to the tumor. And so the way we do this conceptually is that we would have the normal genome and thereafter the tumor genome. And we would count the number of reads that cover each position. And so, and how many of those reads correspond to the reference allele or an alternate allele? And so we make this simple matrix here on the bottom that corresponds to the genotypes. And so if we do this for the normal and the tumor, we can see in blue that there are these positions that are different from the reference. So this is a C, which should, which in the reference is a G, but it's in every, in every read of the normal. So this is a germline polymorphism. The normal has 100, the normal person just has a C, C at this position. And similarly a GG at this position or actually a GC, whereas the normal reference is a G. And so if we see the same distribution in the tumor, that means that there is no difference between the normal and the tumor. So in this case we have AB, AB. So these are the genotypes, just like we talked about yesterday. In this case, oops, we have BB to BB, so no difference. But in this case we go from being homozygous A to being a heterozygous A and a heterozygous C. So we have a mutation. The base A is now changed to a C. So this C is different, both from the reference and from the normal. So it's this type of somatic event that we want to identify in short list. And so the problem of course is that there are many such events in the genome and most of them will be the same between the normal and the tumor because there are three million polymorphisms and only perhaps a few hundred somatic mutations. And so there are statistical ways to infer which positions are actually somatically mutated. So we would like to pick out cases like this where in the normal case it's homozygous and in the tumor it's heterozygous. And as we've learned to do mutation calling better and better and as this data became more prevalent and tools evolved, it became really important to work out what types of artifacts or biases are in the data that would influence mutation calls. And so one really important aspect of designing these tools was to do validations. So this is from the work from Surab Shah's lab where they looked at 50 triple negative breast cancer tumor normal pairs and called 3000 mutations and then validated them and found that 2000 of them were not actually somatic. And so there's a large input of artifacts that are going to add noise to these results. And so figuring out what's an artifact and what's not an artifact is a lot of work that was done and now we have somewhat better tools but this was some of the work that needed to be done in order to get to this point. And so here are some example artifacts that induce false positives. On the top we see the tumor. This is an IGV screenshot you guys have looked at IGV already so you know that these gray bars are the reads and then the the bases that are different from the reference are colored and so what we see here is that this base actually has the intensity of the color corresponds to the base quality. So in the tumor we don't see any of these bases but in the or sorry in the normal we don't see any of these bases but in the tumor we do. And it turned out that in this case if they redid the alignment with different parameters these reads would go elsewhere in the genome. So this is because there are similar similar sequences throughout the genome and so depending how you do your initial alignment actually affects how well you mutation calling or what types of artifacts are introduced. Indels is another big one so here we see a case where there's a structural rearrangement so this this position is deleted. It's also deleted in the normal but if the reads aren't aligned properly then instead of opening up the proper gap in the read so opening up a big gap instead what the aligner did in this case is it just introduced some mismatches at the end. So it opened up a small gap and introduced mismatches and most aligners have a higher penalty for opening gaps than for introducing mismatches. So if you see mismatches at the end of a read or mismatches that kind of go along with a small gap then that is often a cause of artifacts that are due to indels. And so later I'll mention indel realignment it's in positions like this that you would want to take these reads and realign them locally to find a better fit for those reads. Low base quality is another one maybe it's hard to see here but you can see that the intensity of this base is not very high so that corresponds to the base quality. If you don't see any high quality bases or if you see a big mix of high quality and low quality bases then that position would be a bit fishy. Another type of artifact is when all reads actually that support the variant are from the same strand. So there are positions in the genome that would have a secondary structure for instance so some sequences will sort of form the secondary structure and you can traverse this sequence from one end much more easily than you can traverse it from the other end. So from the other end because you're on the opposite strand you might just skip over a base or misread a base whereas from the previous strand you would correctly read through the obstruction. And so if you see only one direction of read supporting your variant then that is a very suspicious case and probably an artifact. And then sometimes there is no observable reason for why a mutation was called. I don't know why a mutation would have been called here but clearly there is no variant. And some true positive examples here we see nice high quality bases. We can't tell but there are probably reads supporting it from both strands. These mutations are just in the tumor and not in the normal. Here's another case where I think the mutation, is that the mutation there you guys? You can see better than I can on the big screen. So this mutation is very rare. So either because there's a lot of normal contamination or because it's a very sub-colonal event you might see just one or two or three reads support the variant allele. And it could be that these are PCR artifacts or it could be that they're true mutations and in this case this is a true positive example. So you can appreciate that there are lots of features about this data that affect how well and how accurately we can call mutations. And so Sorab's lab developed this classifier based approach. It's a machine learning based classifier to sort of learn the features of the data that yielded true positives versus false positives. So these are features like base quality, mapping quality, any strand biases and so on and so forth. And the idea was to be able to separate those events that are somatic so found in the tumor and not the germline. Those events that were also found in the germline and not just the tumor and those events that weren't really events at all those were just false positives. So you can think of these wild types as technical biases and these germlines as true signals but also they also had a signal in the germline and therefore are not somatic. And so this shows a principal component analysis over a feature space of 106 features. So things like this, homopolymer rounds and so on and so forth. And it shows that it's possible to separate somatic from germline and wild type. And so typically these tools use machine learning classifiers to significantly improve calls. And I'll just not go over this in very much detail but these are those 106 features and how they fall into different groups with the somatic, the true somatic events at the bottom and then the different classes of artifacts grouped over here. So these are cases for instance where we see low base quality and a certain type of error and strand bias and you would get that class of mutations. I guess depending on the type of cancer the errors you get will be different so you would have to train your classifier for each kind of cancer. So the errors you mean? The errors are more platform specific. So a lumina platform and different versions of the chemistry would generate specific errors and so you would for instance much more likely want to train on a new version of a lumina sequencing to find that you know perhaps Gs are underrepresented or something like that. Or certain nucleotide contexts are more associated with error or false positives. Because are you thinking of the mutational signatures? The types of mutations will depend on the tissue. But the types of error are in many cases artifacts that are due to the technology, the sequencing technology, or the alignment algorithm. Any more questions? Yes. There was this one case where it was not obvious in ITV that there was some wrong life when they didn't change the parameters. If you change the parameters you don't really know maybe that's some other things how long. Yeah so if you change parameters you could induce a different a different set of like a different decision of which mutations are true somatic events versus false somatic events. I don't know exactly why that position was called. Do you guys know from Sorab's lab why the particular why this case had a mutation but when you look in ITV you don't see it. It's possible that the yeah it's it's hard to say without actually having the data. So I don't know. In cases that I've looked at I haven't really seen this the only times I've seen something like a mutation called and then it's not there in IGV is when you call the mutation on let's say the initial BAM file the alignments and then you do local realignment or something like that and you look at that BAM file in IGV. And then in that case those mutations that were errors could be fixed so you no longer see it in the in the visualization because you did the mutation calling and the visualization on slightly different alignment files. Good questions. Any more? Yes for the mutation signatures you you look at what the what the overall um like frequency of those 96 categories is in your particular cancer or in your particular sample. That would be a separate analysis so the mutational signature doesn't take that into account but you could look for instance for uh over representation of mutations on a particular chromosome or in a particular region and that that certainly does happen in cancer. Yes the signature is genome wide. Okay so we've done this part we've looked at alignments and now we would want to do mutation calling we've already talked about copy number calling yesterday and so there are a number of tools available and widely used for somatic mutation calling and for visualization. Everybody uses SAM tools if you have anything to do with BAM files you'll run SAM tools. This is implemented in C it's fast and memory efficient. It's a suite of tools for working with alignment files in standard SAM BAM or now cram format. Have people heard of BAM files? Yes have you heard of cram files? No cram is yet a much more sophisticated way to compress data because BAM files are huge. So if you do a lot of sequencing and you generate a lot of SAM files already BAM is the binary format of SAM so there is a level of compression and BAM files are smaller. So in the last I think two or three years people have come up with a cram format which is about I think a quarter to a third smaller than BAMs. So SAM tools is a very useful suite of tools that you will use a lot if you do anything with sequencing. You may also have heard of the genome analysis toolkit this is from the Broad Institute. It's a Java implementation it has some important properties including local realignment of indel regions that I mentioned which significantly improves misalignments and it's actually a suite of tools that performs all sorts of tasks like quality control on the input data as well as germline and somatic variant calling annotation of the variant effect and so on and so forth. So you can read more about it here at this website and a part of this GATK is mutek. So mutek and now mutek2 actually are the main somatic variant callers that are used for instance for a lot of the TCGA work. So this is probably one of the most popular tools for calling mutations and mutek did not call indels but mutek2 also call indels and so it does join calling on tumor and normal and has a lot of these filters that we've talked about it looks at whether there's a gap close to mutation because of this because gaps are much more penalized than base errors it looks at strand bias it looks at whether your mutation is in an area where it reads online poorly and thus would be an alignment artifact. It looks at whether you have two alleles or three and would keep track of that third allele which many tools don't necessarily do and in a very heterogeneous cancer sample it is possible that at a very small subset of mutations you might have a third allele. It looks at whether there reads and in a clustered way so and also it looks at the level of evidence for a particular mutation in the normal sample so it does all this it also has the option of screening your mutations against the panel of normals because screening your mutation against the match normal is great for finding somatic events but there are still places in the genome that are by chance going to generate artifacts so if you screen against the big panel of normals you will get rid of those positions and then it also has an option to do variant classification and keep track of which mutations in your data set come from dbSNP or not in order to prioritize them further so it's pretty useful the other thing that's that's useful in terms of cancer research with this tool is it is fairly sensitive and so what this graph on the right shows that is that these different lines the colors correspond to the allelic frequency of a mutation so if your mutation is at 40% allelic frequency then you can find it with pretty good sensitivity even with 10 reads because on average four of those 10 reads are going to be the mutation you might sometimes find it at one read or at six reads because there's always a bit of a variance in how many reads you're sampling from a population of DNA fragments but and then you can appreciate that as your variant gets less frequent so here's a frequency of 0.2 or 0.1 or 5 percent you need more and more coverage in order to find those variants in a sensitive way so if you're looking for a rare variant mutek could be a better tool to use compared to some other tools because it is more sensitive to these rare events another tool stralka from alumina this is named after the first Russian dogs in space so a canine cosmonaut this this dog on the right so this is a color that generates both SNVs and indels and it's known for being highly specific so it won't generate as many calls as mutek for instance it's not as sensitive but a high percentage of the calls it does generate are going to be true positives so they will validate and part of the reason it's more successful than other tools at calling indels is this step where it calls candidate indels and then it does a realignment in both the normal and the tumor sample on any positions found to have indels in either the normal or the tumor and then it applies a somatic color and and does filtration to identify just somatic events so it's actually pretty pretty good at this and indels from stralka are something that at least in our lab we rely on more than indels from any other caller mutek 2 now also does indels i don't know how well it compares with stralka hasn't been published yet so all the comparisons are not available for others to read but if you have matched normal and tumor data and you're not interested in super subclonal mutations this is a very good caller to use a mutation seek you already know about you use that yesterday in the lab this is available as a python package it comes with some built-in visualizations for a whole genome data and you can read more more about it at that link and we do when we do mutation variant calling this information the information that you get from a mutation is encoded in a standardized format so many of these tools work in different ways but there is an effort to try to output the same kind of information so that you can so that you can compare not only between tools but have a consistent output in terms of structure and information and so the vcf format the variant calling format encodes a lot of metrics about the data that could be used to filter or prioritize mutations so each line in a vcf corresponds to a mutation and you'll go over this in the lab in more detail but essentially each line is a mutation and you know the chromosome the start position and the end position if it's a point mutation start and end are the same position if it's an endel then you can encode it in different ways the reference allele the alternate and then the quality of the call and that will be something that's specific to the caller that you use so different colors different tools will generate a different quality and whether it passes or not built-in filters on that particular tool and then various statistics so for instance the read depth the variant allele frequency so out of 100 reads how many supported the variant allele versus the reference allele whether there was a strand bias all sorts of statistics that would help us to prioritize mutations so we'll we'll look at this a little bit more in the lab and then one of the most important things to do when you have mutations is to actually look at them so we went over IGV just very briefly but you'll have mutations calls calls later and then in your own work when you generate mutations it's really important to take a quick look at least that a random set just to see what your data is like it's very easy for us with our brains are sort of natural pattern finders and so it's very easy for us to detect patterns or spot artifacts or whether mutations look real or not and so just looking at a handful will give you a real real really good idea of what your data is like and then if you have any mutations that you want to follow up on I would certainly look at those in a bit of detail so IGV IGV is one of the better visualization tools it's widely used it has a great tutorial and it's it's easy to use you can just launch it from the website or you could download it and use it on your computer okay so now we have some variants we want to annotate them so that we can focus on those that are likely functional and of interest to follow up on versus those that are perhaps anonymous or non-functional or non-coding etc yeah so in terms of in terms of mutation calling oh true versus positive true positives versus false positives you can certainly do a screen to see if your if your mutations are generally look real or don't look real if they if a bunch of them don't look real then you might want to adjust the parameters or increase the stringency of some filters and so this can tell you whether your whether your parameters are tuned well if your mutations look real they're likely real but I would still validate them so whenever we do whenever we have a cohort of patients and we've identified some mutations of interest we will go back to the DNA and do a do an assay to validate them in the initial sample especially if it's a subclonal mutation and so and so it depends on what your purpose is but if you're going to make a mouse from that mutation I'd validate it first yeah yeah if you're going to do any in vitro work before investing all that time and effort I would certainly validate unless it's a very obvious or known mutation I mean a lot of them from from a lot of calls from Strelka if they look heterozygous and high quality and they're supported from both sides or from both strands and so on and so forth they will very likely validate and then it would be up to you whether you want to spend the time to validate it or go directly to another study but IGV can give you an idea of in general how your data looks and whether your mutations kind of look real so functional annotations okay so I want to say a little bit about mutations versus polymorphisms we talked a little bit about single nucleotide polymorphisms these are mutations so they are changes in the in the in the genome the difference between polymorphisms and single nucleotide variants that we consider mutations is that these mutations are present in the germline of a significant fraction of the population so more than 1% is sort of the rule of thumb threshold the idea is that if these mutations are deleterious to fitness to the organism in general or if they're going to cause a big phenotype they will be selected against and become rare so anything that's very rare in the general population is more likely to have a functional impact than something that is very prevalent those things that are prevalent could be neutral or advantageous depending on the population structure a lot of these polymorphisms can also be associated with disease susceptibility or drug responses and so they will different differ for instance between populations and many of them like I said are found in the germline so they're in every cell of the body mutations are infrequent potentially harmful usually associated with disease often somatic when they're germline they're rare and they lead to these germline predisposition or they lead to syndromes because a person would have a germline predisposition for particular cellular defect and so they would be at less than 1% of the population and I already talked about that so one where do we find these polymorphisms well from databases like the 1000 Genomes Project so this was a large-scale international research effort that was launched to establish the most detailed catalog of human genetic variation by sequencing the genomes of at least a thousand people so these are anonymous participants that were volunteers from different populations across the world so this catalog of human genetic variation can be used for things like association studies to relate genetic variation specific to a population to diseases in that population for instance and so this figure shows the distribution of populations from which the state has derived and some just a couple of relevant findings for instance the size of these pie charts corresponds to the number of variants in that particular population so there are lots of variants in the populations of African ancestry and actually people of African ancestry have the most variants that are specific to those populations so not found elsewhere in the world also in the pie charts you can see that there is a significant proportion so this dark gray area these are variants that are shared by people across the world so we have a common ancestry and we actually share a lot of polymorphisms and then in the lighter gray we see variants that are shared by a subset of the continents and then over here in these colors variations that are shared just by specific populations and so you can see how you can start to associate variants specific to a population with perhaps diseases that occur in that population and so one of the findings from this is that there are tremendous number of variants everyone has millions of these variants in their genome and actually each one of us carries an average of about 300 variants that cause loss of function in protein protein coding protein coding genes and so clearly this loss of function in many of our genes is not deleterious because we are still walking around even though we are carrying all these loss of function alleles we also carry around between 50 to 100 variants that have been previously implicated in inherited disorders and yet most of us are not leaf from any patients and so on and so forth and so and so annotating the mutations that you see in a cancer with this type of information is pretty important because let's say that you find in your cancer in your tumor sample that you have a mutation which actually turns out to be a loss of function mutation and it looks very interesting it's somatic but it's also in 50 percent of the world's population so you would not want to prioritize that variant for any further analysis or you would want to decrease the prioritization for that variant in favor of something else okay have so just I added a couple of slides that are not in your in your slide deck has anyone heard of companies like 23andMe yeah lots of people has anyone used a company like 23andMe a couple of people so this is a kind of a way to join that type of information and so essentially they collect saliva in a tube you send it off to them they send you a kit you send it back and they do what is essentially a snipper a so they do genotyping at a few hundred thousand positions and then they generate a report it costs 200 dollars they generate a number of different kinds of reports they can work out your ancestry because a lot of your SNPs will match up to those known populations in the world they can generate a wellness report for instance and a genetic health risk report or a number of reports so the genetic health risk reports include things like do you carry variants that would put you at risk for a certain kind of heart disease like atrial fibrillation or something and before you read such a report you have to say yes I understand that having a variant in that increases my risk compared to the population from which I am derived because they worked out your ancestry me doesn't necessarily mean I'm going to get this disease it means that this variant is associated with this disease and then they give you the lifestyle factors you could do to change your your risk and so on and so forth and so here are some example reports for a I think a theoretical person this is Jim Jamie someone Jamie K is half native Indian and half Eastern European so you can see that from the ancestry composition with a bit of mixture from other populations he is not at risk of late onset Alzheimer's disease because he does not carry the variants that would be associated with that risk and genetically he's predisposed to weigh about 10% less than average so he's genetically a skinny guy but of course if he eats burgers every day of his life he will not be a skinny guy and so part of part of this information tells you what your genetic predisposition is and then of course your lifestyle factors will affect or combine on top of that so that's the kind of that's the kind of information you get from 23 and me and it really like it's just an example of how you would use this information to tell something about populations or individual people and so this kind of information is collected in databases of variants so DB SNP has a lot of the annotations from a thousand genomes the way that data flows into DB SNP is from research labs sequencing centers and other databases and so the these are so the highest quality data in DB SNPs will be will be from the large consortia like like thousand genomes or HabMap and then they actually curate these and come out with it's very hard to read on my screen one second yep okay so they so for every SNP you know what the alleles are ACG or T you know what the flanking DNA so these are things that anyone who reports or inputs SNPs to DB SNP has to provide this information what the individual genotype was in each person what the population if known of the person was or the individual the population specific allelic frequency and so on and so forth and so the latest version of DB SNP DB SNP 150 has about 130 million SNPs that have a known frequency in human populations there are way more SNPs in there that came from individual labs that sequence the sample and found germline polymorphisms and they submitted them to DB SNP but there's no information on the frequency of that particular variant in the population so those are not variants that we would use for for cancer analysis for instance mostly this data is pretty clean estimates are that about eight percent of SNPs could be false positives because because a SNP for instance could be found by a PCR assay that uses primers that also co-hybridized or hybridized elsewhere in the genome which is a very related sequence but slightly different and then that difference will show up as a SNP that's generally not the case from these population resequencing efforts and then the other thing I want to mention about DB SNP that you should keep in mind is that at some point DB SNP became contaminated with clinical variants so it doesn't just contain variants that are associated with the general population it also became contaminated with variants seen in people for instance with with early onset disease of various sorts and so if you're going to screen out anything in your cancer sample that has been seen in the general population you probably don't want to screen out those events that have been previously associated with the clinical phenotype and so when you look at when you get data from DB SNP they have this non-flagged version which does not include all those SNPs that are flagged as clinically associated or that are very very rare in the human population because those are probably okay to keep if you find in your data and then the other okay so that's it for DB SNP any more any more questions about that yes exactly you could find a somatic mutation that is a known variant in the population and if 50% of people have it then that should weigh a bit in your estimation of the functional impact of that variant it could be that in that tumor type or that cell type in that cellular context that makes a difference but if you have another functional event that is not in DB SNP you might want to prioritize that one first to annotate yeah so yes usually we exclude things annotated in the germline although you could annotate germline mutations in your sample with DB SNP non-flagged and anything that's left over are personal private mutations to that person that could be associated with their disease so you would use this information to annotate your variants and then make decisions based on on that depending on the type of analysis or cohort you have it's a good point okay so another database that we often annotate with is cosmic so the catalog of somatic mutations in cancer i encourage you to check out the website one of the useful one of the useful pages in cosmic is this cancer gene census list so this is a census of genes that have been found to be mutated frequently in cancer so you can annotate your variants with DB SNP identifiers and you can also annotate your variants with cosmic identifiers and so if your mutation is found in cosmic it has been previously associated with some cancer sample and so those are probably ones you might want to take a look at and then in cosmic there are a lot there's a lot of information for for for genes some of which we know a lot more about like BRCA2 so this is just an example where we see that BRCA2 is actually one of the genes that is known to be a gene that drives a cancer hallmark so it's associated with two cancer hallmarks we see here it's associated with genome instability and mutations and escaping programmed cell death the role in cancer of this gene is a tumor suppressor gene so whenever possible genes are annotated in cosmic as tumor suppressor genes or oncogenes if known if they don't have an annotation then that hasn't been figured out yet and anything else that is known about this gene like the processes it's involved in and so on and then in a different page you can pull up all the different kinds of mutations that have been found for this gene so in general myths and substitutions are how this gene is inactivated okay so the way we apply these annotations to a sample is using ANOVAR alternatively SNPF but ANOVAR is the one we're going to use in the lab so today you'll you'll have a chance to to use this and basically ANOVAR annotates variants in three ways a gene-based annotation where for every mutation if it falls in a gene it tells you if it's protein coding what amino acids are affected it can use ref-seq genes you csc genes ensemble and so on whatever version of annotations you you have in your project you can also use a region-based annotation so you can annotate variants in conserved regions transcription factor binding sites and so on and so forth and the filter-based annotation which is where you would you would flag any mutation in your sample that is also in dbSNP or 1000 genomes or cosmic and so on and this includes predicting the effect of a mutation so whether a mutation is damaging and how damaging it is using a number of tools so all these are available as databases you can go to this link it's actually hard to find if you google it or if you go to the ANOVAR site so I put it here this is if you go to this link you'll see all the databases that you can download from which you can annotate using ANOVAR okay so at this point I want to move on and talk about the actual what we're measuring and how we get to cancer cell fractions so what we measure and when we count reads is the variant allele frequency the VAF so if you have 10 reads and three of them are supporting your mutation that would be a VAF of 0.3 what we eventually want to get to is cancer cell fraction and this less used term called multiplicity and so I'll talk about these in the context of this diagram so imagine that we have a tumor which has two copies so a diploid genome and it has a heterozygous mutation so every cell in this tumor has a heterozygous mutation the purity is a hundred percent there's no normal contamination the ployty is two the mutation multiplicity so the copies per cell is one and our VAF our variant allele fraction is going to be around 0.5 so if you sequence so then that's because there are one two three four five six copies of this of this location in our sample and three of those copies are mutated right so three of these six are mutated our cancer cell fraction is one because this mutation is present in every cancer cell that makes sense right our VAF is significantly impacted by normal contamination so here's the effect let's assume now that our tumor is only 67 so these are the same cells but now we have 67 percent tumor purity and 33 percent normal contamination so now we still have six copies of this locus but only two of them are mutated so our observed variant allele fraction is two out of six or 0.33 because a lot of sequencing reads will now come from this normal from these normal cells and so we still have a ployty of two n but a lower purity the same number of mutations per cell and the same cancer cell fraction but a significantly decreased VAF and so when we do mutation calling what's reported is the VAF ideally you would have purity and ployty as well so that you can correct your VAF when we have a very pure tumor so the estimated purity of this particular tumor is almost a hundred percent this is a medulla blastoma we see usually see a distribution like this where a number of mutations have a VAF close to one so they're in their homozygous mutation they're in every copy of every chromosome in every cell there are also mutations of the VAF 0.5 like I showed you on the previous slide where there are heterozygous mutations in every cell there are also a number of mutations that are subclonal so their variant allele frequency is less than 0.5 so they must not be in every cell if they're in every cell we would either see them here or here and so we see subclonal mutations in medulla blastoma what happens to the signal so the signal this range of peaks when we have impurity impure tumors is it all gets squished to the left so now it's much harder to say which mutations are homozygous clonal or heterozygous or subclonal because these distributions instead of being nice and clear and separated are now overlapping so you can necessarily tell a subclonal mutation apart from a clonal heterozygous mutation in a case like this and really what you'd want to do in this case is pull is is account for purity by increasing the VAF by this proportion right okay so how does ploidy affect our measurement in this particular case we have so this is ploidy and also the mutation multiplicity so in this case we have the same purity 67 percent we have a tetraploid tumor so now we have one two three four five six seven eight nine ten copies of this locus in our sample but on the left here we have a mutation that happened very likely before the tetraploid event so it's present in three copies of these of the of each cell so we have a total of six variant copies and 10 and 10 total copies so our variant allylic fraction is three the mutation multiplicities is three or sorry our variant allylic fraction is 0.6 and the cancer cell fraction is again one these mutations are present in every cancer cell fraction and here we see that the mutation happens late so it did not happen before the tetraploid before the whole genome doubling event it happened later and it happened only in a subset of cells so the mutation multiplicity is is going to be one the allylic fraction is now one out of 10 and the cancer cell fraction is 0.5 um so this cancer cell fraction and multiplicity and ploidy are associated with timing of mutations so here we see a we see an example from medulla blastomas um and here on the left you can see some diploid tumors with heterozygous mutations so these are those vaf plots and we see that most mutations are at a vaf 0.5 which basically corresponds to a tumor kind of like this and looking at doing a fish analysis confirms that this is a diploid tumor or that all of these are diploid tumors actually and we see tetraploid tumors in contrast have this pattern where a mutation happens after the tetraploidy event and therefore it's only um uh present in about 25 percent of the um of the of the chromosomes so the mutation allylic frequency can often tell us something about ploidy okay and here we can see how vafs uh vafs here on the left can be corrected for purity and copy number and classified that then into those that are clonal and present in every cell and those that are subclonal and only present in a subset of cells okay so we can calculate va ccfs uh the cancer cell fraction if we have the vaf and the purity and the copy number at that locus so this is how we would do it um we have a term here that um uh estimates the contribution to this to the signal uh from the normal diploid cells and here the effect of the copy number at this locus and here the effect of purity on the vaf so basically for an example like this it would look like this we have a purity of 0.67 so every brown is uh replaced with 0.67 our ploidy is four uh at this particular location and our allelic fraction is 0.1 so we can see that the cancer cell fraction is 0.5 so half the cancer cells contain this mutation uh in practice when calculating ccf you'll often see numbers that are um possibly higher than one and that's because vafs aren't perfect right there is an error associated with how many reads you will have that support the variant versus not the variant um purity is also an estimate right uh and copy number is also an estimate so these these values are estimated by tools like titan uh and vaf is are estimated by tools like mutek uh and so on and so uh what i'm showing you here on the right is the ccf at a time point one versus two in a particular tumor and you can see that what's cut off of this graph is actually a big uh gray circle that describes that a lot of mutations are actually in uh a ccf or in a hundred percent of cells at time point one and two um and the graph is cut off but basically you will see ccf's greater than one does that mean that it's generally overestimated well you also see ccf so you will see this distribution around the true ccf so you will see that in some cases depending on if you're if you're truly a herzygous variant you might and you sequence 10 reads because your coverage is low you might find seven reads that support it out of 10 or you might find three out of 10 so you could estimate on either side and then the purity employee you also have error in estimation so actually it's uh it's it's not just always overestimating sometimes you're underestimating the the cancer cell fraction and so basically the ccf is often used to infer clonal dynamics so you would see this used in a in a scenario like this where you have two different time points or you have two different parts of a tumor and you want to know if the cells that contain this mutation are more prevalent in one or the other sample so you're tracking cell populations and so you'd look for it things like this red um this this what this red oval represents are mutations that go along together in the same cancer cell fraction and in the time point one they're found at 10 roughly plus or minus some percent and at time point two there are now about 75 percent of the of the cancer cells have these mutations and so you'd want to look for patterns like this that don't fall on this diagonal line um so whenever you do time point analysis or uh regional biopsy you would be looking for events like this and you'd want to use the ccf instead of a vaf because different samples will have different pureties and possibly different copy numbers and uh you know the the stochastic nature of how you sample reads will lead to possibly different vafs um okay so it's clear from many examples in the literature that the presence of subclonal mutations is a relevant metric uh in tumor biology and in many cases having subclonal events is associated with poor outcome uh so here we see this for some cll samples all these cases on the left have the presence of some subclonal drivers so ccf's much less than one and all these cases on the right do not um and there's a big survival difference in these patients okay and this is the case obviously for more more types of disease than just cll okay so how well can we detect these important subclonal events well that depends of course on our purity our copy number and the sequencing coverage uh so here uh we see this is a plot from the absolute paper uh that i mentioned yesterday and um basically we see curves that correspond to different combinations of purity and copy number so let's say our tumor is this particular uh green point here um our tumor has a copy number of six at the locus of interest uh and a purity of point five if we have a sequence coverage of 30x which is a typical uh whole genome then we would have a detection power of about 80 to find this mutation if our uh if our copy number was two so we moved over on the left here to this pink curve with the same purity level then we actually have a detection power of 80 percent with only 20 reads coverage and so you can see how tumor purity and copy number um will affect our ability to to uh detect these variants and for uh for whole exome sequencing uh you actually have read coverage up to well it depends how you do excellent whole exome sequencing whole genome sequencing is typically typically around 30x uh some people do even up to 50x but if you're looking for mutations that are let's say 0.2 percent frequency and 20 percent of cells so these are subcolonal mutations you don't actually have power to detect them um with this amount of of purity and copy number unless you had 250 reads that would give you enough power to detect a subcolonal event so that's even above what most people do for whole exome sequencing some people do do very much deeper exome sequencing than we've done in in in the past or others have done um and so the amount of depth really will depend on how much power you have to detect subcolonal events okay so just to wrap up in a few slides uh finally we have a set of mutations that we think are real we filtered them we annotated them their VAFs are hopefully transformed to CCFs so what now that that really depends on the type of cohort you have um and the goals of your project and so um some general follow-up activities could include for instance um trying to infer which mutations are important uh so which ones are functional and often we can do that with by looking at their recurrence so our mutations significantly over represented in your cohort from what you would expect you could look at patterns of clonal evolution if you have the right kind of data you could look at mutational mechanisms um and so you could look at the signatures for instance and you can look at association with clinical variables like subtype or survival or metastatic potential and this is something that you'll do in later modules so interpretation of mutations uh so this is looking at the frequency of mutations in the population uh with tools like music so these will predict um these will predict uh for instance um if the mutation of the frequency is more than you would expect by chance and certain features of genes or a genome will affect how often you would expect to see a mutation so for instance there are genes like titan that are huge and if you sequence any sample you will find mutations in titan um it doesn't mean that titan is important in that analysis it means that we have to correct for gene length because the longer a gene is the more likely you are to just get random passenger mutations in it um and so you can use you can use these kinds of tools uh depending on your cohort so if you have a large cohort or if you have a good case control study um alternatively we might look for mutations that are for instance early initiator events versus later maintenance or perhaps events that promote metastasis so this is a study where these um these um this group um Marco Girlinger and and team uh took multi-regional biopsies from um from kidney cancers um that had metastasized and they saw that there was convergent evolution for instance on certain genes um and so these genes were mutated in different ways in different parts um of the of the tumor or in different lineages of the tumor and so clearly there are some there is some constraint uh and selective pressure for specific types of mutations so if you have this kind of data you'd want to look for uh convergence on specific pathways you'd want to perhaps look at those events that are early versus late uh you can interpret your mutational profiles uh using the mutational mechanisms that we talked about so perhaps you might find that your cancers are specific specific etiology uh you might want to look at the mutational processes that have generated them um and finally um a really good metric or important aspect of testing for functionality of mutations is to look at the impact of the mutation on expression of uh of downstream of the downstream pathway um this is one way to look at a functional impact so for instance this is uh uterine cancer uh data from the TCGA where all these cases have a mutation in the beta-catanin genes CTNMB1 so these are all mutant um and mutations in this gene are always associated with activation of this pathway and so here we see the genes in the pathway that are activated or inactivated um and they show up as red or blue and we can see that a subset of the samples despite having this mutation don't have activation of the pathway and so these are likely passenger events and not actually driver events so you can start to separate uh those cases that have functional mutations from passenger mutations uh by analyzing your data um in the context of um expression or perhaps methylation or other measurements uh and in this particular case it looks like these tumors that did not have a functional CTNMB mutation were once that were pole immutated so that's a mismatch repair deficiency so those tumors are ultra hyper mutated they have mutations in many many different genes they have a huge burden of mutations so just because you see a mutation in a gene in a hyper mutated tumor it doesn't mean it's a functional mutation okay uh so I think that's all I'm going to say on mutations unless you got questions we can have some coffee and then do some hands-on yes question so do you some um experimental people have uh matched if you don't have a matched normal you will identify mutations that are somatic plus the mutations that are germline so then it will be very difficult um to actually just get a subset of mutations that are somatic uh you can try to eliminate anything that's in dbSNP for instance because lots of people will have those events those will be a lot of the germline events but you have three million germline events and you know a handful of functional somatic events maybe 20 or 100 so you could do it uh you could try and then depending on how luck you are and your disease you might find something that um you might converge on something but I would say the chances are pretty low unless you know that your disease is driven by potentially uh something that is also a germline uh like p53 or p10 germline deletion or sufu or patch mutations there are certain mutations that uh it doesn't matter if their germline are somatic that they will be causative