 Good morning, everyone. So today we're going to talk about mutations, somatic mutations in cancer, the idea of drivers versus passengers, oncogenic events, loss of function events, just like we talked about yesterday for copy number variations. We're going to talk about mutation classes and rates, as well as signatures. And we'll see some examples of somatic mutations that have clinical relevance. And then in the second part of the talk, we'll talk about some statistical considerations for modeling variant allele frequencies from next-gen sequencing data, as well as some analytic approaches that will go over kind of the basics in this lecture. And then you guys will have a chance to tackle in the lab. And finally, just a brief overview of how you might want to interpret some of your mutations. Okay, so yesterday we discussed this idea that cancer is an evolutionary process and it conforms to the rules of Darwinian selection. So today I just wanted to add a couple of more definitions, specifically that cancers arise as a result of driver mutations. So there are these initiating events. And it's thought that it takes between two to eight mutations, on average, to initiate a tumor. So here we see this green cell that became a black cell. It took at least two to eight mutations to get there. And so this process can happen over many years. In pediatric cancers, we don't see that many driver mutations, but there are probably other epigenetic events and so on that help to drive those tumors. But in adult tumors, that's the ballpark number. And so these provide a selective growth advantage. And then we see other mutations crop up, which are important in maintenance of the tumor. And we have this heterogeneity within the tumor, where some clones have a preferential growth advantage. And then after selective pressure, like chemotherapy and treatment, we see that subset of clones result in progression of the tumor. And often what next-gen sequencing data and really deep genomic analysis has let us understand in the past few years is that these clones that give rise to mutations are often present up front. So there is this big diversity of mutations in the primary tumor because mutations are occurring all the time with every cell division. And they may look like passenger mutations that have no selective advantage in the primary tumor. But in the right context, they're going to give those cells an advantage. So somatic mutations, I'm sure you guys actually know quite a bit about this already, but we'll just go over the basics. So this is P53. It's a tumor suppressor gene. Genetic defects allow cells in this gene to evade program cell death and DNA repair. And so these are small changes. Yeah, one question? Okay. So let me take another sip of coffee. Okay. So yeah, these are in contrast to yesterday where we looked at events that are hundreds, two thousands of base pairs long. These are typically very small events. So one base pair or just a few base pairs for indels. So you can see here that for somatic events, these are changes that happen only in the somatic, so in the tumor cells and not in the germline of the individual. There are three main classes of mutations in the coding regions of genes. These include mis-sense mutations. So these are single base substitutions that alter the amino acid of a protein. And often these occur in the first or the second position of a codon, which is a group of three bases that are recognized by the cellular machinery as a unit. And each codon is matched with a unique amino acid during translation, so during mRNA translation. There are also silent or synonymous mutations, which often occur in this third position of a codon, where there's a lot of redundancy because there are many more codons, there are 64 codons and I think 20 amino acids. So many codons encode the same amino acid. And basically this third base is redundant, so if they do not change the protein, these are termed silent mutations and they don't have a functional impact. And then we also have these nonsense or truncating mutations that introduce premature stop codon in the protein, and they essentially truncate it and can affect its function. And then another class of mutations are small insertions and deletions. So here we see the normal sequence of a mRNA, and when you have an insertion of either one or two base pairs, you change the reading frame. So a different amino acid is now going to be encoded at that position, and not only at that position, but every following position. So you're basically putting the rest of the protein out of frame when you introduce an indel of one or two. And when you introduce an insertion or deletion of length three, then you're just adding or deleting one individual amino acid. Just a note, accurate detection of indels is actually pretty tricky. And next gen sequencing has a number of biases that makes this challenging, and so it's an ongoing computational challenge to accurately predict indels. So just like copy number alterations, mutations or indels have typical patterns of frequencies that differ between tumor suppressor genes and oncogenes. So here we're looking at four genes, the top two are oncogenes, and you can see that each one of these little glyphs, these little symbols, represents a missense or a truncating mutation in a specific tumor. And you can see that for the oncogenes, most of the mutations are focused in specific spots. So these are important domains, for instance in PI3K, the helicase and the kinase domain. So when you see this pattern of hotspot mutations, that's a really good indication that there's positive selection for this mutation. And these missense mutations typically change the amino acid of the protein, and so the new version of the protein has a gain of function or switch or alteration of function. And often these kinds of changes cause a protein to be constituently active, so you would no longer need signaling from upstream molecules in order to activate this protein, it's just active without this additional input. And the second pattern is genes that show this loss of function pattern. So these mutations shown in red are missense, and the ones on the bottom shown in black are truncating mutations. And these change the protein in such a way as to reduce its function. So, for instance, by preventing binding to obligate partners or by truncating and shortening the protein, so you can see that the distribution of these is spread throughout the gene. So there are many, many ways to inactivate something or break something, and there are just a few ways to activate something. So those are the two main patterns of mutations that we see distributing across these two groups of types of genes. So consortium projects like the TCGA have aimed to sequence thousands of tumors, large cohorts, in an unbiased way, and find all the mutations in tumor suppressors and oncogenes. And the goal is to find what events cause phenotypic attributes or those hallmarks of cancers so that we could perhaps explain how these hallmarks are acquired. And so we'll just take a quick look at a summary from one of these studies. So this is a synthesis analysis of the mutational landscape of tumors, from tumors of many types. So we see the types of tumors here on the bottom, from many tissues of origin and in patients across different age ranges. So in general, it looks like there are about 120 genes to 140 genes in total that have driver mutations that have been defined in this way. And we'll look at a couple of them just as closer examples in a minute. But in any given tumor, like I said, there are between 2 to 8 acquired driver genes that change the phenotype of a normal cell to a malignant cell progressively with acquisition of each mutation. But these are not the only mutations in the tumor, like I said. Basically there are between 33 and 66 genes with protein coding changes. So a big challenge is to tell apart which are the drivers from the passengers. There are some outliers in this range. These mostly include melanomas and lung cancers. So these are cancers whose etiology involves really potent mutagens. So they have hundreds of mutations instead of tens. And they're only really out mutated by tumors with defects in DNA repair, which can accumulate thousands of somatic events. And then at the other end of the scale, we have these pediatric tumors that typically have very few alterations. And then our final point is it looks like mutations in tumor suppressor genes predominate over oncogenic activating mutations. So that has clinical implications because, as we talked about yesterday, it's a lot easier to target oncogenes and disruptor function. And it's a lot more difficult to add function back to genes that have lost it. But synthetic lethal approaches like the BRCA, PARP, dual inhibition is one way to get around that. So these studies show that by far the most frequently mutated gene in cancer is P53. So this gene is involved in regulating programmed cell death and DNA repair. And if there are significant genomic abnormalities in a cell, P53 will prevent the cell from completing the cell cycle and cause cell death. So it's a built-in safety system. And so mutations in this gene allow cells to survive despite acquiring massive genomic rearrangements and mutations and are able to proceed through the cell cycle. And often in many tumors, this is an early event. But in some tumor classes, this is an event that happens later on once the genome is starting to become more unstable and to accumulate a lot of changes. So this is a view from CBIOPORTAL. I think you guys will have a chance to go over CBIOPORTAL in some detail in one of the labs. So basically it's a repository of data from TCGA and big consortium efforts. And it's data that has been processed and analyzed. So if you're interested, if you have a favorite gene and you'd like to know what its distribution of mutations or alterations is in cancer, this is a really good resource for that. So the URL is down here on the right. But basically it shows the frequency of P53 mutations in this plot across many types of cancer profiled by TCGA. So you can see that almost every cancer of certain types have mutations in this gene. And then there are some cancers that have lower frequency of mutations in this gene. But I encourage you, if you ever need corroborating data for your experiments to look and see by a portal. So the other 120 to 140 genes mentioned as drivers and mutated significantly more than expected by chance. These genes are shown here as rows. And they're kind of organized based on their involvement in a wide range of cellular processes, which are kind of broadly classified in about 20 categories, like transcription factors and regulators, histone modifiers, genome integrity, receptor tyrosine kinase signaling, cell cycle, and so on. The types of cancer that have been profiled are the columns. So one thing you can see is that some genes are mutated across many different cancer types. And here's P53 with the strongest pattern for that. So that is really one of the most frequently mutated genes in cancers. Whenever we sequence some tumor and we look at what's mutated in this tumor, inevitably we find P53. So it's not very exciting in a way, but also it's not very surprising. So it's good because it makes sense of tumor biology. This is the VHL gene. It's only mutated in kidney cancers. So there's something tissue-specific and kind of exquisitely context relevant about this gene that makes it oncogenic in just this tumor type. We see KRAS, highly mutated in colorectal cancers. And PI3K, again, exhibits mutations across many different kinds of tumor types. And APC, again, exquisitely specific to colorectal cancers. So there are many cellular and enzymatic processes involved in tumor agenesis. And a lot of these we knew about before, like the MAPK pathway, PI3K when signaling. But a lot of these represent new categories that weren't appreciated as being highly mutated in cancer. So things like splicing or transcriptional regulators or metabolism and histones. So these are kind of opportunities for developing new therapies. And these mutational themes would not really have been obvious without an unbiased and large survey of cancer data. So that's really the power of these kinds of studies. So there's a lot of clinical utility in terms of knowing the status of mutations in particular genes because some mutations define a cancer. So, for instance, P53 mutations are expected to be in every case of serosovarian cancer. If you don't see a P53 mutation in those tumors, it's very likely that there's been a misdiagnosis. So in some cases, mutations are diagnostic, just like the BCR-Able translocation in CML. It's not only diagnostic, but every patient that has that translocation can be treated with GLEVEC. So there's lots of companion diagnostics emerging for therapeutics. The most common are probably for EGFR mutations in lung cancers. For melanomas, there are specific test for mutations in BRAF. Mutations in KRAS and colorectal cancers are used as a counter indication for a specific drug because we know those patients will not respond. So you can use mutations to assess response as well as lack of response. And mutations are also emerging as markers of drug resistance. So secondary mutations in EGFR indicate resistance for anti-EGFR therapy. So those patients that are undergoing therapy and have this specific T70-90M mutation, you know those patients will recur and you need a second-line therapy ready for when they do. So just a couple of examples. This was an important discovery in gliomas, IDH1. This is a metabolism gene. So people kept seeing mutations in this gene and it was really hard to interpret. It's a recurrent mutation and it was very difficult to say how this was a driver event. And finally, people have really kind of figured out the mechanism, which is based on a gain of function in the mutant protein, which generates a very rare metabolite. And that metabolite accumulates in cells and is a competitive inhibitor of histone demethylases. So the effect in gliomas is that these mutations increase levels of histone methylation, especially repressive marks. So that causes a block in differentiation. So we see these epigenetic effects as a result of this mutation in a metabolism gene, which is not at all an intuitive interpretation of the mutation data, but that's what turned out to be the effect. So it turns out that IDH1 play a big role in tumorogenesis and now there's a lot of research activity devoted to studying this process and how we can interfere with it. Here's another example, FoxL2 mutations. This is from Saurab Shah's lab. And this came from studying a really rare form of ovarian cancer. And in this case, they performed RNA-Seq. And notice in their first case that there was a particular mutation in this gene. And then they saw that mutation in every other patient that they sequenced. So this is a gene that's a transcription factor responsible for differentiation. So mutating this gene does not allow the cells to differentiate. And it's actually the causative event in this disease. So that's called a pathognomonic event. So that's a mutation that defines the disease. So it's very diagnostic. And it's important for rare cancers to have a diagnostic event like this, because sometimes they're very hard to diagnose and you don't necessarily know what the treatment should be. So this is more of a problem for rare cancers. PI3K we talked about. But I wanted to bring it up again because I wanted to show you a pathway diagram. So how many people are familiar with pathway diagrams? Maybe just under half. So the PI3K pathway, I'll just point out a couple of things. So each one of these boxes encoding the name of a gene represents that protein. And each arrow represents a direct interaction. You could see that in some cases there are tiny little plus p. That means that there's a phosphorylation event, which is part of the interaction. Or minus p, which means there's a defosphorylation event. These dotted arrows are indirect interactions. And these lines that end in a perpendicular line are inhibitory interactions. So you can see that mutating a gene at the top of this big cascade would have a big effect, right? Because it's at the top. So it's going to affect every downstream, every one of these downstream members. Primarily, PI3K would activate AKT by inducing phosphorylation of this molecule, this PIP2 to PIP3. So we often see the hotspot mutations in PI3K allow it to kind of engage and undergo this conversion process without input from its upstream regulators. So it decouples it from the upstream regulators. P10 is the gene that blocks this process, so that defosphorylates this metabolite. And so we often see deletions in P10, right? So these are mutations that are very powerful because they are really high up this hierarchy and they will affect many downstream events. If you see mutations in one of these genes, it's less likely to be a driver because its effect will not be as wide as mutations higher up the hierarchy. Oh, and the other thing was mutations in genes that act in the opposite way, like inhibitors and promoters of a pathway activation will often be mutually exclusive. So that's the other test. If you see mutually exclusive mutations in a cohort of patients, that's evidence that they are working to kind of affect the same pathway. So this table basically summarizes some of the important mutations that are currently being tested for, for which they are targeted at agents and for which clinicians can prescribe a therapy. So we've talked about most of these and in the next slide we're just going to see an example of the clinical relevance of the BRAF inhibitors. So specifically the V600E mutation in melanoma. So melanoma patients, about half of them carry this particular mutation. Amino acid 600 is changed from a V to an E and that leads to constitutive activation of downstream signaling through the MAP-K pathway. So 90% of these mutations are in that particular amino acid and there's an inhibitor of this form of BRAF called the murafenib. So the top graph shows patient tumor response to Vemurafenib versus the classic therapy or the standard therapy for these tumors. And so what you see on the Y scale is growth in the positive or shrinkage of the tumor. So you can see how many patients respond to the classic vet therapy versus how many patients respond to this targeted therapy. And it was really a big dramatic improvement and so people were very excited when this came out that now you could use Vemurafenib for each tumor that had a V600E mutation. And so people were excited to use it for colon cancer which also has a BRAF mutation at this particular position. It's about 10% of colon cancers that would be expected to respond. And what these plots show is that compared to control and the standard of care for these colon cancer patients, Vemurafenib, which is the PLX line in this case, made no difference even though these patients are mutant. And that's because colorectal cancer cells have an activated EGFR pathway. So that completely bypasses the signaling that you're blocking using BRAF. And in melanoma cells where this therapy worked really well, there is no EGFR activation. So that's why it works so well in that tumor type. And so this group proposed that a dual inhibition of BRAF and EGFR would actually work well. And that's exactly what they saw. So this line at the bottom is shrinkage of the tumor or lack of growth of tumors that have dual inhibition. So just the presence of the mutation alone is not sufficient to predict response without the cellular context. So the cellular context actually plays a big role. And here's just one last example. This is a more recent discovery of mutations in the regulatory regions of genes. So we've talked a lot about coding mutations, driver mutations, that alter function, but specifically in melanoma in sporadic tumors as well as familial tumors. So two studies came out, I think, the same issue of science. So these two groups studied melanoma-prone families as well as sporadic cases and through linkage analysis and high throughput sequencing, they found a mutation that's segregated in the germ lines of these families. And it was in the promoter of a gene, so the TERT gene. This encodes the catalytic subunit of telomerase. So this mutation creates a new binding site for a transcription factor, the Ed's transcription factor. And it basically causes a big increase in expression of TERT. In sporadic melanomas, these mutations are recurrent. They're found in 85% of metastatic tumors and 33% of primary melanomas. And so somatic mutations in regulatory regions are actually something that is really relevant to cancer and a bit understudied because there was a big focus on exome sequencing for a long time. And now as whole genome sequencing is getting cheaper, we can actually start to look in more detail and more systematically at characterizing effects in regulatory regions. Okay, so we talked about events in single genes. What about patterns of mutations across the genome? What can they tell us about the biology of mutations? And there's one property that we haven't really talked about, which are mutational signatures. So that tells us something about the processes or the mutagenic influences that are causing mutations in a cancer genome. How many people have heard of mutational signatures? Maybe half. Okay, so a great way to think about this is to consider the patterns of mutations. So this figure is from a pan-cancer report, and it shows from the center to the outside the abundance of mutations in each tumor. So here's the why it's kind of a non-intuitive but really interesting plot where you have mutations per megabase going from the center to the outside from 1, 10, 100, and so on. And each dot represents an individual tumor, and around this donut or circle you can see the types of mutations that can accrue. So C to T mutations, C to A mutations, different kinds of changes. So these black dots, I know you guys don't see the exact same color, but in your printout you would. The dots are melanoma tumors. And these are not only characterized by a very large number of mutations on average in each tumor, but they also have this preponderance of C to T mutations. So does anyone know what would cause this? Yeah, sunlight, UV, UV radiation. So this is the pattern that you always see when your mutagen is UV radiation. So it's a signature for that mutagen. So this process involves the amination of cytosines and creates a C to T mutation across the genome. And it's specific to melanomas. The red dots are lung cancers, and we see these typical C to A mutations. So a similar type of thing is happening there. I'm sure you guys can, yeah, smoking. So tobacco exposure is this particular signature. So these are examples of exogenous factors that can reveal what mutational insult has happened to these cells at one point. There are endogenous factors as well. And so you can classify these substitution patterns or represent them as these so-called mutational signatures. So I'm showing you here just the first six signatures. But basically each mutation is analyzed or considered in the context of the base before and the base afterwards. The upstream and downstream base. So these bars show the frequency of C to T mutation in the context of, you know, how different they are in the context of a proceeding and a downstream base that is different. And you can see that there is a big difference in the context. It doesn't matter. So considering this upstream and downstream base generates this list of 96 different trinucleotide combinations. So there are 96 possible mutation types. And we have about 30 signatures at this point. These have been derived from an analysis of the frequency of these 96 mutational signatures, but trinucleotides across many thousands of patients from about 40 different kinds of cancers. So this is a study from the TCGA. So each signature has, with it, when you look it up, it has the associated cancer type in which it was found, in which it's prevalent. The proposed etiology for that mutational signature, if it's known, a lot of them are still unknown. And any other mutational features that might be known about it. Yeah? What would you use if you have your own data? Yes, there are packages to where you can input your mutation data from your tumors and have a prediction of the mutational signatures and a p-value for their enrichment and so on. And we will have an example of that in the lab. Yeah? Why are there only six possible cases here? Ah, good question. So the question is why are there only six possible cases here? So C to T. So there are two parts to that question, I think. One is why isn't there a G to A? Right. So the G to A is just the opposite strand of this. And so all the reverse complement events are redundant. And so the list is collapsed from 12 to 6. And then within the C to T, you would see the different contexts. So that red bar is kind of zooming into the C to T. And you would see that if your C is following an A and right before another C, you are much more likely to have this mutation. And reducing the redundancy really helps because we're already at 96. So instead of doubling that and having redundant information is just a good way to simplify. So here are some longitudinal carcinomas and some melanomas. And this is the kind of output you would get from one of these mutation signature calling algorithms where for each tumor that you're considering, which is each one of these bars, you would have the contribution of each signature towards the mutations. So many of these mutations belong to or correspond to signature 7. And there's only one tumor where you see a proponent of signature 11. But you can see that in every tumor there are more than one mutational process is happening. And so when you're comparing between different cancer types, you can see these big dramatic differences. But even within patients of the same cancer type, you would potentially see differences. Lung and melanoma are these sort of really like they're just driven by UV and smoking. There are other tumors that don't have a single mutational driver like this. This is the kind of information you can get from the cosmic site for these signatures. So here's signature 4. You get the cancer types where this has been found, the proposed etiology. This is associated with smoking. It's what are similar patterns, what are the mutagens that are likely, and any additional mutational features, which we've already talked about. So when you do this, when you run mutational signature analysis on your samples, you would then go to cosmic and kind of browse the list of what are your top hits. So I encourage you to explore cosmic as a resource for reading about these different signatures. So considering all the different kinds of cancers that have been profiled and all the 30 signatures, it's pretty obvious that some mutational processes are widely distributed and found across tumor types and others as we've seen are just relatively specific to particular malignancies. And some are only found or have only been observed in one cancer. So do you guys think that this is the complete list of signatures? No. There's been, and how would we find more? You can take more types of questions. Sequencing more types of cancers, more types of rare cancers would reveal additional rare types of mutational insults that could perturb the genome in a way that can be detected with this type of analysis. So I think recently there was a big paper, there's a nature paper describing a big cohort of pediatric tumors and they found a new signature or they proposed a new signature and it's because TCGA does not really have a lot of pediatric cases, right? So different mutational processes are going to be observed at different stages of life. They're also potentially clinically relevant. This is an example from the personalized oncogenomics project run at the BC Cancer Agency. So patients are admitted to this kind of personalized sequencing and analysis of their tumor genomes. Once they've gone through a lot of therapy and have failed and basically have no other options. So at the time Pog was in its first version where they were just taking these highly treated patients. And so this patient had an abdominal tumor of unknown type. She had three rounds of unsuccessful treatment, was out of options, was enrolled, her tumor was sequenced and an analysis of mutational signatures were done and her abdominal tumor had the classic signature of a breast cancer, which is not something that you would ever treat an abdominal tumor for. You would never treat with therapies that would be predicted to work only for breast cancer. And it turned out that this was a rare case of a tumor arising on the milk line, which I never knew about before but it makes perfect sense. So this is the ridge of developing. This is the ridge along which I think in embryonic week 7, so between 7 and 8 this ridge develops in all mammals and that's before sexual differentiation. So before embryos become male or female, which is why males have nipples and many, even though they don't have functional mammary glands. So and then the development of these cells regresses and they disappear except for the ones that become nipples and mammary glands. And so some people who have third nipples, the third nipples develop along this line. But in her case, some of those cells persisted and became a breast tumor in her abdomen. So once they figured that out, they treated her with breast cancer therapy and she responded and she's tumor free. So it can have really insightful clinical applications in certain cases. Okay, so some statistical considerations for analyzing allelic distributions from next-gen sequencing cancer data. How do we detect mutations? Let's first revisit the properties of cancer genomes that we need to account for when performing mutation calls. So we're interested in those mutations that are not in the germline of an individual oftentimes, although there are predisposing mutations. Tumor DNA, but for somatic mutation analysis, you would like to be free of normal cell contamination and as we know, that's not really possible because tumors are often admixt with normal cells. So this is a property of cancer genomes that we need to account for because it dilutes our biological signals. There's also this intratumoral heterogeneity which is that cancer is this mosaic of cellular populations that will have different mutations. So we need to account for that when we consider the frequency of mutations. There's dynamic instability like we talked about yesterday. So copy number events will change the observed allelic fraction of mutations. And so when you're analyzing mutations, it's best to do it knowing copy number. So it's best to do copy number and mutation analysis in parallel. And the ideal experimental design is to sequence both tumor and normal samples from the same patient. So this is the flow chart for what an analysis might look like. So we start with a cohort of interest that's sequenced. And the first step is to align the reads to the genome just like we did yesterday. There are numerous tools for doing that and we'll talk about that in a second. These alignments are then input to one or two or more of these mutation calling tools as well as tools for detecting copy number. So you can have a copy number status at each position in the genome and the purity and the ployity of each tumor. So now we would have variant allele frequencies for candidate somatic mutations and we would want to annotate these with gene information, filter out germline events, filter out common polymorphisms in the population and then be left with a short list of interesting events that we would want to follow up on. So these interesting variants, we would want to, for these variants we want to transform the variant allele frequencies, so the observed ratio of reads that support the mutation versus that support the wild type. We want to transform these into something called the cancer cell fraction, CCF. So we're going to talk about that in a bit more detail because we're going to go in the next few slides through each of these steps. And then we'll stratify mutations into clonal and subclonal and perform some further interpretation. So first up are alignments. There are many tools for aligning. I think the take home message from yesterday was BWAM was the best, maybe marginally the best. So basically all these tools take in your reads, align them to the genome and try to infer where there are any differences. So we won't go over alignment in this module but the idea is that we reduce the sequence read data to just a set of allele counts, counts that support the mutations versus the wild type at every position of interest. And the position of interest are those that are different from the reference as well as those that are somatic. So conceptually the way we do this is that we have the normal genome and the tumor genome and we count the number of reads covering each position and how many of those reads correspond to the reference sequence which is shown up here in green versus how many correspond to an alternate allele. So in this particular case these blue bars represent bases that are so this is a heterozygous position out of seven reads we see three bases that support a C and the reference is a G but in the tumor we also see three out of eight bases that support a C. So it's heterozygous AB, AB and both normal and tumor. And same with this position we see that I think it's a G to a C so we see basically zero Gs out of seven and zero Gs out of seven. AB, AB, AB just like we talked about that's the minor allele. But in this case we see a homozygous A in the normal and then we see a heterozygous AC in the tumor. So this would be the somatic mutation that we're interested in. Yeah. So who wants the dream line? Are they not a person? They might be so you would if you're interested in those variants although if you're looking for somatic events you would not be because they're germline so there's no difference between the germline and the somatic so you wouldn't think it's a driver necessarily especially probably in an adult tumor. So here you're just looking for those variants that are different that are specific to the tumor and not found in the normal. So you would not be able to do this analysis without the normal and without that individual's normal because every person's going to be different that's at like three million positions at least. And so basically there's a statistical component to then interpreting these ratios and assigning a probability that an event is somatic given the allele frequencies in the normal and the tumor. So this is kind of the way the mutation calling strategies work. There are a number of artifacts that are going to generate positive mutation calls so it's really important to visualize your data but some of the artifacts that could occur can be caused by a different number of things. Here we see this T instead of an A any idea what could be causing this artifact. You can see that the T has poor quality because the intensity of the color fades. Yeah exactly. There's a long long stretch of homopolymer stretch of T's and Illumina sequencing kind of has the tendency after a long homopolymer stretch to just add one more of those instead of the next base so it often makes errors like this. So substitution errors. This is another example. This is an indel. So here we're seeing, you guys have gone over IGV so here we see the reeds, the gray reeds are ones that have a decent mapping quality. These black bars indicate where the aligner has decided that there is a gap. So that read aligns and then it jumps so there is a deletion relative to the reference and then it keeps aligning. And often at sites with indels there is misalignment. So what you can see is that for some reeds the aligner decided I'm going to introduce this gap and that will be a better alignment than not introducing a gap. But often for reeds where the end is pretty close and you don't have a lot of sequence that would anchor the reeds to the other side of the gap it decides instead to just keep aligning the reed and introduce mismatches. And that's because it assigns different penalties. It assigns a higher penalty to introducing a gap than to introducing single mismatches. And so that's how the math works out around indels and you can get these clusters of SNPs around indels that if you were to assemble this region and so many tools have a local reassembly around indels and these positions of clustered mutations to kind of resolve this problem to do mutation and indel calling after that reassembly step. And you can see it even here down here this is kind of what the pattern looks like. These two reeds, this one up here and this one down here have the gap introduced and then the reeds that don't have the same pattern of mismatches. Here's a case where we see a few reeds that have low base quality. So sequencers, the sequencers do make errors. We talked about errors. Oh, I think we just talked about errors kind of after the lab at some point. But basically when you have this, each sequencing cycle adds on a new base so you're sequencing by synthesis. And so hopefully in your whole cluster on the Illumina flow cell that contains 1,000 templates you're adding the same base and you see the same fluorescence. If the signals become out of sync then you're going to see messed up signal. So you'll see the combination of a C and a T. And if something becomes stronger then you have a poor quality base call because the sequencer is not quite sure of the specificity of the signal. And it can detect in many cases so this position is probably just low quality for a number of reeds. Here's an example where all the reeds and all the mutations come from the same strand. So if you look at forward versus reverse with blue being reversed you can see this pile up of Cs that you don't see in any of the reeds on the opposite strand. So there are definitely strand specific events. On the flow cell you have especially in GC rich templates you can have secondary structures form and so when the polymerase tries to go through them it can impede the progress of the polymerase or it can cause it to make mistakes. So there are some sequence context dependent mistakes that polymerases make and often those are strand specific. Here's an example of a true positive. You can see even when we color by strand or when we don't color by strand there's a strong signal for this T. There's no low base quality bases. Forward and reverse reeds both support the variant. The variant is supported by reeds where it's in the middle as well as reeds where it's at the end. Here's another example of a true positive. This event actually validated. It's very rare. There's only one reed that supports it but it validated. So this is a very subclonal event unless a real somatic event in this tumor. Yes, there's another one here although this one is probably not used in the mutation calling because it's the last base or it's actually the first base of a reed. So it may or may not be used often. The first base of the reed would actually be used. It would be better. I would have higher base qualities. You can also get this low frequency high quality errors or high quality calls that are errors if there was, for instance, a PCR induced error when you're amplifying your library. So you may not see poor base qualities but it could nonetheless be an artifact. So it's important to validate mutations because they could be real despite you thinking they're not. It could be hard to sequence through region and the base qualities are low so it's really important to look at your data and then validate what you think is promising. Did you have a question? Do you filter out reeds before aligning based on base quality? The aligner has a number of parameters with default filters for what reeds would make it in the alignment which you can change. Usually if it's a reed which is just all T's and low base qualities that's not going to align anywhere. I've run alignments with different tools with some by default all the reeds show up in your final BAM and with others any reeds that are excluded for certain reasons don't show up. So it depends how you want to run your alignments if you want to preserve all reeds or not but regardless of what you do when you run a mutation calling only certain reeds are used for the analysis. So how would you validate that mutation? With Sanger sequencing. So if you think this was an important mutation or a driver event or something you wanted to follow up on you basically have to use a different sequencing technology to prove that you validated it. So if you're going to publish this reviewers will ask you about other sequencing technologies. So Sanger sequencing is like a gold standard method for that. Mike? Do you see only two reeds that support something but they both have the exact same starting position? Yeah. So the question is about duplicate reeds because of PCR amplification. So often for mutation calling you remove duplicate reeds first that can skew your interpretation of the variant allele frequencies. So there's a step in the mutation calling tools to remove duplicate reeds. For RNA-seq data, that's still done even though by chance you could have the same fragment because your mRNA is only so long and when you fragment it and you have like for highly expressed genes possibly thousands of copies so by chance you could get a fragment that's the same start and end which is how you call duplicate reeds. It has the first reed starts at the same and the second reeds start at the same point as another reed pair. So in RNA-seq data you still remove duplicate reeds but it definitely makes sense and you should do that for all genomes. Yes? Mm-hmm. Yes, good question. So for a subclonal mutation detection certain tools are more sensitive than others. So if your goal was to detect subclonal mutations and we'll go over that in a little bit when we talk about mutation callers you, you know, not every tool is going to be good at that. Some tools are more specific and less sensitive and some tools are more sensitive but have more false positives so you'd have to make a decision based on your particular interest for your samples, what tool to use but we'll have some recommendations. Any more questions? Okay, that concludes our examples. So now we've done this step of aligning and SNB calling. Ideally one is doing copy number analysis on the data as well in parallel. For somatic mutation calling and visualization there are many, many tools out there. We'll talk about some of the ones that are more commonly used and that are pretty specific and sensitive. One of the tools you'll use for a lot of analysis is SamTools which you've already used. So it's a suite of tools essentially for working with alignments but it's also possible to call mutations. So you guys have done MPyLAB, SamTools MPyLAB in the lab and that basically gave you the output of at this particular position I count this many reads with the reference allele and this many reads with the alternate allele and these are all their base qualities and that's the information that would then go into the next step of mutation calling using SamTools. So that's one approach but it's not very specific. GATK, the Genome Analysis Toolkit from the Broad Institute is another good suite of tools. It's a Java implementation and it has some important properties including local realignment of indels. So that resolves a lot of the false positives that would come up from, let's say, SamTools MPyLAB. So it significantly improves misalignments and you can do quality control on your input data. You could call germline and somatic variants. You can annotate the variant effect. So if it's protein changing or non-protein changing if the effect is damaging and so on. So you can read about it in a bit more detail here. Mutect is what you guys will use in a lab. So this is a somatic variant caller part of GATK. This has been used for a lot of TCGA data and it's probably one of the most popular tools. It accounts for various sources of error. So you input tumor and normal sequencing data and it performs this variant detection statistic while accounting for whether the mutation you're interested in is next to a gap just like we saw where there are often misalignments. It accounts for strand bias. So it assigns p-values to the presence of a proximal gap. It assigns a p-value to whether there is strand bias or not. It looks at whether there's poor mapping whether the mutation call is always in a clustered position so like a certain distance from the end of a read. It has various thresholds for having observed this mutation in control samples and it also handles triallelic sites. So we've talked about biallelic sites where you have a reference and an alternate. Sometimes you see a third base. And you have the option as well of using a panel of normal samples. So some sequencing artifacts are going to be obvious when you have a big set of samples but they may not be obvious in your set of controls. And so when you use a panel of normals the different variants go away because they're observed recurrently in panels of normals. And so that can be a good way to filter out additional artifacts. And then it does this variant classification step where it annotates your... Yes? Question? For the panel of normals, do you always seek as many samples from Western Canada? When does the panel of normals... So the question for the recording is does the panel of normals provide the most benefit? Is it when you would have a population from a specific region that maybe share variants? So the idea is that sequencers might make mistakes at some rate and because you only sequence 10 tumors and 10 normals maybe that those sequencing errors happened to some degree in a couple of your tumors and by including this additional panel of normals you have a chance to observe that same type of mutation maybe it's a specific location or sequence dependent thing that at some rate you're going to see an event at which is actually an artifact. So the panel of normals allows you to correct for that. Getting to your, you know, your population question you would want to be careful about using a panel of normals if you thought, for instance, that the things you might be interested in could be germline events and so you wouldn't use, for instance, the normals of some other patients that had cancer because that Tert mutation that is somatic in your tumor could actually be a germline predisposing event in another patient. So you would ideally use a panel of actual normals not just the matched normals from patients with disease but actually people that are normal and are disease free. So you have to be careful about how you choose your panel of normals and ideally you would want to match technology as much as possible because the goal is to remove these technology induced biases. So, yes, so then Mutech does some db snip annotations and then you get your list of candidate mutations. The attractive part of using Mutech is that it has sensitivity to low frequency mutations. So when you're looking for subclonal events this would be a tool you'd want to use. So what you can see here is that in function of tumor sample sequencing depth you have different power or sensitivity to detect various mutations. So if your mutation is at a frequency of 0.4 you can detect it in a tumor where you have 20 reads coverage. If your mutation is at a frequency of 5%, then your sensitivity is just over 50% when you have 60x coverage. So coverage and sensitivity to detect mutations is really there's a big relationship between those two and Mutech has better curves for this sensitivity than other colors. So that is one of the selling features if you will. Mutech and now Mutech 2, both do this well Mutech 2 also calls indels. So it used to be if you were using Mutech not Mutech 2 you also had to run an independent indel color but now Mutech 2 will call mutations and indels so it's kind of a good suite for calling both types of variants. Another tool that is quite popular is Strelka this is from Illumina it's named after the first Russian dogs in space, a canine cosmonaut this one so it's a color that generates both mutations, SNVs and indels and it's known for having highly specific predictions so it's not going to generate as many calls as Mutech and it's not as sensitive but a very high proportion of these calls are going to validate so it's very stringent in what it calls and part of the reasons it's more successful than other tools that calling indels is this step where it performs so it performs local realignment at positions where it detects that there could be indels or where there are clusters of mutations yes exactly so when you have an indel and your aligner has decided that it's less costly in terms of penalization to just introduce a bunch of mismatches then this tool will perform local realignment of those reads to decrease the mismatches and basically not penalizing insertions or deletions as much mutation call information is encoded in this somewhat standardized format called VCF the variant calling format which you're going to learn more about in the lab there's a benefit to having a standardized format which is that anyone could use more like any of these tools I just mentioned and have a fairly consistent output in terms of structure and information VCF encodes a lot of metrics about the data that can be used to then filter or prioritize mutations and each line in a VCF file corresponds to a mutation so they always start with chromosome the position of the variant the start and the end sorry the writing is so tiny you guys are going to go through a more detailed version of this usually it's the chromosome the position and then some sort of annotation in this case this VCF file has been annotated and this is an identifier for a SNP so this is the known SNP the reference allele the alternate allele the quality the quality is going to differ for every caller because they each have their own way to assess the quality of the mutation call based on different metrics sometimes mutation callers will assign a filter status so whether the mutation passes some set of thresholds or it fails for whatever reason and then it tells you the reason and then there's this info field which is always starts with the annotation of what you're going to it's called colon delimited and this part of the info field tells you what each one of the values that you're going to see in the last column means so genotype and at the header of this file is going to tell you in detail what GT stands for so genotype you know depth, DP and so on so you see that this variant one out of one is homozygous this one's your out of one heterozygous and so on and so forth so you can, we're going to explore this format in the lab but basically there are many features about the data that are useful in interpreting your variants and then as I mentioned one of the most important things you can do is to have a look at your mutation calls at least the subset of your mutation calls to kind of get a feel for your data okay so now that we have variants we need to annotate them so that we can focus on those that are likely functional and of interest and I just want to talk a little bit about mutations versus polymorphisms here how many people know the difference between SNPs and SNVs I guess it's up here yes sure I know I should have asked this question before I put up this slide so yeah polymorphisms are common mutations present in a population right it turns out it depends populations differ and so some SNPs are more common in some populations than others but basically there's a threshold of this is kind of a rule of thumb of one percent of the population if variants are deleterious to fitness they would be selected against so they would be rare in the germ lines of individuals if they're advantageous for whatever reason they would be selected for and they would become more prevalent in that population some polymorphisms are associated with disease susceptibility with drug responses and so on and so there's been a big effort to kind of collect information on polymorphisms in different populations and then single nucleotide variants are infrequent, potentially harmful and usually associated with disease and the frequency threshold we use to separate these two things is about one percent although that can be argued and if you see germline occurrence of some of these SNVs those are often associated with disease predisposition so mutations in p53 are very rare in the germline but if you have them then those kids get leafromin syndrome people with germline mutations in patch one develop Gorlin syndrome so these are deleterious and rare events rare SNVs are usually heterozygous it's very very rare to see two rare heterozygous people who have survived with their predisposition to whatever actually have a child who is then homozygous and have that be viable so most of these are going to be heterozygous so most of predisposition are heterozygous and often a pattern you see which kind of supports their role in disease development is that the second hit in tumors in these people basically results in you know, copy neutral LOH or a second hit in that same gene so this is one of the projects that's collected this kind of data on average from these studies so you know people from around the world have volunteered, normal people anonymized have volunteered for these kinds of genotyping assays and I mean I don't want to spend time and go through this in detail but you know there are variants that are shared across all continents some continents specific to that population or specific to that population or specific to the continent so you can kind of see the distribution many people share the same events but then there are populations specific events that are going to differ for the different populations around the world each person carries about 250 to 300 loss of function variants and annotated genes and 50 to 100 variants previously implicated inherited disorders so having one of these variants doesn't mean you're going to have this disease it just means that you have one of these variants but more than just having a variant involved in disease so there are other things that are going to affect penetrance and so on and so forth so there are lots of these events that are non-functional and we can also estimate the rate of de novo mutations which is about 0.1 to 1 mutation per cell cycle and then this plot just shows the distribution of how many events are specific to each population so with African populations having the most variety of mutations we can estimate the rate of polymorphisms this has also been commercialized some of us talked about 23andMe yesterday basically you can pay $200 unless you catch a sale and 23andMe you collect saliva in a tube that they send you and then you send it off to the company they extract DNA and they they basically perform a SNPCHIP analysis so it's a I think at the moment they profile about 600,000 positions so they have a custom chip they used to use one of the Illumina genotyping arrays but they will provide information on your ancestry so they can put you into one of those populations they provide you reports on various wellness traits so for me it was I'm a fast coffee metabolizer for others it's whether you have curly hair or not, freckles lots of just regular phenotype information they also provide information on genetic health risks so given your population or the risk of a particular disease in your population on that genetic background you have an increased or decreased risk of developing this particular disease so they provide this kind of information they also collect lots of personal information through surveys so they collect everyone's genotype and then whether you've ever been diagnosed with whatever so you have the chance to fill out a whole bunch of surveys so they actually have genotyping data on about 2 million people and deep phenotypic data on a big subset of those so they do use this data for research so they found new risk low side for I think Parkinson's and depression and other traits so they mined this data for information they provide to you your ancestry various things about your traits like I mentioned this example person is genetically predisposed to weigh 9% less than average they also provide you the raw data so if you're interested or if you ever did 23andMe you could download your raw data as we did yesterday in the lab and figure out your copy number and B.A.F. state across your genome so all this data about polymorphisms is collected in databases so DBSNP is something you're going to use it accepts submissions for any organism from a wide variety of sources including individual research laboratories where's my mouse collaborative polymorphism discovery efforts large scale genome sequencing centers and other databases and private businesses so lots of information goes into this they do have a curation process one of the things that I would suggest is to threshold or to use their high quality calls which come from population resequencing efforts like 1000 genomes currently DBSNP 150 this is I believe the current version has about 130 million SNPs with a known frequency in the human population up to 8% of those may be false positives due to related sequences so paralogs so some investigators are interested in a specific region and they'll design primers that amplify that region and then screen cohorts of patients and it turns out those primers cross hybridize to a paralogous region in the genome so it's not a mutation as much as a false positive there's also a version of DBSNP called non-flagged so the flagged SNPs at some point DBSNP became contaminated because anyone can submit entries to it and it's supposed to be polymorphisms from normal individuals but people did submit polymorphisms found in individuals or cancer and so it became contaminated so then there was a big effort to flag those SNPs that are associated with some clinical feature or that have a really low low at rare frequency in the population and so this non-flagged version is what you should strive to use if you're going to use DBSNP to filter out events not of interest because you don't want to filter out events that are associated with disease a note on HD38 it does not have yet the non-flagged DBSNP version so lots of people there's lots of resources for HD19 if you're going to undertake a study and you've got your reference genome set you want to do a bunch of work and you want to use DBSNP non-flagged it's not available yet for HD38 so it will become at some point whenever we go up a version in the reference genome all these other resources have to catch up so we are still a little bit in the tail end of the catch up phase so something to consider when designing your experiments another database of variants these are variants associated with with disease it's cosmic so I encourage you to look at the site especially the cancer gene senses this is something else you would want to annotate your variants with if they've been previously implicated in cancer and so in cosmic you could go to this website and explore it but for instance here's what you would see for BRCA2 it's a familial breast cancer ovarian cancer predisposition gene it's involved in the hallmarks of cancer specifically these two and they're kind of detailed here genome instability and mutations escaping programs cell death you know you have a link to the papers that support those associations it tells you it classifies genes if it's known into tumor suppressors oncogenes or uncertain or unknown function you get information about the processes that they're involved in and the kind of mutations that are seen so for BRCA most of the mutations are missense nonsense for p53 there would mostly be missense and so tools that we use to add all these annotations on to our variants include anovar snippet there's also oncotator we're going to use anovar in the lab it's a very popular and widely used tool so basically it allows you to do functional orientation of variants in three ways gene based where you annotate what the effect of the variant is on the protein amino acids are affected it will also consider the different isoforms of the protein so in this isoform it's going to have this effect in this other isoform which maybe has a different combination of exons it will have a different effect or the mutation will affect a different amino acid the number will change but not the actual amino acid so you can put in whatever gene annotation you want ensemble gene you see a C genes code gen code is a good place to start for doing gene annotations ensemble is also very comprehensive RefSeq genes are more focused on protein coding genes so a lot of non-coding genes and novel genes, novel predicted genes would be more in ensemble and gen code rather than RefSeq so region based annotations where you can annotate your variants with whether they're present in conserved regions of the genome whether they're present in transcription factor binding sites whether they correspond to GWAS hits if they're in segmental duplications whatever set of annotations you want to use are available or you can download from your CSE and make into a database can go in this region based or gene based annotation filter based this is where you would put dbSNP 1000 genomes and so on it's going to predict the mutation effect and impact and the available databases that you can use to do all these predictions are here at this link which is actually a little bit tricky to find if you're just looking online so I've provided it here it also has instructions on how you would make your own database if you so desired okay so now that you have this functional annotation you might filter your list to once that match mutations that match certain criteria so whether they're coding or predicted to be damaging or in regulatory reasons or supply site mutations or what have you and then the thing we want to do is turn our VAFs into CCFs so CCF is this idea of cancer self-reaction so the VAF the variant allele frequency is great if you have no tumor no normal contamination in your tumor and no copy number events and so here we have we see a mutation I guess the mutational impact here is lightning so this little red X represents that one of our copies of DNA is mutated so we're going to profile 1, 2, 3, 4, 5, 6 pieces of DNA and 3 of them will be mutated so our variant allele fraction is going to be 3 out of 6 so 0.5 so we have a fraction of 0.5 this mutation multiplicity is a number corresponds to the copies of mutations you have for cells so that's 1 the ployity of these cells is 2n and the purity is 100% so our cancer cell is essentially how many cancer cells have this mutation so 100% when we have normal contamination we're still profiling in this view you can see there are still 6 pieces of DNA that we're profiling the purity is 67% the ployity is 2n there's still one copy of the mutation for each cancer cell but now because of this normal contamination we're only going to detect 2 out of 6 reads with a mutation so that's a variant allele frequency of 0.33% but the mutation multiplicity is still 1 it's still in every cell so we have to correct the VAF to get this cancer cell fraction of 1 so having the lower purity pushes the VAF towards the left when we plot VAF on the x-axis here in a normal tumor what we see question? yeah with a mutation how many of your cancer cells are mutant so if you're going to based on this number the variant allele fraction so if you had perfect conditions no copy number, no ployity events, purity was 100% your variant allele fraction would kind of correspond to the cancer cell fraction in a different way than it would hear when you have normal contamination so you can't really use VAFs for for some of the downstream applications it's much better to use this cancer cell fraction it's more informative knowing that 50% of your cells have a mutation versus 100% of your cells is more important than I picked up this many reads that support my mutation so the VAF perfect scenario would look like this where you have some mutations so this is a density plot of the VAFs of the mutation in this particular sample which has an estimated purity of close to one so you see a bunch of mutations that are homozygous then you see more mutations that are heterozygous and there's always some noise in measuring and so you don't see a perfect peak you see this distribution and then we see a number of mutations that are subclonal this population of mutations that are even more subclonal when you have low purity okay right so this is clonal clonal homozygous heterozygous clonal and then the subclonal events when you have low purity you basically just squash this towards zero so you know you can't use a threshold on VAF for determining what's a clonal mutation because your VAF is 0.2 unless you know the purity of your sample and if there are any copy number effects you might say 0.2 is kind of a subclonal event but in this tumor it would actually be a clonal event so here are a couple of examples where multiplicity is different so in this case on the left we have a purity of 67% a ployty of 4n so now ployty changes how many times we're gonna how much DNA comes from the tumor versus the normal the mutation multiplicity is 3 copies per cell the variant allele fraction is 6 out of 10 and the cancer cell fraction is 1 and here we again have the same purity the same ployty allelic fraction is going to be 1 out of 10 because there's a mutation just in one cell so this is going to be a subclonal event so our cancer cell fraction is going to be 0.5 this multiplicity and CCF tell us something about the timing of mutations so these are some megaloblastoma tumors where we can see that diploid events often have and megaloblastoma is a pretty pure tumor so we often have close to 100% purity so here we can see that diploid and tetraploid events have different mutant allele frequencies so the VAFs and some tumors typically have a peak that's around 0.25 as opposed to 0.5 and that's because there's often an early genome duplication events and then lots of mutations are acquired after that event so you only see it in one of four copies so you see this preponderance of things at 0.25 and so one can calculate CCF with this formula which you probably don't need to know I just wanted to put it up in case you guys are interested but you have to correct for purity correct for copy number the local copy number at that mutation and the copy number in normal diploid cells which is always going to be 2 so that's kind of how the numbers would work out for this particular scenario and you would come up with a cancer cell fraction of 0.5 so this is important because you can use the CCF then to infer clonal dynamics and this is the number that you would use for that kind of analysis so here's an example where a tumor was profiled twice the first time point was in the primary disease and the second time point was at recurrence and basically some of the mutations in this tumor which are highlighted here in red were very rare in the primary it's this thin strip of red here and became very prevalent at recurrence so when you plot the CCF at time point one versus time point two and you cluster mutations by their CCF you can find this group of mutations that kind of travel together because they're in the same cells that had a CCF of 0.1 initially and now are around 0.8 whereas most mutations are clustering up here and we're just common to both populations because they happen early and they're in every cell so CCF is what people use to do this kind of analysis and what you turn your VAS into yeah oops the plot on the left oh that's just an artistic depiction of what likely happens now there is a tool to draw these kinds of plots I can send it to you after if you want often they're difficult to draw because you have information on what mutations and what frequencies the VAF of each mutation in your sample but you don't know if one mutation at 10% is on a background of the clone at 40% or the other clone at 40% so you don't necessarily know where to assign it so there's enough ambiguity in how you could actually draw this that it's pretty hard to do so it's hard to do accurately so always these depictions kind of have the caveat that this is an interpretation of what your tumor structure could look like there are other tools that will predict the phylogeny of your tumor cells based on these frequencies and copy number events and those are a bit more accurate but they don't generate this nice kind of figure they will generate a tree a phylogenetic tree especially with real world tumor samples everything between exactly yes so yeah just for the recording the comment was that you often don't have time point one and time point two and if you do you don't have all this stuff in between so you're just inferring what happens before and in the middle but that kind of data is becoming more prevalent and it's being collected by more people so maybe we'll be able to draw accurate versions of these at some point soon one more question the number of sub clones from here we can tell that there is at least one sub clone that has gained that has become very over represented in the recurrence relative to the primary whether these mutations occur in the same cells or in two groups of cells is not something we can tell from this particular plot we just know that their frequency was equally low in the primary and now is equally high and correlated in the recurrence you would need to do more of a phylogenetic analysis to predict lineages there's also this clone down here this little gray one that was at a decent proportion of the primary which is now pretty much completely absent so you can see changes in the composition of the tumor and what mutations are segregating in the recurrence versus the primary and again this is important because we know now for many studies that the presence of subclonal drivers inversely impact outcome so here anytime it's not black when it's red or yellow we have events in these genes are subclonal as opposed to this cohort of patients where they're all clonal and when you look at their survival analysis you can see that having a subclonal driver is much worse for prognosis okay so a few more slides the effect of purity and purity on the power to detect somatic SMVs so I want to start kind of I think in the middle plot so these are these lines are in the middle are your detection power for a particular mutation given sequencing coverage and this green dot on the line which is the delta is essentially the frequency of your mutation and you can get a frequency of 0.125 so 12.5% this is using absolutes right so it tries to fit a purity and copy number that will give you that particular observed frequency so in this case we have a tumor with a purity of about 50% and a copy number of about 6 so that's going so a clonal event in such a tumor will show up at a frequency of 12.5% in your reads so you have a power to detect this event with about you need about 33 reads so a typical whole genome sequencing experiment will give you a 30x coverage so for this kind of tumor that would just give you enough power to detect a clonal event because of the amount of normal contamination present so if you have more normal contamination you lose power to detect events if your mutation is sub clonal then to have an 80% power of detecting such an event you would need about 300x coverage right so here we're in the realm of WGS and here we're in the realm of whole exome sequencing and then here you would be in much deeper sequencing than you typically get from a whole exome sequencing so purity and copy number or ployty really affects the relationship between that and the power to detect variants so I guess this means in the case of the terminal line you can sequence the matched terminal line so in the case of the tumor you sequence deeply before you make a clonal event yeah would you know in your tumor sample if the event was germline or not I guess you would compare to your actual germline so you're saying how low can you go in terms of coverage for the germline if you have all this normal contamination a big utility in having a germline with decent coverage is that you can exclude artifacts that are going to be observed at a specific coverage so people have done comparisons with deeply sequenced tumors and deeply sequenced germlines and then started to reduce the amount of coverage in the germline to see if you can still pick up and accurately find your somatic events and after a certain point if your coverage isn't enough in the germline you start to have lots of false positives so I think there is usefulness in sequencing your germline to kind of a comparable depth to your actual experiment but in practice you know don't do like less than two thirds of your coverage I know people it's expensive to do sequencing and it's attractive to do low coverage but it really does help you to eliminate false positives so just for the recording the comment was if you have a high coverage tumor sample and a low coverage control so input then some tools will down sample the higher coverage sequencing run so that you have equally comparable data sets and then you're just wasting your sequencing so something to consider yes people do look for non-coding events associated with cancer so like the tert mutations but also it depends what you're annotating your variants with so if you're interested in non-coding variants don't use RefSeq to annotate your variant list use GenCode and then you'll have all the link RNAs and small non-coding RNAs and all the non-coding annotations in the genome if you want to do prediction of novel non-coding things you would need to use RNA seek data and if you want to look for non-polyadenylated transcripts you would need to sequence in a specific rate because you can't do poly T selection so it depends how your experimental setup is for what your interest is in finding non-coding driver events because they could be just an expression difference not necessarily a mutation so in the non-coding transcript mutation is going to be silent by default because it's not coding although some of these non-coding RNAs are shown to have now a small open reading frame so that's still I guess an area of active investigation but if you see recurrent mutations in a non-coding gene that would be of interest and you should follow up probably but regulatory elements are the other big category to look at these graphs do you know that graphs continue to look the same like an order of magnitude greater sequencing depth so you can use sequencing depth to put past any verification I don't know how these particular graphs look when you go further to the right but you know so this is why deep sequencing and amplicon sequencing studies exist to find these really rare events that you're not necessarily powered to find using 30x or even 200x so in previous work we've done amplicon sequencing you know an average of 10,000 reads coverage for a particular position and we were looking for clonal events in the recurrence and if we could find them at all in the primary tumor so we could find these really rare events with that level of coverage this was in megaloblastoma which like I said has really good purity so if you are also suffering from low purity your power would basically be proportional to the coverage so it kind of depends on what your question is and if you're taking a targeted approach or what your approach is but for this particular data I don't know what the curves look like on the right but I assume that they would converge at one point with enough coverage you would find everything except you would still have false positives which would correspond to those rare PCR mutations or other mutations that are going to be artifacts of the technique you're using so at some point with rare enough variants you won't be able to distinguish them from the sequencing or PCR induced artifacts so if you're going to go really deep you need to use a really accurate polymerase okay so now that we have these cancer cell reactions we can see or determine which mutations are clonal versus subclonal so now people do various types of analyses that would be specific to your experimental setup and what you hope to learn from your experiment so you could do a recurrence or a significance analysis to find which mutations are more likely to be drivers because they're observed more than you would expect by chance you could look for patterns of clonal evolution like we saw in the last slide you could try to look for mutational mechanisms or you could associate your mutations with clinical variables like subtype survival, metastasis and so on frequency in the population is one of the things that's really broadly used to interpret the importance of mutations so Titan is a huge gene it's going to come up and you'll find mutations in Titan no matter what you sequence almost and so you need to normalize for gene length for instance in order to determine if the frequency of observed events is more than you would expect by chance so there are various tools like this tool music or mute 6CV or other tools that would take a various or account for various properties of genes that would affect the mutation count like length of gene also mutation rates turn out to be lower in genes that are highly expressed in normal cells because there's this process called transcription coupled repair so in highly transcribed genes you have lower mutation rates than in genes that are more infrequently transcribed the replication timing of a DNA region is also important so genes that are replicated late have higher mutation rates so genes that are replicated early can be easily repaired more easily repaired than ones that are replicated late and one hypothesis for why this might happen is that by that point in the process there is a less available pool of free nucleotides and so when a repair needs to get done you only have so much time to do it you don't have an available T or whatever base is needed to go in the repair then you have less chance to repair that accurately so these are some of the things that you need to account for in interpreting whether the frequency of your observed mutation in the population is more than you would expect by chance or not like I said we can do this clonal evolution kind of analysis so this is an interesting paper they looked at kidney cancer and they profiled the primary tumor as well as multiple metastatic sites and we are seeing here in grey are the mutations and black is lack of mutations and so we see that in all these different regions of the tumor there are a lot of mutations that are in shared and then some that are specific to different metastases and interestingly when you do a phylogenetic analysis of these mutations you can see that some things are consistently mutated like set D2 is recurrently mutated and it's different mutations in different part of the tumors P10 is recurrently mutated in different parts of the tumors so there seems to be a dependence of like a biological dependence on certain pathways for the growth of this tumor so this type of convergence and mutational events between different regions really tells you something about the biology of tumors so these kinds of studies are really powerful for analyzing a clonal evolution we already talked about mutational profiles and you guys will have a chance to do that in the lab and then just a last slide which really describes kind of the other important aspect for testing a functionality of mutations so you're going to have these predictions and you're really fond of what you think is the key to your tumor you have to validate it before you validate it if you have expression data you could do this type of analysis where you look at the impact of the mutation across the pathway that it's involved in so this particular study looked at uterine cancer in TCGA data so all these cases of uterine cancer which are all the columns of CTN and B1 which is this track right here so these are all mutant tumors so a mutation in this gene is known to activate the wind signaling pathway and the rows here are the genes involved in this pathway so red and blue indicate activation and suppression of genes in this pathway and so we know that these mutations for instance are active because they're altering the transcriptional output of this pathway but it turns out that there are some cases with mutation which don't activate the pathway and so when you look at this additional row pole E so pole E mutant tumors are hyper mutated so they often carry mutations in many many genes and a lot of them are going to be passengers so a lot of these mutant CTN and B1 cases that don't have pathway activations are these hyper mutant tumors so these are actually passenger events in those tumors something else is driving those tumors and so even though they contain a mutation it's not functional so it's not just a matter of finding the frequency or over representation of your mutation but there are other analyses like this that you can do to then infer the activity of your mutation in a particular sample okay so I think we'll end there