 We're going to talk today about somatic alterations, and I will just reiterate that all these slides are under a Creative Commons so you can share them, reuse them, change them, and just remember to share them under the same license. So for today's lecture, we're going to talk about a number of types of somatic alterations and we're going to cover structural variants, specifically copy number alterations. We're going to talk about the different kinds, how these relate to cancer evolution and genetic heterogeneity. We're going to have some examples of how to detect copy number alterations, what are some confounding factors and strategies to overcome these. We're going to talk about lots of heterosygosity as well. So some of the confounding factors that are relevant, purity, employability, intratumoral heterogeneity, and then for the second part of the lecture, we'll focus a bit more on single nucleotide variants and small indels and how we do mutation calling and annotation. And again, many of the confounding factors are very, very similar for both of these classes of aberrations. Okay, and feel free to just jump in, so unmute yourself and ask a question if something is unclear or you're wondering something. So what do we measure when we sequence a genome? Yesterday we talked about the fact that we break a genome apart, we have these fragments of DNA that we sequence from both ends. And what are we actually capturing there? We're going to capture all the variants in that individual's genome that are different from the reference genome, which is a static reference that we use to align everything to. So we're going to catch all the germline variation between our individual and the individual that contributed or the individuals that contributed to the reference genome. And these are going to be either heterozygous or homozygous. And I hope, let me know if that's not something that you're, if that's something that you would need clarification on, but heterozygous means that that variant is present in one of your two DNA copies and homozygous means it's present in both. These variations are either common in the population or they could be rare or they could be personal. So then maybe it is a mutation that has just the risen in the germline of the individual we're sequencing. We're also going to catch all the somatic variation. So these are not things that the person has inherited from their parents. These are mutations that have been acquired in time. And again, these will be, many of them will be in the tumor or they'll be highly prevalent in the tumor. They're going to be homozygous, heterozygous, or they'll be just in a few tumor cells. We're also going to catch all the PCR errors that are incurred or introduced during the library preparation as we do PCR, sequencer errors, and then any errors that have been introduced by the choices made during alignment and mutation calling and so on. And so if you think about the way that all these variants arise in a human genome, you can sort of put it in a context of the developmental timeframe of the person or the organism that you're sequencing. So in our case, we start out with the germline we inherit from our parents, right? The egg and the sperm. They come together. That's the starting point. As soon as that happens and cells start to divide, there is an error every time that DNA is copied and partitioned into the two daughter cells and then copied again and partitioned in the daughter cells. So almost immediately, you start to acquire somatic mutations that are, that look like they're in the germline if they happen really early on, or they could be restricted to a tissue if they happen later. At some point, one of these events will be, will give you, will give that cell a growth advantage relative to the other cells. And so this is one of the driver events for cancer. It's typically never a single driver. Usually you need to acquire a few of these drivers. So once you have, I think the average is two to eight drivers, you have robust cellular growth, you detect the cancer. There's treatment, induces additional changes in the genome. And so depending on what sample you're sequencing, you're going to catch cells along this pathway of acquisition of mutations. Whenever we do tumor normal sequencing, we're comparing the sequence of a tumor to the germline of that individual, right? By aligning both to the reference genome. So we sort of have a three-way comparison. So we'll talk a bit more about that. When we measure, when we measure, you know, these mutations in a cancer, it turns out that we're measuring not just one population of cells. We talked about clonality, and there was a question in the Slack channel about clonality yesterday. Tumors are constantly evolving in response to their environment. They're constantly evolving because they have genomic instability, so they keep acquiring mutations. And so what you end up seeing is across different kinds of cancer, you see that, oh, and I didn't put in the annotations, but this is the number of cancers of this type that have a single detectable clone. The number of cancers of this type in light blue that have at least one detectable subclone. So that is a smaller population of tumor cells that are genetically distinct. And then with increasing dark blue bars, you get more and more. I think I have to take my mouse over here. Here we go. So the increasingly dark bars show us that some cancers are really heterogeneous. So you're only in a case like this going to detect lots of competing cellular populations that are genetically distinct. There's no one homogeneous genotype. And so we're always going to be in a situation when we analyze these genomes of trying to infer the cellular frequency of a mutation or a copy number aberration. And unless you're lucky and you're working with one of these cancers, the signal will be mixed. Okay. And when we look at cancer genomes, we're going to find that both small, so single nucleotide variants, small indels, and structural variants, so big events, are really relevant. And this is from a paper that's showing again, pan cancer, that some tumor types sort of exhibit this mutational class as opposed to others like ovarian cancer or breast cancer, which are driven by copy number aberrations, so the C class. And so you see that cancer sort of span the spectrum. And if you look here at the graph on the right, cancers that have high levels of mutations or high levels of CNNs are really distinct, but there are plenty of cancers in this middle space where we're going to need to analyze both copy number and small CNVs and indels. Okay, so back to what do we measure when we sequence a genome? We have all these reads. We pile up our reads on the reference genome. And then we use the information that we talked about yesterday, base qualities, in order to infer that a difference from the reference genome is actually a true variant and not, let's say, a sequencing artifact. So we use base qualities to give us evidence for that. We use the mapping qualities to know or to be sure that the fact that we detect fewer reads in a certain position is really because we have a deletion and not because we are unable to map reads to that location because of repetitive sequence or something like that. So aligners will use and mutation colors and copy number colors will use this kind of information to make a decision to make a decision on whether there is a gain in a specific region, whether there's a loss compared to the average coverage in the genome. And similarly, whether there is an indel or an SNB or a translocation. So in the case of translocations, and you guys look at some structural variants in the IGV lab, in the case of translocations, you'll have reads that align really, really, really well up until the translocation breakpoint. And then the rest of the sequence is actually really high quality. So you would see really good base qualities. But those bases would align really, really well somewhere else. So you can either see that with reads themselves or with pair of reads, where one of your pairs aligns to one side of the breakpoint and the other aligns to another side. Okay, so that's sort of copy number differences, small indels. There's a lot more we can do with this kind of data. So we can tell, for instance, how many copies or we can predict how many copies of a certain mutation there is, what are the arrangements of those mutations. And some of those are annotated here. Let me just point to them. I'm working with two screens. And you know what, maybe I'll just, I'm going to switch my screens. I'm going to stop sharing for a second. Sorry, sorry. And I have a quick question for everyone. Do you have access to the slides? You should be able, if you refresh, do you have the last slides? So do you see my mouse? Right? Yeah, you guys can see where I'm pointing. Okay, super. So yes, there's a lot we can do with this information. We can tell, for instance, that this mutation is on one allele. So there's a, it's heterozygous. We have one allele. That's what well type one that's mutant. In this case, we can tell the allele ratio is one wild type to 19 mutated alleles. This can happen if you have a tandem duplication where you just copy the same thing over and over in series. Or in some cases, you have these double minute chromosomes, which are little circles of DNA that contain the gene, the driver gene, and they are amplified at high levels. You can also see genomic losses of different kinds, small focal events, or whole chromosome arms missing, or a whole, a whole, like one of the whole chromosomes missing and you're left with one copy. We're going to talk about copy neutral loss of heterozygosity, where essentially you have just one copy of the chromosome left, but it's duplicated. So you don't see a copy number difference. However, you've lost heterozygosity. So we'll have some examples of that. Whole genome duplication. This is a typically an early event in cancer evolution, and it's where the whole genome is duplicated. It is an unstable state, so often you don't see a perfect tetraploid tumor because as soon as this happens, cells will start to lose different pieces of DNA. And so you'll see a ploidy somewhat less than four, and lots of breakpoints and copy number variants will be observed. And then there are other forms of imbalance. So those are sort of the kind of outcomes that you can get from these copy number events. They sort of fall into the simple structural variant category. Structural variants also contain, like I said, translocations or insertions where you have DNA from elsewhere landing in a spot it's not from. There are also tandem duplications, foldback inversions, chromothripsis, where you have part of your chromosome shattered into pieces that then get reassembled sort of in a random way. And so you get lots and lots of connections that are new, and they're actually really difficult to really difficult to figure out exactly what happened then in what order. So for the purpose of the talk and the lab, we're going to focus more on these simple structural variants. The other kind of aspect to this, and we're not going to talk about it in this year's workshop, but we've talked about it in 2019, when this module was actually two modules, and we had a bit more time. So we're not going to do any of this in the lab, but you can check out 2019 lab for information on this. It's mutational signatures. So yes, we have these individual indels or point mutations that we can detect. So that might be driver genes. But as a whole, the process that generates those mutations typically has a signature. So for instance, tobacco smoke, tobacco smoke and lung cancer, sorry, not melanoma, induces a certain type of C2A mutation. So if you see tons of the C2A mutations in a lung cancer, that lung cancer was caused by tobacco smoke. Similarly, the C2T mutation signature is what you see when cells have UV damage. And these can be prognostic. So for instance, the signature SPS6 is due to defective DNA mismatch repair. And that typically means that those patients that have that signature will be responsive to certain types of immune therapy. So we're not going to talk about those. I want to talk a bit more in a bit more detail about copy number and structural variants. And first of all, why they're important. So yesterday in the Slack channel, there was a question of how many of these are actually germline in the human population, right? Because we always talk about DB SNP and how SNPs are, there's so many SNPs that make us unique. And they're variants in the human population. It's actually the same for structural variants and copy number variants. And a lot of these variants that are common in humans are described in this database of genomic variants, the GB. So it turns out that people just differ by a few hundred inversions, a few hundred duplications, about 3000 deletions that are at least 500 base pairs. Sorry, back up. We still have active retro transposons in our genome. So every once in a while, they jump around. Unlike the SNPs, our small little one base pair alterations, these affect a large fraction of the genome. So they actually end up affecting many genes. And as such, they form the genetic basis of traits because they affect gene dosage. And so if you have, you know, an additional copy of 10 genes, you'll see that potentially at the transcriptional level and at the proteomic level, at least in certain cell types. So they've been implicated as a result in various diseases, for instance, neuropsychiatric diseases. And in general, speciation is driven by rapid changes in genome architecture. So many species have diverged, for instance, through a tetroploidy event, like all the salmon and rainbow trout and so on. Lots of plants use polyploidy as a way to speciate. So we see this in speciation. We also see this in, if you think of the tumor as a small ecosystem, more cells are evolving. So this is what our normal human karyotype looks like. This is a deployed female karyotype. So each one of the chromosomes is painted a different color. You can see that there's two copies of every chromosome, one inherited from your mother, one inherited from your father. When we look at cancer, this looks dramatically different. So I'm going to show you an ovarian cancer, which is one of the cancer types that, if you remember that plot, was the most highly aberrant in terms of copy number variation. So here's what the karyogram of four different cancer patients looks like. And you can see a lot going on here. So first of all, there are lots of translocations. So these are new connections between chromosomes. So here we have, for instance, a translocation between the end of chromosome two and the end of chromosome three. So you can see that with the color. You can see lots of translocations in these genomes. So the genomes of these patients are of these patients tumor cells are extremely unstable and chromosomes are hit by multiple translocations as the tumor is growing. The other thing to note is there's very high ploidy. So the whole genome has been duplicated multiple times. Here we see mostly the broad events. These are giant regions that we're able to observe by eye, but there are lots of focal events that happen as well. And for those, we need to zoom in to the base pair level of resolution instead of this high level genome resolution. And remember, each one of these chromosome karyotypes is from a single cell. And so when you look at these, at the cancer, you're going to see a ton of variability, especially in these ovarian cancer types. Okay, so now we're sort of zoomed in to the level of one chromosome. So this is chromosome five. This is not an ovarian cancer. But we can see the different classes of copy number alterations that are interesting and would be desirable to analyze for us. So these, again, are measured relative to the germline. And on the x axis of these plots, we have the base pair. So we start at the beginning of the chromosome here on the left. And then we go to the end of the chromosome on the right. On the y axis, we have a scale that describes the ratio of the amount of reads that we're getting in our tumor sample versus the amount of reads that we've sequenced in our germline sample from the same individual. So if there's no difference between tumor and germline, you would see a whole bunch of points kind of centered around zero. So that's what this blue segment is essentially. This is a deployed segment of the genome. There is variability. You have some measurement noise always, right? But if you look at the segment as a whole, it's centered around zero. There's no difference. When you have a gain of one copy, you suddenly have a consistent measurement that is higher than zero and a bit closer to one. This is a hemizagis deletion, so deletion of one copy. And you can see a very nice discreet focal deletion between this is actually the centromere of the chromosome. So between the centromere and this other segment, there's a deletion. And then there's another hemizagis deletion that spans a large part of the chromosome. And where there is one point that has undergone an additional event, and this portion is deleted completely. So this is the home of both alleles are gone at this location. At the end here, we see an amplification. So amplifications are typically more than a single copy gain. And the other big event on the end of the chromosome here is a copy neutral loss of heterosargosity. So here you see again that this region is blue because there's no difference in copy number compared to the germline sample of this individual. The only way we know that this is actually copy neutral LOH is from the so called BAF plot, the beta allele frequency plot, or the allele ratio plot, or there's a few ways to describe this plot. But essentially, what this plot is, is it shows us the signal from every snip along this individual's chromosome five. So these are sites that are polymorphic in the human population. So we talked about DB snip a little bit yesterday. Now, a lot of the snips in a certain individual will be homozygous, and those are not informative for this analysis. Many snips though will be heterozygous. And when you count up the reads that correspond to one allele versus the other allele, you'll have a ratio of about 0.5. Okay, so in a normal deployed area of the genome, you have this sort of population of data points that are centered around 0.5. And you can see this corresponds to the deployed. When you have a gain of a copy of the chromosome, you're taking one of those alleles, and instead of having one to one, so 0.5, you're not going to have two to one. So you're going to have 0.66 and 0.33. So now you see your distribution change towards 0.66 and 0.33. These, this plot, the BAF plot, is always symmetric around 0.5. And so what we can see here on the right is that there's no more heterozygosity, right? It is completely lost. And so the points are pushed from being centered around 0.5 to being pushed out to one or zero, because we've lost one copy. So we've either kept or deleted the SNP at this position. So here at each point is a SNP. So I hope that makes sense. I'm going to talk a little bit more about SNPs. And then I hope that if you have any questions, you can ask so we can clarify this. This is an important thing to get, but it's also kind of complicated. I just want to give you an example of what these SNPs, I guess just a visual example of some of these SNPs. So a DBSNP 150 I think is the current or at least in the last year, one of the most current updated databases of single nucleotide polymorphisms in the human population. There's 130 million SNPs annotated in this database that have a known frequency in human populations. So that frequency could be, you know, that many people have the allele, so 80% or that 1% of people have that allele, but it's been detected in some human. And it's been, and you can quantify its frequency. So if you take this number and you divide three billion base pairs by it, which is how many base pairs we have in our genomes, you're going to have an on average variant every 23 base pairs or so. So that is a lot, a lot of points that we could potentially use for this type of LOH analysis. Now let's look at BRCA2 for a second. In this gene, which has 27 exons and about that many introns, there are 8,200 plus SNPs. So even a small region like one gene will give you thousands of data points potentially. When you look at some of these SNPs, you realize not all of them are informative, right? So this SNP for instance is a change from a T to a G where the G is in 37% of people and the T is in the other percent of people, right? In this case, the change is from a T to a C, where the C is the minor allele. It's only in 9 out of 1,000 people. So it would be pretty rare to find a person that's heterozygous for the SNP. It's possible. It's just much more unlikely than a person that's heterozygous for the SNP. And then similarly with this C to T, the T is extremely rare, right? It's much less than 1 in 10,000 people or who knows, they don't have more than three significant digits here. But the point is some of these amazingly large number of SNPs will be informative and some will not be informative. And so when you do the analysis for an individual person, you're going to use those SNPs that are heterozygous in that person. So I have an example of that. And this is from one of the papers that was on the pre-workshop reading list as an optional. So you may or may not have seen it. It's the cobalt genome medicine paper. So here's a person, the proband is a child with some disease. This is the genome of the mother and the genome of the father at a specific position. So you can see some interesting things happen. And I've annotated some of them here. This, the child is heterozygous for three SNPs in the region we're looking at. They inherited an A from their mom, a G from their dad, and so on. You can see the parents are homozygous for these positions, right? This, the mother is homozygous for the non-referenced genome variant. This is why it shows up in IGD as a color. And the father is homozygous for the variant that is actually in the reference genome. So it's, we don't see a band here because IGD only shows us differences. It turns out the mother had a deletion here that the child has inherited and is heterozygous because they got one allele from their mother who was a homozygous deletion. And you can see there's no read coverage and one allele from their father. So all of these variants in the child would actually be informative for LOH analysis, right? All of these variants in either parent would not be informative at all. So in the, in the parents, it would be other variants that are informative. So why does this matter? And, and how does it happen? And so to address that, I'm going to show you, or I'll just go through this plot. I think this is a plot of the common copy number and copy neutral aberrations in patients with myelodysplastic syndrome from this paper. And you can see here that lots of patients, lots of patients, and this is now the percent of patients with aberrations, then it's like a cumulative plot of how many, how many patients have a gain or a loss at specific regions of the whole genome. So you can see lots of patients have a loss of chromosome 5. So there's probably some important genome 5 that you need to delete in order for, for cells to proliferate and grow. And then there's lots of areas in the genome where there's nothing happening at the copy number level. But if you look at the beta allele frequency, or if you look for copy neutral LOH, you see that tons of patients have copy neutral LOH at these regions. And so here is one locus, Jack 2, on chromosome 9. So this is the first arm of chromosome 9, so 9p. And here's what has happened in these patients. First, there's a mutation that activates the gene. So this is an oncogene. This gene gets turned on by the mutation. Now, you only have one copy of this. So you have some survival advantage. But if you lose the other copy and duplicate your mutation, you end up having a huge proliferative advantage. And so what ends up happening is you acquire this mutation. Then when you double your chromosomes to get ready to separate into two cells, there's a crossover event. And then you end up separating such that your mutations go together in one cell and the wild types go together in the other cell, which has no growth advantage. So this is lost from the population. And this grows a lot. So that's what you see when you sequence the tumor. And so when you end up sequencing these patients, you can see that in their LOH plots, the 9q arm is perfectly heterozygous. So we see this bandit 0.5 and the bandit 0.1 in every case. But then for the 9p arm, we see loss of heterozygosity in every case, at least enough of the PRMS loss that it encompasses this Jack 2 locus. Okay. So this is a way to duplicate an important mutation. Usually it's a loss of function in a tumor suppressor, but in this case, it's a really interesting case of a gain of function in an oncogene. So hopefully that makes sense. And LOH is kind of more clear and we're using read pileups for the copy number detection, and we're using SNPs and their heterozygosity for detecting LOH. Okay. So we have this kind of information and we need to infer which copy number events have happened, which copy neutral events have happened. This is challenging because of a number of factors. One of them is normal contamination. As we heard yesterday, anytime you sequence a tumor sample, you're not just grabbing pure tumor cells. Tumors are infiltrated into tissue. They have cells that they recruit to help them grow. And so you're always going to have some other cells that are germline and do not have the aberrations that the tumor cells have. And so you're going to dilute your signal by including and sequencing a lot of these cells. And for this reason, some of the genome sequencing efforts like TCGA have a criteria for a minimum number of tumor cell content. So for TCGA, you must have a purity of 80% or higher to be included in TCGA. If you don't have that, you could sequence at higher depths, but it comes at a price. The price is the cost of sequencing. So you're going to sequence a whole bunch of reads and pay for them and then end up not using them. So if your tumor is only 20% pure, you're going to have to sequence up quite a lot to identify copy number variants, especially ones that are subclonal, which gets to the second point. You have always these populations that are potentially genetically distinct in different ways. And their signals are mixed in the sample that we're sequencing. You can get around this with a different experimental design. So for sequencing different regions in space, you can get around this because then you're picking cells that have evolved to deal with hypoxia or some other physical feature in space that's different. And then you can start to detect these different populations. But in most cases, you just have a single sample from a tumor and a single germline. Your noise will affect how well you can distinguish somatic aberrations from germline polymorphisms, which again exist in every person. And the final thing that makes a big difference is the tumor ploidy. So how many copies of the genome you're measuring. So tumor purity and tumor ploidy can influence your signal in different ways. And in many cases, different combinations of purity and ploidy can explain your data. And this is called the identifiability problem. So I'm going to give you an example of this. You might detect that your copy number for a certain region is 1.2. This could come about in two ways. It could be because you have a homozygous deletion of that region and a 30% tumor purity. So in the normal cells, which are 70% of your sample, oh, I think that it's actually 30% tumor purity. So I think the number is a little bit wrong here. But anyway, you have two copies of that DNA from your normal cells and they contribute to 60% of your sample. So it's actually, I have to change this number. I actually have to change this equation. And then in the tumor, you have a homozygous deletion. So you've lost both copies. The point is going to be the same. I'm going to fix the numbers later for when I reupload the slides. And so the tumor is only going to contribute to 30% of the data and it will contribute zero. The normal will contribute two copies of DNA and it will be a much higher signal. So you're going to get 1.2 if you work out these numbers. You get the same number if you have a heterozygous deletion combined with a 60% tumor purity. So in this case, the tumor is contributing one copy because you only deleted one. And that's 60% of the signal. And now the normal cells are contributing two copies, but only 30% of the signal. So you get 1.2. And there's no way to tell from their read, from the read counts, which of these situations has happened. Okay. And then similarly for copy number and beta allele frequency, there are different ways you can get the same BAF. So in this case, if you start out with a diploid tumor, which has a BAF of 0.5, and you gain a copy, you're going to end up with a beta allele frequency of 0.33 and 0.66. Or you might start out with a tetraploid tumor where your BAF is still 0.5 and end up with the same shift to 0.33 and 0.66. And you don't know which type of genomic event has happened that gives you the same BAF without additional information. So hopefully that's clear even though the numbers aren't perfect. There's different ways that people have come up with to solve this identifiability problem. One way or one possible solution is to look at what has happened to the somatic mutations that are encoded in these regions. So there are different ways or different estimates to do that. So for instance, when we look at somatic mutations and you plot how many mutations have a certain variant of allele frequency, you'll see in most cases, there's lots of heterozygous events. So the tumor has acquired a mutation. It's in one copy of the DDNA. So you see a variant allele frequency of 0.5. If it's a homozygous mutation, you'll see a variant allele frequency. It'll be a distribution of these mutations will be around one. And then there's always some subclonal events at different frequencies. So the distance between one and the peak of this distribution is your purity, right? So in this case, it's a really highly pure tumor. When you have low purity, you squish this whole distribution to the left because now you're adding in the deployed normal heterozygous distribution. And so you are far, far away from one, far, far away from 0.5. Now, it turns out that when you squish your distributions in this way and they're hardly overlapping, you can't tell for sure that this first big peak is the heterozygous peak or the homozygous peak. It could be that you don't actually have any homozygous sample or mutations in this sample, which can happen. So either way, you would know that you are at most or you're at least at 30% purity, but maybe at 0.6% purity. So it's not perfect. And it really depends on how many somatic mutations you have, what your purity ends up being. So there are other solutions. There are tools that will try to fit. They will come up with different explanations for your data, right? So they'll come up with these scenarios, maybe other scenarios that could fit your data. They do this in different ways, and they choose between the options in different ways. Some tools do not choose between any of the options. They just give you all the possible solutions of purity employed that can explain your data. Some will use experimental data. So if you're analyzing a variant cancer example and you have to choose between these two scenarios, well, it's very likely that it's high-poiety. So you're more likely to choose that it's a tetraploid tumor, for instance. CNA norm will favor solutions closest to diploid. So CNA norm for the same tumor might choose the solution. There are other strategies for this. One is to look at, sorry, one is to look at copy number and the BAF combined and try to pick purity and purity based on that. So a tool like PyLOH, in this kind of exact scenario where you have one segment where there is no copy number difference, you're getting 2000 reads, 2000 reads in the normal and the tumor. What you see in the LOH for that kind of segment is a big heterozygous set of variants in the tumor and the normal. But in a segment like two where you have a deletion and you don't know if it's a homozygous deletion with a low tumor purity or heterozygous deletion with a high tumor purity, you can tell from the LOH. If you still have heterozygosity, then you will know it's the solution. If you don't have heterozygosity, you will know that it's a homozygous deletion. So there are different ways to do this. None of them are perfect. You're always going to get an estimate of purity, employee, and you have to take it with a bit of a grain of salt. In the LOH, we're going to... Yeah. Is it possible to just explain what is BAF? I don't understand. So BAF is the beta allele frequency. So it's the plot here. So BAF, the beta allele frequency, in every one of these positions is a SNP where you have two alleles, one from your mom, one from your dad. The two alleles could be C or T, A or G, all the combinations of nucleotides that we have. And so we call the one that is more prevalent in the population, the A allele, and the one that's least prevalent in the human population, the B allele. So in this case, C is the B allele because it is less prevalent. So the A allele is T. Instead of using all the possible combinations, we're just reducing to A and B. So B is the minor allele in the population. So when you have a plot like this, you don't care that it's a C or T. You just want to see a distribution of, for this heterozygous SNP, do we see that this individual has a 50-50 ratio in their genome, or do we see a shift away from that? So it's the A versus B, or the allele ratio, or the BAF, or there's a bunch of rooms for it. And for the figure above, what was the copy number exactly? The copy number is calculated in this case. It's the ratio of the read depth in the tumor versus the normal. So I guess going back to here, you're going to have something like this in the germ, for the germline sample of this patient, maybe blood, and you'll sort of see an even coverage. And then in the tumor, you see this really huge peak of reads here, and then way fewer reads than you would expect here. And then you see kind of equivalent number of reads over the rest of the regions. So relative to this person's germline, this is a gain because you're not going to have two times as many reads. So you have a log R ratio of one, or you have way fewer reads. So you have a log ratio of something smaller than one. And you can see that here, right? Compared to the normal genome of this person, there's four copies of chromosome 5. So you're going to have way more reads for chromosome 5 than from the blood of this person, and so on. So does that make sense to everyone? Any more questions about this? Just sort of shout them out. Okay, so this is where we had paused. So Control Free C is one of the tools that we're going to use in the lab for identifying copy number variants and loss of heterosygosity regions that are copy neutral or not copy neutral. It is a very nice to use package. It's actively maintained and updated. It does purity, employee estimation. There's a number of steps for how it works, and you'll see this in the lab. One of the important things is to determine the parameters for how you're going to look for copy number variants. And for many of these tools, you end up having to set a window size. You can either let the algorithm pick a window size or you can set a window size. And a window size is how many bases do you want to make an assessment for? So in this case, if we set the, we're going to set the window size dynamically at 50,000 base pairs. So for 50,000 base pairs, you're going to compare the tumor versus the normal. And then you'll move on to the next 50,000 base pairs. So every one of the points on this plot in your case will be 50,000 base pair wide window. So you're breaking up your genome into, I think it's 60,000 50,000 base pair windows to cover the 3 billion base pairs. So you're going to get all these data points for your normal, which is shown up here on top, and for your tumor. And then you'll get a ratio of the normal versus the tumor. So in this case, the normal has two copies of DNA. The tumor has two copies of DNA for this part of chromosome 17, and then it goes to three copies and four copies. So your ratio plot will reflect this relative difference. For this example in chromosome 19, there is no copy number difference, right? So your ratio is going to be centered at zero. But then when you look at the beta allele frequency, you'll see, in this case, we see a shift away from heterozygosity for the PRM of chromosome 19. So this is like the jack two example, right? There's probably a gene on here that has a mutation that it was advantageous for this tumor to duplicate that mutation and lose the wild type copy. So these kinds of events give you a clue as to the important mutations that are harbored in this region. So you're going to set a window size that will increase and decrease your resolution for finding things, right? If you have a giant window size, you're not going to see small events. If your window size is too small, you're going to have tons of noise. The next step after setting window size is to normalize your weed counts for GC content in the genome and for mapability, and we'll talk about this in the next slide. And then you do your copy number ratios. You calculate your beta allele frequency for all the positions in this tumor, where in the germline, the individual is heterozygous. And then you want to do segmentation. So you have all these points that you've measured. You actually just want a few segments of gain and loss. You don't want 60,000 points of information. So we'll look at a couple of details to do with these steps. I just wanted to mention that even after you do normalizations, you will see this variance in copy number ratios and the BAF. And if you increase your window size, you'll reduce noise, and you'll reduce your false positives for instance, this signal that we see around the centromere. So does anyone know why there's such a huge huge amount of reads essentially at this position? Any guesses? This is right next to the centromere. Do you know what kind of sequence the centromere is? It is highly repetitive, yes. So aligners will have a problem with distributing. When you take your tumor sample, you fragmented, including all the centromere pieces, and then you sequence these short fragments, and you have to align them to your genome. So all of those short fragments will go into typically one of the positions that represents this repetitive sequence. They're often taken out from the reference genome, but the flanking regions of the centromere are often left in. So all those reads, all those reads will end up piling in that one position where they can align kind of well, and you'll see this huge pile up in the normal and the tumor. So you know that it is not a somatic event. It is specific to the sequence content around the centromere, and you see it again here around the centromere and chromosome 19, and you'll see it at many chromosomes. If you increase your window size, you get rid of that. But you also reduce sensitivity for small events like focal amplifications. So it's a trade-off. So you have to decide in your analysis, you might want to run it twice with a large window size and a small window size, and then look at the differences. Okay, so why does normalization happen or needs to happen? And the reason has to do with the GC content of the human genome we talked about yesterday. If you look here, GC content versus read count that's not corrected, you will see that it is not a flat line, which is what you hope to see. You would hope to see that you get kind of the same number of reads, regardless of how many ATs or GCs you have, and it's not the case. If you have an AT-rich region, you get many fewer reads than if you have a more balanced GC content. And if you start to get into really high GC content, again, you have fewer reads. So you want to be able to correct for this. So many of these copy number detection algorithms will have a correction step for GC content, where you're now not going to detect a deletion wrongly just because you had a high AT content region. And then similarly, mapability. This is like the centromere example. There will be some places in the genome which are highly repetitive. So you're either less likely to align reads with a high mapping quality or you'll have alignments that are driven by low mapability. And so you want to exclude these regions. And instead of having sort of a relationship here, you want to have no relationship between mapability and read count. So these are two critical normalization steps. The effect of these is to take read counts that sort of fluctuate along the chromosome and essentially flatten them to reveal such that your variance now reveals the copy number events and not lots of GC content variation, for instance. And again, once you have these really nice points, whatever your window size was, you're going to have that many points across your whole genome. So in our case, if you have 50 kb windows, you're going to have 50,000 or 60,000 of those across your whole genome. You actually want 50 to 200 segments, right? You don't want 60,000 segments. So you want to merge all these segments, for instance, that are read and agree in copy number into one. You're going to have a minimum of 23 segments because that's how many chromosomes there are. You probably have double that amount because there's a centromere and usually you have the pr and the cure separately. And then you'll have additional segments for copy number variants. And in this case, for this chromosome, I think there's 23 or 25 discrete copy number segments as you go along the chromosome. So hopefully, that makes sense. And we'll do this in the lab to a certain extent. Any other questions on this section before we move on? Yeah, I had a question about the study design. If you could go to the slide that compares normal and the cancer cell, and yeah, the cancer tissue, just a couple of slides above. This one? Yes. Okay. So when we're looking at the normal, are we looking at the surrounding normal tissue or the blood? So that depends on your study design, right? If you have access to blood, to unmatched blood from the patient's sample or your sequencing, typically that's what is used for the germline sample. Sometimes you don't. Sometimes it was like a sarcoma and you have, I don't know, FFP material or something or there was no matched blood collected. You could find some muscle tissue that's normal. That's like beyond the tumor margin and use that as your germline. The caveat being, hopefully there's no tumor cells in it. Yeah, that's what I was wondering, because if you, let's say in this case, if the normal is from the surrounding tissue, then that would cause aberrations in the normal plot, right? Because of the condition. So then you have the reverse purity problem. Sorry. You have the reverse problem of your normal sample. Sorry. My mouse is super sensitive. So your normal sample, if it has, let's say, 10% tumor cells in it, you'll see a bit of the signal of the, of the somatic aberrations in your normal sample. So when you do the subtraction, right, of your read ratios, you might lose your, let's say your tumor sample is not perfectly pure either, right? Then they're just so close, you will lose your power to detect events. Right. And this could happen for, for copy number events. It's going to be much more of a problem for SMBs, where you often have a much more stringent filter on evidence from the germline that, you know, an event is somatic. So you, so you want to then do the analysis in a way where you're going to be more permissive with the difference between germline and tumor. Right. Okay. Thank you. Thank you for clarifying. Appreciate it. Okay. Any other questions before I move on to SMBs? Just on the, like, this slide that you were on, I guess, could you talk a little bit more about the normalization? So it's normalized to the GC and mapability? Or, or how do you, like, what was it normalized to? It's from the graphs on the next slide, right? Yeah. So if you have a low GC content, so it's a high AT region, you will see less reads, not because there's less DNA, but because of technical difficulties with profiling and sequencing. As many reads from that DNA as from DNA that has a normal GC content. And same with high GC content regions, you'll also see pure reads. And I don't know if you emphasize this, or if you looked at this in IGV yesterday. But if you look at the, if you look in IGV at the genome in a certain region, you had a track at one point of GC content at the top. And you could see it's not flat. It goes like that. And then there are some regions that have very low GC content and they're AT rich. And if you look at your coverage track in a genome, you will see your coverage tips there. But it's not because you have a deletion. It's because it's high AT content. So you want to be able to correct for that. So these tools essentially divide your read count by, it's not that simple, but that's conceptually what is done. So is it in the lab that will go over the details of the normalization? Yeah. And it's basically it's integrated into the tool that it does. Okay. Okay. So it's pretty standard. It's pretty standard. The the other thing that helps you a lot is having tumor versus normal. In theory, they will be subject to the same like gain, you know, variability in read coverage based on GC content. So you could skip this step if it's, if you have a tumor normal match sample. It's just a good idea to always include it. And if you just have a tumor sample and you don't have a match germline, still do the analysis, but then you have to include GC content. And so the normalization on the previous slide is based on the bottom plots on that slide, which is mapability and GC content. Yeah. This is normalized exactly for both. So you do both normalizations and then you hope to see this kind of flat, flat profile, right? If you still see variation in your in your normal sample after the normalization step, there's something else that's causing variability. Okay. Great. Yeah. Sorry, is it possible to just define what is mapability? So I don't, so I think I'm not sure if you went over this in the IGV lab, but remember how your reads were gray? Yeah. Yeah. And then did you ever go over summaries that were clear? Yeah. Yeah. So those three stuff that were clear had a low mapping quality because the aligner is not sure that that read goes there. Okay. So the mapping quality is from the aligner. And if the aligner, you know, it's trying to align your read somewhere in the genome. If you find one spot where it could go really well, and there's no other spot that's even close to scoring so well, then you have a high mapability score. If you have two possible locations where it could go and it's not sure, you have a low mapability score. And typically those are positions that are going to be repetitive in nature, right? So there are some parts in the genome where you just have low mapability because they're really similar to like multiple other parts in the genome. And you're never going to have a high mapability in those areas. So you don't want to use the reads piling up in those areas because the aligner will make a random decision as to where it goes, right? It could be like one of five places. So it's going to stick it in one of those places. It doesn't mean that you have a copy number gain there if you see lots of reads there. So you want to be able to correct for that. And for that example that you had normal and tumor cell, do we need to start with heterozygous normal cell or it can be anything? Because your normal, your normal cell would be, your normal sample is let's say blood, right? But if you're talking about heterozygous, you're referring to the SNPs. So this, so any given person will have like three million heterozygous SNPs in their genome, three million to 10 million. I mean, you have to have something like this or it's just an example that you're showing here. We can have AA or BB instead of so AB means heterozygous, right? So in this case, the whole chromosome is AB or there's evidence for heterozygosity across the whole chromosome, which is what you expect is if you have two copies, right? So for both these chromosomes, this person, because it's the normal sample has AB. But in the tumor sample, there is a shift from the heterozygosity because there's a one copy gain, right? So you go from AB to AAB and ABB, but that's redundant. So you only write one of them. And then with a four copy gain, you just duplicate everything, right? So now you have two As and two Bs. So it's still heterozygous because you still have a 50-50 ratio. It's just a heterozygous, but with with a two copy gain, sorry. So does that make sense why you would have AAB? I didn't understand the AB in the beginning. Okay, so that has to do with the two alleles, right? The two alleles at every one of these points is a SNP position. Okay, so we know that's the SNP position. Okay, got it. So one of the inputs to these algorithms is DBSNP. And then for every one of these genomes, your copy number tool is going to check for every position in DBSNP, is this person heterozygous? If yes, then you can assess loss of heterozygosity at that position. So the ability to do this depends on having SNPs. Got it. Thank you so much. Okay. I have one just quick question I think about in terms of the normal versus tumor. In the context of like myeloid cancers, for example, you're saying that blood is often used as the normal for comparison, but like we do myeloid cancers in my lab in my experience, we don't usually have like a match normal sample in that case. So is this still like quite doable without having match normal samples? I know you mentioned you have to look at the GC content more closely and whatnot, but so you want to for sure normalize for mapability GC content. So you can do control free, right? Where you don't have a control sample analysis with this tool. You could also use a different source of normal, for instance, skin, like a cheek swab. These cases, I don't know if you would have that, but that would be another source that's typically free of blood, right? Or hair follicles or something like that. Yeah, it is more challenging in those cases, but you can still do the analysis. Okay. So let's talk about, so those are large events. Let's talk about small variants. So SNBs or indels. And also in the context of these compounding factors of purity, loyalty, and copy number variants, which are going to affect how well we can detect an SNB or an indel. And I just wanted to show you conceptually what an analysis of SNBs would look like. So you would start with your cohort of interest. Could be one sample. It could be a bunch of samples. We have these alignments and you talked just to be about alignments. And then on the BAM files that come out of the alignments, you're going to run a mutation color. Also a copy number detection tool like control frequency or titan or absolute, et cetera. There's a slew of them. And what these will give you are copy number ratios, LOH, so BAF ratios, tumor purity, and the variant allele frequency of any somatic mutations that are found. Once you have this, the variant allele frequencies, and then you could have this related information, but you don't need it. But once you have a set of variants, you want to annotate them to see what they do. Is the mutation you detected in a gene? Is it predicted to disrupt the protein coding sequence, et cetera? So you want to do this functional annotation step. And then using the copy number and purity, if you have it, you want to correct the variant allele frequency, which is just how many reads you're observing to support your variant versus your reference to get what's called a cancer cell fraction. So we'll talk about that. And then you can sort of distinguish between the early mutations that are in every cell versus those late mutations that arise further on during tumor evolution. And then you want to interpret and validate these mutations. And that will be the topic, especially interpretation of other modules. Okay, so here's conceptually what happens when we are doing mutation calling. You're aligning your reads to the reference genome, right? And then you're doing this for the normal sample and the tumor sample. And then you want to jointly assess whether there are any differences from the reference and whether there are any differences between the normal and the tumor. So in this case, we see that there are two germline, germline events in this patient that are in, so they're in the normal sample. This is why they're germline, right? So compared to the reference sequence, which is a G, this person has a C. And in the tumor, it's also C. So this is just the germline variant. You're going to pick it up because you're doing mutation calling relative to the reference, but you don't care about it because it's not different between normal and tumor. In this case, the person is heterozygous at this position where the reference is a G. They have two alleles, a C and a G, right? And you see that in the tumor as well. So again, no difference, A, B, A, B, B, B. This is the interesting case where the person's germline shows an A, which is, which does match the reference. But in their tumor sample, they have a heterozygous mutation to a C, right? So it's an A to a C mutation. So this is the kind of event that you would hope to be able to shortlist from these analyses. And these are the events that you would then want to annotate, functionally annotate, etc. There are lots and lots of tools that will let you perform somatic mutation calling. I think we mentioned SAM tools as the suite that allows you to do lots of different kinds of analyses on alignments. So this is a suite for working with files. It's a community standard. Files can be in SAM, BAM, or CRAM format. You can do mutation calling within SAM tools. So SAM tools and PILOP will generate essentially these PILOPs of reads. So at every position, how many of one type of nucleotide you have and what the base quality is. And then you can pass that in PILOP to BCF tools and do a bunch of filtering, and you can output somatic mutations. GATK is another suite of tools. You can see the whole tool index here. And they have a best practices recommendation for different kinds of analyses, including somatic variant calling. And so that link is here. And their somatic variant caller, which we're going to use in the lab, is NETEC2. So this is a high sensitivity caller. And I'll just point out a couple of features of it, which is it's a paired caller. So you're always comparing your tumor to the normal of that person. There are a number of filters on the reads themselves to make sure you're only using the high quality data, so high base quality, high mapability qualities. And then there are a number of filters. And I'll just highlight that they have, they take into account some of the features we've already discussed, right? Is your mutation called next to a gap, where realignment might be really important? Is there a strand bias when you're calling these mutations? So if there is a strand bias and only reads on the forward strand or supporting your variant, that is a red flag that something is not okay because you can get these kinds of errors during sequencing. So if you're sequencing a high complexity or certain kinds of regions in one way, those DNA, the cDNA can form like a tertiary structure. And instead of getting through it okay on the sequencer, on the Illumina sequencer, the enzyme always makes a systematic error. But in the other direction it doesn't. So then you end up with this single strand evidence for a mutation. Sorry. If there's poor mapping, so if you see all those reads with low mapping quality at a position where you're detecting a variant, that is also suspect. So you would not want to consider those as potentially real variants. So all these filters are being done for you. If it's a trilelic site that's marked, if the position of your SNV is clustered close to the ends of reads, that is also a concern because remember at the end of reads your base quality starts to decline. So you start to make mistakes. If your read or if evidence for your mutation is observed in the control, then it is excluded. So this is where you want to relax this filter if your control is adjacent normal. So you would want to be able to consider the parameters that you're using for filtering based on your data. And then once this filtering is done, there's a high quality call set which you can choose to also filter based on a panel of normal samples. So if you have lots and lots of samples that you've sequenced on that instrument or at that institution and you see systematic mutations come up, you can remove those because they are likely associated with the instrumentation or something to do with the workflow rather than your actual sample. So then you can have this high quality and panel of normal filtered call set which you can then analyze in various other ways. So the filtering is very important. Mutec2 is a color that was built so that it is more sensitive to low frequency mutations. So for instance in a 30x genome, which was the standard for a while but now we've gone to 60x or above, you have your sensitive to mutations that have a frequency of more than 20% in your sample. So you need at least like 20% of your reads to support the mutation before you're going to call it as a real mutation right at 30x. And if you're interested in a mutation that's at 5%, there's almost no chance that you will call that mutation. You need to go to 60x or much higher coverage in order to have sensitivity or power to detect lower frequency mutations. So your and of course your purity will affect the frequency of your mutations. So coverage you'll have to compensate for lower purity potentially with coverage. Strelka is another one of the highly used or well used algorithms for this. It has a built-in realignment step for indels. So the indel call set is really high specificity and you have fewer false positives which is typically one of the issues with calling indels. Lots of them just end up being wrong so you have lots of false positives that you have to like sift through to get to your true positives. But Strelka does a good job at that. So there's lots of tools. There's others I haven't mentioned. How many tools should you use? Which ones are they? This is something that you have to decide in your analysis. We use three in my lab. Different labs use different numbers and different ones. So how do you know which to use? You could go based on literature. I'm just going to show you one example very quickly from PCOG, the pan cancer analysis of whole genomes, where they reanalyzed some data. So they took a bunch of data. They aligned it in the same way and then they ran three different mutation copy number indel etc. calling pipelines on this data and compare the results. And so when they did this, so they ran these three different pipelines and things they marked with one are their SMV color. Things they marked with two is their indel color. Not all mutation colors do both SMVs and indels. At this time, Mutect1 only did SMVs so they also had an indel color. Structural variants copy number aberrations. So they ran all these in parallel and they compared the results. No way to compare them is to look at the F1 score, which is how sensitive and specific was the call set. If you use just one tool or two, two of these pipelines or three pipelines together. And all three pipelines were the best. It took the most time. So the cost per donor is like AWS credits. If you used two pipelines, you did really well. But if you used only one pipeline, you did just as well as two pipelines. So it's up to, so you have to make a decision. You go with one color or do you add another or do you go with three and then take the intercept or the union. So this is something that there's no, there's no one solution and it depends on sort of your data type and your analytic goals. After you have these mutation calls, filtering is really critical, like I mentioned. Visualization can help guide how you filter. So always look at your data, sorry, look at your data on IGV. Often when you look at something, you'll just spot an obvious part of fact, or you'll just see that your data is terrible or, you know, you have to look at your data. And then after you're happy with your analysis results, you want to annotate. So you want to look for those mutations that are damaging and change the amino acid sequence in some way, either because they're non-synonymous variants or they introduce a stop or remove a stop codon, or there's glycyte mutations perhaps and they end up changing the protein. And you want to, you want to probably focus on these rather than the silent mutations or the non-polymorphisms. So for annotation, there are lots of different tools. Again, we use Anovar and we're going to use Anovar in the lab. Anovar runs three different types of annotation and you can use lots of different data sets to annotate your data or to inform your data. You can use many of the available databases on their website or you could just download tables from UCSC or you could make your own data. So you might want to annotate, let's say, your variants, your VCF file with information about genes. So you want to know, are these mutations in exons or are they in introns or are they in between genes intergenic? If they're exonic, do they result in a missense mutation or are they frame shifts and so on? And if they are exonic and they change your protein, what does the amino acid change? So this is the kind of information that you'll get from Anovar from a gene-based annotation. You can also do a region-based annotation. So if it's not in a gene, maybe you want to look for mutations in transcription factor binding sites or regulatory elements or repetitive sites or who knows, or regions of copy number gain or loss. And you can also run it in a third way, which is a filter-based run or annotation. And you might want, for instance, to remove anything that has a high frequency in DVSNP. So if it's at 1% or more of people, just in the normal germ lines, you're not going to consider it for your analysis. So you're able to set these kinds of thresholds and filter out things from your VCF that don't pass the filters you've set. So it's very modular and powerful in how much you can tune it and you'll give it a try in the lab. Any questions on this part? It's pretty straightforward. I just have a few more minutes then. We're going to end with talking about getting back a little bit to clonality. So the metrics for reporting mutations are, so we talked about VAF, the variant allele frequencies, how many reads support your variant versus how many reads don't support your variants. And when we went over VCFs with Richard yesterday, he went over proportions of the variant allele frequency. What we want to end up with is actually a CCF, which is a cancer cell fraction, and this thing called multiplicity. And I'm going to explain why this is important. And we'll start with looking at kind of a perfect example of a mutation where you have 100% purity of your sample, right? It's a diploid tumor, so nothing funky is happening with a copy number. And the multiplicity is how many copies of the mutation you have per cell, and in this case, it's one. So every single cell has a mutation, and every single cell is heterozygous. So now I imagine we're extracting the DNA from these cells and sequencing it. Your variant allele fraction, so the number of reads, let's assume everything is proportional and there's no noise. The variant allele fraction is going to be three out of six, right? You're measuring six pieces of DNA, three of them have the mutation. Your VAF is 0.5. The cancer cell fraction is one because every one of your cancer cells has this mutation. So we want to be able to report this one. Here's how purity and ploidy are going to change this. Imagine now that we have a more realistic scenario where you have some lower purity. In this case, we have some normal DNA. So now our purity is 67%. The ploidy is still 2N. There's nothing funny happening with the copy number. The mutation multiplicity, so the copies per cell of two more cells, is still one. But our VAF is now two out of six because now we're also measuring this DNA from the normal, which doesn't have the mutation. So our VAF is 0.33 because we didn't adjust for purity or anything else. The cancer cell fraction is still one. Here's another example. In this case, we have the same purity of 67%. But the DNA that our mutation is in is now at a copy number of four. And the mutation happened before the duplication. So you've also duplicated the mutation. So you end up having three copies per cell of your mutation. Now, when you sequence the sample, you're going to have a VAF 0.6 and a cancer cell fraction of one. And in this case over here, the gain in copy number happens before the mutation. So your mutation comes after and it only happens in one cell. So this is going to be like one of those subclonal events that happen later on in tumor evolution. So all these things are the same except now you only have one copy per cell instead of three. Your allelic fraction is very low. So you may or may not be able to detect this mutation. And your cancer cell fraction is 0.5. So this tells you that this mutation is subclonal and it comes later on in tumor evolution. So I'm going to ask you a question. So for the mutation multiplicity copies per cell, how can you get that with your bulk sequencing? How are you distinguishing reads from cells? So if you, for example, have a tumor and you sequence that, you don't know really how many reads occur per cell or so this isn't, sorry. So I guess we can go back to this simple example. If you had, you know, if you know your purity and your ployty and your variant allele fraction, you can work out, you can work out the multiplicity. I guess if you know the cancer cell fraction. So you can sometimes work that you can sometimes find the multiplicity and sometimes not, right? So you're going to know some of these. This is a scenario where we know everything, right? And you can see how changing one of these parameters will change your VAF. So the goal of these, with this is to be able to calculate a CCF. Often you don't end up being able to figure out your multiplicity, but sometimes you can't. Okay. Thank you. One clue to the multiplicity is the LOH in this region. So in this case, this mutation is in a heterozygous region, right? In this case, there's been a duplication event and one of the alleles has been duplicated more than the other allele. So the LOH will be different. So there will be loss of heterozygosity in this region. So in theory, it would be possible to figure out your multiplicity. In practice, people figure out the cancer cell fraction and don't really worry about the multiplicity, although that is an important parameter to know, right? But the cancer cell fraction is the important part that often you'll see reported in publications. So in most cases, I already mentioned, the whole genome duplication is kind of an early event. And so what you end up seeing is in the cases of tetraploid tumors versus diploid tumors. When you have a mutation in the diploid tumor and your tumor purity is pretty high, you're going to see that most of your mutations have a VAF around 0.5, right? So this kind of matches that first example we looked at. If you have a tetraploid tumor, most of your mutations are now around 0.25 because they occurred after the whole genome duplication game. So this also, this kind of analysis also allows you to time your mutations and your copy number or genome duplication events. It is possible to calculate the CCF if you have the purity of your sample, your VAF, which you always have from your VCF, and the copy number of the DNA segment at this mutation that you want to convert from VAF to CCF. If you have this copy number, you can calculate the CCF. So you're basically correcting your VAF for purity or the copy number in the tumor cells and the copy number in the diploid cells. So for a case like this, where your purity is 67% and your ploidy is 4% or your copy number is 4%, you're observed the lethal fraction, your VAF is 0.1, and you want to be able to show that your cancer cell fraction is 0.5. And you can do it with this formula. So it's actually pretty straightforward to do. I don't have a formula for multiplicity, but for CCF you can definitely do it and you do want to correct your VCFs or sorry, your VAFs for purity and copy number and end up with CCFs. So you don't need to write this down. We're not going to do this in the lab. We did it in a 2019 lab. It's easy to do. If you want to do this, you could just put it in as long as you have all the right numbers. You can just calculate it. So I'm going to end there with the slides and just mention that once you have all this analysis done, you will have a set of mutations that are now interpretable because they will be annotated with their clonality, whether they affect genes or not. And then you want to go on and do something else with them and that really depends on your analysis. You might want to look at the pathways that these genes are involved in. You might want to look at mutational signatures. You might want to look at association with clinical variables, et cetera. And that will be the topic of the future modules.