 Hi everyone. I'm Serana. I did my PhD in bioinformatics at the Genome Sciences Center in Vancouver and then did my postdoctoral training at Sick Kids, where I did a lot of cancer research work. I'm cancer genomics. And now I've started my own lab at the University of Calgary. So I've been part of the CBW for about five years or so now. And I'm doing two modules, so the copy number variations module now and the module tomorrow morning, where we're going to talk about mutations. And when I started participating as an instructor, I did take some slides from previous lectures by Sorab Shah, who's a faculty member at the BC Cancer Agency in BC. So thanks to him for some of those slides. And as I go through, it would be great if you guys ask any questions so that we can make sure that concepts are clear before getting to the end of the talk. And so it's kind of structured in mainly two parts. Today we're going to explore just the idea and impact of copy number aberrations in cancer. We're going to talk about genomic instability, cancer evolution, and genetic heterogeneity. Tumor suppressors versus oncogenes, what are actionable mutations. So the things that we hope that this knowledge of inferring copy number aberrations are going to transfer to the clinic, for instance. And then in the second part of the talk, we're going to talk more practically about detecting copy number aberrations. What are some confounding factors and strategies to overcome these? And how to measure copy number aberrations? And what are some of the computational tools that are typically used and some of the tools that you guys will use in the lab? So I think between these two sections we'll have, we should have a small break. So that will be another time and opportunity to ask questions. Okay, so you've heard in Trevor's talk that cancer is a disease of the genome. And tumorogenesis is this really multi-step process that requires several mutations in order to get a cancer going. So here in this diagram, we see some normal cells here on the very left. And some of these cells will have somatic mutations. And the right mutations will cause selective advantage in some cells. So either they're less likely to die or they're more likely to proliferate at an increased rate. And so when a mutation confers this highly advantageous phenotype, these cells outcompete their neighbors. And each time they divide, they can get additional mutations. Okay, so when these additional mutations are acquired, as you can see here by the different colors, although this green is a little bit hard to see. So this green, this pink, and those orange lineages are characterized by mutations in these particular genes. So these selective pressures will cause a big increase in the population frequency of these cells. And then under a selective pressure, for instance, during treatment, you can see really big reorganization of the clonal structure of these tumors. So here, as we go along to the right, time is passing. So you can see here that chemotherapy starts. So this tumor is starting to be treated. And so there's probably surgical resection of this tumor mass and then chemotherapy, which puts a huge selective pressure on these clones. And what survives and comes back as a recurrent tumor isn't necessarily those genetic clones that were very successful in the primary tumor. So this green lineage is completely gone. And this red lineage, which was not really advantageous in the primary tumor, now makes up the majority of cells. And then we also see acquired subsequent mutations in the recurrence, like that blue clone. So these malignant cells, even within a single tumor, can differ from each other, both in space. So at the same time, in different parts of the tumors, you can see different genotypes, and they certainly evolve over time. And very rarely is a tumor 100% pure. So often, tumor samples will contain infiltrating cells, like immune cells, as well as various types of cells from the microenvironment. So essentially, this makes tumors complex, like a normal tissue. And so we see stromal cells and their presence and composition actually can significantly change the biology of the tumor. And so it could make a tumor more or less sensitive to chemotherapy. So actually, the combination of genetically distinct clones in a tumor, as well as the specific types of microenvironment or cells in the microenvironment are what are involved in the pathogenesis of some and perhaps all cancers. So it's important to appreciate this aspect of tumor biology. And you guys have seen this before. Our foundation for understanding biology of cancers is really easily presented as the set of six hallmarks. So these are the acquired functional capabilities that enable cancer cells to survive, to proliferate, to disseminate to other parts of the body. And the acquisition of these capabilities is made possible by two enabling characteristics. So these are shown here on the bottom right. So inflammation and the tumor microenvironment, like I was saying, and most relevant to our discussion today, genome instability and mutation. So genomic instability is what endows cancer cells with the genetic alterations that can drive tumor progression. So understanding tumor biology is then an exercise in measuring or detecting these clonal genotypes. So these are shown here by the different colors of these cells and linking these then to disease progression and response to treatment. So one way to infer clonal lineages is to consider their population frequency in the tumor. However, there are several confounding factors to consider. And today we're going to talk in a little bit of detail about two of them. One is normal contamination, like I mentioned. These are non malignant cells from tumor stroma. And the simultaneous presence of these multiple genetically distinct lineages, whose relative frequencies need to be deconvoluted. Okay, so before delving into the details of how we're going to do that, let's go over some background on copy number alterations. And we can start by considering this karyotype of a normal human cell. So this is a spectrokaryogram with chromosome painting, which makes each chromosome in a human genome have a distinct color. So it makes it easy to appreciate that the structure of the human genome is diploid. So we have two copies of each chromosome, one from our moms and one from our dads, except for the X and Y chromosome. And these are some genomes from ovarian carcinomas, which are it's a cancer that has some of the highest burden of copy number operations. And so their genomes look almost nothing like the normal karyotype we saw, right? It's obvious from these images that copy number changes are a major feature of cancer. So it makes sense to study copy number profiles in detail to get insight into the biology of these diseases. And actually, the biology of ovarian carcinomas is driven by copy number events. So these types of karyotypes are really laborious to produce, but they're really interesting. And I wanted to show this because they reveal a couple of features of copy number operations that we should keep in mind as we keep going. One is it's really obvious here when translocations between chromosomes happen because you'll see different colors linked together in the same physical DNA molecule. Second, it's obvious these are not diploid genomes, right? Most of these chromosomes are found at copy numbers of three or four or six or even more. So that means that at some point in the evolution of these tumors, at least one round of whole genome duplication occurred. So that's exactly what it sounds like you have the genome, you have duplication of the genome before that cell splits into two cells, but you don't have adequate segregation of chromosomes. So all of that DNA ends up being segregated to one cell. And so this is a fairly prevalent feature in cancer. And having a genome duplication event is really associated with the propagation of chromosomal instability. So in many tumors, these genome duplicate duplication events happen early. And they essentially provide the material for major chaotic reorganization of the of the tumor genome. And then finally, I wanted to point out that at this resolution, it's really obvious what the broad events are once that encompass whole chromosome arms. But actually, there are lots of focal events in cancer. And so we're going to talk about those in more detail in a little bit. Okay, so here are some examples of the type of copy number events that we would try to detect. You can see here, a normal chromosome, this region of the chromosome has three loci ABC. And we can look for deletions, right, where A is now next to C, because B is deleted, where we have insertion of another portion of the genome D. We can look for inversions where we have CBA instead of ABC, or we can look for copy number variations where we see multiple copies of a particular locus, or even segmental duplications, which tend to be bigger. And so just a note on nomenclatures is that CNVs, which is what this is, these are variations or polymorphisms present in the general population. So we all have these variants. If we compare our genomes, there will be lots of these kinds of events that are different between any two normal people. CNAs, or copy number aberrations or alterations are somatic changes that are present in tumor genomes, but not the germ lines. So we can see that up here, right? So these are amplifications or deletions that typically will range between 1KB to a whole chromosome arm, where deletions of course will have a loss of DNA content and bring two parts of the genome that were previously distal in close contact. And of course amplifications involve multiple copies, not just two, that would be a duplication. Amplifications would be typically like four or five or more copies of a particular region. So these types of somatic rearrangements are a hallmark of tumor genomes. So loss of key tumor suppressor genes like BRCA or P53, these have a significant impact on the biology of a cell. And amplifications of growth factors or proliferative genes like PI3K or B2 can promote proliferation. So there's actually been a huge amount of effort to profile cancer genomes and find copy number aberrations that are diagnostic or prognostic. And some of these have become, sorry, some of these have become targets for therapy. So that's really the goal of this whole exercise. So just conceptually, how do we find amplifications, deletions, and so on and so forth. And the concept I want to bring up here is heterozygosity and how heterozygosity is used to infer copy number events. And so our genomes are actually peppered throughout with natural like positions that will naturally vary between individuals, right? So these are called single nucleotide polymorphisms or SNPs. We have about 10 million SNPs. And for ease of nomenclature, the two alleles that are the most common in the human population are denoted A and B. So here we see in the case of no copy number aberration or variance, these particular positions have a particular, these three positions have a particular genotype. So AA, AB, and BB. When we have a duplication, we can see a change in that the genotype. So if we duplicate the AB locus, actually, we have, we go from AA to AAA, we maintain BB, but now we add a second B. So we've added a second B and a second A. A hemizagous deletion where you just have deletion of one copy would take us from AA to a single A, and from AB to a single A, because we've deleted this B allele here. And when we have a homozygous deletion, we just don't see any evidence for these two alleles. So this is kind of this idea that the two alleles most common in the human population that are denoted A and B, is initially very confusing for many people. And I wanted to draw on the board the thing that I usually draw when people come and ask me about this afterwards. But we don't have a board here. So I met an extra slide that you guys don't have in order to explain this well. Okay, so this is a normal genome, you get one copy of your DNA from your mom, and one copy from your dad. And almost all the positions are going to be the same. But we're just going to look at four of them for for reference. So we have three billion bases, right? And they're going to be 10 million snips. So I've highlighted four, one which is a heterozygous snip, right? And A from the mom, I see from the dad, a homozygous snip, where you get the same allele from both parents, another heterozygous snip and another homozygous snip. So in the population, these alleles are so associated with specific frequency. So 80% of people, if you're going to profile hundreds or thousands of people are going to have an A. So that's the A allele, 20% of people will have a C. So that's the B allele. So back in the previous slide, all the all this A and B stuff is actually this. So the minor allele is always called B, the major allele in the population of people is called A. So this CC position, most people 60% would have an A. So the A allele is not present. This this site has the C, which is the B allele. And similarly, here we see an AB where we're now the G is the majority is the major alleles. And so we set that one to A. And similarly, here we can we can see which one is A and B based on all population frequencies. So now when we translate this back into a B allele frequencies for our particular individual, we can see that this position is the B allele frequency is 50%. At this position, because the B allele is the C, and this person only has a C, the beta allele frequency is one. Here again, we see a heterozygous snip. And here we see only the A allele present. And so the B allele frequency is zero. And when we plot this on a graph where along the X chromosome, we're going to have all the variants, we're going to detect in the genome, but I'm only showing you points for these particular four. And we plot the beta allele frequency, we're going to see a point five here, a one here, a point five here, and a zero here. So keep this in mind, or maybe draw a little diagram of this, because this is really useful for interpreting the plots that are going to come up. So I'll show you the next plot in a second. But does this kind of make sense? We're using the population frequencies to instead of saying AC, CC, TG, da, da, da, like whatever the combination is of these four nucleotides, we're just going to call them A and B. So it's a way to simplify the data. And the reason to do that is to generate these kinds of plots where I've hidden a bunch of things that I think you have printed out on your, on your slides. So the whole goal of this exercise is to go through the genome. And here we see one chromosome. So you can see the ideogram here on the bottom. And each one of these points is on in the bottom plot represents the copy number at that locus. And on the top plot is this beta allele frequency plot is the allele frequency. So you can see here for most of the chromosome, you have lots of points around point five. So these are heterozygous snips. You have a bunch of stuff here up at one. So these where they're homozygous for the major allele and a bunch of stuff here at zero, where they're home, these positions are homozygous for the A allele. So the B alleles at zero. And then seeing deviations from these three bands indicates that something funny is happening. And so that's the pattern to look for. And then there are specific patterns that we're going to talk about that would indicate what is happening in terms of copy number and heterozygosity. And then this graph at the bottom basically tells us that the copy number. So this is the log r ratio of how much coverage you have or the copy number in your tumor versus your normal. So if the difference is so if you have a deployed tumor and a deployed normal, the difference is going to be zero. So if this thing is around zero, like we see for most of the chromosome here, then there's no major copy number event. But just here, just after the centromere, we see higher copy numbers. So there's been a copy number gain. Or actually, there have been two copy number gains. Okay, so we've gone over this, what happened, you know, the band pattern in a normal scenario. We can also see cases where there are where there's a copy neutral event. So the copy number doesn't change, but this pattern changes. So we've lost heterozygosity. So I don't know how many of you guys have heard the concept of copy neutral LOH. Show of hands, maybe. Yeah, a few of you. So there are definitely, this is kind of a typical pattern for tumor suppressors. We'll go over it in the next slide. But you have to have this beta allele frequency in order to pick up these kinds of events. When you have three copies, that changes your beta allele frequency because you no longer have one out of two. Now you're gonna, your beta allele is going to either be one out of three, or two out of three, or completely gone, or all three copies will be the B allele. So then you start to see this banding pattern where you have, you don't have anything at point five, you now have one out of three and two out of three, right? So point three, three and point six, six, these are always symmetrical. And if you have four copies, you see an additional band, right where you regain this heterozygosity because you can have two A alleles, two B alleles, or three A alleles and one B allele, or one A allele and three B alleles and so on and so forth. So this changing pattern of bands tells you is kind of correlated with the copies of DNA that you've that you have. And you can see that down here in the copy number plot as well. What copy number by itself won't tell you is about these copy neutral loss of heterozygosity events. So these happen, essentially, this is kind of a classic way to take out a tumor suppressor gene. And so you have a heterozygous position, or let's say you have in the case of a tumor suppressor gene, you would have a mutation, for instance, in patch or SUFU. And then when your cell duplicates DNA and divides, there is a non disjunction event. So you, you segregate two, two of the same copies of the mutation in one cell and then lose the wild type. So now you have a copy neutral event where that region is homozygous because it's actually the same chromosome twice. So here's an extremely clean and beautiful example of a tumor that has a number of different types of events going on. And these are colored to indicate discrete regions of distinct copy number states. So on top, we can see the copy number profile in blue are all the regions that are diploid. So these are going to have two copies of the chromosome. So these are zero on the Y scale. And then a single copy gain will show up at the beginning and that corresponds to a shift here in the BAF plot from point five. So here, the gray indicates that we have heterozygosity and here we have a shift away from heterozygosity, because now we have an extra copy. A loss is the section in green. Do you guys see green or black? Yeah, so you can see that you also have a shift away from heterozygosity. And here we have another example of copy neutral loss of heterozygosity where you see two copies of the genome, but absolutely complete lack of heterozygosity. And we also see a homozygous deletion where where the small region is completely lost. Okay, so let's look at a couple of quick examples of copy number aberrations involving driver genes that we know are important in cancer. This is amplification of a potent oncogene, or B2 on chromosome 17 of a breast cancer patient. So the X axis again is the position along the chromosome and the Y axis is the copy number as a function of the normal genome in this individual. So the expectation is that there are two copies of the reference. And in the genome encoding RB2, this very red signal indicates that there are actually many copies of the gene. So you can see the median copy number is around five or maybe even six at this locus. So RB2 is actually amplified in this way in about 15% of all breast cancers. And it's a driver event that leads to proliferation and growth of tumor cells. And patients that have this high level amplification can be treated with a drug called Herceptin. So they respond really well. So that's a great example of personalized or precision medicine. So knowing the driver event in a tumor will let you then make relatively rational predictions for a treatment. And in clinical practice, a technique that's often used to kind of validate or prove that an event is happening is fluorescence in situ hybridization. So this is a fluorescent sequence specific. So it's a sequence specific probe that's fluorescently tagged. And you can use it to label the genomic content of cells. So the blue areas here are the nucleus. And the green probe in this case, you can see these little green dots, hopefully. Those are those are recognizing some sequence that we know is deployed in these tumors. And the red probe corresponds to RB2. So you can see that some cells have literally hundreds of copies of this gene. So this fish assays clinically approved. And it's an alternative way to measure an alternative way to to fish, which is clinically approved is to actually measure the level of the protein through immunohistochemistry. So that's often done as well. On the other end of the spectrum, we have complete losses of tumor suppressor genes. So this is what a deletion looks like. I circled this event because it's so tiny. This is a really focal deletion of the tumor suppressor P 10. And often tumor suppressor suppressors that need to be homozygous lead deleted or completely lost will have these very, very focal small regions of loss. Okay, so we have these gains and losses. The clinically relevant subset of these alterations are that are functional are going to give rise to gene expression changes. So this is that certainly the case for RB2. So here on the right, we were seeing a plot of the expression level. And then the colors indicate the copy number state. So with red being a high level of amplification, and you can see that there's there's definitely a relationship between copy number and expression for this gene. And in general, as a rule, there's a much better correlation of with expression for focal events compared to broad events. And so here we see we see we see high amplitude gains and losses versus broad gains and losses. And the difference in gene expression as a function of whether the high amplitude gains and losses are homozygous deletions are balanced in copy number or high level amplification. So you can see that the the focal events actually drive gene expression changes too much, too much higher degree. And that's because gene expression, you know, it's sensitive to copy number, but also other many other regulatory events, right? So how much how many transcription factors bind to that locus and their regulation, etc. Yes. So they are affected but not to the level that you would see for like her B2 amplification. So that focal event would generate in some cells, you know, there are 10 copies or more. So that will be much more detectable as a gene expression change, then gaining you going from two copies to three, which might have a small effect in gene expression change. But then you might have, you know, down regulation of some transcription factors to correct for this. So there's a lot of regulatory inputs for expression of genes. And I mean, the yeah, is usually one more or one less, right? And the focal event will usually delete two copies of a tumor suppressor gene, right? So then you start to see this on the left, right? These homozygous deletions affecting gene expression and having it go significantly down. And then conversely, these amplifications really driving gene expression up. So there are a number of genes that are known to be affected by copy number, like recurrent copy number aberrations in many cancers, especially these high level gains and homozygous deletions. And we see certain genes coming up as targets across a number of cancers. So these correspond to the known oncogenes that drive proliferation or B2, EGFR and so on. And known tumor suppressor genes like P10, BRCA1 and 2, which we'll talk about in a second in a little bit more detail. And so identifying the full repertoire of these driver events in cancer, especially the more rarely mutated genes, which are harder to find, takes large cohorts of patients across cancer types. So this is just a very short, now, in a bit of need of updating list of papers that describe the efforts of big international consortia to look at large cohorts of patients that have copy number changes, measured either with genotyping arrays, or with whole exome or whole genome sequencing approaches. And of course, the ultimate goal of this activity in profiling cancer genomes is to find actionable targets. And so these are genes or pathways that cancers rely on to proliferate, and then develop therapeutic agents against those targets. So this is a brief list of specific actionable copy number changes in cancers that can be targeted. So I think it's pretty clear that amplifications or gain of functions are much more feasible to target with small molecule drugs that will inhibit the action of a protein than tumor suppressors, because it's very hard to add back functionality. It's very easy to just mess up the way something works. And so in addition to guiding treatments, these genomes can actually be used to stratify patients. So this is a nice synthesis study that shows that cancers actually reside on a spectrum where at one end tumors harbor a lot of point mutations. So that's shown here on the left. And at the other end, like for the ovarian carcinomas, which we saw the karyotypes for, they harbor a lot of copy number alterations. And so it seems that there is either a selection for processes that promote defects in DNA repair that fixes double strand breaks, and thus lead to genomic instability, or selection for processes that lead to a deficiency in mismatch repair that repairs single base changes. And so we see cancers falling on kind of the ends of these two spectrums. So the the presence of both DNA repair mechanisms being altered is very, very rare and likely selected against. So we don't really see anything in this middle space. So so yes, most cancers reside mostly at the ends of these scales. And that actually actually stratifying patients in this way opens up a therapeutic opportunity because drugs have been specifically developed to interfere with each of these acquired capabilities necessary for tumor growth and progression that that you've heard about already. And many of these drugs are already in clinical trials. And in some cases, I've already approved for clinical use. And we can just take a quick look at one of the genomic instability drugs, or a category of drugs which are the PARP inhibitors. So this is a class of drugs targeting genomic instability. And the key idea idea here is that DNA is damaged thousands of times during each cell cycle. So that damage has to be ongoingly repaired in order for each of your normal cells to proceed through the cell cycle. So PARP one is the protein that's important for repairing single strand breaks. So these are nicks in DNA. If these nicks persist, unrepaired until DNA is replicated, then the replication process itself can cause double stranded breaks at those positions. And so drugs that inhibit PARP one will cause multiple double strand breaks to form in this way. So normal healthy cells survive inhibition of PARP because they have intact BRCA one and two proteins that are involved in the repair of double stranded breaks. And that's done through the homologous recombination repair pathway. But in the subset of tumors where these genes are mutated like breast cancer, which are BRCA one and two mutated, double stranded breaks cannot be efficiently repaired. So they accumulate an increasing numbers and lead to the death of these cells. So these two events, BRCA one mutations and PARP inhibition are considered synthetically lethal to one another. So each individual event can be tolerated, but together they cause cell death. So this drug won't affect your normal cells because your normal cells have a working copy of BRCA one, but they will selectively kill the breast cancer cells with these mutations. And actually, so this opens up in a way, a way to target tumor suppressor genes, right, which are very difficult to target with small molecule drugs. And so we can take advantage of this synthetically lethal combination to come up with better therapies. So there are currently two drugs available clinically approved for BRCA mutant breast cancers. So women can have a, you know, a test for the, for the presence of BRCA mutations. And these drugs are approved, but there are others as well in clinical trials. So this is a great example of how stratifying patients can lead to, to better approaches to therapy. Okay, so I'm going to move on to talking about some confounding factors for, or that make copy number inference challenging. And essentially the challenge is, is, comes from three main, three, there are three main reasons. First of all, when you're profiling a sample, as I mentioned cancer cells are almost always intermixed to some normal cells. Okay, so we have, we can have low purity in some cases and that's going to confound our analysis. Second, the actual DNA content of cells, the ploidy, is unknown. So you don't know if you have a diploid tumor or a tetraploid tumor and so on. And then third, the cancer cell population could be heterogeneous because of this colonial evolution process where you gain and lose things as a function of time and only some cells will have specific events. And so when these values are unknown and have to be predicted, there's often more than one combination of purity and ploidy that can explain an observed copy number state. So for instance, so this is called the identifiability problem. So let's say that we have a homozygous deletion in a sample with 30 percent purity. The way that you would calculate the copy number is you would take the contribution of DNA from your normal cells. So these are diploid. So that particular locus, let's say we're looking at p10, has two copies in 60 percent of cells because 60 percent of cells are normal. And then 30 percent of cells that are part of your tumor, you see zero copies. And so when you add these two things together, you get an overall copy number in your sample of 1.2 instead of 2. So you know there's been a deletion but you don't see that there's been a homozygous deletion. You can also get 1.2 in a heterozygous deletion case in a tumor with 60 percent purity. So in this case you have two copies coming from 30 percent of cells and one copy coming from 60 percent of cells. So you have the same copy number, you have the same measured copy number, but two ways in which it could have happened. You could also have an equivalent beta allele frequency, for instance in a case where you have a copy number gain in a diploid tumor. So you go from AB to AAB. Or you could have a copy number loss in a tetraploid tumor. So you go from AABB to AAB. So there's this identifiability problem in this purity, employability, and subcolonal events that kind of add, that make this inference more difficult. Yes, what was the question? So you have, I mean, I guess a way to measure it is a main cell. So, I mean, one way is, I'm going to be on the next slide, which is to infer the best fit of the copy number, ploidy, and purity. I think there are also ways to measure purity, kind of in vitro, to validate your prediction, but I don't have expertise at that. Do you guys in the back? How would you measure how many cells have for the period like that? Yeah. So there is some. Yeah. So you know that for that small population. What your purity is? Sub-sample. Yeah. Cancer cells. So for some of these methods that try to infer purity, the gold standard they compare to are some of the samples in TCGA, right, the Cancer Genome Atlas, which has sequenced thousands and thousands of tumors. And for some of them, pathology slides exist. And, you know, a review has been done to kind of determine in those samples what the purity was. Yeah. So it's definitely a difficult problem. This is one of the approaches. This is a computational tool, and, you know, most of the tools that are going to predict purity are computational tools. If you have a pathologist handy, then you could do a bit more to confirm your predictions. But this is the absolute algorithm. So this takes in processed copy number segments like we were seeing with the BAF and copy number plots from before. So the loss of heterozygosity and inferred amplifications and deletions. And then it tries to infer the best combination of purity and ployty. So that's this graph here on the bottom left. That would explain the particular, you know, amount of gains and losses in LOH in that individual sample. And so, I mean, I could go through this whole thing, but it's a published paper. And this is a method that you could, you guys, I think we're using absolute in the, is that right, Hamza? Are we using absolute in the lab? Okay. So it's a paper you guys could read, or we could go in more detail about on these plots if you guys want afterwards. But the whole point is that it basically has a model to try to fit purity and ployty and come up with a prediction. And in cases where different combinations of purity and ployty are equally likely, then it picks the most, then it picks one based on information from a database of what's normally seen in cancer. So it's more common to see a tetraploid tumor than to see a tumor with seven copies. And so if you have an equal choice between your tumor being tetraploid or having seven copies, you will have slightly more chance of it being called tetraploid. And so taking this approach and looking at purity and ployty across 5,000 cancers, it turns out that over a third of all cancers have a ployty of three or greater, meaning that they must have undergone this genome doubling event at some point in their evolutionary history. And so we see this for multiple types of cancers, the different kinds are here on the bottom, ployty is on the left and here's kind of a cumulative distribution. So you can see there's a peak at two where most tumors are deployed, but also a significant peak around three and just under four. And so can anyone tell me why it's not actually four? What do you guys think? Why don't we see four? Well this is after inferring purity. So it could be normal contamination if your purity estimate isn't quite right, but usually it's because when you do have a genome duplication event, then you have lots of other rearrangements that happen and in many cases those are deletions. So your tumor duplicates its genome and then all sorts of events happen including massive losses because you have plenty of DNA that's extra so you can afford to lose a lot. So you see a lot of losses in these tetraploid genomes and that's why the overall copy number goes down to three point something. So you won't typically see four. Yeah. Yes. So you would go from AABB to AAB or ABB. You would lose if you lose one copy, but you could lose two copies. Yes, it's an average over the whole genome. And then these graphs basically show that there's compelling evidence that this genome doubling event happens early. So what we see here for broad events on top and focal events on the bottom with gains in reds and losses in blue is that in samples which have a whole genome duplication event, so every pair of these bars are the diploid versus the duplicated genomes. There are many more amplifications and deletions that occur after the genome duplication. So these genome duplication bars are a lot higher and specifically the deletions are a lot higher. So many more losses happen after genome duplication. So more CNAs occur after genome duplication than before and deletions are outnumbered gains. And so just one final example for this section of the clinical relevance of genome doubling events in cancer is from this recently published cohort of 100 lung cancers where each patient's tumor was genomically profiled using exome sequencing but from using multiple spatially separate samples of the primary untreated tumor. So each tumor has three or more individual pieces. And so this is the conceptual design of this experiment where you would do copy number loss, copy number analysis and mutation calling for each region and then you use that information to work out the phylogeny of events that generated this tumor. So at the bottom here you would have the normal cells that then acquire mutations and that clone that has all of these mutations grew and then a portion of those cells acquired some mutations or copy number aberrations that are specific to this subset of cells so these are subclonal. And so you have this diversification and you can infer these phylogenies from multi-regional samples or from single cell samples of tumors. And so in this paper they show and I'm just going to highlight a couple of findings that are relevant to our discussion. So they found that nearly 50% of copy number alterations were subclonal and restricted to a single part of the tumor. So if you're just biopsy in one piece and sequencing that and you're seeing an event it's very likely well it's 50 you have a 50% chance in this kind of tumor that it's actually just a regional event and it's not one of those trunk events. So 70% of these subclonal events look clonal because we don't know that they're in all cells we don't know that they're not not in all cells so it's important to do this multi-regional sampling or keep in mind this idea that there is clonal evolution. Early genome doubling events were highly associated with the presence of subclonal events right so that's the idea that this really propagates genomic instability so every cell is going to have a chance for something else happening and you see lots of these subclonal events and here on the right we see this survival plot where you can appreciate that those patients with lots of subclonal heterogeneity actually do a lot worse in terms of disease outcome compared to those patients that have a more bland genome so it's a clinically relevant aspect of of the biology of these tumors. So it's also possible to classify mutations in genes by the timing of their mutation into those that happen early in tumor genesis so pre genome doubling so these are going to be events that are involved in tumor initiation and those that are subclonal and occur later and so these likely have a part in tumor maintenance or even resistance to therapy. So if you're thinking of trying to find drug-able genes in a disease then the idea would be to focus on these events that are going to be present in every cell rather than these events that are going to be present in only a subset of cells. So I recommend reading this paper it's full of interesting facts that are relevant to our concepts and so it's a good thing to read if you get a chance but basically um that kind of concludes the first part of the talk and now there's going to be a more detailed uh second part of the talk I just want to take maybe five minutes break if there are any questions we can chat about things um otherwise we will I don't know Anne if you want to pause this for five minutes okay. I mean you can hear it. How's this volume? Is it any better? Is it about the same? Apparently not. Like how close do I have to get? Like it kind of works but is this any better? No. Can't hear anything. Is it stereo or some mono? I don't want to mess with it too much either. Time to take the windsock off. Have you changed the mic socket? He didn't. No. Okay. Hmm between what did he mess with? Turned off some. Yeah. We hear it. No, it's not really. It's just you. Okay guys, I think we should get going. I will speak up. Is this a little bit better? Great. Okay, so measurement technologies for copy number analysis, there's been a progression of these technologies over a number of years, ranging from these low resolution, high accuracy approaches like fish, which we mentioned to this middle ground of higher resolution, but still lower accuracy and coming to the current date with these high resolution and high accuracy methods like whole genome sequencing. So this is, the fish is this method where we can look at the actual copy number at just a few low size simultaneously in single cells. And then in the early 2000s, these hybridization array platforms were developed that were capable of probing between 30,000 to 100,000 positions in the genome. And so one could wash total DNA over this array and generate intensity signals that corresponded to the copy number state, but no LOH calls, so no beta allele frequencies. And then in the mid-2000s was the advent of these very high density genotype arrays that came on the market, mostly from Illumina and AFI. And these really drove analysis of copy number in cancer for many years. So large cohorts of many types of cancers have been generated on these platforms. For instance, from the TCGA. And at the moment, the large consortia and many LOHs have moved into whole genome or exome sequencing data. So we're going to take a look at both array and genomic data analysis in the next few slides and also in the lab. So just to remind us, the challenges that have to be overcome in order to accurately measure copy number changes is that cancer is a mixture of normal cells, the tumor and the microenvironment. And these dilute essentially the signals of gain and loss because the normal cells have a normal diploid genome. So that alters our sensitivity. There's also this intratumoral heterogeneity where we have copy number profiles in just subsets of cells. So this creates a lot of biological noise when we try to infer signals. So decomboluting these mixtures of cells is an important part of analysis. And then finally, the other confounding factor is we're looking for somatic events in the presence of germline alterations. So germline events are going to be the strongest signals we see because they're present in every single cell. And we have the problem of polyploidy. So all the calls we make are actually relative to a deployed copy number in the germline. And the original algorithms that were devised for analyzing microarray data were designed for population studies where people were looking for differences in germline copy number events and heterosegosity between normal individuals in different populations, for instances. And so when these algorithms were applied to cancer data, it became obvious they were not very well suited to dealing with all these sources of biological variation. And so in the last few years, there's been tremendous advance in the computational approaches to interpret these copy number signals. This is just a good review from Terry Speed, the godfather of bioinformatics that describes the statistical considerations in cancer genomics data. So it's a good read in general for the problem at hand. And so we've talked about these total copy number calling and genotypes. And this is just to remind us of the scenario where a single copy duplication leads to a shift in genotype from AB to AB or ABB. And a deletion shifts AB to just A or nothing in the case of homozygous deletions. So we have these changes in genotype. So when we have two copies of the genome, there are these three possible genotypes, homozygous, heterozygous, or homozygous for the other allele. In the case where we have a copy number gain, we can now denote three alleles. So we would have A, AB, ABB, and so on and so forth when we have four copies or five copies. So this type of table will summarize the types of genotypes that you would see at different kinds of copy number events. Oh, there was one more thing I wanted to say. So it's not just the copy number, but this actually also tells us about whether the tumor has a balanced digosity or a complete loss of one allele in favor of the other. So having that beta allele frequency is really important. OK, so inferring this beta allele frequencies, both for array data and sequencing data, relies on measuring the relative frequencies of A and B alleles at polymorphic positions in the genome. So these are SNPs. And you guys, I don't know how many of you guys have used DBSNP, or you've probably all heard of DBSNP. So DBSNP is a database that warehouses this information. In the latest version, it has a set of 130 million SNPs that have a known frequency in the general population. So these are useful for this type of work. And we see here some examples of SNPs in the BRCA2 gene, which has over 8,000 variations annotated along its length. A lot of these are non-coding and would occur in the UTRs or introns. But in all cases, the two alleles are listed. So you can see that here in this third column, along with their known frequency in the population. So for instance, in this case, the C is the minor allele. Yeah, the C is the minor allele. And it's found in 0.009% of the general population. So the T is the major allele. So the T would be the A allele, and the C would be the B allele. This wouldn't be a great example of an allele to include on a genotyping array because most people are not going to have this variant. The second variant is a good example of one that you would want to include on a genotyping array because 30% of people will have the T, and the other percent of people will have the G. So these are informative heterozygous SNPs. And so the AFI SNP6 arrays were designed to measure the presence of SNPs that have this evidence for heterozygosity in the general population. Essentially, it consists of probes that are 25 base pairs long. They're oligonucleotides, which contain the polymorphism in the middle. There are 900,000 probes that are essentially in genomic regions known to have SNPs, as well as over 900,000 probes that are known to be in genomic regions, which vary in copy number, but don't necessarily have polymorphisms. So these probes hybridize with labeled DNA and generate kind of a continuous signal of intensity that corresponds to the amount of DNA at that locus in the library. So the more copies of DNA you have, the brighter the signal of binding to that specific oligoprobe. And so because we know the positions of the probes on the genome, we can plot the intensities of the probes on the chromosome plots as dots, as we've seen. And by the way, a good question came up in the five minutes, which was that when we look at those B allele plots and we have the three bands, it's easy to assume that any position you see three bands, but actually each dot is an independent point along the genome. So at a certain position, you're going to have a 1 in your B allele frequency, and then a few bases later, you'll have a 0. And a few bases later, you might have a 0.5. And because there's noise in all this data, you have that fuzziness to the bands. But you never have a position at which you see multiple of these values. So here, this describes what actually happens on a SNP6 array. So basically, we have this DNA that has a SNP. So one allele will have an A, and one allele will have a C. And the probes that correspond to the SNP are going to have the A and the C. So some will have the A, and some will have the C. This is so tiny on mine. Do you guys see different colors here? It's this T. This T, I think it's this A or C would correspond to T or G. And then when the DNA that has allele A binds, it will bind. So this is the DNA here on the left. And it binds the probe that has a T because it has perfect complementarity. And always this binding has kinetics. So the DNA and probe stick together, and there's some fluorescence signal given off, and then they come apart. So you always have some binding, even for non-specific or less specific interactions. You're always going to have some background with SNP6 arrays. But you'll have a lot more signal at those bindings that are perfect. So this AT and this GC, which have perfect bindings, will give a stronger signal. You always have signal for all the different genotypes, but you can tell which ones are the ones with more support. So that's sort of the short notes version of how this works. So people still run SNP6 arrays. There's a wealth of data publicly available from these large consortia. TCGA, for instance, has about 11,000 tumor samples profiled with SNP6 arrays, as well as other platforms. So these span a range of diseases so you can see here lung cancer, brain tumors, breast, and so on. And until recently, there was no equivalent for the mouse, but now there is a genotyping array that is able to characterize a wide range of strains in mouse and uncover genetic events in mouse models of disease. So part of what you're going to do today in the lab is to take genotyping arrays on the Athymetrix SNP6 platform, where we start out with cell files, which they contain the intensity values for all of these probes. So that's what comes off the machines. And the workflow is to next pre-process these signals from all the probes on the arrays that you end up with normalized and comparable signals across the whole genome and across all your samples. So some samples may not work very well. Some probes may work better than others or be more noisy than others. So this pre-processing step is pretty important. And if you have data and you want to analyze data, this is the time to get all your samples in at once because you want to do this normalization of everything together. If you then bring another cohort because you've sequenced a few more patients, then depending on how many you're bringing in, it's actually worth redoing this whole normalization with a whole cohort again. So once you do this, it's followed by a couple of different extraction techniques. On the left, we see generation of calls for copy number and minor allele frequency. And then those measurements are processed with a statistical model that can infer where the copy number and beta allele ratio changes occur, so what the breakpoints are across the genome. And once we have those segments, we can project what genes are encoded in the different regions of gain or loss. And we can follow up with, like in the other modules that are going to come up in this workshop with things like pathway analysis or clinical applications. So this is the workflow for SNP6, but it's really generalizable to sequencing data as well. And then for any kind of data, normalization is absolutely required in order to remove these platform-induced artifacts. So these probes, like I mentioned, are actually not very specific. They will hybridize with other parts of the genome that they're not intended to hybridize with. And that degree of hybridization can be affected by the length of DNA fragments that are washed over the array. And some probes may have worse binding kinetics in the presence of mutations or clusters of SNPs. So if you have multiple mess matches within your DNA or your probes, that could affect your binding. So this Aroma-Aphometrix package handles a lot of these artifacts so that each experiment is comparable with other experiments and outputs copy number and beta allele frequencies. So hopefully these reflect biology rather than artifacts. So once we've normalized this data and we've removed artifacts, we can start to infer copy number aberrations, loss of heterosygosity regions, and allele-specific copy number changes. This slide basically lists a number of methods for high-density genotype arrays, including Oncosnip, which is the package that we're going to use in the lab. And it infers purity and ployty. And so in the copy number field, these genotyping arrays were just dominant for many years. And recently, whole genome sequencing has taken over as being routinely performed because the cost has dropped significantly. So I think currently it costs about $1,400 to do a whole genome. And at 30x coverage, and about $650 or $700 to do a whole exome. So it's much more reasonable than it used to be. And basically, in a whole genome experiment, I mean, you've seen one of these kinds of slide before, libraries are essentially made by shearing or fragmenting your DNA into pieces that are relatively uniform, although you always have a distribution. But let's say you're aiming for about 300 base pairs. Most DNA fragments will be in that range. And then you sequence them from both ends. So you sequence 100 base pairs from each side of each piece of DNA. So you get these paired end reads, which are these orange bits with the unsequenced portion of the fragment in gray in the middle. And then when you align these read pairs to the genome, hopefully, you see obvious patterns when it comes to coverage. So you can see that an average amount of coverage would be what you would get in diploid genome. Extra coverage is due to copy number gains. Loss of coverage is due to deletions. No coverage is due to homozygous deletions. There are also technical reasons for seeing variations in coverage, which we have to account for. But basically, these sequence reads also give us the allelic ratio at single nucleotide polymorphisms across the genome. So we can infer these copy number events, and we can also infer the beta allele frequencies in an analogous way to array data. But of course, we do this using read counts instead of intensity signals, which can be quite a bit cleaner. So we move from an analog technology with the arrays to a digital one, so from intensities to counts. So some of the biases in whole genome data. Often GC content is the predominant contributor to the number of reads that show up at a given locus. So the more GC content you have, just shown here on the x-axis, the more read depth you have. So there's a positive correlation. And so that's one aspect of the data that needs to be corrected for, and regression techniques are typically used to correct that bias. And so here on the right, you see the corrected coverage considering GC content. Another source of bias is that the human genome has many, many repetitive sequences, as you've heard already. So some reads cannot be aligned or mapped unambiguously. And when you have multiple alignment options for any given read, depending on the parameters with which you're aligning, the aligner will pick one of those three spots that a read might go into. So this could be a random selection or it could be the first position of alignment, depending on the parameters you're using. And so that generates those stacks of reads you guys were seeing before, right? So it's always going to make the same decision out of three possible position. So you'll see those stacks of reads. And so that's actually something we can account for because we know what the repetitive elements are. And so this reaping content, it makes this difficult, but we can correct for it to a large degree. And so the main effect of all this processing is that we go from a copy number plot that looks kind of like this on the top, which is very fuzzy and messy. We can bin regions of the genome in order to get rid of some of this variability. So if we do, if we bin reads into one KB regions and calculate copy number in each of these bins, then we eliminate some of this variability. And then once we account for GC content and for mapability, the signal becomes much cleaner and really represents the biological data at a greater extent than what we started out with. So we can start to see the events in the cancer. So in the lab, you'll use... The absolute truth, the absolute truth. What's the way of your learning to compare this method? Is there some test data that you can... In terms of the effect of pre-processing or whether these gains, for instance, are real? I mean, you could biologically validate that in your samples using fish probes. That would be the ultimate validation of this copy number gain, for instance. You could use other methods that work in a different way. Yeah, many people would... So there's... So you could PCR across breakpoints for inversions and translocations, which are not very well represented here. That's actually a big advantage of the whole genome approaches over SNP arrays is that you can start to see translocations and other kinds of copy-neutral events that are just invisible on the SNP6 platform. On SNP6, you can look at copy number gains and losses and LOH, yeah, at the resolution of the SNPs, which I think are on average about 700 base pairs apart. So that's pretty much going to be the resolution of your breakpoints, right? Whereas with whole genome data, often you can have a much higher resolution. If you're in a repeat region, then your resolution is going to be slightly affected, so you won't always have base-bared level resolution. But for deletions, it's easy. You design primers on both ends and prove that you get a band for the deletion versus the wild type. Same for translocations and versions and so on. I think there are some gold standard data sets that have had a lot of sequencing and that people use to compare different algorithms for in order to assess sensitivity and specificity. So... In a copy, yeah, the copy number... I see what you're getting at, yes. Yeah, your chance, yeah. So the reproducibility is a bit tough and the copy number events are not as... Yeah, they're not as robust as mutation calling, for instance, and even mutation calling has problems, so. Big events or events with a high copy number, I would say, are more believable than small changes or changes with a lower copy number. You can see them in a mic. So once we have this pre-processed and normalized data, and you can see here that we have these regions of going up and down across the scromazome, segmentation is applied. So this is basically the step where we have these different regions or we have these different regions that have kind of a concordant copy number for a while and then they change to a different copy number. So segmentation is basically inferring where the start and end of your event is. And so this particular chromosome has about 30 segments that each differ in copy number from the previous segment and from the next segment. It's possible to do exomes. It's very desirable because it's much cheaper. So there's a lot of interest in performing this type of analysis on exomes. When you have exome data, you're only working with about one to 2% of the data you get from a genome. So it's actually a bit more difficult and it took a lot longer for methods to become available that did a decent job at working with exomes. And so now there are definitely tools that you could use if you have exome data. And there are plenty of samples in TCGA for instance, which have exomes and copy number events inferred from those exomes. Okay, so this is the slide I showed before which shows the super clean example of the typical features we look for in a copy number analysis, right? Gains, single copy gains, amplifications, single copy and homozygous losses. What this doesn't show us are subclonal events which we know are prevalent in cancer genomes. And so how can we find those? They actually show up as weaker signals that are centered around non-integer copy numbers. So instead of going from zero to minus one to minus two, you'd go from zero to minus point two. And again, that goes back to Francis's comment about can we robustly find copy number events? That problem gets even worse when we consider these subclonal events. But basically the tool you're running in the lab, Titan, can predict subclonal events. However, I think the data that you're using has mostly clonal events except possibly one small subclonal deletion, right? So the example that we can focus on here is that. Yeah. And that's because we can't really use patient data in the context of the CBW. So if you do have patient data, then it's much more feasible to find subclonal events. So basically they would show up. So here we have an area of the genome that's diploid. We see it here in blue. And we see a single copy loss here in green. And you can see that the subclonal events, the losses are just a little bit under the diploid line but not quite as pronounced as the single copy loss. And in contrast to the single copy loss where you go from in the BAF plot from a 0.5 heterozygous state and you lose that heterozygosity, in these regions, you see you don't completely lose the heterozygosity but you kind of go away from it. And so that's sort of what the data would look like in some of the examples that, where you would have subclonal events. So in the lab you're gonna use Titan as I mentioned. The conceptual framework for this tool is that we profile typically a matched normal and a tumor sample. And we look at those positions in the normal and the tumor where we have a heterozygous SNP. So that determines the positions of interest that are going to be informative for this analysis. So this is gonna be about three million SNPs per individual on average. And every one of these positions, you count the alleles, so that's, so allele A versus B. And then you apply a statistical model that takes these genotypes and coverage as input and tries to learn where the copy number and loss of heterozygosity segments are. And then it tries to determine their cellular prevalence. So you can see here in the normal, there was a heterozygous event, three reads versus four reads for the A and B and you go to homozygosity in the tumor. So you have five reads supporting just one allele. Question, yes. How do you differentiate the purity of the tumor? So there is a step where it tries to infer purity. So just conceptually, in this, Hamza, do you know the exact way Titan calculates purity? Copy number two and then it determines the probability based on the degree of gains in any other. But it's basically, it's trying to minimize the distribution, or the standard deviation of early frequencies, given that you might have a normal calculation. So the conceptual way to, I guess, the conceptualization of that answer is shown here where you have a sample where you have 80% tumor purity, 70% tumor purity. So these would be different parts of the region, of the tumor and you put them together to get an average. So you can think of this as a clonal mixing example. So in sample A, you have a homozygous deletion, or sorry, actually a one copy deletion and a one copy gain and in sample B, you have a one copy deletion. And when you merge these data together because this gain is only in a subset of tumor cells, you're gonna see it averaged to a lower frequency. And then the cellular prevalence is estimated. So there's an overall estimate of normal contamination. So in this case, it's 25%. So all these signals are going to be like a maximum percentage of the possible signal because of this normal contamination. It turns out, so Titan has been benchmarked against other tools. It does all tools do pretty well for clonal events. For subclonal events, Titan outperforms the other tools. And so this is a good way to approach this type of copy number calling and improve your sensitivity for subclonal events. So just to finish with a nod towards emerging technologies like single cell DNA sequencing, the idea is instead of the concluding these mixtures from bulk cells essentially that are intermixed with normal cells, you can directly measure the DNA content of individual cells. So you can directly measure clonal composition. And so in this paper, this paper describes single cell sequencing from a triple negative breast cancer sample. So here you see the frozen tumor. And then cells are individually dissociated. The DNA content of nuclei is shown here. So some nuclei are diploid, some are hypodeploid and some are aneuploid. So they have copy number gains or maybe have even undergone genome duplication event. And you can cluster the copy number of these cells and show that actually these diploid cells are the ones that are normal. So when you do sequencing and you sequence for mutations or you call mutations and copy number events, you can see the normal cells which are diploid have essentially nothing happening. And the tumor cells which are these hypodeploid or aneuploid cells have events in many genes. And you can see the population structures start to emerge because you can appreciate that a small subset of cells have particular drivers. And these other cells which share a lot of events in common with the tumor cell population have a non-overlapping set of drivers. So this data is not necessarily cheap or easy to generate or work with. There are certainly biases with this kind of data but there are many efforts ongoing to couple these single cell and bulk measurements together. And so as the technology evolves and becomes cheaper we will see more and more insight into this clonal evolution of cancer from this type of single cell experiments. Unfortunately we're not gonna do a single cell we're not gonna do anything with single cell data in the lab today, but Titan can be used for single cell analysis. And so if you look back at last year's lab I think you could use those methods for if you guys wanna try it out on single cell data that would be one way to do it. So I'm just gonna end here and just summarize the genome architectures, fundamentally important in cancer, somatic copy number aberrations will change gene dosage. So that's the functional consequence of these aberrations specifically for oncogenes and tumor suppressors. We can measure these aberrations in a few ways. So SNP6 and whole genome is what you guys will talk about next in the lab. And this really is an opportunity to find therapeutic opportunities for tumors. So I'm not only that but to track these clonal populations in tumors and see what events are going to be responsible for recurrence or metastasis or resistance to therapy. Some tools that are common in the field including the two we're going to use, it's a good resource and I'm happy to take any more questions and have some call.