 Hello everyone. My name is Sorana. I am an assistant professor at the University of Calgary, where I've been for about a year and a half. And previously I did my postdoc here at Sick Kids in Toronto. So it's kind of nice to be back. I do cancer genomics, so I'm interested in brain tumors primarily and how they evolve through time and in response to therapy. One thing that's not on my CV is probably my first paper that I published in undergrad when I worked in a zoology lab. So my research was putting aphids on these fake leaves that we made so that we could test the interactions between aphids and ladybugs to see how they kind of communicate and their communication is interfered with by the bean plants. Anyway, so I started off in zoology and I ended up in cancer genomics. So ecology is very interesting and actually maybe I should put it on my CV because it makes perfect sense. I still study ecology, but now of cancer cells. Okay, so we're going to talk today about somatic copy number alterations in cancer. The learning objectives are really kind of broad. We're going to talk about some biology and we're going to talk about some computational approaches to detecting copy number variations and interpreting these signals. We'll start out with a bit on cancer evolution, genetic heterogeneity. We'll talk in broad strokes about copy number and structural variations and genomic instability and how that feeds into cancer evolution. Tumor suppressors and oncogenes and what actionable alterations are. And then we'll go in a bit more detail on how to detect copy number operations, what are confounding factors, what are strategies to overcome these, specifically purity and ployty of tumor samples and genetic heterogeneity within individual tumors. And so we'll talk specifically about two technologies, SNMP6 and whole genome sequencing data for measuring copy number alterations and then analytic approaches for analyzing this. So feel free throughout the lecture to put up your hand or just shout out with questions if you have them. I'm happy to chat as we go along. Okay, so you've heard, I think from Trevor, that cancer is a disease of the genome. Tumor egenesis is really this multi-step process that requires several mutations. So we start out with a normal cell here on the left that requires some event that gives it a proliferative advantage. So it grows and replicates or dies, either replicates more often than its neighbors or it dies less often. And so its frequency increases in time. These cells continue to acquire mutations or copy number aberrations as they proliferate. And so we have these expanding populations of cells. And when we, and so on the x-axis is time, so when we profile it tumor, we're really measuring these different populations of cells. When we treat a tumor and then profile the recurrence, we often see that the architecture of the genomic, so the genomic architecture is often quite different at recurrence. So these malignant cells within a single tumor can significantly differ from each other both in space and in time. And very rarely is a tumor 100% pure. Often tumor samples contain infiltrating cells like immune inflammatory cells as well as various types of cells in what is called the microenvironment which essentially makes tumors as complex as normal tissues. So these cells can consist of various stromal cells and their presence and composition can significantly change the biology of the tumor. So it could, for instance, make it less sensitive or more sensitive to chemotherapy. So we're increasingly appreciating that it's this combination of genetically distinct clonal lineages of malignant cells and the composition of the tumor microenvironment that underlies the pathogenesis of some and maybe all cancers. So understanding this tumor biology is an exercise in two parts. First, we have to measure or detect these clonal genotypes. These are shown here in different colors. And linking the observed changes in their population frequencies to disease progression and response to treatment. And second, realizing that detection of these distinct lineages is going to be affected by the presence of non-malignant cells from the tumor stroma, which is interesting in and of itself, but actually represents a significant confounding factor in our attempts to analyze the genomes of tumors. So we have to account for this. Okay, so before delving into details, let's go over some background on copy number alterations. Why do we care about them? They're a feature of our normal genomes. So they're really common in the population. This database, the database of genomic variants, DGV sort of has a compilation of the genetic repertoire of normal genomes and how they differ from each other. So every one of us differs by a few hundred inversions, duplications, about 3,000 deletions that are rather large, so more than 500 base pairs, and probably tens of retrotranspose on insertions. And so compared to single nucleotide variants, which we're going to talk about tomorrow, these types of variants affect a large fraction of the genome. They form the genetic basis of traits. So there are gene dosage effects. We have two alleles of each gene, one from our moms, one from our dads. The dosage is often very important for many of these. Changes in those dosages underlie different diseases like cancer. And finally, we see genome evolution both at the macro level in speciation. So speciation can be driven by rapid changes in genome architecture, for instance, by doubling the whole genome and then losing various parts. And also at the micro level within tumors. Here are some examples of diseases caused by structural variants. I won't go through the whole list, but basically I just want to note that the nature of the disease is a consequence of the type of event. So these first two examples, for instance, involve the 7q11, so this locus on the q arm of chromosome 7. It's either deleted, which leads to one disease, or duplicated, which leads to another disease. So again, dosage really matters. And we see that copy number variants and structural variants contribute to a significant burden of disease in humans. So this image depicts the karyotype of a normal human cell. This is a spectrokaryogram with chromosome painting, which marks each chromosome in the human genome with a distinct color. So it makes it really easy to appreciate that the structure of the human genome is deployed with two copies of each chromosome from each parent, except for the X and Y. And it looks really nice. It looks like pairs of socks, right? When we look in cancer, this is what it looks like, an even more amazing collection of socks. These genomes are actually ovarian carcinomas, and they have some of the highest burden of copy number changes in cancer. So these look nothing like the normal karyotype we saw. It's obvious that copy number changes are a major feature of cancer. And so some of the features to point out here are the presence of chromosomal translocations, which connect pieces from different chromosomes. So whenever you see one of these chromosomes that has multiple colors, and in some cases you can see quite a distinct combination of more than two colors. So those are translocations. Another thing to point out is these are not deployed genomes. So most of these chromosomes are found at copy numbers of three or four, or even six or more. So this means that at some point in the evolution of these tumors, at least one whole round of whole genome duplication occurred. And that's exactly what it sounds like. It's a replication error that leads to when a cell splits into two, one daughter cell getting way more of the chromosomes than the other daughter cell. And so having this hypo ploidy is actually associated, we know now, with the propagation of chromosomal instability. So in many tumors we know that these genome duplication events happen really early, and they essentially provide this material for major chaotic reorganization of the genome. And then finally I just want to point out that at this resolution we see really broad events that encompass whole chromosome arms. But in cancer, where we have selection for cells that have acquired an ability to be fitter than their neighbors, we also see lots of focal events that affect single genes. Okay, so what are copy number and structural variants? A note on nomenclature, copy number aberrations are somatic changes present in tumor cells, but not in the germline of an individual, whereas copy number variants or structural variants are variations or polymorphisms present in the general population. So those would be the things that we see or are annotated in DGV, the database of genomic variants. Conceptually we can imagine how copy number alteration can appear when we consider the structure of a chromosome in a tumor versus a normal sample. So for instance, let's say that this is the reference genome and we're looking here at four contiguous loci. We can see that a number of events can happen, structural variations or aberrations can happen, and copy number variants are a subset of those. And so we could have, for instance, deletion of this allele B, so then we have a deletion. We can have an insertion of some other region from the genome or some novel sequence, which could just cause a structural variant. We could have duplication of one of these loci, for instance B, which would cause a copy number change, or we could have an inversion, which would not be detectable as a copy number change, but is nevertheless a structural variant. This is, by the way, what these long reads are really good at picking up are these big structural variants, where with a short sequencing libraries you would not necessarily be very powered to detect that such an event has happened. And we can also have translocations where you take, you know, one part of your chromosome is now fused to a part of a different chromosome. So generally these copy number aberrations range from one KB to a chromosome R. So we're talking about large events. So these types of somatic rearrangements are a hallmark of tumor genomes, as I mentioned. Loss of expression of key tumor suppressor genes like BRCA or P53 have a significant impact on the regular function of a cell. Amplification and consequent over expression of a growth factor or proliferative gene like PI3 kinase or HER2 can have an impact as well by promoting proliferation. So there has been a huge amount of effort to profile cancer genomes and find copy number aberrations that are diagnostic or prognostic because these could then be used as targets for therapeutic agents. So in many cases, so I'm going to show you a few examples, but in many cases the clinically relevant subset of alterations that are functional are going to be the ones that drive gene expression changes. And in general there's a much better correlation with expression for focal events. So these are small events that target one or two or three genes compared to broad events that target a whole chromosome arm or a whole chromosome. And so what we see in these plots are the standardized expression values of genes that are encompassed by broad or focal events and you can see that high level amplifications really drive up the expression of genes and homozygous deletions really affect and down-regulate gene expression of the genes involved, whereas broad gains and losses often don't have such a high magnitude of effect. And can anyone tell me why that might be? Yes. Yes, there's different layers of regulation. One of them is, you know, the number of copies of DNA you have and then there are transcription factors and various other layers that happen later. So when you have a gain or loss of one copy, often cells can compensate by modulating these other regulatory mechanisms. Whereas if you have, you know, a 10 copy gain of a gene, there's only so much you can do with these other mechanisms of regulation. So here we see, for instance, RB2 and you can see the expression on the y-axis and copy number stayed on the x-axis and each data point is a patient. So patients with a blue dot have no change in copy numbers. So this is the normal variation of the gene when the copy number is unperturbed. And then as we gain copies you can see that there's a really nice correlation between copy number and expression. Okay. So we're going to look at some examples of data. But first of all, I want to talk about how we detect copy number alterations and has to do with heterozygosity. So people have heard of SNPs before, right? Show of hands. Have you heard of the nomenclature A and B alleles? Show of hands. Fewer hands, but some people have. Okay. So let's talk about heterozygosity and how this is used in for copy number events. So the concept here is that our genomes are peppered throughout with positions that are going to naturally vary between individuals. So these are SNPs, single-nucleotide polymorphisms. And these are really useful for referring copy number variants. There's about 10 million or so of these positions that are polymorphic in the general population. And of course we have four nucleotides in our genomes, right? A, C, T, G. But for you for making lives simpler when talking about SNPs in the population, we talk about the two most common alleles. So the major allele A and the minor allele B. And so here we're looking at some portion of an example genome where we have three SNPs. These two, this SNP is the major allele. So at this position it might be a C and the C might be the major allele in the population. And so that's A. And here we might have a C, and so we have a heterozygous polymorphism at this location. So at the green location. And here we see the minor allele on both strands. So one would be the maternal and one would be the paternal strand. So the genotypes here are homozygous for the major allele at this position, heterozygous in the middle position, and homozygous for the minor allele in the third position. So what happens when we have a duplication? For the red allele we now have three A's, right? We have three copies of the major allele. So we've duplicated this, the A and B. For the green allele we now have an A, B, B. So we have three copies of the allele and it's still heterozygous, we have an A and a B, but it's no longer 50-50. And then for the third allele we didn't make any difference so it's still B, B. When we have a deletion of one copy we've lost the second copy of A, we've lost this B, and then we retain our genotype at the third position. And in cases of homozygous deletion of course we no longer have any signal for A or B at these two positions. Okay, so the next couple of slides are going to set us up for understanding the allele frequency plots. I don't know how many of you have had a chance to go through the lecture notes. I added a slide compared to the lecture notes just to kind of help explain this concept and in response to previous questions that I had went yes. So we will update the GitHub page later. Yeah, okay. Okay, so here we have again an example of the maternal and paternal, so the inherited two DNA strands. And remember that two most common alleles in a human population are denoted A and B. So this person has an ASC, two C's here, a T from their mama, G from their dad, and two G's from their mom and dad. When we look at the population frequency of these alleles in the general human population, the A is in 80% of people, and the C is in 20% of people. So that is the A allele and the B allele. In this case, the C is the minor allele, so here we just have the B allele, and in this case the T and the the major is the G, and the minor is the T. And in this last case, the G is the major allele. And so if we're going to encode this sequence for this individual with the A, B nomenclature, it would kind of look like this, where A and B are heterozygous. At this second position, we have a homozygous minor allele. Here again, we have a heterozygous snip, and here again, we have a homozygous major allele. So the B allele frequency is the frequency of the B allele. So this is 50%, 100%, 50%, 0%. And you're going to see plots like this, kind of conceptually like this, where basically as you go along the chromosome, you're plotting this B allele frequency from 0 to 1 on the y-axis, and then the x-axis is going to be your location along the chromosome. And so in a perfect world where we have no noise, it looks like this. So that's again the same plot. In our world, which has noise, lots of measurement noise, we're not really going to get the right frequency every time, because we're either using snipper arrays or whole genome sequences, and there's going to be some technical variants or something that's going to throw off our measurement, and we won't see the perfect ratio. We're going to see kind of a ballpark number that's kind of close to that ratio. So we're going to have noise. And then of course, I only showed you four of these positions, but we're going to plot all of the positions in order along the chromosome. So our plots are going to kind of look like this, where we're going to see a bunch of positions at 1, which are the homozygous B allele, a bunch of positions at 0, which are the homozygous reference allele, and then a bunch of positions at 0.5, which are all the heterozygous snips. And so this is what plots typically look like. When we talk about copy number, we have two plots. The one on the bottom is the one that tells us something about the copy number state along the chromosome. So here we are on chromosome 15. Here's the PR, the centromere, and the Q-arm. We don't really have data for the PR. It's probably highly repetitive, or there are not a lot of snips, or for some reason there's no data. And we can see here that this is the log ratio. So this is the ratio of signal in a tumor versus a normal. So whenever the ratio is 0, that means there's no difference. So it's copy neutral. So the tumor and the normal, the normal is always deployed. So the tumor along this part of chromosome 15 is deployed. So copy number of 2. So the difference is 0. And here we see copy number gains. And we see two types of gains, this first gain and the second gain. At the top plot is the B allele frequency plot. And there's some things to appreciate. So the normal copy looks like this pattern that I showed you before. As we go along, we see some snip that has a B allele frequency of 0, which means it's homozygous reference, or homozygous, the major allele. We see some snips that are heterozygous and we see some snips that are homozygous for the minor allele. And so when we have a normal copy, we see this pattern of 3 at 0.5 and 1. If we've lost one of the copies and then duplicated that region, which often happens in cancer, and I think the next slide is going to talk about that in a bit more detail, then we see this copy neutral LOH. So these are regions where we've lost the heterozygosity. So loss of heterozygosity is what LOH stands for. So we don't have the band in the middle anymore. We just have the homozygous positions. But the copy number is neutral. So we'll talk about that in a second. The other pattern that we observe is moving away from the 3 band to two bands that are off the heterozygosity. And that's when we gain one copy of the chromosome. So we go from bbabaa to bbbababa, the different possibilities of how three chromosomes are going to look. And similarly for four copies, we now see this different pattern. So I hope that makes sense to folks. Let me know if you have questions about that. We're going to see more and more of these plots. But the copy number state and the beta allele frequency or the LOH plots are really informative and tell us a lot about copy number aberrations that are somatic. Okay, so just a quick note on copy neutral LOH. This is actually, this is a pattern of mutations that marks tumor suppressor genes. So the idea is that there's a loss of function mutation, which is really important. For instance, it takes out a tumor suppressor genes function. And so there's strong selective pressure to have that event be homozygous. So when that mutation initially happens, the other chromosome is lost and there's a duplication event. So that's what is kind of shown here. When this cell that has heterozygous mutation duplicates its DNA and splits into two, you get an uneven assortment of chromosomes. And then you get loss of this useless wild type copy that the cancer cell doesn't want. And you end up with two versions of the chromosome that have duplicated this driving event. And so we see loss of heterozygosity for this region because now we just have the maternal allele twice, right? Instead of a maternal and a paternal. Okay, so this is a very beautiful and clean example of a tumor that has a number of different types of events going on. They're also colored to indicate discrete regions of distinct copy number states. So on top here, we can see the copy number and so the copy number ratio. And so deployed regions are in blue. A single copy gain shows up as this red or dark red region. And the deployed region, when we look at the beta allele frequency, has this band around 0.5. For simplicity, the zeros and one bands are just not shown. So we see heterozygosity here. We see copy number ratio of zero compared to normal. When we have a single copy gain, then we shift away from the 0.5, right? We shift away in both directions. We see a big area of homozygous deletion of one copy, right? So we see another shift away from 0.5. And we see this little area of homozygous deletion where we just like completely lose that region. And here's an amplification. So it's very easy to pick out these patterns that we hope to identify. So this is kind of a key slide that you can refer back to for when we talk about copy number signals. Okay, a couple of quick examples involving driver genes. So this is an amplification of one of the potent oncogenes I mentioned, or B2. This is on chromosome 17 of a breast cancer patient. The x-axis, again, is the genome position along the chromosome. And the y-axis is the copy number in the tumor. And so the expectation is that there are two copies of the reference genome. So again, 0 is no difference from the blood of this individual. And in the region encoding, or B2, we see this spread signal, which indicates that many copies of the genes are present. So like around 10 copies or so. And so our B2 is amplified in this way in about 15% of cancer patients, breast cancer patients. And it's a driver event that essentially leads to the proliferation and growth of tumor cells. If a patient has this high-level amplification, she would respond to a drug called Herceptin. So this is one of the great examples of precision medicine based on genomic data. This is not a clinical test. In a clinic, you would use a technique like this, fluorescence in situ hybridization, where a fluorescent sequence, a sequence-specific probe is used to label the genomic content of cells. So the blue areas indicate the nucleus of a cancer cell. And the green probe in this case, you can see two little green dots. Those are two regions of chromosome 17 that have normal deployed copy number. So we see two of them. And the red probe marks that are her 2 locus, or her B2 locus. And so you can see that some cells have literally hundreds of copies of these gene. And that the number of copies varies between cells, which indicates that there's genomic instability happening. Here's another example. On the other end of the spectrum, this is a homozygous deletion of P10. This is a typical of a tumor suppressor gene. And again, we see that most of the chromosome is diploid with some broad, so a broad copy number gain of the P-arm and the end of the Q-arm, but only one focal region of loss. So this is the driving event in this case. And so by looking at the gene content of these recurrent copy number, aberrant regions in the genomes of many cancer patients, and specifically focusing on these high-level gains and homozygous deletions, we can see that certain genes come up as targets across numerous cancers. So they correspond to the known oncogenes, that drive proliferation, some examples here, tumor suppressor genes like P10 that I talked about. And so there have been many people involved in global efforts to really identify the full repertoire of these driver genes in different kinds of cancer. And so this takes large cohorts of patients across cancer types. So this is just a short list of some of these papers that have sort of summarized this data using arrays, genotyping arrays, or more recently, a whole genome sequencing. And the ultimate goal, of course, is to come out with actionable targets, which are going to be genes or pathways that the cancer cells rely on to proliferate and then develop therapeutic agents. And so in many cases, this has been a successful exercise. This is a brief list of specific actionable copy number changes in cancer that can be therapeutically targeted. Amplifications or gain of function events are much more feasible to target with small molecule drugs that work by inhibiting the action of a protein. And so it's easy to break something or to stop something. It's much more difficult to add functionality back. So if a tumor suppressor gene is gone, it's very difficult to recreate its function unless you have a creative way to work on a pathway, which has been done. And so for instance, for HER2 or B2, every breast cancer case is tested now for HER2 positivity. And if there's a high level amplification, then that patient is eligible for targeted therapy. Okay. So in addition to guiding treatment, the nature of genomes in cancer can be used to stratify patients. This is a nice synthesis study showing that cancers reside on the spectrum where at one end tumors harbor a lot of point mutations. So these are these cancers here on the left. And on the other end, they harbor a lot of copy number alterations. So there's selection for either a process that promotes defects in DNA repair that fixes double strand breaks, and thus leads to genomic instability, or a deficiency in mismatch repair that repairs single base changes. So presence of both DNA repair mechanisms being altered is pretty rare and likely selected against, meaning we have cancers that reside mostly at these two ends of the spectrum. So the ovarian cancers that I talked about are kind of the most dramatic example of these cancers that are defined by genomic instability. And tomorrow we're going to talk about some of the cancers that are defined by somatic variants. Okay. So let's move on to talking about some main compounding factors that make copy number inference challenging. Inferring the absolute copy number of a tumor sample is challenging for three reasons. First, cancer cells are always or nearly always intermixed with an unknown amount of normal cells, right? We don't know ahead of time what the purity of our sample is. Second, the actual DNA content of cancer cells or the ployty of the cancer cells is initially unknown. And third, the cancer cell population may be heterogeneous. Maybe their subclonal evolution and some cells will contain some event and other cells won't. And so when these values are unknown and have to be predicted, there are often more than one combination of these values that can explain our data. So for instance, and this is the identifiability problem. So for instance, we could have a homozygous deletion. So both copies of the DNA in the cancer cells are gone, but only 30% tumor purity. So we have a tumor with 30% purity. So the tumor cells are 30% of the of the signal and they contribute zero. And the normal cells are 60% of the signal and they contribute two copies each. So we're going to come out with a copy number of 1.2 if we use these values to calculate our copy number. Alternatively, we might have a heterogeneous deletion and a situation with 60% tumor purity. In that case, there's a heterogeneous loss in the tumor, but those are 60% of cells. And the normal cells are also there at 30% and so they're going to contribute two copies. And we still have an inferred copy number of 1.2 or an observed copy number of 1.2. We just don't know which situation is the one that led to this observed copy number state. And similarly for the allele frequency, we might have one copy gain in a deploy tumor. So we go from AB to AAB because we've gained the A allele. Or we might have had a whole genome doubling event. So now our cancer is AABB and then we have a loss of one allele. So that's going to yield the same observed B allele frequency. And so computational tools have been developed to specifically address this problem. One such tool, Absolute, is an algorithm that takes in the process copy number segments and loss of heterozygosity calls for a sample and tries to infer the best combination of purity and ploidy given a pre-existing knowledge of karyotypes. So if it's more common that we see ploidy F4 in a specific cancer and you might have a solution that involves a ploidy F7 or ploidy F4, then you would pick the ploidy F4. And so you're welcome to kind of read this paper, but just as a broad overview here we see in purple, oops, in purple every time there's a normal copy number, we see purple, anytime we see gain or loss we see red or blue. So it's a bit different than the previous plots we've looked at, but I won't go into too much detail. So we have basically this profile of gain's normal copy number and losses which can be explained in different ways. And so we can explain it with a ploidy of 2 and a purity of 0.35 or a ploidy of 4 and a purity of 60% cancer cells or a ploidy of 7 and a purity of 0.4. And based on a compendium of a genomic data sets for these kinds of cancers, the best solution in this case is a ploidy of 4 with a purity of 0.6. So often we try to identify these two numbers at the same time. Okay, so taking this approach and looking at purity and ploidy across 5,000 cancers, it turns out that over a third of all cancers have a ploidy of 3 or greater, meaning they must have gone, undergone a genome doubling event at some point in their evolutionary history. And so here we see for different cancer types, some lung cancers, head and neck cancer, kidney cancer, breast cancer, et cetera, ovarian GBMs. The purity ranges quite a bit, right? We have some really impure and some really pure tumors. And for a number of these cancers, the ploidy is around 2, so that's the histogram here, describing this pancancer result. But many of these cancers have had one or even two genome doubling events. And so we would actually expect, if there was a whole genome doubling event, we would expect to see kind of a peak at 4. So why do you think we don't see a peak at 4? Any idea what happens after a genome doubling event? Yeah, if you, so, oh yeah, a whole genome doubling event is going to lead to a ploidy of 4, but then that's a really unstable scenario. So immediately cancer cells will lose some of the chromosomes. The other way to or to increase ploidy is not through a whole genome doubling event, but through, you know, gains of some chromosomes only. So we don't really see 4. We know that these events happen and then subsequent deletions take down the ploidy to 3 plus something. So here's an example. Here we see amplifications in red and deletions in blue for diploid samples and samples that have undergone a whole genome duplication where we can tell if the amplification must have happened before the whole genome duplication event or afterward based on the B-oleal frequencies. And what you can see for broad events is in the cases with whole genome duplication, there's lots of gains after that don't have, so the more gains, more chromosomal gains happen after a whole genome duplication events, then in cancers that are deployed and lots and lots of deletions happen after you've duplicated a whole genome. And same for focal events, so these broad and focal events we see the same pattern where there's just much more genomic instability once you've duplicated a genome and that's sort of that leads to a proliferation of copy number alterations. So just one example of the clinical relevance of the genome doubling events in cancer this is from a recently published cohort of I think about a hundred lung cancers where each patient's tumor was genomically profiled using exome sequencing, but they did the sequencing from multiple parts of the tumor. So from that individual tumor they took in this example four pieces and then they did exome sequencing on each one. They did mutation calling and copy number calling in each one and then they used this data to work out the phylogeny of how the genomic events must have happened. So the evolution of this genome in time and how these clonal lineages are represented in each tumor region. So at the base of this tree we would have some of the first events that happened and are present in every cell of the tumor and then as some cells diverge we will see these branching patterns. So this is a kind of an excellent paper I just want to highlight a couple of things from it. First it turns out that nearly 50% of copy number alterations are subclonal so they're restricted to only a certain part of the tumor. Without multi-regional sampling 70% of these subclonal events would not would look clonal because we wouldn't know that in some other part of the tumor they were those cells were deployed. Third, these early genome doubling events are strongly associated with the presence of subclonal events. So once you have that genome doubling event there's genomic instability and your cells will lose different pieces different cells will lose different pieces and so that that's going to really boost this subclonal heterogeneity and patients with subclonal heterogeneity do a lot worse in terms of disease outcome compared to those that have really bland boring genomes that did not undergo whole genome duplication. And so if you look at these tumors in time you can tell which things are clonal and present in every cell these are these blue and red and which things happen later and are subclonal. And so it's actually possible to then classify genes by the timing of their mutation so you can see which genes are either mutated or their copy number is altered before the genome doubling or after the genome doubling event which is kind of critical information for guiding choice for testing rational therapies in this disease for instance. And it really drives home the message that this chromosomal instability is an oncogenic driver and a predictor of outcome. So genomic instability equals poor outcome. Okay so how do we measure a copy number variance and what are the technologies and tools that are disposal? There has been a progression of technologies for copy number changes ranging from those that are low resolution but really high accuracy and then this middle ground of higher resolution still relatively low accuracy and then to date with high resolution high accuracy methods. So some of these on the left are things like fish which we mentioned that's a method that lets us look at the copy number of just a few low size so it's not genome wide you have to have a hypothesis for something that you're testing and a probe for that event but you have single cell resolution so you know exactly need to sell how many copies of that piece of DNA you have. And then sort of in the early 2000s hybridization array platforms were developed and you could profile between 30,000 to 100,000 positions in a genome so you could wash you know your whole genomic DNA over this array and generate intensity signals across the genome that corresponded to copy number state but this did not include a loss of heterozygosity information so no BLDL frequencies and then in the mid-2000s we moved to high density genotype arrays from Illumina and AFI so these really drove analysis of copy number in cancer for many years and there are large cohorts of many types of cancers that were generated for instance by the TCGA effort. Yeah and then at the moment these large consortia and many labs have moved to whole genome or whole exome sequencing. I don't think the large consortia are doing long read sequencing at the moment but so for the purposes of this module we're going to look at array and genomic data analysis but whole genome sequencing with short reads. So just to remind us these are the challenges that have to be overcome in order to accurately measure copy number changes. One cancer is a mixture of normal cells in the tumor microenvironment and cancer cells and basically the normal cells dilute the signal so we have to infer how much contamination there is. There's intratumoral heterogeneity so there are these colonial populations we know that cancers evolve and this is the non-going process so part of the challenge that these computational tools have is to detect subclonal events and so there's a lot of biological noise as well. The other compounding factor is we're looking for somatic events in the presence of germline alterations so these are often the strongest signals we see because germline events are of course going to be in every cell of the cancer and so the strategy for cancer genomics is to sequence both the tumor and the normal. When these algorithms were devised for microarray data they were basically designed for population studies of GWAS studies and so where people are basically looking for differences in germline copy number and heterosargosity between different populations of normal individuals and not the weird things that happen in cancer and so applying those algorithms to cancer data did not really work they were not well suited to dealing with these sources of biological variation so in the last few years there have been advances on the computational side to interpret copy number signals from cancer samples that account for the sources of noise and one of the best things you can have is the normal germline sequence of that person that you're studying the cancer of. Okay so when we have two copies of the genome there are three possible genotypes AA, AB and BB as we discussed if we have and when we have AB so when we have heterosagous positions those are really informative because that's when we can detect LOH right and so moving away from the heterosagous state in tumor versus normal lets us know that a loss of heterosagosity has happened and so as we gain copies we have more of these possible genotypes and by the time we get to five we have six possible genotypes and we have various versions of zygosity that may have happened right we may have complete loss of heterosagosity on these two ends we could have heterosagosity or allele specific copy number at alterations where you have way more of one allele than the other so we use the copy number and the zygosity status to figure out what the genotype is and inferring the B allele frequencies in both arrays and sequencing data relies on measuring this the frequency of A and B alleles and so I just wanted to talk a little bit more about SNPs so DB SNP is is basically the database that warehouses this information and so in the latest version there are about 130 million SNPs that have a known frequency in the general population and so our would be in theory useful for this type of work here's an example of SNPs in BRCA2 gene this gene has over 8,000 variations annotated along its length a lot of these are non-coding which doesn't matter at all for our purpose of looking at copy number so they might occur in introns or in the untranslated regions the UTRs but in all cases in DB SNP the two alleles are listed so here we see the alleles T and C and in this position for instance C is the minor allele and the minor allele frequency is at nine in a thousand in the general population here's a G and a T where the minor minor allele is 37 percent so 37 percent of people have a G and the rest of the people have a T and here's a CT where the T is less than a thousand one in a thousand and so what do you guys think you would want to include on a SNP array if you wanted to detect LOH and guesses yeah the middle one where you have where you're going to have at least a chance a decent chance of detecting heterozygosity if everything is normal because because the two alleles in the population are both rather frequent so in any one individual you might see both of them so you want to include these kinds of sites so that you can detect heterozygosity in the normal case and then detect a shift away from heterozygosity in tumors okay so the AFI SNP6 arrays were basically designed to measure the presence of all of these alleles all these SNPs that have evidence for heterozygosity in the general population like that middle allele the probes are basically 25 base pairs long they're oligonucleotides that and they contain the polymorphism in the middle of the probe because you want your probe to bind your DNA as well as possible so in the middle it's the highest stringency of binding so there are on this array about 906,000 SNP probes and also a lot of copy number probes so these are probes in the genomic regions that are known to vary in copy number but don't necessarily have SNPs in them so these probes all hybridize with labeled DNA and generate pretty much a continuous signal of intensity that corresponds to the amount of DNA at that locus in the sample so the more copies of DNA you have the brighter the signal you would have for these copy number probes and because we can we know the positions of the probes on the genome we can plot the intensity of the probe probes on those chromosome plots as we've seen and to analyze these arrays we need this chip definition file which tells us where the probes are and what their sequences are and so imagine that we're looking at or we're interested in this SNP which is a heterozygous SNP we are either going to have an A or a C there will be four probes for the SNP two probes that are the A or the T which corresponds to one of the alleles or two probes that are the C and the G so for the forward and reverse strands so we're going to have four probes then we wash labeled DNA over those probes and the DNA that binds the best is going to is going to generate a signal so for if you have in your sample an allele that's either the A or a T you will see signal for the probe that has the A or the T and if your sample has a C or G you will get signal for the other allele with SNP arrays you always get some background binding so it's always a relative ratio of signal to noise and so we analyze these signal and signal intensities and basically figure out which is the highest and which allele the highest intensity supports okay so people still use SNP6 arrays there's a wealth of data available publicly from these large consortia TCGA for instance has 11,000 tumor samples that are profiled on SNP6 arrays as well as other platforms so these span a range of diseases as you can see here there's now an equivalent for mouse so this is a genotyping array that can characterize a wide range of strains and uncovered genetic events and mouse models of disease essentially so we can do human and mice and this only became available in the last I think two or three years so part of what you you're going to do in the lab is to take these genotyping arrays on the Afimetric SNP6 platform where we start with a cell file so that's what comes off the machines and the workflow is to next pre-process these signals from all the probes on the arrays that you end up with normalized and comparable signal across the genome and across different samples so we don't want to have batch effects and that's what the normalization is important for and this is then followed by a couple of different extraction techniques on the left to generate calls for copy number and on the right to call them the BLEL frequencies and then these two measurements are processed to the statistical model that's going to infer where the copy number and BLEL ratio changes occur across the genome so we're going to call segments of gain and loss and LOH and so on and so forth and then once we have those segments we can project what genes are encoded in the different regions of gain or loss you can follow up as in the other modules with things like pathway analysis and clinical correlations and so this is really kind of a general workflow I'm showing it for SNP6 but it's really generalizable to sequencing data as well okay so the important first part for any kind of data normalization is absolutely required to remove these platform induced artifacts so for SNP6 for instance we have these probes that are 25 base pairs long and you can imagine that they might bind different regions of the genome so they're not totally 100 specific efforts have been made to design probes that are as specific as possible but you're going to have non-specific binding to some degree the degree of hybridization can be affected by for instance the length of DNA fragments that are washed over the array so if you imagine that this 25 base pair probe is going to tether to something that's 50 base pairs long or 500 base pairs long it's going to be a much tighter binding with a 50 base pair just because you don't have all the extra parts trying to pull out the DNA fragment and so the binding kinetics are going to be affected by the length of the DNA fragment as well as the presence of mutations which will reduce the binding specificity or in some cases you might have clusters of SNPs so part of what people do often is to exclude probes where you have more than one SNP because then you're going to have a lot of variants in your binding of DNA to those probes and so this aroma aphometrics package is what we're going to use in the lab it handles a lot of these artifacts so each experiment is then as comparable to each other experiment as possible and it outputs copy number and baleal frequencies which hopefully reflect biology rather than artifacts of the platform so we'll try that out tomorrow morning once we have normalized data we can start to infer copy number aberrations loss of heterosagosity allele specific changes there are different tools to do this we're going to try out a couple of them there are many others it's a rich field for tools and I'm going to talk about whole genome sequencing but first I also wanted to mention high density DNA methylation arrays these are arrays that were generated to basically integrate genomic and epigenomic data so how many people have heard of methylation arrays maybe half half of folks so this infinite human methylation for 50 e chip basically profiles I think it's 450 positions across the genome and it is able to detect methylation so DNA methylation methylated C's as well as copy number alterations and it's basically on par in terms of sensitivity with this slip snip platform so there are a lot of data sets now that are generated on this platform which are also useful for for detecting copy number and so the the probes are designed for detection of CT alterations or seeded T alterations based on this enzymatic process by sulfite conversion where methylated C's are protected from conversion but unmethylated C's are converted and after PCR amplification read out as T's and so you can basically count you can look for differences in methylated and unmethylated C's across the genome or in a targeted way but you can also look at copy number and so these arrays 485,000 have 485,000 probes that are going to be useful for this their their median intermarker distance so that's the median distance between any two probes is pretty frequent so they have a different distribution across the genome compared to SNP6 but they pretty much generate the same copy number profile as SNP6 data in some cases so here we see SNP6 copy number derived profiles on top and 450k array copy numbers on the bottom in some cases you see some things on the 450k array that you don't on the SNP6 array for instance this deletion and that's because the probes are in different regions so you're going be able to pick up different focal events on one platform but not another so keep that in mind if you do happen to have methylation array data you can also use it for copy number calling you guys have talked about next generation sequencing you've talked about alignments and assemblies and so this figure is self-explanatory to many of you right we've generated 300 base pair or so fragments of DNA that we then sequence from both ends maybe a hundred base pairs from each end so we have these sequence regions which are colored and the middle part which is unsequenced which is gray and when we align these you basically align reads and the coverage tells you something about the copy number so when you have no coverage you have a homozygous deletion compared to the reference so this is alignment based if you have loss of one copy you'll see a reduced amount of coverage in terms of your reads at the opposition for gains you'll see more coverage than you would expect at the opposition and you can also pick up structural variants for instance part of your you know your two reads one of your two reads might align to chromosome one and in this case the other read aligns to chromosome five or you might have a split read alignment as is shown here so and so the sequence reads give us also the elitic ratio for a single nucleotide polymorphisms because we're actually sequencing through the whole genome we're not limited by the arrays for polymorphisms included on the SNF6 array we're sequencing every position and if there's sufficient coverage you can just count how many A's versus C's do you see how many G's versus T's do you see and so you get an elitic ratio for every SNP in an individual so not surprisingly there are also biases with whole genome data GC content is one of the significant contributors to variants in the number of reads so if you're using number of reads as your measurement of copy number gains or losses you have to correct for GC content because GC rich regions will have a higher read depth and so what we see here in this plot on the top left is that in a given whole genome sequencing library there's a strong correlation between GC content and coverage and there are if you correct for this bias using regression techniques for instance now you'll see that this correlation is lost which which is what you want mapability is the other confounding factor so the human genome has a multitude of repetitive regions so that some reads cannot be aligned unambiguously and depending on how you run your alignment the aligner might choose you know one region preferentially to align these things that could align into different places so you'll see a peak at that position and then a loss or a lack of coverage in the other position and so you have to account for this repeat content it does make it impossible to align reads to some parts of the genome so centromeres, talomeres are the classic regions where we just we can't use alignment and short reads and that's what long reads are incredibly useful for so the main point is if we don't need preprocessing when we for instance bin reads into one kb regions and then calculate copy number in each of these bins and plot that value as a point along the chromosome so here's chromosome one we see that the signal is influenced by a lot of noise and once we account for gc content we see that noise reduce at once we account for mapability we see that noise reduce further and so what we hope to get is the actual genome of the cancer sample without any of this confounding noise one of the tools that we're going to use in the lab Titan has this preprocessing all this preprocessing built into it but it's important to appreciate why it's so critical okay so we're gonna have this preprocessed and normalized data after which we apply segmentation so that's where we have all these data points where we think ah the copy number is you know the copy number ratio is zero or minus something or plus something and then we have a whole bunch of those in a row and segmentation basically determines contiguous probes that agree in terms of their copy number state so here we have instead of probably 3000 points we just have about 20 segments of copy number gain or loss or or or copy neutral areas so we can do this with whole genome sequencing um which is cheaper compared to a few years ago exomes are even cheaper many people have generated exomes and even though you only get about one to two percent of the data that you would get from a whole genome you can actually still call copy number variants so TCGA for instance has um it's just dominated by exome capture data um you you can call copy number variants from exome data but you it's very difficult to call loss of heterosygosity because now instead of having all these millions of polymorphisms that are basically tiling your genome you're just you just have tiny snapshots wherever there's an exon in the genome and so often there isn't enough data to make a really good call for LOH so LOH is a problem which is a drawback okay so we've seen this slide before um this is a super clean example of the typical features we look for in copy number analysis what it doesn't show us um are subclonal events as I mentioned which are really prevalent in cancer genomes and so we would like um to be able to um to measure and quantitative subclonal events and so the tool we're going to use in a lab Titan has the ability to do that and so I just want to take you through a couple of slides where we look at what that might look like um so here's here's here's the problem with subclonal events when we have a clonal loss for instance we see a really big drop in signal when you have a subclonal event you're going to see a little drop in signal and so you're going to go from um a median loss of you know one copy to maybe 0.25 or 0.2 it really depends on your tumor purity and how many cells um contain that additional loss um so the conceptual workflow for for Titan is that we profile the normal sample the tumor sample and then we extract the heterozygous SNPs from the normal sample right because in that individual you're going to have lots of heterozygous positions maybe it's going to be two to three million SNPs um and then at each one of those positions we're going to count the alleles uh both in the normal and in the in the tumor so in this case um for instance we see that the a allele and the b allele are heterozygous in the normal and that we just see the a allele in the tumor so there's loss of heterozygosity at this position um and so uh after doing this for everything we what we end up with our genotypes and coverage and we input those into a statistical model that tries to learn where the copy number and loss of heterozygosity segments are um and then tries to determine their cellular prevalence so uh to do that here's a conceptual example so this is a schematic of what it would look like um in this case we have um we have two regions of the tumor sample a and sample b um they have different tumor purity so this one is a bit more pure than this one uh 80 percent tumor cells versus 70 um this first tumor has a gain and a deletion and the second uh piece of the same tumor only has the deletion so the deletion is colonial it's basically present in every cell that we've profiled but the amplification is subclonal it's only present in some of the cancer cells so when when we actually sequence this tumor um and these cells are mixed we're going to see a really good signal for the loss and only a partial signal for the gain and so the goal is to decouple this and try to determine um for each event what the cellular prevalence is while accounting for normal contamination um and i think i'll leave it there with just a list of tools um that are useful for copy number analysis um including for whole genome sequencing data as well as methylation arrays if that's something that you guys want to do and i'm happy to chat further and we're going to have some hands on time tomorrow but i'm happy to chat further about any of this um in the next little while as well as tomorrow