 I'm Serana, I work at Sickids, I'm just finishing up a postdoc and working with clonal heterogeneity and brain tumors. So I'm interested in a pediatric brain cancer called medeloblastoma and in January I'll be moving to Calgary where I'm going to start my own group, also looking at cancer evolution on heterogeneity. So today I'm going to tell you about somatic copy number alterations in cancer and tomorrow we're going to talk about mutations in cancer and hopefully I'll teach you everything I know about this topic, although I'm sure there are lots of things that others on the panel are also able to chime in with. So just a note on the lecture. Some of the slides that I used are from a lecture given in the past on this topic from Saurab Shah, so thanks to him for those and then a lot of the slides I are new for this year. And so just to look at learning objectives for this module, basically we're going to kind of have an overview of the impact of copy number alterations in cancer. We're going to discuss genomic instability, cancer evolution and genetic heterogeneity. We're going to talk a little bit about tumor suppressor genes versus oncogenes and we'll talk about some examples of actionable alterations. And then we'll dive a little bit more into how we detect copy number aberrations, what are some of the compounding factors we have to account for and strategies to overcome these, so things like purity and ployty of tumors and also intratumoral heterogeneity. And then we'll finish up by looking at specific measurement technologies and computational tools. So we're going to focus on arrays and whole genomes and exomes. And so before we start, can you guys tell me just with a show of hands how many people have done copy number analysis before? A few. How many people want to do copy number analysis as part of their work? Quite a few more. Okay, great. So you've already heard, I think from Trevor's lecture, that cancer is a disease of the genome. So tumor genesis is really this multi-step process that requires a series of several mutations and each mutation drives a wake of expansion of these cells that harbor the mutation. So if the mutation covers a highly advantageous phenotype to the cell, those cells outcompete their neighbors, so we can see these on this graph, the initiator mutations are in gray and then subsequent mutations that confer a selective advantage are these, generate these green clones or these orange clones. Mutations that don't confer a selective advantage are not selected for and so they remain at a constant frequency or eventually disappear. And so when we have a huge selective pressure like therapy in this case, this time point is a disease or section and then treatment, what we see is a huge population drop and then preferential survival of those cells that are resistant to the treatment. And so in this particular illustration, we can see that the genotype of all the cells in the recurrent tumor still contain all those initiating mutations, the gray background, but we can see that they have a completely different genotype than the tumor at diagnosis. So this red clone is now the majority of cells where it was a very small minority to begin with. And so malignant cells within a single tumor can significantly different from each other both in space, so in different regions of the tumor and in time, if we track a tumor longitudinally. And very rarely is a tumor 100% pure, I don't know how many of you guys are planning to work with mouse models of disease, perhaps that's one case where you can have a pure sample. Meduloblastoma in my case, I'm pretty lucky meduloblastoma is about 90% plus pure, but many cancers have as low as 30% purity. And so often tumors will contain infiltrating cells like these immune inflammatory cells as well as many other types of cells from the microenvironment and basically the stroma, so the stroma. And so the presence and compositions of these other cells can significantly change the biology of the tumor. So for instance, it might make it less or more resistant to chemotherapy. So I'm not going to go into detail on the microenvironment, but I mention it because we are increasingly appreciating that actually it's a combination of genetically distinct clonal lineages of the malignant cells and the tumor microenvironment that together are involved in the pathogenesis of some and maybe all cancers. And so in fact, our foundation for understanding the biology of cancers, which you guys talked about on the first day, can really be described as the set of six hallmarks. So these are the acquired functional capabilities that enable cancer cells to survive proliferate and to metastasize. So the acquisition of these capabilities is actually made possible by two enabling characteristics. And so one of which is inflammation from the microenvironment. And the one relevant to our discussion today is genomic instability. So genomic instability enables cancer cells to have the genetic alterations that drive tumor progression. And so understanding tumor biology is then really an exercise in measuring and detecting these divergent clonal populations and linking these to disease progression and response to treatment. And one way to infer that certain clonal lineage has a selective advantage or a fitness advantage is to measure its frequency in the population. However, there are significant confounding variables to consider. So today we're going to talk a little bit about two of them, normal contamination by nonmalignant cells found in the tumor stroma and the simultaneous presence of these multiple genetically distinct lineages, which have different copy numbers and relative frequencies and those things have to be decomboluted so that we can understand what's really a driver event in these cancers. Okay, so before delving into the nitty-gritty details of how we do the analysis, let's go over some background and copy number alterations. This is a normal human karyotype. You guys have seen this plot before. This is a spectro-karyogram with chromosome painting. And so each chromosome shows up as a unique color. And it really makes it easy to appreciate that actually the structure of the human genome is deployed with two copies of each chromosome, one from each parent. Now the chromosome theory of inheritance is over 110 years old and so it was independently proposed by two scientists Sutton and Bovary. They identified chromosomes as the linear structures that carry the genetic material of a cell and which behave in a way that's concordant with Mendel's rules of inheritance. And so they're present in all dividing cells and they pass from one generation to the next. So this scientist, Bovary, was a biologist who was interested in cell organization and his work was focused on embryonic development in sea urchins. And he observed that a certain phenotype of high proliferation and growth was often associated with a change in copy number of one of the chromosomes. And so he actually made the link that human cancer and cancerous growth is actually a result of aberrations in chromosome structure. That would cause cells to proliferate uncontrollably. And in 1960 he was proven correct when one folks discovered that CML is caused by a fusion between two genes, BCR-Able. So now these genomes are from ovarian carcinomas which have some of the highest burden of copy number aberrations in cancers. They look nothing like the normal karyotype we saw earlier and it's obvious from these images that copy number changes are a major feature of cancer. And so it makes sense to study the copy number profiles in detail to get insight into these tumors. They're actually really laborious to produce but pretty fascinating to look at. So I'll just point out a few key features. One is you can spot chromosomal translocations because we have two colors. So we have these chromosomes that have two colors. You can see where different parts of different chromosomes are now stuck together. Second, it's obvious that these are not actually deployed genomes. So in many of these cases we can see, if you can look at chromosome one, you can see many, many copies of chromosome one. And so some of these tumors have a ployity of three or four or six or even more. So this means that at some point in the evolution of these cancers there was a whole genome duplication event. So that's exactly what it sounds like. At some point mitosis failed in a way, and instead of evenly dividing the chromosomes to two daughter cells, all the chromosomes went to one daughter cell. So genome duplication events are one of the ways in which, well, there are fairly prevalent feature in cancer, and they're one of the ways in which these, the building blocks of these chaotic genomes are made available. And I lastly want to point out that we mostly see broad events here. So we see events involving whole chromosome arms or whole chromosomes, but there are plenty of focal events that we can detect with finer resolution methods. So conceptually, we can imagine how a copy number alteration might appear when we consider the structure of a chromosome. So we see a chromosome here on the left and on the right in these two images. And specifically, we want to see how that structure is different between a normal sample and a tumor sample. And so just a note on nomenclature is that copy number alterations or aberrations are actually what we use to refer to somatic changes. So these are present in the tumor cells and not the normal, whereas copy number variations are actually polymorphisms present in the general population. So there are databases of common copy number variations that we would use to rule out specific gains or losses as putative drivers. Okay, so we're talking about somatic amplifications or deletions. So these are changes that involve between 1KB of DNA to a whole arm. And so deletions reflect the loss of DNA content and ideally loss of a tumor suppressor gene, whereas amplification involve gain of DNA content and multiple copies of things like oncogenes. So these are the hallmarks of tumor genomes and loss of tumor suppressor genes and gain of oncogenes. Okay, so before we look at some examples of data, I want to mention heterozygosity. So the concept here is that our genomes are peppered with these positions that vary naturally between individuals. So these are single nucleotide polymorphisms. I've heard of SNPs earlier. And so, in fact, there are about 10 million or so of these polymorphic positions in our genomes. And for ease of nomenclature, the two alleles that are most common in the human population are termed A. Sorry, the allele that's most common is termed A and the allele that's least common is termed C. So it could be that at a specific position in our genomes, 80% of us in the room have a G and the rest of us have an A. So the G would be the A allele and the A would be the B allele. And so here in this image, we see three positions, three SNPs, and this individual has AA. So the genotype of the first position is homozygous A. At the second position, it's AB. And at the third position is homozygous B. So when we see a duplication, this changes because now we have another copy of the A and the B. So we have three As. So we state that the genotype is now triplicate, three A. And the heterozygous position becomes ABB and the homozygous B remains BB because it's not involved in the copy number event. When we have a hemizygous deletion, so we just delete one copy, we lose heterozygosity at this position. So we are left with A and A. And in the case of homozygous deletions, we just lose both alleles. So another pattern of mutations that marks tumor suppressor genes is copy-neutral loss of heterozygosity, often coupled with a somatic mutation. So the idea here is that typically there's a loss of function mutation in one allele of a tumor suppressor. So this is denoted by the spread mark here at the A on the allele. And then the tumor gets rid of the other copy through a process, for instance, like nondisjunction, which is an error in the mitotic division, and then duplication of the allele. So now we have two copies of the mutated tumor suppressor gene. So this is often a pattern of aberrations that is specific for tumor suppressors. And this is something to keep an eye out for when analyzing data. Okay, so let's look at copy number data for chromosome 15. This is the type of plot that you'll see over and over in this module in the lab and in publications. The lower panel, or below the lower panel, we see the chromosome. It turns out that the P-arm of this chromosome is too repetitive and doesn't have any useful data, so we are just looking at the copy number variation on the Q-arm. And so on the bottom panel, we see the log ratio of the copy number. So ideally, in a well-designed experiment, you're always testing the copy number state of your tumor versus a normal. And ideally, the normal is matched. You can get around having a matched normal, but that is the ideal scenario. And then the way that copy number is presented is the log ratio to the matched normal. So a log ratio of zero means that there's no difference between the copy number in the tumor and the copy number in the normal. And so for most of this chromosome, we can see that there's a log ratio of zero with a small gain here and a larger gain here. So these are values on the y-axis that are above zero. And then on the top, we have a plot showing the B-aleal frequency. And so for the majority of the chromosome, that's heterozygous. So this person has AB, so it's heterozygous for most of these positions along the chromosome. Or it's homozygous for the B-aleal or homozygous for the A-aleal. So we see this pattern of 0.5 being heterozygosity and then 0 and 1, showing that some positions in the genome are homozygous. And so when we have a copy number gain, so here we have three copies, we now see a departure from the heterozygous state. Because instead of having AB at a majority of positions, now we have ABB or AAB. So instead of 0.5, we see a shift towards 0.3 and 0.6. And we still see 0 and 1 because we can still have triple A and triple B. And when we have a bigger copy number gain of an additional copy, we can see a pattern of 0.25, 0.5, 0.75 and then the 0 and the 1. And in copy neutral LOH, where you have no difference in copy number. So if we just had copy number, we would never focus on this region at all. But when we look at the Baleal frequency, we see that there's dramatic shift from 0.5 out to 0 and 1. So we've completely lost heterozygosity there. So these are the kinds of features that we look for when we analyze copy number data, we look at the copy number ratio, and then we look at the Baleal frequency and together they tell us the copy number state along the chromosome. Yes. I'm sorry if it's really obvious, but how do you know it's not ABC? Like, how do you know, like, is it by your knowledge of the chromosomes or the chain itself, or if it's a tree copy? So most positions that are used for this kind of analysis are heterozygous in the germline of the person or in populations. And so, yes, you're right. Most of us are G or T at a particular position, but some of us will have an A. It will be a very small percentage. And usually, this type of data does not account for that low, low frequency. And most SNPs have a major and a minor allele. And when we're going to talk later about the way that SNPs are chosen to be put on, for instance, the affymetric SNP6 array, they're chosen to be by allele. So we usually just have two alleles of interest at a particular position. Any other questions? This is a pretty important plot. Yes. So this is one of the trickier things to get about this kind of data. So let me go back to this. So B allele is the minor allele in the population. And so each person at each SNP position, we all have, let's say, three million SNPs. We actually have about 10 million SNPs, but at many positions, we're homozygous. And so we're either homozygous for the A allele, for the allele that most people in the population have, or we're homozygous for the B allele, the allele that's more rare in the general population, or we're heterozygous. So one of these chromosomes is inherited from your mom and one is from your dad. So it depends on what the genotype of your mom and dad was. And so when we talk about the B allele frequency, it's actually whatever that B allele was. So if it's the C, if it's a heterozygous position and it's a GC, and you lose copies of the chromosome that carried the C, then your B allele goes down to zero, towards zero. And your A allele will consequently go up towards one. Yes. Yeah, so over many regions in the genome, people are heterozygous for these positions. So you'll have a frequency of 0.5. And when you lose heterozygosity, or when you gain a copy, you'll gain just one of the two copies, or you could gain both. You could have a copy gain of two, where you've gained one of each chromosome, and then you've made one. So here's another, maybe this is more helpful, here's another example where we see a diploid area of this particular chromosome, and with a B allele, so the allelic ratio is 0.5. This is like a normal part of the jump. Oops, nothing has happened to it. Next to it is a single copy gain. So we see a shift now out from the 0.5. If we have a deletion, we also see a shift. We no longer have heterozygosity. So the pattern to look for is heterozygosity, and then loss of heterozygosity in different ways. And here, you can see at the end of this chromosome, we have a copy neutral loss of heterozygosity. So the copy number ratio is, does that say 0? Yes, it says 0. So there's no copy number gain or loss, but we can see that there's no heterozygosity over the end of this chromosome. So this is what happened, what I was showing a couple of slides ago, where you lose this region on one, let's say the maternal allele, and then that allele is duplicated and the paternal is lost. We also see an amplification here. This is a focal amplification. So this is the way that oncogenes are often gained and activated. And we also see a homozygous deletion in this light green. So this is kind of a classic way for tumor suppressor genes to be deleted. It's focal events often target things like P10 and CDK and 2A. Here's another example where we have actually really deep sequencing data from a tumor that generates this very clean profile. So on top, we have the copy number ratio, so log r, log ratio of 0 being normal. And on the bottom, we see the beta allele frequency. And in contrast to the previous tumor, from this tumor, we can actually spot subclonal events. And so when we look at a region like this, monosomy4q, we see that there's a copy number loss of 1. So we only have one allele left. Our header's Igosity goes from 0.5 out towards 0 and 1. And the difference is pretty dramatic. So this loss is essentially in every cell. This is an early event that all cells carry. And in contrast, we can see here on chromosome 13 that there is a deletion, but it's not quite at minus 1. And when we look at the B allele frequency, you can see that the shift away from header's Igosity is not as pronounced as that for chromosome 4. And so this event is very likely then in a subclone of the tumor. So it's in a subset of tumor cells. And so what we can tell from this data is that monosomy4 must have happened first. And then the subclonal deletion on chromosome 13 must have happened second. So you can start to time events by looking at this kind of data in this way. That makes sense. So let's look at a couple of quick examples involving driver genes. This is what amplification of ERB B2, which is a potent oncogene, looks like. This is chromosome 17 of a breast cancer patient. So again, the x-axis is the position along the chromosome. And we can see here that where ERB B2 is encoded, we have this skyscraper of red, which is essentially the signal for copy number gains of this locus. And so the expectation here is that the tumor genome is deployed except for this region. And so ERB B2 is amplified in this way in about 15% of breast cancer patients. And it's a driver that leads to proliferation and growth of cells. And patients that have this high-level amplification can actually be treated with a drug called Herceptin. So this is a great example of personalized medicine-based genomic data. If you find this type of cancer or this type of mark in a breast cancer patient, you would think that Herceptin is likely to work. And in fact, a technique that's very often used in clinical practice is fluorescence in situ hybridization, where a fluorescent sequence in a specific probe is used to label the genomic content of cells. And so here the blue blobs are the nuclei of cells. And in green, we have fluorescence for a probe that binds to the copy-neutral region of chromosome 17. So you can see that in most of the nuclei. We have two copies of chromosome 17. You can see that here. You can see that here as well. And the red probe is a probe that binds to ERB B2. And now you can appreciate that some of these cells have hundreds of copies of ERB B2. And so this is a clinically-approved way to infer this amplification or measure it. Alternative ways could be to look at protein expression through immunohistochemistry, which is often done as well. And then on the other hand, at the end of the spectrum, we have homozygous deletions, like P10. Maybe it's a bit hard to see this tiny green blob here. So this is a complete absence of copies over the region of P10. It's focal, so it's very classic for a tumor suppressor. And we see that it's in an area of copy-neutral LOH. Copy-neutral LOH, right? And so the clinically-relevant subset of these alterations, because there are plenty of copy number gains and losses across these genomes, are those that are functional. And we can get at those that are functional by looking at gene expression changes. So if you have a copy number event and the gene expression doesn't change, that means the protein expression doesn't change. And the likelihood of it being a driver is very low. And so that's certainly not the case for ERB B2. So here in the plot on the right, we can see that each dot is one patient. So this is a cohort of breast cancer patients. And on the x-axis, we see copy number. And on the y-axis, we see expression. So copy-neutral cases will have a variance of ERB B2 expression. But as you start to gain copies of ERB B2, you see a concordant increase in its expression. So ERB B2 is not only amplified, it's also highly expressed in all cases that are amplified. That's generally true for focal gains and losses. And so here on the left on the top, we can see that expression on the x-axis being high or low is really dramatically different for alterations that are either high level amplifications that are focal or focal deletions. But when you have big broad segments of gain and loss, that's not generally the case. So you don't necessarily see big differences in gene expression when you have whole chromosome arm gains and losses. So often the consequences of copy number alterations and mutations are actually difficult to predict. There are many computational approaches that involve looking at amino acid changes that result from point mutations or indels or in the case of copy numbers, looking at whether gains and losses are found more often at a particular gene that you would expect by chance. But as we saw in the last slide, it's actually important to consider other molecular measurements like gene expressions. And so when inferring function, the patterns of loss of heterosegosity and copy number can be integrated with mutation and expression data to infer which genes have patterns that would be corresponding to a loss of function mutation. And so in this particular paper, the method X-seq was applied to detect putative novel tumor suppressors across a range of cancers. So 12 tumor times, 2,700 tumors by looking for genes with bililic inactivation and loss of expression. So concordant changes in their genomes and transcriptomes. OK, so just to end talking about tumor suppressor genes in detail, we've gone over the idea that of the classic two-hit hypothesis, where both copies of a tumor suppressor gene are required to be inactivated before pathogenesis. A phenotypic outcome is observed. And just to complete the picture, we can also have what's called haplot insufficiency, where you only need to lose one allele of the gene in order to initiate oncogenesis. And then losing the other allele simply increases the severity of the disease. And in some cases, we have what's termed quasi-insufficiency, where even a small decrease in expression can result in a phenotype, but where the tumor cell cannot tolerate the full loss of the gene. Because, for instance, there needs to be an interaction between the wild type and the mutant allele. So there are different patterns of tumor suppressor loss. So by looking at the gene content of these recurrent copy number alterations, and specifically at the high level gains and homozygous deletions, we can see that certain genes come up over and over as targets across numerous cancers. And these correspond basically to the known oncogenes or tumor suppressors that we've heard a lot about, are B2, EGFR, pediatric kinase, and so on, and also tumor suppressor genes P10. And I'll mention in a second, BRCA 1 and 2. And so identifying the full repertoire of these driver events in cancer, especially the more rarely mutated ones, takes lots of cancer patients, and so large cohorts of data. So this is just a very short list of some big consortium efforts that have applied either array technologies or sequencing technologies in many cancers and across many tumor types, the copy number landscape of cancer. And of course, the ultimate goal of all this activity is to find actionable targets. So these are the genes or the pathways that cause cancer cells to proliferate. And once we figure out what they are, we can develop therapeutics against those targets in the hopes of better outcomes for patients. And so this is just a brief list of specific actionable targets and specifically copy number alterations in cancer that can be therapeutically targeted. It's a pretty short list. And it's actually pretty evident, I think, that gain of function events, so amplifications, are much more feasible to target with small molecule drugs that work by inhibiting the action of a protein. When you have tumor suppressor loss, there's no small molecule that's going to easily give you back the function of that tumor suppressor. And so most of our targeted therapies are against gain of function mutations or amplifications of these types of genes. And so for instance, in breast cancer, every case is tested for or B2 positivity. And in cases where the patient shows a high level of amplification, she's eligible for her septum treatment. So that's one of the poster children of personalized medicine. So in addition to guiding treatment, the nature of the genomes in cancer can also be used to stratify patients. So this is a nice synthesis study that shows that cancers actually reside on a spectrum. Or at one end, tumors harbor a lot of point mutations here on the left. And on the other end, they harbor a lot of copy number alterations. And so we can infer that there's a selection either for a process that promotes defects in DNA repair that fixes double-stranded breaks and so leads to genomic instability and you have accumulation of many copy number aberrations or a deficiency in mismatch repair that causes single point mutations, single base changes. And it's more rare to see both of these processes operate in a single cancer. And indeed, it looks like tumors kind of fall to the outskirts of these two ranges. And so we can see that the ovarian cancers, which are this red line, are highly enriched in this part of the spectrum. And they have those chaotic karyotypes we looked at. And some of the highest level of copy number and genomic instability. And so this way of stratifying patients also opens up a different way to look at therapeutic opportunities, because drugs have actually been specifically developed to interfere with each of these acquired capabilities that are necessary for tumor growth and progression. So many of these are in clinical trials or have been approved for clinical use. And I just want to spend a moment briefly looking at PARP inhibitors, which are a class of agents that target genomic instability. And so the key idea here is that DNA is actually damaged thousands of times during each cell cycle. Every cell in your body that's dividing will have nicks that occur somewhere along the genome. And that damage has to be repaired in order for the cell to proceed through the cell cycle. So PARP is a protein that's important for repairing these single-stranded breaks, these nicks. If these nicks are not repaired, so if we have an inhibitor against PARP, then during replication those nicks turn into double-stranded breaks. And so these PARP inhibitors have the effect of inducing hundreds to thousands of double-stranded breaks in the genome of cells, which is OK in a normal cell that has a functioning copy of BRCA1 or 2, because BRCA1 or 2 are involved in repairing double-stranded breaks. In breast cancer cells that are mutated for BRCA, so they have homozygous loss of BRCA, they cannot perform homologous recombination repair. And so many double-stranded breaks accumulate that those cells die. So these two events, BRCA mutations and PARP inhibition, are in a way synthetically lethal to one another. Either one on their own is fine. The cells will survive. But if you put those together, they're synthetically lethal. And so despite these tumors having a mutation in what is a tumor suppressor, which are really hard to target with small molecule drugs, we can take advantage of the synthetic lethality to come up with effective therapies in a different way. And so currently there are tests that will test for BRCA1 and 2 deletion or mutations. And in those cases where a woman has a BRCA mutant breast cancer or a germline mutation, she would be eligible for treatment with a number of these drugs. I think two are approved and then there are a number in clinical trials. So this is kind of a great example of stratifying patients based on driver genomic alterations and then identifying combinations of tumor dependencies that are amenable and principle to target a therapeutic intervention. OK, so let's move on now to talking about some of the main confounding factors that make copy number inference challenging. Basically, this task is difficult for two or three main reasons. First of all, cancer cells are nearly always intermixed with some unknown amount of normal cells. And so this is referred to as tumor purity. And the actual DNA content of the cancer cell or the ploidy of the cancer cell is unknown at the beginning. So we don't know if their karyotypes are like those crazy ovarian cancers we looked at where you might have six copies of the genome or if it's a diploid tumor. But this will influence the copy number analysis in a significant way. And then, of course, we also have genetically divergent lineages in a cancer sample, which may differ in a various loci in terms of gains or losses. And so when these values are unknown, they have to be estimated or predicted. And often, there is more than one combination of purity and ploidy that can equally well describe an observed copy number state. And so here are two examples. Here are two examples of different combinations of purity and ploidy that give us the exact copy number. So in this first case, we have a homozygous deletion and a 30% tumor purity. So at this locus that's been homozygous lead deleted in a tumor, which is 30% of the sample, we have zero copies, so that's zero. And then we add two copies in the normal case, times 0.6, which is because normal cells make up actually 60% of the sample. So we get 1.2. And if we have a heterzygous deletion in a tumor with 60% purity, then we have the one copy that's left in 60% of cells and two copies in the normal cells that are in 30% of cells. So again, we get 1.2. And so we can't tell which one is more likely. Or we need clever ways to tell which one is more likely. Here's another example where we get equivalent copy number and beta allele frequency. So in one scenario, we start out with a deployed tumor where the heterzygosity is AB. So we have 50% heterzygosity. And we gain an A, so we have AAB. So our BAF is AAB. Or in another scenario, we have actually a tetraploid tumor, so we start with AABB. And we have a one copy loss. So we end up with AAB. In one case, we ended up there through a gain. In one case, we ended up there through a loss. And so knowing whether you've gained or lost that piece of DNA is kind of important. And so purity and ployty are something that need to be accurately estimated from the data before we can do any more inference. And so computational tools have been developed to specifically address these problems. This is an overview of the absolute algorithm, which takes in process copy number segments and be allele frequency calls from a sample. And it tries to infer the best combination of purity and ployty given that tumor's profile and also a pre-existing knowledge about cancer karyotypes. And so in this particular case, here in B, we see a genome-wide view of copy ratios for allele A and B. And whenever we have a purple segment, A and B are at equal ratios. And then whenever we have a divergence between blue and red, we have gains or losses of one or the other allele. And the gray and white bars are alternating chromosomes. So this is a genome-wide view. And here on the right, we have the sum, the whole genome profile of this tumor. And so we can see that most of the genome is in this state. Major minority is in this other state. And then there are some losses and additional gains up here. And so you can actually explain this copy number status in at least three different ways, so at least three different combinations of purity and ployty, which are shown here on the bottom left. So we can have, for instance, a ployty of 4n, which means that each allele is at 2. So we have 2 and 2 for the maternal and paternal allele over most of the genome. And then in many cases, we would have loss of one of those alleles, or gain of one of those alleles, or gain of both, or a subclonal gain. We can also explain that copy number ratio with a ployty of 7 or a ployty of 1. It turns out that a ployty of 7 and 4 are the most equally probable. But if you look at cancer care types, 7 is not as probable as 4. It's much more likely to have a tumor with a ployty of 4 than a ployty of 7. And so this algorithm takes into account your copy number states and then everything else that's known about cancer care types in general, and will make a call and a prediction and give you a prediction score for which purity-employity combination it inferred. So taking this approach and then looking at purity-employity across 5,000 cancers, it turns out that over a third of all cancers have a ployty greater than 4, meaning that they must have undergone a whole genome duplication event at some point in their evolutionary history. And so here we see that for 12 different kinds of cancers across a range of different purities. So some cancers are more pure than others. But within each type, you have some more pure samples and some less pure samples. There's a significant proportion of tumors that are diploid. So these are the purple samples. So they have a ployty of 2. But more than a third have undergone at least one genome duplication, and then a minority have undergone more than one genome duplication. So those are the green and the red. And there's compelling evidence that genome-doubling events actually happened fairly early in tumorogenesis. So what this plot shows for broad events on the top and focal events on the bottom and for gains in red and losses in blue is that samples with whole genome duplication, which are for each cancer type, the bar on the right, there are more amplifications and deletions that occur after the genome duplication. So the way to read this graph, so the bar on the left is always the tumors in each type that are not duplicated and the bars on the right are the ones that are duplicated. And we can tell which events must have happened before the duplication or after the duplication. So if we look at the overall signal from amplifications and losses that are broad, we can see that before tumor duplication, some events happened, but a lot more happened afterwards. And then in whole genome-duplicated tumors, we see a lot more events than in non-duplicated tumors. So whole genome duplication is kind of a hallmark of cancer instability. So you generate all these extra copies of every chromosome, and then you are free to lose various parts of your genome because you have backups. Unlike a diploid genome, where if you lose large chunks of your genome, often that will cause a cell cycle arrest or death of that cell. And so losses are actually much more prevalent in these genome-duplicated cases than in non-genome-duplicated cases. OK, so one final example of the clinical relevance of genome-doubling events in cancer. This is from a recently published cohort of 100 lung cancers where each patient tumor was genomically profiled using exome sequencing from multiple spatially separated biopsies of the primary untreated tumor. So you can see here they took four different biopsies. And then they performed, this is just a conceptual design where they performed copy number analysis and mutation calling done on each region. And this data can then be used to work out the phylogeny of this cancer. And so this phylogenetic tree really depicts the mutational trajectory of this cancer. So it starts with some initiating events here in gray, genome-duplicate event that happens fairly early on, a few more events that are found in every single cancer cell, and then branching of these into different clonal lineages that have distinct mutations or copy number gains and losses. So these are subclonal events. And so I'm only going to highlight a few of the most relevant findings for our discussion now. First, they found that nearly 50% of copy number alterations in these lung cancers were subclonal. And that means they were restricted to a certain part of the tumor. Second, without multi-regional sampling, 70% of the subclonal events would look clonal because you wouldn't know that they're missing in other cells. So if you just profile one biopsy, you'd be wrong 70% of the time that a clonal event is clonal because it's in every cell of that specific region of the tumor but not elsewhere. Third, early genome-dubbing events were actually associated with higher levels of subclonal events. And the patients with high levels of subclonal events did a lot worse than patients without them. So genomic instability in this disease is prognostic. And then finally, this approach when considering mutations and copy number alterations as a function of the whole cohort, which is sort of summarized here on the left, it really allows us to classify genes into those that are lost or gained or mutated before a genome-dubbling event and lost and gained and mutated after a genome-dubbling event. And so those that are lost or gained or mutated early are possible initiating events. And these later ones are possible maintenance events. And so when you think about therapeutic targeting of lung cancers and you've just profiled one region and you've picked a clonal mutation or alteration that you think is a driver and you're going to target it, it makes a difference. And if you had multi-regional sampling or not, because that mutation may not actually be clonal. And so this is kind of a great paper. It goes over a lot of the concepts that we're going to talk about today and tomorrow. And I encourage you to read it if you get a chance on top of all your other reading. OK, so let's now talk about measurement technologies and computational methods. This slide essentially shows a progression of measurement technologies for copy number changes that range from low resolution and high accuracy here on the left to a middle ground that has a higher resolution and coming to date with really high resolution and high accuracy measurement technologies like genotype arrays and whole genome sequencing. And so I mentioned fish already. This is a method where you can target a small number of loci or genes or regions of interest and look at their copy number in a single cell resolution, which is great, but it's very laborious and it's not high throughput at all. And then in the early 2000s, hybridization array platforms were developed. So these had between 30,000 to 100,000 probes. So these were positions in the genome and one would label a tumor and a normal sample, wash them on these arrays, and then look at the differences between the red and the green label. And so you could focus in on gains or losses in a tumor sample versus a normal. But you had no information on loss of heterozygosity or allele frequencies, for instance. And so affymetrics and alumina really drove the cancer genome analysis copy number field in the later 2000s or mid-2000s with the advent of these genotyping arrays. So these not only measure copy number, they measure the allele frequencies of polymorphisms in the general population. And now, of course, we've moved on to 3 million differences with whole genome sequencing. So we are really at a high resolution and high accuracy and of the scale. And in today's lab, you're going to have a chance to look at both array data, so affymetrics array data, and sequencing data. OK, so from this data, we're going to try to infer copy number gains and losses and loss of heterozygosity, which is not easy. So this is a challenging statistical inference problem. And I've already mentioned some of the challenges, normal contamination. So we have to account for purity. We have to account for tumor ploidy. We have to account for the possibility that there are multiple genetic clones that are different in our cancer sample. And currently, most experimental designs don't actually take multi-regional sampling. So we can't use multiple samples to infer whether something is clonal or not clonal. And so this is also a challenge. But basically, there are now statistical tools that will perform this prediction of purity, employee, and deconvolution of these mixtures of cells, which is a really important aspect of the analysis. And if you want to review some of these statistical considerations, this is a really good paper by Terry Speed. It's a good read in general and for the problem at hand. And just to remind us, before looking at some examples, this is the kind of data that we're going to look at. Or this is the basis for the inference. In other words, ideally normal samples where we have heterozygos or homozygous polymorphisms and no copy number differences, tumors which may have duplications and changes in heterozygosity. So from a heterozygous state to a loss of heterozygosity or to an altered genotype frequency. And then the quick reminder on genotype. In a diploid genome, we have the two alleles. A is the more prevalent in the population and B is the least prevalent. So at every position, the C will be the A and the T will be the B. And you use that A and B for everyone. So when we have two copies of the genomes, there are three possible genotypes, AA, AB, and BB. So that's shown up here. If the germline of the person is heterozygous at a particular SNP and you see AB, then that position is still heterozygous. If you see A or AA, then that is loss of heterozygosity. BBB, loss of heterozygosity. When we have a copy number of gain, we now have four possible genotypes. And so these are the ones we talked before, where you're shifting away in heterozygosity from 0.5. So your beta allele frequency or your B allele frequency would be 0.3, 0.6, 1, and 0. And same by the time we get to 5, a copy number of 5, we can have six possible genotypes. And so these are the zygosity states where we can have complete loss of heterozygosity, allele-specific copy number alterations, ASCNAs, or retention of heterozygosity. And so we do want to ascertain what the genotype actually is, because we often want to know if the tumor has balanced heterozygosity or complete loss of one allele in favor of another, because that actually tells us something about the biology of the disease. So it's important to do this process. Here's an actual example of a SNP. And so when we do, when we infer the B allele frequency in both arrays and sequence data, that relies on measuring the relative frequency of the A and B allele at particular positions. And those positions we usually get from DB SNP. So this is a database that warehouses all the polymorphic information for humans and other species as well. So in the latest version, DB SNP 150, there are about 130 million single-nucleotide polymorphisms that have a known frequency in the human population. So here are three examples. This is in the gene BRCA2. And this gene has 8,200 such variations. So a lot of them don't necessarily have an allele frequency that's known. But these three examples do. These 8,200 will actually mostly be non-coding. And for each one, we can see what the alleles are. For instance, at this position in the genome, it's a T and a C. The global MAF stands for the global minor allele frequency. So the C is the minor allele, and it's found in 0.009% or that's the proportion in the population. So 0.9%. This position is a GT. The G is the minor allele, and it's found in 37% of people. So at this position, the A allele is T, and the B allele is G. And same over here, we see that the T is found at such a rare frequency in the population that you need more than three decimal places to write it down. And we only have three here. So that wouldn't be on a genotype array, for instance, because it's always going to be a CC. Almost no one will have the T. So it's useless to put it on the genotype array. So on the affymetrics genotype array, the SNF6 array, we have probes that contain these polymorphisms. So these are 25 more oligonucleotide probes. There are almost a million of them, so 900,000 probes that will be a perfect match to each allele of the SNP. And then there are also almost 950,000 probes that don't bind to SNPs. They bind to areas of the genome that are known to be variant in terms of copy number in the general population. So these are the CNV probes. The average heterozygosity of SNPs on the array is 25%. So for each SNP, on average, it'll be heterozygous in 25% of people. It's kind of hard to read that. And what we measure on this array are hybridization intensities. So each probe will bind to its target. So we add the person's DNA to the chip. And if there is binding, it's a fluorescence-based assay. And so we read out basically a continuous signal of intensity. And because we know where each one of these polymorphisms are in the genome, that's how we can make those plots of signal along the chromosome. And we can see areas of gain or loss. OK, so this is a little bit more detail about, oh, yeah? Does that work the other way? So in your own, across your own genome, are you So you have a 25% chance of being heterozygous at that one position. And then at some other position, you have a 25% chance of being heterozygous. And you have 900,000 of these positions. So chances are pretty good that you'll be heterozygous at a lot of them. If the average heterozygosity of these SNPs was 0.001, you would likely be homozygous at most of these positions. And so you would never be able to do allelic ratios because it would always be 0.01. So these were chosen specifically to tip the scale in favor of finding heterozygous marks. And so the way that that works in the assay is that these, so let's say we have a SNP of interest here. It's an A to a C. A is the red. C is the green. The probes that are designed will be complementary to one or the other allele. So we have a probe here with a T. So that's complementary to the A. And a probe here with a G, which is complementary to the C. And not only do we have one probe for each allele, we have multiple probes where the base that's changing, it's in the center of the probe or a few base pairs up or down. So we have about 20 probes per position in a SNP6 chip. And let's assume that we have an individual who is homozygous for allele A. The DNA of this person will bind to the probe that has a T. It will bind to the other probe that has a T in a different position. But it will also bind to the probe that has a C. It's just that the binding is much weaker. So that binding, because it's not a perfect binding over the 25 base pairs, that fragment will dissociate a lot of the time. And so the signal, so there's always a background signal. But that signal will not be as strong as the ones from a perfect binding. And then same for a person who has the C allele, for instance, the DNA of that person will bind to the G, but also to the A. And so again, we're looking for a much increased signal on one of the strands. And so the intensities are kind of summarized here on the right. We either have homozygous A, where we just see signal for the complementary to the A, homozygous B, which are the probe's complementary to C, in this case. Or we'll see both probes light up, or all probes light up, which will be the case for individuals that are heterozygous at this position. So this is the data that comes off the, or this is the data that we read at the signal intensities. OK, so people still use SNP6 arrays. There's a wealth of data available out there from large consortia. So TCGA, for instance, has 11,000 tumor samples profiled with SNP6 across a range of different cancers. And many of these have other types of data, so expression or methylation and so on. And until recently, there was no equivalent for other species, or specifically for mouse. But now there is a genotyping array for mice. And the way that it was designed, you can actually characterize a wide range of strains. So if you're working with model systems, this is probably pretty useful. So you could uncover genetic events in mouse models of disease, for instance. And it would be the same type of analysis as we talked about for the human array. So the workflow and part of what you'll do in the lab today is to take genotyping arrays on the SNP6 platform, where we're going to start with a cell file. So this is the data that comes off the machines. And the workflow is to pre-process the signals from these probes and perform normalization so that we have signals that are comparable across the genome and across samples. And then this is followed by a couple of different extraction techniques, one on the left to generate calls for copy number, and then on the right to generate calls for the B allele, the minor allele frequency. And then those measurements are processed with a statistical model that can infer where the copy number and B allele ratios change across the genome. So when we go from one copy number state to another. And then we can follow up once we have those segments of gain and loss or LOH. We can follow up with some of the activities you'll do in the other modules, like finding genes that are overrepresented in these regions or looking at pathways that may be hit significantly. And so this is the workflow for SNP6, but it's really generalizable to sequencing data as well. And just a quick note that for any kind of data, normalization is absolutely required to remove platform-induced artifacts. And so the probes are actually not really specific. They will actually hybridize with other parts of the genome. There's that background signal I mentioned. The degree of hybridization can be affected by the length of the DNA fragments, for instance. And the probe may have worse binding in the presence of mutations, or if there's clusters of SNPs, so some filtering can be done there as well. And so you need a package, for instance, like the Aroma-Aphemetrix package that handles a lot of these artifacts. And does the normalization so that the experiments are comparable with each other. And you have outputs, which are copy number and B-allial frequencies, which actually reflect biology and not the artifacts of the platform. OK, so once we have normalized data, we can start them for copy number aberrations, LOH, allele classic changes. And so again, here are just a couple of examples of some very clean signals, especially for the B-allial ratio for gains and losses. So here we see a small gain. Here we see copy neutral LOH and so on for this. And in this slide, I'm just listing a number of methods for high density genotype arrays, including oncosnip, which I think is the one that we're going to use in the lab, as well as absolute, which I talked about earlier. OK, so the copy number field was predominated by genotyping arrays for many years. And in relatively recently, whole genome sequencing have been routinely performed as the cost of sequencing has dropped. So I think it currently costs $1,400 to do a 30x genome and much less to do an exome. I think it's about 650. But basically, in a whole genome or exome experiment, libraries are essentially made. You've heard of this or a version of this already. But you make libraries by sharing or fragmenting DNA and then selecting some fragments that have a reasonable size, for instance, 300 base pairs, and then sequencing from both ends of each fragment. So here in this diagram, we see a fragment. And the sequenced parts are these orange ends. And so what we can see in a sequencing experiment is that coverage corresponds to copy number. And so we might infer that the average coverage is a diploid genome. And then changes in coverage where we have more reads would be gains. And changes in coverage where we have less reads would correspond to deletions. And then areas of the genome where we don't have any reads would be homozygous deletions. And sequency reads also give us the allelic ratio at mutations and signal nucleotide polymorphisms. So we can infer B allele frequencies in an analogous way to array data. Of course, we do this by read counts instead of intensity signals. And actually, that's one of the major differences between these two platforms because we're going from an analog technology to a digital one. So we're going from intensities to counts. Not surprisingly, HODGENome data is also subject to different biases. So GC content is actually a major culprit contributing to changes in coverage that are not due to copy number. And so we see here in this plot on the top left that regions with a high GC content have a higher coverage. There is a correlation. And so this is an aspect of the data that we need to account for and correct. And regression techniques are typically used to correct this bias. So after correction, we see that read coverage or read counts are no longer correlated with GC counts. So we want to remove this. And then another interesting feature of genomes is mapability. And so lots of areas in our genome are repetitive sequences. And if you cannot uniquely align a sequence, you have low mapability. So there's this inverse correlation of mapability or read coverage with repeats. And this is also something that we should and can correct. So these are the two main things to correct. So when we do preprocessing on sequencing data, we take data that looks like this. So when we look across this particular chromosome, we see that if we look in 1,000 base pair bins and count the read coverage or copy number, we can see that there's a variation that goes up and down. And once we correct for GC, this variation is a lot of that is gone. And once we correct for mapability and GC, we are left with a pretty clean signal where it's very easy to infer that a copy number event, so a gain or loss is happening. So in the lab, you'll use Titan, which has this processing built into it. And then once we have preprocessed and normalized data, segmentation is applied to infer which of these contiguous regions have concordant copy number and BLEL frequencies. So we go from these plots that have all these black dots to these plots where different colors denote different segments with concordant copy number states. So you can see about 30 different segments here. So here we have, it's very hard to read. Blue is 0. And then you can see there's a gain. And then another copy, neutral. And then a homozygous loss, a heterozygous loss, a subclonal loss, another heterozygous loss, another piece of normal. We're missing information at the centromere because that is one of the low mapability areas. It's full of repeats. So we often don't have signals at centromeres and telomeres. And then we keep going like this. So this is just an example of a genome that was processed with a tool called Apollo, which is the precursor to the tool you'll use in the lab, Titan. And this is from the publication that compared SNP array data, which was the gold standard, to doing this kind of analysis using whole genome data. And so these data sets were generated from the same aliquot of DNA. And you can see that they have really concordant calls. And this tool also takes into account a stromal parameter, SP. So this reflects the amount of normal contamination in the sample copy number. So it infers the ploidy of your tumor sample. And SC is spatial correlation. And I think that's just a metric that will account for how far away your heterozygous positions are. OK, so whole genome is cheap. I mentioned compared to a few years ago, whole exome is even cheaper. It's about $650. So there's actually a lot of interest in performing this type of analysis on exomes. But they only give you about 1% to 2% of the data you get from a whole genome. And so a lot of people who work with exomes and TCGA is dominated by exome capture data want to perform copy number and LOH analysis on this type of data. And so that's actually very possible. You can get fairly good specificity and sensitivity in finding deletions and amplifications. It's not very good for finding areas of LOH. Often the borders are very fuzzy. And you don't have a lot of statistical power because you don't have as many heterozygous positions. There are not that many SNPs in the exome compared to the genome. And so often LOH suffers in exome data. But it is possible to analyze copy number and LOH from exomes. Control freak will only do copy number. Titan does both copy number and LOH. And here are just some examples from control freak where it corrects for GC content and mapability and generates these nice plots. Although you can see that the dots are much more sparse than in whole genome data. So this is, again, that super clean example of the typical features we look for in a copy number analysis. This doesn't give us anything about subclonal events, which we know are prevalent in cancer genomes. And so how can we find those? They actually show up as weaker signals that are centered around non-integer copy numbers. So here we see this region is a copy number loss of 1. And this other region is a copy number loss of 0.5. So this is a subclonal event. It is not in 100% of cells. And so there are tools like Titan which can predict subclonal events and decompilute the likely lineages that are present in the sample. So for instance, in this case, we have this clonal event, this loss, which happens early on. It's in 100% of cells. We have heterozygous loss. And then there's a subclonal cluster, this cluster 2, which involves deletion of this region and this region. And that explains why you would have a copy number of 1.5 instead of 1 at these regions. So tools will attempt to, with information from purity, employee, decompilute the remaining signal into subclonal populations. So in the conceptual framework for this tool is that we ideally profile a matched normal and tumor sample. We extract the positions in the genome that are heterozygous in the normal. So here are two examples where we see that both of these alleles are around 50%. And then we look in the tumor and we determine whether the heterozygosity is retained or lost, so L and H. And typically there are about 2 to 3 million SNPs per individual. And at each of these positions, we basically count the alleles. And so we apply then a statistical model that takes genotypes and coverage as input and tries to learn where the copy number and LOH segments are and then tries to determine their cellular prevalence, which schematically kind of looks like this. Let's say we have a tumor with a mixture of cells corresponding to two genotypes. The first has a deletion and a gain. The deletion and a gain and basically 100% of cells in this genotype. The second to lineage has just a deletion and no gain. And so the idea is that when you mix these two lineages together and then add in some normal contamination, you have a mixture of lineages and the confounding factor of lower tumor purity. And so you have a signal like this, where you have the loss, which retains its signal because it's in every cancer cell. And the gain whose signal is suppressed because it is not in every cancer cell. So now we see it as proportional to the amount of cancer cells that were in that lineage with a gain. And so the task of these tools is basically to account for normal contamination and decombolute the events that are in both or single lineages. So benchmarking says we can do this pretty well. We can do this pretty well compared to other tools. We can do this with high precision at various mixing ratios of those two lineages, so where one lineage is closer to 0% or 100%. And it shows that the corresponding population structure or that considering this population structure increases sensitivity to subclonal events. So you need to account for this to find subclonal events. If you don't have this as a parameter in your statistical tool, then your chances of finding subclonal events is significantly reduced because you're essentially treating them as noise. So here's an example where we applied titan to exome data from brain tumors that were in many cases highly genetically heterogeneous. So genetic heterogeneity in this case is just illustrated here as different colors. So these are cells in a tumor. And we picked multiple regions from these tumors to sequence. And I'm just showing you here for examples on the right of some high-grade gliomas and some megaloblastomas. And we can see that a lot of the genome is copy neutral. That's gray. And then we can see for each region, so 1, 2, 3, 4, 5, 6, in this high-grade glioma, we had eight regions. For each region, we can see copy number gains on top in red and losses on the bottom. And the intensity of the color indicates how clonal that event is. So in megaloblastoma, loss of 9Q is an early event. That is essentially it's one of the events that takes out patch, which is tumor suppressor. So you mutate patch on one copy, and then you lose the other copy and duplicate it, or you just lose the other copy. So you're left with a mutated copy of patch. So that's an early event. And then I think in the glioblastomas, you have gain of 7 and loss of 10, which are early, early events. But there are plenty of events that look clonal in one region and are not found anywhere else in the tumor or that are subclonal everywhere, which means probably that there is a different part of the tumor we didn't profile, let's say this purple region, where that event is actually clonal. And there are many examples, if you can look at these plots for a while, where just profiling one biopsy doesn't actually give you the true copy number states of these tumors. And so this is just like the lung cancer study that shows that in 70% of copy number aberrations, they weren't really clonal when you looked at other regions of the tumor. So let's finish with a nod towards emergent technologies that have a goal of the convoluting mixtures of genetic lineages so that we can understand population structures. And this has directly achieved a single cell sequencing. And so you can sequence hundreds, two thousands, to hundreds of thousands of single cells by purifying nuclei from cells. And in contrast to sequencing a bulk tumor and then applying fancy algorithms to deconvolute the signal into likely lineages, instead, we are directly measuring the copy number gains and losses in each cell. So a subclonal event, found in only subset of cells, you know exactly what events go together. So you know exactly what is happening in each cell. Now, the data is not necessarily cheap or easy to produce or easy to work with. And it's not bias-free. But there are many ongoing efforts that actually work to couple a single cell profiling with bulk measurements and gain insight into this clonal heterogeneity. And so as the technology evolves and becomes cheaper, we'll see more and more insight into clonal evolution from these types of experiments. OK, so just to summarize, genome architecture is a fundamental aspect of cancer. Somatic copy number alterations change gene dosage. So drivers will often have concordant gains or losses or increases or decreases in their expression. We can measure these alterations using erase-based technologies or next-gen sequencing. The properties of the genomes that we can learn about through copy number profiling actually can highlight therapeutic opportunities if we do the analysis right. And so we need to account for all the caveats that I talked about in order to be able to infer the biological signal and infer the structure of the clonal populations within bulk samples. OK, so I'm going to leave you with a set of tools. We'll use Titan and Oncosnip in the lab and hopefully fill up on coffee so that we can keep going for another few minutes this afternoon.