 The name of this module is not module name. The name of this module is inferring copy number changes from high-density genotyping arrays. And I'll just start with giving you a little bit of background of what I do and where I come from. So my lab is in Vancouver at the BC Cancer Agency, and I'm affiliated with the University of British Columbia. And my appointment's in the Department of Pathology there, and I have a cross-appointment in the Department of Computer Science. And so not surprisingly, I really operate at the interface of these two fields, cancer genomics, where I study mutational profiles of different cancers, as well as tumor evolution. And in the fields of computer science, and the topics of machine learning, statistical models, algorithms, and data analysis. And really, these two once disparate fields are, of course, overlapping in a major way these days. And it's probably one of the reasons why you're all here. Cancer genomics has become, well, genomics in general, has become a quantitative science and necessitates the use of sophisticated computational analysis to make sense of huge amounts of data that are being generated. So this is where I operate. And a couple of people you may recognize in faces, Jamie, Kerry, are in my group, and they're here somewhere, I think. Where are they? They are. They probably thought it started at two as well. Maybe I gave them misinformation. Maybe they gave them misinformation. They're usually very dependable people. In any case, so I have a group of about five or six grad students and a few post-doctoral fellows. Robert, it's a colleague in a different lab, so nice to see you. Good. Okay, so that's enough of the introduction about me. I guess one thing that could be considered special about me is I spent a few years being a Bohemian jazz musician having nothing to do with science, and that was fun. Okay, so here we are. Today's topics, we'll go through some major topics, and I'm really going to talk about copy number alterations in cancer and some of the biological relevance and impact of that. Why we should study copy number changes in tumors. I'll go over a little bit about the measurement technologies that exist today for studying copy number changes, and then go into some details on high-density genotyping and rate analysis. And ultimately, we'll talk about really three topics, which is pre-processing and normalization, segmentation, and then how to interpret those segments from this analysis. And this is really the guts of it, and this is what we'll talk about in the lab. So some of you, I should actually probe. I mean, how many people come from the biology background? Almost all of you, okay, good. Any oncology, pathology specialists in the crowd? A couple, okay, pretty good. Any analytical computational people in the crowd? Okay, excellent, excellent, good. Okay, so we have a nice mix of people. I'm going to probably touch on something for everyone in this topic. So hopefully some of you may be familiar with a picture like this. This is a spectral karyogram, and it depicts a normal human karyotype. And really the important point here is that our DNA in all of our cells is neatly organized into 23 pairs of chromosomes, 22 autosomes, and then two sex chromosomes. So we really have two copies of our genome neatly packed into these chromosomal arrangements. So as early as the early 1900s, Bovary suggested that, in fact, the distribution of chromosomes may actually affect how a cell proliferates. And he says that we start with the assumption that the qualities of malignant cells have their own origin and a defect that exists within them. So he had some very insightful views very early on that. In fact, disruption of the genome itself, so misconfiguration of the genome, can lead to abnormal behavior in terms of cell growth and cell proliferation. So he studied sea urchin cells, and that really led him to believe that an abnormal distribution of chromosomes in the cells can lead to cell proliferation. And that gave him the insight that, in fact, the content of the chromosomes is actually different. And so having extra copies of particular chromosomes is what led to seeing cells that were basically proliferating in an uncontrolled manner. So it wasn't until 1960 that he was actually proven correct with the discovery of the Philadelphia chromosome CML. And this is an abnormal chromosome resulting from a translocation of two chromosomes that essentially creates a chimeric protein that's an oncogene. And so this just gives you a historical context that people have been thinking about how chromosomes are organized and how they lead to cancer for a very, very long time. So this is a depiction of six different high-grade serious cancers of the ovary. And you can see that this picture looks very, very different than the picture I showed you before. There are numerous chromosomes that have multiple copies here and here. There are some chromosomes that are missing arms. There are chromosomes that are fused together with two different colors here. And so these genomes are highly disrupted. So cancer genomes, and it's really almost the property of nearly all cancer genomes, have completely abnormal karyotypes and missegregation of chromosomes and a different distribution of chromosomes is a hallmark feature of most cancers. So copy number variations or CNDs are essentially very simply defined as losses or gains of genetic material. And so this is actually from a germline perspective whereby we have the alleles of the parents here coming together to form the genome of a child. And so sometimes what can happen is that we'll get a de novo loss of a particular genetic marker here. So this would be considered a deletion. Sometimes we can get a gain of material as is depicted here. And then sometimes we can get deletion and duplications in the following way. You guys are late. I was late too, it's okay. Okay, so that gives you an example of the types of events in a very simple schematic diagram that we're trying to depict. So this is some real data from 1,000 breast cancers, and this shows you just how disrupted breast cancer genomes actually are. So what I'm showing here is red going up shows you the frequency in the population of 1,000 tumors of how often a given locus is amplified, so how many times it's duplicated. And the chromosomes are just arrayed on the bottom here. So this region here just depicts chromosome 1, so this is chromosome 1Q, is probably the most frequent event happening in almost 50% of all breast cancers. Same can be said of chromosome 8Q. And then we have very frequent deletion events here on 1P, 8P, and 16P. And so you can see that if we took this same picture in normal genomes, it would generally be in about 98% of the genome would just be flat. There would be no changes at all. So these are somatically acquired changes in cancers that show a really severe disruption. And this is really called the landscape, if you will, of the genome. So any questions so far? So the major reason why we want to study these copy number changes is that they actually disrupt normal cellular behavior. So here are just three examples of the types of things I've been talking about, just depicted in a different way. So here's a deletion where we may have three genes in this region here, and this shows a deletion here. Here we have a copy number change which is just duplication or many copies of a particular gene. Then you can have whole regions that undergo segmental duplication like this. And here's just what that looks like on a chromosome. So you can imagine that if you have a gene contained in here where there's a deletion, and let's say there's a tumor suppressor that's harbored in there, and that genetic material is no longer available and that gene just doesn't get expressed and doesn't get made into protein, then that function of that tumor suppressor is gone and that whatever repression function that had is also lost. Amplifications is the opposite. So if you have a growth factor, let's say within an amplification, to get multiple copies, then there are more copies of that gene that are expressed and more copies of the protein that are getting made. So this is called the gene dosage effect. And essentially the major focus that we want to try to find is the CNAs that can lead to adverse expression changes in the targeted genes. And ultimately these copy number changes can be used for diagnostics and prognostics. So in CML it's a classic feature. This BCR ABLE is a transducation. It's not a copy number change, but it's a transducation that basically defines the disease. We can look for gene disease associations and targets for therapeutics, which I'll talk about as well. I would say that there are really three main types of CNBs. We can talk about congenital abnormalities. So these are CNBs that people are born with. And a classic example is Trisomy 21. We have an extra copy of 21 that causes Down syndrome. There are other CNBs implicated in mental retardation and autism. We can have somatic alterations, which are basically acquired over the course of a cell's life, and they're tissue specific. These are changes that one is not born with, and these are a feature of most, not all cancers. Then we can have benoan variations, which are just polymorphisms in the genome that make two people different. These are just a classic type of human variation. It really wasn't appreciated or discovered until about five or six years ago, and it contributes at least as much as single nucleotide polymorphisms to human variation. Yes, it's an arbitrary number, the 1KB, but that's basically the standard is that anything below that is really considered as a sequence level change that we'll discuss tomorrow. These are more gross changes that are considered structural alterations of the genome, and the convention is to talk 1KB, but that's really an arbitrary dimension. Now focusing on the cancer types, really there are three different types of alterations that I'd like to discuss. The first is segmental aneuploidies, and these are usually large-scale, so often whole chromosome or chromosome arm-length events. These are giant events that really contain all the genes within them. Those are much more difficult to interpret because you can imagine that the gene dosage effect of every single gene in a chromosome arm would be affected, and so focusing down on what functional impact of those large-scale changes would be difficult. However, a lot of cancers exhibit focal copy number changes, whereby the change will target one or just a few genes, and these can really be good indicators of so-called driver events. Have we gone over the driver or passenger? Okay, probably ad nauseam at this point. Okay, so these can really be good indicators of events, driver events. And then finally we have rearrangements, whereby the parts of one chromosome fuse to a different chromosome. These are called translocations, and often they induce what we call gene fusions, and they can create chimeric proteins, whereby you have a protein that's half of one gene and half of another gene that does not endogenously occur in normal cells. And so these are, the VCR able, again, is different. Another example is an example of this, and I've done some work in gene fusions, gene fusions and discovery as well. And so gene fusions are highly sought after. Again, another example that's just been discovered, I would say, in 2005, Tom Lunds and I all reported gene fusions with prostate cancer as being 50% current in prostate cancer, which is implicated as really a very important driving event in prostate cancer. And with next-gen sequencing technology, more gene fusions are being discovered almost on a daily basis, as you can see in the literature. So these are important. I don't think we're talking about gene fusions in this workshop, but nonetheless it's an important consideration. Okay, so here's an example. Yes? Snips? Snips? Snips definitely can. So we're, that's tomorrow's topic, but definitely. I mean, so we know about hereditary breast cancers, certainly in other forms of cancers as well, exhibit a very large hereditary component. And breast cancers have been localized to BRCA1 and 2, whereby inherited polymorphisms actually disrupt those genes and give it susceptibility. So I think what you're getting at is the difference between germline polymorphisms and somatic changes. Is that correct? So you can think about hereditary cancer and sporadic cancer. Hereditary cancer has a familial component to it, whereby basic genetics play a role, and it's actually, the susceptibility is passed down through hereditary means. Whereas sporadic cancers, there's no genetic factor, and people could speculate that it's actually environmental causes that will initiate the initial mutation. But in fact, we're all walking around with a lot of mutations. And if we all lived long enough, we'd all die of cancer, because mutations accumulate over time. And so essentially somatic mutations occur stochastically and randomly over the course of one's life. And they're quite distinct from germline polymorphisms, which have the genetic component to it. So another 20% of breast cancers are affected by this. This is a somatic change in chromosome 17 that affects the ERB2 locus. They're also called HER2. And this is probably the single greatest success story in targeted cancer therapy that exists. This gene is a growth factor and it stimulates growth. And essentially there's, in the 90s, between Dennis Slayman and a company called Genentech, targeted therapy was developed for this particular event. And you can see that basically what I'm showing here is that the number of copies of this gene is essentially on the y-axis and the locus where it sits on the chromosome is on the x-axis. So these are essentially where you have no change compared to reference, or you can imagine this is a diploid region and this is a diploid region here, and then you have a massive amplification of HER2. So many, many, many copies. It gets expressed, it gets made into protein. Through the development of an antibody though, however, this can be inhibited. And so the 20% of breast cancer patients, everyone gets tested for the abundance or expression of this protein when they're diagnosed with breast cancer. And if they have high levels, then they're immediately prescribed a drug called Herceptin, which is developed by Genentech. And what used to be essentially a death sentence for this type of cancer now has a very favorable outcome under Herceptin. So really this is targeted therapy that started about 20 years ago. So here's some evidence from real data. Again, this is that same data set of a thousand breast cancers I was talking about. Now on the x-axis, we plot the level of genome copies, the number of, essentially the, you can imagine this is the number of copies in the genome. And we had matching gene expression data from the transcriptome. We could then correlate whether changes in the genome actually translated into changes in the transcriptome. You can see for RB2, these are the patients that have applications in red. There's a very nice correlation there. This gene is next door to it as well. This is GRAB7. People have speculated that actually GRAB7 is also an oncogene. And you can see that it is actually affected dramatically by the copy number change as well. So the copy number change, actually here, often spans just more than HER2. It also includes neighboring genes and usually contains anywhere between three and eight genes around it. So the surrounding genes are actually often affected as well. Here's another locus that's commonly affected in breast cancer. This is CCND1 locus. And you can see that a very similar trend is affected. So copy number changes can and usually the focal ones affect gene expression. And that consequently affects how much protein is made in the cellular function. So those are amplifications. Here's an example of a deletion. I hope you can see this. So this is the gene P10. It's a tumor suppressor gene. And what's shown here is just a part of this chromosome. Sorry, the exact location isn't marked here. But it's basically zoomed in region. You can see the dotted lines represent the boundaries of P10. And the bright green represents a two copy deletion. So whereby that gene is just completely obliterated. It's no longer there. And so patients with... We think that P10 is a very early initiating event in these tumors. It's not a huge percentage of patients at exhibit P10 of breast cancer, but nonetheless is targeted by copy number changes to inactivate the protein. So we've seen one example of an oncogene or B2 that gets amplified. Now we're seeing this is an example of a tumor suppressor which gets deleted. Yeah. So why the solid definition of copy number variation? What's the difference between copy number variation and insertions and deletions? So just going back to that 1KB arbitrary distinction. It's basically changes... More copies of a particular segment of the genome that are greater than 1KB. More or less copies. As you mentioned, this definition is not to be something fixed. And it was the definition based on the resolution of the... Correct. That's absolutely correct. And it said it's an arbitrary distinction. So ultimately an insertion and deletion is a copy number variation. Just that base pair resolution basically. So you could actually make the distinction. Maybe it's more appropriate to say that the gene is larger than base pair resolution changes. Okay, so then here are a list of genes that have been found over the years to be affected by amplifications and deletions. So RB2, I mentioned, EGFR is a close cousin of RB2, epidermal growth factor receptor, the MIC oncogene, PI3 kinase, IGF1R, FGFR12, and the KRAS protein. So deletions are... have been recurrently found in many cancers, RB1, which is the classic retinoplastoma gene, yes. I was just wondering, how much is known about the structure of amplification? In what sense? Well, okay, so if I'm envisioning for the deletion, it's not too hard to do imagining a section that's missing and two ends that are fused together. So you can imagine during replication, there's a bit of a stutter stat. So a piece of the chromosome will just get stuttered, essentially replicated like this, as it's being copied from one cell to another. So then there's homologous end joining and non-homologous end joining and those different processes that lead to this as well. So there's quite a bit known. Probably won't elaborate much more than that. So these proteins here are known to be affected by deletions. And then I'd encourage you to read these papers if you haven't already. These are recent high-resolution interrogation of somatic copy number landscapes. This paper here from Big Nail-O'Doll looked at the cosmic resource of about 700 different cell lines from various different cancers and profiled them all with Afumetric SNP6 arrays and defined recurrent changes across all these cell lines and identified some novel tumor suppressors. This paper here, Barry Keem, at Tall, in nature from Matt Merrison's group, looked at more than 3,000 tumors on a slightly lower-resolution platform, again, across all tumor types and identified some interesting patterns across tumor types. And then finally have some tumor-specific interrogations. The TCGA glioblastoma dataset, which you may have visited already, and the recent ovarian, high-grade cirrus ovarian paper, discusses at least 300 tumors from these different tumor types profiled with high-resolution SNP genotyping arrays. And I hope to next time I see you in this course would be able to talk about this 1,000 tumors that we've been very close to being published in breast cancer. Okay, so here's a list of... Look at this section of this table here. These are genes that are undergoing copy number changes, somatic copy number changes in cancers that have actually have, for the most part, therapies that can be administered to inhibit the proteins. Something happened. I may have hit the switch there. Oh, there we go. That was easy. Don't touch. Okay. You can put a glass barrier here. Yeah, close. Is there a door? There are doors. Okay. There we go. Okay, excellent. That won't happen again. The... Is the microphone off? The microphone's off. Here we go. I can speak loudly. Just let me know if you can't hear me. So what I was discussing... So here is the example I was talking about. So Trustuzumab is Herceptin. It's the scientific name for Herceptin. And there are other drugs, such as... Orlottinib that targets EGFR, and PI3 conase inhibitors that target PIC3CA. So these are, in fact, clinically important and clinically actionable events in cancer. And really this is eventually why we're all working hard to try to discover these changes is that so inhibitors and therapies can be designed around them. And the list, as you can see, is really quite small. When you look at the landscape of breast cancer, a huge part of the genome is affected by these changes. And we've got a list almost that you can count on your hand of genes that we can actually target. So that's why a lot of us are in this area to try to discover these changes. Okay, so have we talked about genotypes, Francis? Not really, okay. So a genotype is best described as... You get one copy of a genome from your mother, one copy from your father, and they can differ. And if they're the same, then we call... Just look at this here, okay? So let's say we have a single nucleotide polymorphism at a given locus. Its genotype is AA if the major allele from the mother and the major allele from the father is present in the child. And it's also the same for minor alleles. It's BB. And if it's a heterozygous position, then it's AB, okay? So this is the standard state. So you can imagine if you get a deletion of one copy of your genome, then you basically don't have two copies to talk about anymore. You just have one, so it can be either A or B. Then let's look at a single copy, single copy duplication. When all of a sudden we have three copies... Oh, there we go. It's back on. Three copies of the genome. And they can take four possibilities, okay? So we have AAA, AAB, ABB, and BBB. Does this make sense? Yeah, okay. Excellent. And then, et cetera, and as you increase the number of copies, the number of possible genotypes changes as well. Okay? So what's particularly important here is this line here, copy-neutral loss of heterozygosity. So you can imagine that there may be a dilaterious allele that's essentially protected by a normal allele. And if that normal allele is gone and only one copy of the dilaterious allele is expressed, then we might have some functional effect. Similarly here, deletion of one copy is also loss of heterozygosity because you only have one copy left. And this is exactly what led to the discovery of the retinoblastoma gene way back when. When Newton noticed that, in fact, there's a very rare form of cancer that there's actually a two event system whereby one allele has a mutation in it and the other one is subsequently lost through somatic mutation. And so that rendered that protein completely gone. So loss of heterozygosity analysis is important and can actually give clues in the context of mutations as to which mutations might be actually under this two-hit idea. Yes, that's right. That's exactly right. Well, so it's not loss of heterozygosity if you didn't have heterozygosity to begin with. Okay, so here's an example of what this looks like in action. So here we have a picture of an array and each black dot represents a probe on the array and it's physically ordered on the chromosome. And we can see a few different events here. So here's a copy number gain. You can see that that creates induces for copies and they can be detected by the different genotypes that are induced by that. So here we have examples of all four, sorry, all five resulting genotypes. Okay, there's an example here where we have no evident change in the genome at a copy number level. However, there's no evidence of heterozygosity here. So this is what we call copy-neutral heterozygosity, loss of heterozygosity. And then this region is really the normal state where you should see at different loci A, A, B, or B, B. So this is the normal state and then these two regions are really the abnormal states. Okay, so why should we model alleles in cancer? I've used this a little bit before, but essentially this is a picture from this very nice review paper. I encourage you to read this. This is actually, Newton is the second author on this paper. And so it's really kind of a follow-up to his two-hit hypothesis 40 years later and he's refined it quite a bit. So this is quite an interesting read. It's a review paper. So essentially what we have is we have two alleles. They're really kind of three different paradigms. So if we have two alleles, then in this two-hit paradigm, then we have no phenotype. If we lose one, then we have some cancer susceptibility. And if we lose two, then that initiates tumor genesis and we get... and this is what happens in retinoblastoma. In this paradigm here, we have what's called haploinsufficiency. And that means that once you lose an allele, that can already... one allele, that can already be enough to initiate cancer. And basically as we lose, in terms of expression, as we lose more of the protein then the severity of the disease increases. So a particular protein that's subject to haploinsufficiency is P53. Then we go on to what he calls quasi-sufficiency or obligate haploinsufficiency. And this is where as we start to lose copies in the expression of these alleles, the severity of the disease increases until it reaches a point where we have to have some of the protein around in order for the cell to survive. And so that's why it's called obligate haploinsufficiency. So you have to have a little bit of the protein remaining in order for those cells to function properly. Otherwise even the tumor cells will die. So measuring alleles essentially can reveal these different types of genes that may undergo these different characteristics in terms of heterozygosity. Yes? In the case of the P10, you had initially shown that a recent breast cancer that it's basically deleted. Yeah. So is there an example of this? So he talks about... So what's your name? Fias. So Fias was just mentioning that so I showed some pictures of homozygous deletions of P10 which effectively suggests that the protein's no longer there and that really conflicts with this idea of obligate haploinsufficiency. And so he does discuss certain examples in different cancers of this phenomenon. And it's not true, it doesn't have to be true. Different characteristics can be different in different cancers. So this may only be true in certain types of cancers and not so in other types of cancers. Okay, so let's spend a bit of time on measurement technologies. So essentially we have an increasing resolution here. We go from fluorescence in situ hybridization. What this is, is it's a very elegant way of targeting just one region with probes and essentially allows fluorescently labeled probes to work their way into the nucleus whereby we can measure copies of that probe within actual cells. And so using microscopy and fluorescence you can basically get the probe inside the nucleus and here what's being shown is each blue blob here is essentially a nucleus. And then the control probe is green and so you can see that a lot of cells have two green blobs and then the actual testing probe is red and this is actually an insulin receptor amplicon that we found in a particular breast cancer. And you can see that the number of red dots greatly exceeds the number of green dots and this is really a low throughput but highly reliable validation technique for looking for copy number changes and this is often used in the clinic to detect her too. Okay, so moving from this very low throughput technology but clinically useful technology we go to what's called array comparative hybridization. So this technology emerged in the early 2000s and essentially can look at about between 30 and 100,000 loci in the genome in parallel and I'll go over in a bit more detail what this does but essentially you can design probes, say 100,000 probes, array them on a glass slide and then in parallel you're going to interrogate the number of copies of all of those probes which you know where they fit in the genome and then can look at patterns across the genome. In the late 2000s, very high density genotyping arrays started to emerge really as a consequence of GWAS studies for germline heritable diseases. That's what really drove the technology but actually they're quite useful for cancer and these genotyping arrays can contain up to about 2 million probes and essentially query two alleles at each loci as well so that's a very big distinction between array CGH. And finally we get into what I call 3G technology. These aren't cell phones, these are genomes. We get to base pair resolution through sequencing and essentially we now have the ability which you probably looked at this morning to find these copy number variations at base pair resolution which is quite astonishing and extraordinary and so at this point this technology still really costs about say 10 to 15 times or maybe 20 times what a high density genotyping array costs and so it's not really used in large cohort studies like SNP arrays have been adopted and so you see in the literature even emerging today in high profile journals to see studies of a thousand or several hundred tumors with SNP genotyping arrays and we're not there yet with sequencing technology we're still in the tens for sequencing technology. Okay, so a little bit on how this works. Essentially what we do is I mentioned that we design probes where we know where they exist on the genome and we put them on some sort of surface, a glass slide that can be whereby DNA fragments can be washed over them and through image processing very much in the same way that a gene expression might have already worked, I'm sure Paul went over that. Is that yes? Can you confirm that? Okay, so this is very similar in nature but the difference is that the dynamic range of copies of the genome is considerably less than the dynamic range of copies in the transcriptome and the other thing is that we expect that since these aneuploid is of these changes happen in large chromosomal regions we can expect that adjacent probes will actually exhibit similar behavior because you're interrogating many, many probes within a single biological event and so the analytical tools are obligately different for copy number analysis than the gene expression analysis. So here what I'm showing is just a result of a chromosome. This purple line represents a no change between this and either the matched normal or pooled normal reference and then deviations from that can indicate copy number changes. So here's basically a chromosomal arm level deletion and then here you have some amplifications and here's just a very focal deletion that if you zoom in on looks like this and so again each black dot represents essentially the relative copy number of that part of the genome. It's a noisy measurement but a fairly representative measurement of the copy number and then here you have a segmental deletion that is exhibited by negative values here in terms of a log space. So this is really what we're trying to find is events like this and this is very low tech in today's standards but it schematically represents the same concept that we're trying to find. Okay. Any questions so far? Good. Excellent. Okay, so moving on to high density genotyping arrays. They're really what this comes down to is measurement of two alleles. So these are regions of the genome or specific loci in the genome that we know are variant in the human population. So these will have been discovered either by the HATMAP project or the Thousand Genomes project or projects like that which are targeted at really discovering loci in the genome that are frequently variant in the general human population. And there are about a million of these that are million loci across the genome that are measured and the major and minor alleles this major being the most frequent the minor being the less frequent alleles are measured essentially separately and independently and this is really the key distinction between this and array CGH. So array CGH you get a single measurement per locus and the SNP genotyping arrays you get two measurements per locus. Okay, does that make sense? Yes, that's correct. Because they're actually independently they're a different locus loci in the array. Okay, so as I mentioned the original motivation for design of these arrays is in the post-human genome era where genome-wide association studies were widely carried out and really this is for associating inherited SNPs with human disease and to track the literature from about 2005 through to 2009 there are probably about several thousand papers on genome-wide association studies and they started to emerge in nature and then nature genetics and now they're probably more spread out across more specialized journals. Okay, so that was really the original motivation that's what drove the development of the technology and the major vendors during this time were Illumina and Afro-Metrics very conveniently though in cancer really allows for inference of segmental aneuploidies focal copy number changes as we discussed and loss of header and agosity and allele specific copy number changes as I mentioned earlier. So this has been actually a really although developed for studying hereditary diseases in medical applications the application to cancer has been quite fortuitous and has actually led to a lot of insights into the architecture of cancer genomes. Okay, so that being said there are considerable challenges of statistical inference of biological events from cancer samples. And I'm not sure how much has been discussed but I don't think it can be stressed enough that studying a cancer genome is very, very different than studying a normal human sample. And these are the major reasons. First of all is that it's almost impossible when extracting DNA from a cancer sample to get an entirely pure set of cells unless you're working with cell lines. And so mixed in with the cancer cells are normal cells that come from vasculature. There are lymphocytes in there there's stromal contamination and so necessarily what you're seeing whenever you do a genomics experiment with a cancer sample is a mixture of the normal genome and the cancer genomes. And I say genomes because in fact in many, many different cancers especially epithelial malignancies they're what we call intratumoral heterogeneity. And so a cancer sample is not represented by one genome it's represented by many, many genomes that have diverged over time in an evolutionary Darwinian process. And so there are corona populations of cells with different genomes. And so not only are we seeing a mixture of normal and tumor cells we're seeing a mixture of normal and several different tumor genomes. And so really most experimental designs don't deal with this at all and consist of a single sample from the tumor. And so just be aware of this that it's a great leap of faith to when you're inferring something that in fact it's representative that you're actually capturing all of the variation that exists within the cancer cells that you're trying to study. Okay. We'll all mention it again tomorrow just in case. So the other thing that I want to mention is that and we've discussed this already is really what we're after in many contexts because to study hereditary populations it really requires vast numbers. So we have to distinguish which cohorts to actually get the statistics right. And so often when we're studying tumor samples which are often hard to come by we're really interested in somatic aberrations and so we have to think about variations that we see as which ones are somatic and which ones are germline. So we have to distinguish which ones are tumor cells specific from those that are actually just normal human variation that we're born with. That's a very important point that is unique to the cancer setting that doesn't come up in hereditary studies. Finally many tumor genomes are polyploid or at least at certain chromosomes are polyploid so they have more than two copies of the whole structure and so that obviously convolutes the interpretation of alleles. Okay. So there's three major points that I really want to get across in this lecture and one is that the assumptions in most statistical software packages ignore at least one of these issues. At least one and most often all. Okay. Soffers and statistical models designed for analysis of normal genomes do not generalize to the cancer setting. Okay. So pulling something off the shelf that's designed for the thousand genomes project does not translate into an effective piece of analytical software for cancer genomes. And so specialized tools for cancer are needed. All right. So with that in mind there's a very nice piece of review paper that's come out recently that I would encourage you to read that talks about these statistical considerations. Some of them, not all of them in a very nice easy to read way. There's a bit of math and a bit of notation in there but I highly encourage you to read this paper in the context of how statistical models have been developed in the context of cancer with genotyping arrays. Okay. And we'll touch on some of this but really these are quite advanced topics that for the non-analytically inclined to grasp in actually hands-on work. Okay. So let's just look at a general workflow for how to process high-density genotyping array in cancer. So coming off the machine either Illumina or AFI or whatever you get some sort of file. For AFI it's a cell file. Yes. Well it depends. Yeah. So that's an excellent question. I'll touch on that a little bit. They are becoming more and more known. Projects like the Thousand Genomes project and also the HapMap project have revealed that quite a large percentage of the genome far more than was originally thought is under the influence of copy number variation. That is just germline. It just makes you and I different from yourself. These are naturally occurring regions of human variation. So that begs the question in the medical community, well some of these must be susceptibility loci as well. So now what's being done here in Toronto and all around the world are large-scale studies to try to look at hereditary factors of particular diseases associated with copy number variations. So to answer your question there are a large amount of data resources and databases that are growing. Often when we look at a somatic change we can put that into context of what's known in the databases of normal human variation. So if you see a signal in a particular gene and get all excited and say go look up in the database it's been reported in 30% of HapMap. So in that case that's likely a somatic mutation but rather something that's just a normal human variation. So whenever I hear about the kind of copy number variation study I should expect that it's down on the germline DNA? Well it depends what context. So is it a cancer study? Then no it should be done on, so if it's a susceptibility study then yes, be done on germline. If you're looking for hereditary factors then certainly should be done on germline DNA. If it's looking for, you're looking for somatic changes to define acquired changes that exist only in tumor cells necessarily has to be from the tumor genes. Okay? Most of the studies that I've referenced so far are tumor specific. Isn't it better in most cases to look at hope? Absolutely. So that's one of the experimental designs that if one can afford to and this is by far the more desired experimental designs to do paired analysis. So we look at the genome extract blood from the patient or some sort of germline DNA and not leukaemic patients obviously but if you're looking at epithelial cancers you can get a sample of blood or a scan or something like that hybridize that DNA and compare it to the tumor from the same patient. So that's called a matched tumor normal experimental design and certainly in sequencing it's necessary otherwise yeah, so in sequencing context which I'll talk about tomorrow it's the only experimental design that I would advocate. Okay? Alright, so back to the workflow so we have our file that comes off the machine and then very similar to gene expression microarrays really have to go through the step of preprocessing and normalization in order to get to deconvolute machine noise from biological signal and then really the analysis forks at this point whereby we can in genotyping arrays you have the ability to just look at the B allele or the minor allele fraction and then you can look for loss of heterozygosity on the left we can do total copy number extraction and then look for copy number alterations and then finally these two things can come together and we can start to do interpretation based on a gene or pathway or clinical correlations which I think we're going to do on after the session tomorrow and then Friday we'll get into clinical correlations. So let's look at the specs of Arthymetric SNP 6. These are essentially 25 oligonucleotide probes and there are 900,000 SNP probes. So these are probes whereby two alleles are measured and there are 25 MRS that differ at the centering base at the polymorphic site and then we have 900,000 CNV probes and these are just they're not measuring two different SNPs they're just measuring the total copy number at that location. Essentially the way it works is that measure hybridization intensities similar to gene expression microarrays and you can learn a bit more about this at the chip definition file site there. So as I mentioned, normalization is required to remove platform-induced artifacts. My method of choice and I think it's been the best well-developed piece of software is called Aroma.Athymetrics there are others but this is the one that I advocate it's unfortunately can be a little bit difficult to use for naive users and that's why we're doing the lab this afternoon you're going to get your hands dirty with this with this suite of tools. So this vignette here describes how to use what's called CRMA version 2 and really this is the suite of functions that does a lot of the preprocessing for this and I'll talk about some of them in detail. In my opinion it really outperforms the commercial software and it has a great benefit of being transparent and open source and so it has a user community and you can look at the source code and basically you can output a little specific and total copy number real value data and the other thing is that it's really developed by in Terry Speed's lab and he's really one of the leaders, academic international leaders in dealing with microarrays and worked with Afrometrics as well and so there's really quite a lot going for this package but that makes me recommend it. Okay so let's talk about something called Alela Crosthok. So Alela Crosthok occurs when the probe from the major allele mishybridizes to DNA from the minor allele and vice versa. Okay so this is where that single nucleotide polymorphism that distinction isn't enough and you get a piece of DNA going to the wrong probe and so what's plotted here are the intensities of the two different SNPs and really what you should see is you should see a line kind of going across here that represent homozygous regions a line going across here which represent the other homozygous regions and then a diagonal which represent the heterozygous regions and so you can see that this is just basically the feed-in to this data and it's not quite vertical and in this case it's not horizontal and so so this paper describes how to correct for this so that in fact the signal is more accurate more representative of what it should be and there are other types of artifacts such as sequence-specific artifacts so if you have high GC content the hybridization affinities of those probes are going to be different and so that needs to be corrected for and then there's a restriction fragment length a restriction fragment length step in this process and that also induces some artifacts and so the Aroma CRMA package basically adjusts all of these factors and accounts for them so that this is the representation of the intensities before normalization and after and so you can see here there's a sort of distinct bimodal distribution here and this is really due to allelic crosstalk and moreover the means of all these modes are not quite comparable and so if you want to compare or raise between individuals then you really need to be normalized and so that this is what the picture looks like afterwards so this is essentially a fancy histogram it's a density plot and it shows the shape of the total data set and you really want the shape of that density plot to be consistent across the different individuals that you have in order to compare them properly so you saw something similar to this in the gene expression section so once we've pre-processed then we can go on to inference so what we want to do is we can look for these three things so total copy number the c and a's loss of header's agosity and then the leal specific copy number changes as well so just by way of notation we can represent each one of these topics here with some notation so y sub j a is the intensity for allele a at position j so the intensity comes from image analysis and j is just the genomic position and then a is basically the allele and so we get one for the other allele as well so this is in the total intensity is just the sum of those and that's just indicated by y sub j and then this one's important this is a total copy number at that position and what we usually do is compare what we get from the total intensity of the two alleles to the total intensity from some reference and that can be a match normal that can be a pooled reference that can be the average measurement in a population of tumors so y r is basically this reference here and we multiply that by some constant factor usually 2 and we'll do that in the lab and then often people take the log of that so that deletions in terms of levels are comparable to amplifications and yeah so the design of the array usually takes the two most frequent nucleotides simple answer it's binary in terms of major and minor and often actually there's a restriction on the number of nucleotides that can actually be varying in population just due to evolutionary constraints so I think that's why the convention of two has been adopted because that's been the most frequent number of faces that are seen at a given polymorphic site it can be more but most often it's two most often it's one actually but then at one in ten thousand sites you get two and then it tails off after that okay okay so how do we go from signal to copy number so essentially what we want to do is you can imagine that these dots are not colored for a second and really what we want to try to distinguish are these events that where the copy number events change you can see that it's not as straightforward as just drawing a line and saying all points above a certain line are going to be copy number changes because there's spatial correlation and the points next to each other are often that need to go together and in the same copy number call if you will and so we have these continuous signals that are really quite noisy and really what we want to do is try to make those points discrete and so this whole thing this is one event here so all of these points can really be represented with one single event and this would be almost like a chromosome arm amplification of this region so here's the banding pattern here it's a centromere and so this is the chromosome level event chromosome arm level event and we have a little neutral region here and then we have got some copy number amplifications here and then the green here represents deletions okay now in this case we do have a match normal and so here's an example of something that where you need to watch out for so here's an event that's very obvious it's a huge deletion you might jump up and down and say look I found a tumor suppressor it's all gone it's very focal it's great but then you look in the normal and say okay there it's there as well so that means that it's just part of that normal human variation and not to be considered as pathological in the context of cancer it could be in a hereditary hereditary study but for the most part in somatic studies this is what we're more after is these ones so you can see comparison to the normal here these changes are quite evident in the tumor and just not at all there in the normal okay so in high density genotyping arrays we usually talk about six different states you can talk about more states in that but generally in practice what we found is that these six states really represent most of the variation and so that we have a neutral state where there are no change compared to the reference we have a hemizagis deletion which is a single copy loss homozygous deletion which is two copy loss and then three levels of amplification so this is interpreted with a Hinn-Markov model that adjusts according to the actual sample what we'll do in the lab is actually do something much more simple than that which is arbitrarily draw cutoffs of the segment means and the segment mean is basically what's the mean of all of these data points that are within the same segment and so you can look at that and say okay well if that is within some number of standard deviations or some number of median absolute deviations then I'm going to call it neutral but if it goes beyond that sort of acceptable neutral band then I'm going to call it again and if it goes way beyond it I'll call it a high level amplification that's a bit of an arbitrary thresholding but it's actually often what people do yeah yeah I mean the answer is that the steps between gain amplification and high level is there anything that you would suggest all of that yeah so it's somewhat arbitrary I mean usually large scale chromosome arm events are usually single copy and those are readily apparent and so we call those gains it's anything more than that then we'll call an amplification and then if it's super high and focal then we'll call it a high amplification yeah it's really it's a bit of an arbitrary distinction so that's why some model based hidden Markov modeling can actually help interpret that because there's a principled way of actually going through that alright so I have 10 minutes to break okay let me keep going there's a lot more to come oh yeah okay so so that was an example of total copy number and here's what the the B allele fraction looks like so here we have the same the same tumor genome that's been looked at from a total copy number perspective you can see it has a little amplification here and what that does is so again you remember that in a neutral position you get three possible genotypes and that's quite evident here you got the nice three bands we have you can call this AA, AB and BB here we have our little one copy gain which is additional genotypes and then this one is really quite interesting in the sense that it actually is copy neutral in that it has two copies here okay but there's a distinct loss of heterozygosity so that middle band that represents heterozygosity is gone and there's a skewing away from heterozygosity into so this is what loss of heterozygosity looks like so this is copy neutral loss of heterozygosity you have a distinct deletion in this region and that also eliminates one allele so you get loss of heterozygosity here so copy neutral loss of heterozygosity and deletion induced loss of heterozygosity yes so this is the whole chromosome so it's unlikely so that you can't have a whole chromosome be homozygous to start with that's unlikely so we're looking at of course we're looking at the places where we'd expect there to be heterozygosity in the normal and then it's lost in the tumor you do have to have heterozygosity to start with in order to lose it correct yeah showing everybody not necessarily it's really it's the intensity of the major and minor alleles it's actually difficult to decombulate which parent it comes from because in one case the major allele can come from the mother a different case and a different snip so so when you have all age but you don't have a copy number change does that mean there's a duplication of that? so it means that there would have had to have been a loss followed by a duplication or some desegregation in mitosis whereby both copies from one parent came through in mitosis versus proper segregation I guess that happens often in tumors it does because they have abnormal replication machinery that's usually what initiates tumor genesis in the first place in fact DNA repair is what's often compromised in tumors and so this is happening in ourselves a lot but we have the repair machinery to fix things but tumors have lost that okay so um very quickly now so how do you find copy number changes let's see okay so there are really three main algorithmic approaches one is smoothing and this is really where we try to just fit a curve to the data and then sort of post-process that curve to try to look at so there's noise in the data and we want to try to represent all these little dots that are all over the place we want to try to represent it with some sort of mean regression techniques are often used for this or wavelet type of smoothing and so you can get a curve that represents the data the real disadvantage of this is that you end up with a curve but then you still don't have discrete biology you still have to post-process that to interpret what you have another approach is segmentation this is a very popular approach and basically segment the data into an arbitrary number of discrete levels and this works quite well to find what we call breakpoints as I've indicated here these arrows represent breakpoints so here's one segment here's another segment here's another segment and here's another segment and again however though because it's an arbitrary number of levels we still have to post-process and this gets the question that came early to actually infer biology and make a call then we have what we call independent and identically distributed mixture models which have been proposed in literature and essentially what this does is it treats each probe independently but fits a principled sort of mixture model to the data and so what it would do is call each probe independently basically based on where it sits in terms of the spectrum and what you can see is that so probably it's very susceptible to these single point outliers these are likely just machine noise and don't have biological signal and then moreover so here you'd expect that this whole region should be called together but basically again because it kind of amounts to thresholding they would miss call these ones so this doesn't include spatial correlation and so it's extremely disadvantageous for that then finally we get to Markov models which have been in recent years been increasingly adopted in this type of analysis they have the advantage of actually classifying the probes into discrete states so we go from continuous signal to discrete biology and they can model spatial correlation the output is more or less directly interpretable however it does require parametric modeling so you have to set parameters and that can be a bit of a tricky and arbitrary process as well so so we're going to look at in the lab we're actually going to use this method for segmentation and then a post processing method to do that to call events so there are some examples of review papers that go over these different methods I have recently started working on the right something strange in this data that I do not quite use that previously with microarrays gene expression microarrays I mean and sleep array the definition of each probe was fixed but here we are looking for regions of gain and insert or dilation and it's different from sample to sample that's correct so how can we compare for example to sample to do a kind of association study so what you can do is you can examine the genes within those segments and let's say that you have in one case you have a copy number change that encompasses 10 genes it's still small and in another case you have an overlapping copy number change that may only have 15 of those genes or may only have 5 of those genes so it may be fully contained in another sample within the segment that you found in the first sample and so that allows you to actually narrow down the space and there are tools that summarize multiple different segments segmentation profiles to find what we call the minimal commonly altered region and then that usually helps to refine what genes might be targeted and then when you want to go to your association you do things at a gene level if that's what you're interested in and then you can associate the gene itself and not the segment it's everywhere so some are in genes some are in non-coded regions as well not equidistant but they're governed by polymorphic sites so that's non-inqually distributed across the genome okay alright I'm getting the cane so I think we'll stop there and we'll take a little break and we'll come back so in the interest of time we'll keep going alright so I come from the School of Thought that says that really to do analysis to interpret results you should have some sort of understanding of what processes rise to results in the first place okay so what we're going to go over now is in a bit of detail how two approaches work for segmentation of these datasets so we'll discuss in detail this non-parametric approach called DNA copy or circular binary segmentation the original paper was published in 2004 there's an update I think a 2007 update there's a nice software package available in bioconductor and basically it requires R and integrates well with bioconductor this is actually the approach that we're going to use in the lab the lab is fully contained within R and that's why basically I chose this this package because just to show ease of software there are parametric approaches using HMMs some of which I've developed and there are a number of other tools as well generally speaking these HMMs are not as friendly as some of the other packages however we are working actively on a bioconductor package for one of the HMMs that I've worked on so let's get right into the DNA copy algorithm so the key ideas here is that given this continuous data we want to find change points in the data so points in the data where as you're scanning across there aren't change in where the points are falling so here is quite an obvious change point and then there's really not a huge obvious change point within here so what we want to do is to minimize the within segment variation so we want to keep this at a minimum so there aren't a lot of jumps and maximize the between segment variation so the between segment variation is the mean of the segment so this green line represents the mean of all the data points in here we want to try to maximize the distance between this mean and this mean and it employs a concept called circular binary segmentation and it's a standard of binary segmentation so let's look at how it works so let's just look at a particular chromosome and we're going to let y sub t go for a particular probe and then t can go from one to n and this is a number of probes in a given chromosome and essentially this is just a fancy way of saying from position i to j the mean is represented by mu sub i j so that just means it's the mean of the segment okay so let's say we are going to now take this circular part we're going to take this linear chromosome we're going to wrap it around and make a circle and so in that circle we still have our positions and basically we bring position 1 and position n together and what we want to find is we want to find the positions i and j that maximize a value and we call that value z i j and essentially what is it takes exactly what we discussed earlier and it looks at the differences between the means of the segments and tries to maximize that and then it penalizes very short segments so this is like a penalty term that says okay very short segments are undesirable because we know that there should be correlation and then this term is just the difference between the means and so we want to try to find an expression that maximizes this quantity here yeah so it's the other bit so here's one segment and this is the other segment and then essentially what we do is we look at permutations of that and we see how often that change point actually occurs and then based on so now we may have found two breakpoints and that may take our chromosome and divide it into three sections so we're going to have this section this section and then what we're going to do is on all of these new segments is just do the same process over and over again basically until we get no more significant segmentations so what the algorithm outputs is segments that have some arbitrary number of levels basically these red arrows indicate segments and then when you see a change in the mean here that represents one segment and then the beginning of the next one starts right away okay so post-processing the output as I mentioned is still required in order to determine copy numbers so this gives us where the breakpoints are but we still don't know which ones are gains and which ones are losses and so there are a number of different approaches such as merge levels immediate and absolute deviation factors which we'll do in the lab that can then take these segments the means and then turn those segments into actual copy number calls so the other approach or complementary approach is using hidden Markov models that simultaneously segment and classify the segments so the segmentation, the breakpoints are found in the context of classification of those segments and so the segmentation actually helps with the classification meaning you want to call a segment as a loss neutral or gain state and the segmentation helps with the classification and vice versa as well and classification is done at the probe level into a fixed number of states and there are as I mentioned a six state model and there are also three state models and four state models and that's really the tricky part with these parametric models is actually choosing the number of states so the segmentation what you have provided in the last slide so it's not based on SNP probes or the copy number probes or both so in the case of copy number you use all the probes in the case of LOH analysis you can only use the SNP probes I mean this is really just schematic in fact this data comes from low tech array CGH but the concepts are very similar okay so basically we want to output the probability that each probe is a loss neutral or gain so what HMMs do is they use a model based parametric approach that won't go into the details of what a Markov assumption is but essentially that allows calling of the probe to be correlated with its neighbor and so the neighboring probes end up being the same and you get this nice spatial correlation that takes neighboring contiguous segments of probes and that ultimately have the same copy number call so there are a number of tools that use HMMs and especially for SNP arrays and so this is really a two dimensional analysis of B allele and total copy number and all these four methods I would highly recommend that use these types of methods they're very sophisticated in their approach and try to really deal with some of the effects that come up in the cancer settings in particular uncle SNP and this picnic algorithm okay so this is a table from that review paper that I mentioned and this gives you a really nice summary of the various methods that are out there and the experimental designs under which they are appropriate describes the detection method and then whether they consider these two things that I've already mentioned which is umploidy and impurity which is cellular contamination and normal cell contamination and so some of these methods has now started to address this issue of normal cell contamination and so the field is sort of moving into this area where cancer specific um problems are being addressed so it's kind of a simpler version of this one here so originally actually this algorithm was not designed for SNP arrays but it's equally applicable to SNP arrays and it's been used in many different contexts including the papers that I mentioned like the nature papers that describe the somatic landscapes of cancer yeah right yes so in the HMM setting um basically there's a the inference of the breakpoints and the classes itself it's done simultaneously it's not done in a two step process so in the way it approached it we're going to take in the lab as a two step process first we're going to segment the data to find the segments and then we're going to classify those segments as being gain loss or um levels of gain and that can be a lossy process so let's say the segmentation actually has errors in it in that case then you can't go back and correct that segmentation whereas the HMM actually iteratively it takes both things into account and uses the other to help the inference of the other if you will so the segmentation is informed by the copy number states and the copy number states is informed by the segmentation so the HMMs in general are usually parameterized and so the user has some flexibility in terms of setting what the means of the different levels should be and so one can take that approach and if you think that it's under calling one way another over calling one way or another then usually there's a flexibility to adjust parameters to account for that. Ok so here's an example of compensating for segmental end-employees and SNP arrays this is an algorithm called Picnic and changes employee as I mentioned induce new genotype state spaces and so here's a very similar example to what I showed before this is from the Picnic paper itself and basically you can see that so what this algorithm does is takes into account both the BLEL fraction and the total copy number simultaneously and basically implements an HMM that can segment the data into these different states compensating for normal contamination there's a nice method called tumor boost in the tumor normal pair setting so this is where you've extracted some sort of normal reference from the same patient and done two hybridizations of the DNA one from the tumor one from the normal so this is what the BLEL fraction looks like before adjustment for contamination and this is what the signal looks like afterwards ok so there are basically what this does is it takes the SNP intensities from the normal sample and uses that to help inform the inference of the tumor sample itself and so this is I think an open source also part of the Aroma.aphrometrics suite of tools so how do you say how much and how many normal cells do you have I guess this is based on some level of inference that you have 10 to 40% that's right so that's actually what it tries to infer is the level of contamination because you can see how skewed the alleles are in the tumor and you know which alleles are actually present in the normal and so then you can use expected rates and adjust that way so how about you know with the published papers with like tumor associated in the archive pages and all that might carry some of these yeah so it gets complex because well certainly the stromal contamination in the surrounding stroma there's evidence that there are somatic signals in those stromal cells as well and it gets pretty complex so everything that I'm saying is a gross over simplification of biology we're trying to get at the major concepts and we'll never get precise modeling yeah okay so there's just an overview of some of the approaches that I would suggest that if people are interested you do some further reading look at the papers that are referenced in this lecture and in the lab there's also a list of PDFs that are there and you can look at some of these papers that are there for you so one of the tools for visualization of this data is IGV I believe we're going to get a tutorial is it now or whenever yeah so so in about 5 minutes maybe we can do that and essentially so what this is so here's an example of what the this population of a thousand breast cancers this is the RB2 locus and so you can see just this small region of chromosome 17 that's being displayed and IGV essentially represents copy number profiles in a linear track and color amplifications as red segments and deletions as blue segments and so you can see here that here's the RB2 locus and