 So, what I was saying that, the type of work that I do in my lab is really at the interface of computer science, statistical modeling, algorithm development, and data analysis. And all with the view of furthering our understanding of cancer disease types, processes governing tumor evolution and essentially how the genome is remodeled from when in the transition from normal cells to neoplastic cells. And so really all of this activity is directed at understanding questions on this side of the equation. And so, you've met Gavin, and Gavin's going to be featured quite a bit in this particular lecture. He's done a lot of the work that I'm going to present, so I'd like to acknowledge his invaluable contribution to some of the work that I'm going to show. So what we're going to go through today, and it's quite a long list, and I hope we're going to get to it all. We may not get to it all, but these are really the major topics in copy number analysis that I was hoping to cover today. And so we'll discuss a little bit about the biological relevance of this type of analysis and its impact, how we might measure copy number changes, the different technologies, and how to handle that data from arrays through to sequencing data. And at the end we'll get into some more advanced topics that have really come into light through recent discoveries in the literature in the last year or two. So this is a picture of a normal karyotype, it's a spectral karyogram. And it shows how DNA in our cells is essentially arranged into 23 pairs of chromosomes. You have one set from your mother, one set from your father. And so essentially all the genetic material is tightly packed and nicely arranged into this arrangement of 23 pairs of chromosomes. And essentially there are 22 autosomes and the sex chromosomes. So this is by definition a female, two ex chromosomes. And if there was a male there would be one ex in one of them. So the properties that govern the transformation of a normal cell to a cell that's undergone uncontrolled growth, the role of the genome in that has been suspected for almost 100 years. And so Bovery in his experiments with sea urchins noticed that some of the cells that started to grow with uncontrolled proliferation were aneuploid in some chromosomes. So they had an extra copy of one of the chromosomes. So he suspected long ago that that must have something to do with the acquisition of a phenotype of uncontrolled growth. And so he described these observations of multipolar mitosis in the sea urchins and that led him to believe that abnormal distributions of chromosomes and cells were the culprit behind the initiation of malignancies. And it wasn't until about 50 years later that he was proven correct with the discovery of the Philadelphia chromosome in chronic myelogenous leukemia. And this was discovered by Noel and Hungerford and published in Science in 1960. So this was the first real example that chromosomes, abnormal chromosomes or the abnormal organization of genetic material in the cell is actually responsible for malignancy. So this has been shown now to be a property of nearly all cancer cells. Not all, but nearly all. And this is, in some diseases, it's particularly pronounced. And so in high grade serious ovarian cancers, which you'll probably hear a bit more tomorrow from David Huntsman, he's giving a talk, I think, is that tomorrow? So he runs the ovarian cancer research program at the BC Cancer Agency in Vancouver. But this shows that another spectral diagram of what the organization of genetic material looks like in different tumor cells. These are extracted from different patients. And so you can see that there are genome-wide duplications. So in some cases, the whole genome is copied. There are exchanges of genetic material between chromosomes. And that's shown by where you have a hybrid chromosome that's colored with two different colors. That means that the origin has come from two different chromosomes. And you have loss of genetic material, and you have gains of genetic material. And so these genomes are essentially unrecognizable now. They've been just, the deck has been completely reshuffled and scrambled. So let's look at that in detail and see what the consequences of that. So copy number variations, or you might hear the term copy number aberrations or copy number alterations. They're all the same thing, and they mean that there is a loss or a gain of genetic material. And so by way of example, so here we have a copy number of this particular locus that has three copies. And here you have a loss of a region, and then here you have, this is actually from a germline perspective, where you have actually a de novo gain of material. And then sometimes you have deletion followed by duplication, and we'll get into all these different scenarios. So to look at what this looks like in a tumor type, what's shown here is arrayed across the x-axis is essentially the genomic locations, and organized by chromosomes. So this is essentially putting the genome onto one linear chain, losing my pointer here. And what's shown on the y-axis here is the frequency in the population of that particular locus being gained in red or deleted in blue. And you can see that nearly the whole genome in a large breast cancer population is affected in at least at some level of frequency. And so this is a prominent feature of the genomic landscape of tumors. And so it's very important to consider. This is 1,000. This is, well, 997. So this is the frequency in the population of this region of the genome having extra copies there. So it's a copy number gain. So it's nearly half of all the breast cancer patients will have a gain of 8q. So this is the q-arm, and this is the p-arm here. And nearly half have loss of 8p. OK. That makes sense. So the reason why this is important is that copy number alterations disrupt normal cellular behavior. So you can imagine schematically we have, let's say we have a region of the chromosome that harbors three different genes. And you can have different mechanisms that lead to copy number changes. So here's the deletion. So you may observe in the wild type, in the normal cells, you would have genes A and A, B and C in this particular locus. In tumor cells, you may have a deletion. And so that would be manifest by that this locus just basically not harboring the gene, the B gene here at all. OK. And then you could have one gene might be copied many, many times. And so that would result in something that looks like this. Or you could have the whole region could be segmentally duplicated. So you could have this whole region be copied over twice. In this particular slide, yes. So each data point is a gene, but you could easily do it by genomic location as well. So what that looks like is, so let's say we have in normal cells we have this segment here, A, B, and we have C, D. And amplification results in extra copies of that. And deletion would result in removal of that particular locus. And so copy number alterations are segments of a chromosome. And this is a really loose definition of approximately 1KB. We use that in the field just because it's become a standard definition. We use that structural variations that are smaller than that, which you probably already heard about, are considered insertions or deletions at the sequence level. But really that's an arbitrary definition. But essentially what it means is that you have a region where genetic material is lost again. And our homework are tumor genomes, as I said. And so you can imagine that CNAs can lead to adverse expression changes of targeted genes. This is because of a gene dosage effect. So this becomes particularly important when, for example, if you have an oncogene that's harbored within here and its role in the function of the cell is to promote growth, then extra copies of that will lead to a growth type of phenotype. Or if you have a tumor suppressor that's contained within a deletion region and that material is lost and that protein never gets expressed. And so the role of that gene to be a guardian of growth or suppressor of growth is removed and the cell can achieve a growth-based phenotype. Yes. What would happen in the case of where the adverse act is, where the suppressor gene is simplified? That's a good question. I mean maybe the cell wouldn't be viable. So it would be like immune to cancer? Yes. So the cells may never be selected for. So really these are many studies are now published and are ongoing to find copy number alterations for diagnostics, prognostics, gene disease associations, and targets for therapeutics. So I thought I'd just go over types of CNVs that you may encounter. So how many people work on congenital disorders in the room? Autism? No? I know. Okay. I thought I'd probe the room. So certainly in Toronto this is an active area. You would have heard these are abnormalities that people are born with. A classic example is TRI's only 21 in Down syndrome and mental retardation or cognitive disorders have been associated with CNVs. Sematic alterations, which is what we're really focused on in this workshop, yes. I don't know if you started to mention what is the real difference between CNA and CNV? So I'm just about to say that. So these are variations that people are born with. These are acquired in the lineage-specific development of the cell. So as it transforms from a normal cell to a neoplastic cell, then to cancer, that's right. So I don't actually know whether I'm going to cancer or work on the stem-spin use people and stuff like that. Okay. And there was an issue with the column of CNA and CNA. So in the literature you may see CNVs refer to both this in terms of pathogenic mutations but also in terms of just normal human variation that makes two people different. A large percentage of the genome is actually subject to copy number variation that has essentially no deleterious phenotypic effect that just makes two people different. And these are cancer-specific, so these would be not found in the normal cells by definition and these are really acquired and would be specific to the cancer tissues. And as I said, most, if not all cancers, harbor some form of somatic CNA alterations. And then what I was referring to earlier are essentially benign variations. These are polymorphisms that are naturally occurring in the human population. And the point mutation analog to this is a single nucleotide polymorphism. So things like the HapMap project and the 1000 Genome project are all designed at trying to understand this from a basic human variation point of view. So in terms of alterations in cancer, we have several different types of alterations that could be considered copy number changes. The first is segmental end-employees and these are often large scales. They would contain a large percentage of a chromosome, for example, a chromosome arm. The structure of chromosomes is such is that sometimes a whole arm of a chromosome will be replicated or deleted and we'll see examples of that. And then you can have focal CNAs and these are deletions or amplifications of very high amplitudes so you get many, many copies of an amplification or you could get a complete removal of a deletion that tend to target one or just a few genes. And because of that, because they're so focused and if they hit the right gene, often those are a good sign that that's being selected for in the development of cancer and the evolution of the cancer. And so therefore we're really quite interested in these when you're doing copy number alterations studies, these are the things that we're really after because they're very strong signals in the data and they can be good indicators of so-called driver events. So you've covered drivers and passengers, have we covered that already? Okay. Okay. And evolution as a concept has been covered? Okay. So essentially a driver event is something that you could consider as an event that confers a phenotype that is selected for in the evolutionary progression of a cancer. And so these are really what we're after when we do these genomic studies is we're trying to find driver events in the genome. And then you've heard about rearrangements, you've heard about translocations and gene fusions yesterday I think. So passengers is essentially an event that comes along for the ride. So you can imagine that cancer has an evolutionary process where essentially there's a stochastic element or a random element of genome shuffling that goes on. And at the same time that the genome is shuffled you may get a driver event, a passenger event that occurs, but it's actually the driver event that confers the phenotype that the passenger event doesn't actually do anything, doesn't alter the function. And the passenger will also be your response? Well, yeah, so that's, you mean a compensatory type of effect? So that would be, I mean I think that would, if it alters the phenotype of the cell I think by definition it would be not be a passenger. So, yeah. So if you were to just say that now the driver event can happen, and I don't know if it's the same before, but can you lose that driver event that's not going to happen? Well, so the classic case where a driver might be lost is that actually if you have a population of, a tumor is often composed of different populations of cells. So one population can harbor a driver that's really creating the malignancy and then hiding underneath that is a different population of cells that may not actually harbor that driver but has a different driver. And so drugs are often target, try to target these drivers. So if you have, well, might as well just move to this slide then. So this is the classic case. This is ERB2, which is also called HER2, the protein's called HER2, chromosome 17 in breast cancer. And this is essentially the presence of this amplification. So what's shown here is arrayed along the x-axis is the genomic position on chromosome 17. And each dot here essentially represents the amount of genetic material at each locus. And it's a noisy measurement, so you don't see it right on the line, it's a slightly noisy measurement and we'll get into how we deal with that. But essentially the blue areas are copy number neutral areas of the genome. And in this particular tumor there's a massive amplification of this gene called ERB2 and then the rest of the chromosome essentially diploid. And so there's a targeted therapy that was developed in the 90s called Herceptin or Trestuzumab in the generics term that specifically targets this protein and will inhibit its expression. And so this is a particular example of a targeted therapy that really goes after a specific copy number lesion. And this is really a case where women who had this type of cancer, they would basically have a worse prognosis and would succumb to their disease very quickly within a couple of years. And the advent of Herceptin has essentially transformed that disease into a situation where most women do quite well. There is acquired resistance, however, in some cases. And that's because essentially you can kill the cells that harbor this abnormality, but that creates a selection pressure that then drives a different genetic reprogramming of the cells. So that's a situation where a driver can be lost in the presence of the selective pressure of chemotherapy. Yeah. Yes. This is Affymetrix SNP6 data. So 1.5 kb. Potentially. So there can be many driver events that we don't have targeted therapy. We don't have targeted therapy for. Right. That was the interventional. That's right. So we're going to get to that in a minute. But no, I don't think all events can be classified as drivers, passengers. I would say that there are backseat drivers as well. So these are, they come along for the ride, but they do some yapping from the back and so can also drive the phenotype of the cell as well. Okay. So let's just look at the effect. So why is this important? And the most important copy number alterations, essentially through gene dosage effects drive the expression of those genes. So you learned about gene expression in the wazir yesterday probably, right? More than you care to know. But so what causes changes in gene expression? This is one mechanism that causes changes in gene expression. It's copy number changes. And so here what's plotted is in this two-dimensional plots are the copy number on the x-axis and the gene expression on the y-axis. This is again is a set of 1,000 breast tumors and different genes. So here's erby 2. So erby 2 is interesting. So what's colored here are the different copy number states. So green is adhesion, blue is neutral, and then increasing brightness is increasing copy number levels. And so you can see up here, these are the cases with erby 2 high-level amplifications and they have the highest expression in the population. So there's an association here between what's happening in the genome and what's happening in the transcriptome. So that's a gene dosage effect. Let's drive in. Here's just some other examples. These are the most highly correlated profiles. And you can think of these as kind of two distributions. So the part of the distribution that actually harbors the alterations are really sort of to the right of zero. And in those cases, you get a response and expression essentially. The more copies of the gene you have, the more highly expressed it is. Okay. Yes. So they can be. So the pathogenic variations that we're really concerned with probably in this lab and most of the driver events that we're talking about, these are acquired throughout the life history of the tumor. So these are not events that one is born with. If you're born with a HER2 amplification like this, you probably would never make it as a normal human. Yeah. Yeah. embryonic lethal. Okay. So here's the other end of the spectrum. So those were amplifications. And here are specific examples of deletions. And again, it's the same color scheme here. So we have the chromosome, the x-axis represents the position on the chromosome. Each dot represents essentially 1.5 kb segment across the genome. And then here what's plotted are just individual tumors that we looked at. And so what you can see is within these dotted lines here, that represents the borders of a gene called P10. This is a tumor suppressor gene. So the bright green represents a situation where essentially both copies have been lost. So this P10 is just not there at all in these tumors. And this is a well-known driver. And so this is how it looks when you have a deletion of P10. And you'll notice that the interesting thing about this is that different tumors do it in different ways. So each event is not exactly the same. So we don't have the same boundaries around, for example, this one is a much broader deletion. And this one here is really quite focal. It's just basically just contains P10 and not much else. This one is even subgenic. So this is happening even within the boundary of the gene. So it only takes out a couple of exomes. And this is the classic pattern of a tumor suppressor. So as long as the gene is inactivated somehow, then that's all really matters. And that's the phenotype that gets selected for. And P10 is also subject to mutations as well. And so you can have truncating mutations and different type of mutations that essentially inactivate the protein. Okay. Hey, Ian, how are you? Yes? If we weren't looking for the next and other tumor suppressor, do you think it would be better? Yeah, so it depends. So P53, for example, which is probably the most famous tumor suppressor gene, it actually rarely shows this phenotype. It's almost always, not always, but almost always inactivated by mutation. So by truncating mutation or insertion deletion at a very nucleotide level. So I'm talking about a one-base insertion or a frame-shifting insertion or deletion. And so, but however, something like CDKN2A or P16, which is probably the second most famous tumor suppressor, you'd see this pattern as well. And RB shows this pattern as well. RB1. Yes? No, no, so there's quite a distinction there between a missense mutation and a copy number change like this. So this would actually basically result in no translation of the protein because it's not there. It's never actually expressed. So you don't get a message. So a missense mutation, so a message would be made and a protein would likely be made. It would just be have maybe an altered function. In the case of a truncating mutation or a nonsense mutation, the message would be made and the protein would be made but it would be truncated and therefore degraded quickly and so it wouldn't have the chance. And sometimes a message itself is subject to nonsense mediated decay so the levels of protein would be reduced in that case. How often is that not the correlation between expression level and copy number variation? Quite often. Yeah, quite often actually. So in this study, which just came out last month, so we found that there is a correlation in approximately 30% of genes. And so if you think that maybe the baseline expression maybe 50% of genes are expressed at any given tissue type then that leaves about 20% of genes that are probably subject to different types of regulation. And that could be epigenetic regulation, could be regulation through mutation or other factors. Is this for drivers and passengers? Yeah, so that's right. This is just for all genes. And so it could be that in fact some alterations just don't have any effect at all on the expression levels. In this particular study we actually used the idea of a strong association between copy number change and expression level as a different definition for a driver in the sense that that gene is driven by its copy number level. Whereas other genes could be regulated by epigenetic regulation. Yes? If the condition is less than 1, could it be? That's what we do sequencing for. Okay. So these are just some genes that are known to be affected by somatic copy number alterations. We already mentioned there would be two other genes. This is a highly related gene called EGFR, the McAlger gene, PI3 kinase, IGF1R, FGFR2, KRAS, CDK4, CDK6. These are all genes that have been shown in different cancer types to be subject to these focal and high level amplifications. Deletions are known to affect genes like RB1. This is the first essentially tumor suppressor that was discovered. PTAN, C-CAN2A, and TB, which I've mentioned already, ARID1A, NF1. And these are all, this list is growing. As large scale studies essentially make it out into the public domain, this list is going to get longer and longer and longer. We're going to learn more and more about which genes are subject to this type of alteration in the genome. And so some examples of this that have appeared in literature recently are these papers. This was a couple of papers that occurred in nature, which are essentially PAN cancer interrogations of literally thousands of tumors and subtypes using high density genotyping arrays to try to determine the copy number landscape of different diseases and the patterns of copy number alterations are nicely described in here. There are specific tumor type papers that are starting to emerge. Large scale studies, 500 cancers, 300 cancers in here, 200 cancers in here, but we're going to see very soon one from cancer papers, one from our group, and one from the TCGA that described the landscape of copy number alterations in the large populations of tumors. And all these studies have revealed new genes that are of interest, that potentially targetable, that are driving the phenotype of the malignancies. And so you should read these papers to see an example of this type of alteration. So what I was asked before is actionable alterations. So the classic and biggest success story in our field is really ERB2 and Trastuzumab, but there are others. So there are targeted therapies against far, plate to drive growth factor receptor, drugs that target PSB kinase, and actually I mean I would say the other classic success story is the BCRAble, which is targetable by imatinib. So this is the Philadelphia chromosome. It essentially defines the disease and all CMLs harbor this alteration and the drug glibac or matinib was developed to specifically target that protein. And so what was again once a very difficult disease to manage has a targeted therapy based on the discovery of a genomic abnormality. And tomorrow or later this afternoon we'll cover some mutations that are targetable. So I just thought I'd go over because yes, oh yes, yes. So here's why, just as an example you have an interemptivation where you have a bigger point mutation and also the therapeutic mutations are pretty much the same so what's happening with the point mutation? Yeah, so the point mutations are hitting the kinase domain and what that does is it results in a downstream signaling cascade and drives a growth pathway. In the same way that gene dosage effect drives the pathway. So it's a point mutation that these are usually very specific and I'll get into that this afternoon is what do hotspot mutations look like in terms of when they activate a protein that has a pretty unique pattern. And so some of these we're actually finding mutations in RB2 as well as sequence breast cancers. And it's the same type of thing. They're kinase domain mutations that drive the pathway. And result is the same and so that's why they can be targeted with the same agents. Yes? You get this large chunk of DNA or it's shut down or it's disappearing. How do you know that? Yeah, so that's good. So cancer is a temporal it follows an evolutionary path. And so drivers can accrue at different points in the evolution. So an initial event may be essentially you remove a guardian of genomic stability or DNA repair or homologous recombination. So we have you remove the capacity of a cell to repair itself then these CNAs can accumulate. And so eventually a CNA can hit a gene in such a way that then that becomes a new driver. And so there's a temporal series of events that essentially lead to the cancer when we observe it and so drivers can happen at any time along that point along that history. Yes? So if increased copy numbers is also associated with polysomide I'm thinking with EGFR because we can probe for for EGFR it's increased copy number and then we probe some cells also we'll see. And it's not considered I guess a true amplification for us because we have to have the ratio. So I guess I'm wondering if the polysomide accompanies that if that's not considered a driver Yeah. That's generally true and I may have this slide on EGFR somewhere that I can show you but essentially when EGFR is amplified it looks like this it is just unmistakable there are hundreds of copies as you probably know and that's very different than just a single copy gain of all of chromosome 7 and so that would have a very minor gene dosage effect where this has a major effect and these types of alterations tend to be restricted to just the gene or maybe the surrounding partners whereas that's more benign and it's actually more difficult to interpret when you get a chromosome arm level shift that may just be a result of a lack of being able to repair that so it's a telomerase type of abnormality that where the telomeres can't correctly be repaired in replication Yes. Sorry I don't understand the question. So you mean is the whole set the tandem dedications? So well I think the jury is still out on that what's interesting is that that's what we're learning from sequencing is actually so how does this happen where do these chunks happen and they all sequentially arrayed side by side or actually when there's a replication loop like that do those bits of DNA get incorporated somewhere else maybe because of nuclear proximity in the 3D nuclear architecture for example and so some of the sequencing studies are really showing that these types of events actually may get deposited at multiple places in the genome and so each event might actually be quite different and at the end of the lecture I'll talk about a phenomenon called chromothripsis which is essentially like a chromosome shattering and then the chromosome just gets blown up into bits and then reassembled and so that creates a whole other type of phenotype that we can measure now with sequencing so I think we're still gaining an understanding of what is a structure of these and that's the sequencing technology that can actually give us okay okay so just briefly now so this is a project that I worked on for quite some time but 4 or 5 years and what we set out to do is to do high density genotype arrays to explore both genotyping arrays and expression arrays to explore the genomic and transcriptomic architecture of a large population of breast cancers and so what I wanted to show here is you recall back to that first landscape plot that I showed and you can see that these types of alterations were really quite broad and when you sort of lay the expression landscape over top of that copy number landscape and we just focus on the high level events that I showed and the homozygous deletions when two copies are deleted and we overlay situations where we look at the expression in those particular timers and we see that they have some sort of outlying distribution so they're way different than the rest of their population and the assumption is is that it's because we have a presence of an extreme copy number event and the expression of an extreme measurement of expression that they may be associated and so when we plot the frequency of that we see that the landscape gets very sharply focused into these nice peaks so here is our Bosch gene here is Irby 2 and nearly in a large percentage of the cases where we have copy number alterations we also have outlying expression and so this is the sharpest peak in the whole landscape we have another peak like that at AP-12 and this harbors the FGFR-1 locus and is also harbors in F703 which some colleagues published as being a new driver event in breast cancer as a result of this data analysis and then here we have the second most frequent ampicon is essentially, it's actually two peaks so this was really quite interesting and this is 11q13 and we have the CCND-1 locus here and then we have a locus here that has about 13 different genes in it and this was really a defining feature of a subtype that I'm going to talk about in a minute and then sort of smatter through the whole landscape are relatively infrequent but very significant drivers in the sense that they're not frequent in the population in terms of being 50% or even 20% or 10% but nonetheless these would be important targets to pursue in a rare number of cancers so here's IGF-1R here's CCND-1 and then in the tumor suppressor landscape which is these are genes that are homozygously deleted and then have subsequent or consequent loss of expression we've identified PPP2R2A as a potential novel target and also MAP2K4 as a this is really confirmed in breast cancer in the study as being a novel tumor suppressor so this is a this is a type of has the same type of structure as the P10 locus that I showed so that's obviously a known and very famous tumor suppressor and this is a novel one that we identified okay so this is frequency of the patient? that's right so how do you find whether it's high-copy number or not? yeah so you'll do that today in the lab you'll learn all about that here I see on chromosome 8 there seem to be lots of gene candidates that seem to be simplifying many many patients so how do we know that we want to ignore gene A and gene B? oh yeah so I mean I think all these peaks are probably of interest we've denoted a couple and there's no question these broader regions are much more difficult to interpret the focal regions are the regions that are easy to interpret because they only harbor a couple of genes and looking at the biology of the gene you can often tell you something about whether that's likely the driver or not these broader regions are still difficult to interpret yes do amplification events usually happen in tandem with deletion events or like if I were to look at this one way I would say well most of this time there's both amplification but then I realize well perhaps it's actually something that happens these things aren't single events that you're enjoying because it does look like they're very cute deletion events relatively well you can imagine that it's easy to copy something many many times and there's actually no limit there's no upper bound in terms of amplification but of course there's a lower bound with respect to deletions if you deploy you can only release two copies and actually it's probably the cell can probably tolerate amplifications much better than it could deletions so if you really whack out too much then the proteins that you might whack out housekeeping gene and then that would just be lethal to the cell and then never get selected so you're right in the sense that amplifications are probably much more common and those are some of the reasons why yes alright we have a lot of material still to cover okay okay alright good okay well let's plow ahead then so one of the major results of this is with a population of 2,000 tumors we really want to try to look at whether breast cancer could be further subdivided into into specific groups before we started this project it would have been kind of accepted in the field that there's actually five subtypes that were discovered through expression profiling but a lot of those studies were determined based on relatively small sample size so 200 tumors, 300 tumors that type of thing and very few, well there was really no high resolution look at the genome in tandem with the transcriptome and so we were asking questions from the perspective of can the population of breast cancer be further stratified into more refined groups based on a large sample size and a high resolution look at the genome simultaneous look at this genome and the transcriptome and what we found is that there are essentially 10 reliable subgroups and this isn't it's not the definitive answer it's not like there are 10 and that's it there are likely others as well but what we found in this data set by splitting the data set into two groups the validation set we did the discovery work on the first thousand and then tried to validate that in the second thousand over here and essentially what's shown here and these plots were made by Gavin are the frequency of alteration in each one of these 10 groups and then what's shown in the bottom of each track is essentially the specificity or the subtype specificity of those alterations in the group so we see black that means that those regions are different than the distribution of these regions is different than the rest of the population so the black really identifies the subtype specific abnormalities that essentially define the group so here's a group that essentially defined by the AP alteration and then some other things are going on in the genome as well this group here is the AP-12 this is that very focal region that I show that has that ZNF-703 amplification this one here is the ERB2 group so this is a very strong signal in the data I'm sorry if it's hard to see but there's a very focal spike right at the ERB2 gene so that's what defines that group and then over here I mentioned that we have this 11Q13 amplification that harbors CCND1 and some other genes in that second lab icon and this was a new discovery because actually this was composed exclusively of ER positive tumors and the reason why this is important so generally speaking ER positive was the ER astrogen receptor so I should back up a little bit and say that in the clinical practice there are really three subtypes there's ER positive ER negative and HER2 positive and so that's the clinical assay that breast cancer patients get prescribed when they get diagnosed and we measure expression levels of astrogen receptor and HER2 and if they're HER2 positive they get herceptin based therapy if they're ER positive they get hormone based therapy and if they're ER negative then we don't really have a good a good therapy for them yes did I understand correctly that the HER2 positive patients they are all in subtypes by that's right but why are they not overlapping because people can't have HER2 positive and ER positive? let me just see if I can show you that so what's shown here is these are just defined by essentially the HER2 ampicon and overexpression of HER2 and what's shown over here I didn't really go into detail here but these are the expression based subtypes so these are luminal A and luminal B so these are ER positive cases here okay so just one more thing to mention here in this group here this is actually the largest group and what's interesting about this group and this is really about 17% of the population is what do you notice here it's flat right there's nothing going on so this is a really curious finding in the sense that 17% of breast cancers actually don't really harbor alterations in terms of a copy number perspective they're relatively classed genomes so that's the next hypothesis is that maybe these are driven by mutations maybe driven by epigenetic changes that's a follow up question pursuing okay so the other aspect of this project and this is really a massive tour de force to collect this and colleague Sam Apparicio really and Carlos called us in Cambridge led this project in accruing the sample set and they basically went to five different centers in the UK and Canada to accrue a population level of this size and one of the criteria for inclusion was that we had long term follow up in terms of clinical data so in some cases we had up to 15 years follow up on these tumors and a minimum of five years was the criteria so we can then ask what does this all mean in the context of the clinical perspective so is any of this stuff it's fine to say that we found interesting patterns in the genomics but does this have any impact on prognostication and so what's shown here is a Kaplan-Meier plot and I think you'll do these types of plots on day five in this workshop and so the first thing I wanted to just point out is that this group here so how many people have seen Kaplan-Meier plots before so essentially what they measure is the proportion of survivors in a particular group as a function of time so the x-axis is time and this is basically time after the last time after diagnosis and then this shows essentially the proportion of patients in a particular group that's still alive after some degree of time and I won't dwell on that because you're going to learn that in great detail on Friday but essentially what this means is that if the curve stays high then that is a good prognosis if the curve goes low then that's a very poor prognosis so a lot of these patients were accrued before the advent or perceptin and so this group five which has a very poor prognosis this is the Herzi positive group here so you can see that before herceptin this is really a very morbid high morbidity subtype and we had a very steep morbidity trajectory these days that curve would probably be somewhere around here you would see it shift up towards the top right what I wanted to just point out is that we also have this group up here which is essentially a group with very good prognosis and what was very curious is that we have this group this green group this was the group with the 11q13 amplification so these are all ER positive teamers and before we did this study these would likely be grouped in with this pink curve and you can see that there's a lot of split correlates strongly with an inferior prognosis and so this identifies now a new subgroup and it's only about 4-5% of the population it's a small percentage of the population but this is a group that really warrants major attention and potentially our work here is to identify potentially some targets to pursue in the same way that we can with Herceptin there may be a gene in there because it's a high level amplification that may be targetable with a new therapy I just noticed something interesting the group in the last one which had almost no alterations that's actually the second best in those cases this one here? one's protecting the population so one possibility is that these genomes aren't that evolved so the tumors actually haven't progressed to a point where they've acquired a really nasty aggressive phenotype so that might be one possibility is that we've just observed them relatively earlier in their clinical stage the other thing that's associated with this group is that we found a strong signal of lymphocytic infiltration in these tumors and so there's a strong immune response in these tumors in that particular group and that suggests that there may be some sort of these are tumors that are subject to being controlled by the immune system yes your 2,000 patients were they that's a good question that's a very good question we try to look at that the answer is probably no there's a bias there in different regions so one of the hospitals is in a predominantly African community in England our hospitals in Vancouver probably have an overrepresentation of a Chinese population so no there would be some bias there for sure we did try to do that and there wasn't we looked at the genotypes because these are actually genotyping arrays so we could actually look at that question and there wasn't a strong signal there in terms of whether there's association or a confounding factor in terms of these outcomes that's the other thing is that the genome-wide association you try to do this with that tens of thousands of patients this was not the point of the study but yeah it's a fair question for sure yes yes so that was discovered basically we did de novo clustering of the data so we took the we took the data and we looked at the most variable features from a gene perspective in the whole population and then we clustered using a joint measure of where both copy number features and expression features were included and we clustered the data that way and so the 10 groups emerged from that de novo clustering unsupervised yeah so 2000 okay so the other interesting observation that emerged from this and I think this is something that the field is really going to get rapidly excited about now is the idea that it's not a one-to-one correspondence between a copy number change in a gene and an expression in a gene so you can imagine a situation where so again this is copy number on the x-axis and this is gene expression location of a gene on the y-axis and what's shown here is in each point in this matrix is whether we have a positive correlation in red sorry in green or negative correlation in red where you have copy number change in one location is correlated with genes at many other locations in the genome okay so you have a single copy number event that is associated with gene expression changes across the genome and so you can imagine a situation where you could have a transcriptional activator for example so gene that whose job it is to go in and promote the regulation or promote the over expression of many other genes if you get extra copies of that protein it may have widespread effects across across the whole genome another situation is where you could have you can imagine a biochemical pathway where if you drive let's say the top level of a biological cascade it's going to have wide reaching effects throughout that whole cascade and that's essentially what we observed and so here we have this is actually a situation where we have a deletion in 5q that essentially resulted in a whole set of genes that were up-regulated so we probably have some sort of cell cycle regulator in here we don't exactly know what the target is yet and the reason is because if you look at the pathways involved in these genes essentially they're involved in cell cycles so M phase aurorokinase signaling and then here you also have the FoxM1 transcription factor network so there might be something to do with the transcription factor in many genes and then here's the G1S transition so this is a pattern that we observed in this data and in a different analysis Irby2 shows up as one of these interesting targets as well that regulates a large number of of expression changes and so again a single event that has wide reaching effects across the genome and so we just need to start thinking about about biology of cancer and what aberrations in the genome in terms of how they drive transcription networks and not just transcription of a particular gene but whole networks of genes and that's probably a concept that Gary might cover in his lecture okay so we'll plow on here so that's the story of this data set called Metabrig and it is published and you can you can read about it in detail okay it is in the EGA at different levels so you can get the actually I think the segmented data will be presented today in the lab, is that right? so we'll work with the data today okay so what I've talked about so far is really just about about copy number in aggregate so we're not concerned about the specific alleles up until this point and what I wanted to move to now is to think about a genotype so I don't know if did John go over genotype at all or what that means okay so imagine you have two alleles okay you get one from your mother one from your father and we're going to call this allele A and allele B okay and in a balanced deploy case you could have three possibilities okay and so let's just look at this side here so you could have a particular locus that has AA so I said maternal and paternal but this could also be described as major and minor in the sense that through population studies we may have seen that one allele is dominant in the population and another allele is less dominant and less frequent in the population this is what consortium like HAPMAP consortium have worked out for many many positions in the genome if you look at a database called DBSNP all this is essentially well cataloged for a large number of polymorphisms in the genome so if you consider this major and minor this is a situation where both the maternal and paternal allele are the major allele here you have a situation where you have you have the maternal is or you have a major and a minor allele and then here you have homozygous for the minor allele and so that's in the diploid situation that's where you have two copies everywhere what happens when you have an extra copy so if you have three copies then clearly one of the alleles only must have been amplified or duplicated so that induces a new genotype state space or a possibility the possibilities of what genotypes we can get so you can have all A's so that could be a situation where you started here and you just copy that one or you could have you started here and you ended up with an extra copy of A so that's another state space that gets induced and the same by the same extension you could have the same type of extension of a genotype space with four copies or five copies so now if we consider if the starting point is always diploid so imagine a situation where you have in the normal cells we look at all positions that are diploid heterozygous so all positions start here and then we look at the tumor well we get the induction of what we call different psychosity status so you can imagine if you started out diploid A, B and in the tumor we observe that now it had become AA we call that loss of heterozygosity so this is a heterozygous heterozygous allele in the sense that you have two different alleles, it's heterozygous and then the loss of that heterozygosity induces something like this and that can go in both directions so you can start here and you can go you can become homozygous that way and by extension the same thing applies to all these different copy number states so what that looks like in actual data oh this is black and white that's fine interesting so somehow this got black and white but look at your page instead of the screen so this is a region this is actually sequencing data from breast cancer and I think you're actually going to work with this particular data set in the lab so what's plotted here what Gavin did is he went and they looked at he tried to find all the heterozygous positions in the normal genome so this is a situation where we sequence the normal from blood and then we have the tumor DNA from the tumor so these are all positions that were heterozygous in the normal and the definition of that is really that the allelic ratio so we haven't really got to sequencing data what that looks like in terms of in terms of allelic counts but you can imagine that half the reads that cover this particular region were from one allele and half the reads for the other so it's centered somewhere around 0.5 and then in the tumor we can look at those same exact positions and we see that so down here is what's plotted as the copy number and and this is also derived from the sequence data and we'll get into how we do that but just take this as it is for now so we have some diploid regions we have some deleted regions and we have some gained regions here and you can see that results these deletions and these amplifications result in shifts away from that 0.5 so you have now you have this kind of noisy representation where the mean isn't really at 0.5 anymore it's really shifted away from that and in this situation we have a deletion that's induced loss of heterozygosity and that's because one copy is essentially gone ok here you have a situation where you have an allele specific amplification so and the reason we know that is that it's not a balanced amplification because we'd still see a pattern that's this like clustered around 0.5 instead what we have is we have a shifting away from that and so we get a genotype space that is more like AAAAB or ABBBB and so you have a shifting away and this is what we call we still consider this heterozygous but we call it an allele specific copy number change so you can see that in aggregate it's definitely been shifted but it's really just one one actual allele that is what's getting copied the other one is actually is unaffected and then you have this really important region here which is essentially copy number diploid so this is two copies it's neutral in the sense from this copy number perspective but there's loss of heterozygosity here so what do you think is this any ideas so copy neutral loss of heterozygosity how do we get there right so it's a stepwise process and you must have had two events to get there and so we could have we would have had to have first had this event where we have a deletion and then the remaining allele is what gets copied is what gets duplicated again so you have my pointer is really not working here so you have a deletion followed by a duplication and that gives rise to a signal that's very similar to this but the copy number is is neutral is that clear alright this is the same individual so this is the normal just the normal one one person yes sorry sorry okay yep yes yes only one step difficult because the power of doing something like this is you get a whole block so this may represent what is this probably about 10,000 steps in here approximately yep let's say more than a thousand data points in here yeah several megabases in here so several thousand data points in here to get the signal but you can imagine that so what's going on here is that you still get some of those regions show 50-50 and those are probably just regions that are not well covered in the sequencing space and you just looked at this one of the data points right here you would think that there's no LOH there's no loss of headers in velocity you just get unlucky but when you look at an aggregate across the whole space of a thousand a thousand data points then the signal becomes very clear that this whole region is similarly affected and this is probably just noises well so you could look in very very detailed analysis in one gene so this is really I haven't gone into sequencing yet so that's coming in this afternoon but basically I mean one of the choices in experimental design for sequencing is how much sequence do I obtain for my particular whole genome or my gene of interest if you're only looking at one gene you can afford to sequence it in great detail in great depth and so that may be enough then because if you go deep on one particular locus or a small number of loci that should give you a very clear indication of what's going on of course you can't afford to go very very deep across the whole genome because it would just cost you much but if you're just looking at one gene you can sequence very deeply is there this is actually sequence data so what's your this is from one patient this is from one patient so what's your interpretation of these regions of lose-hutter psychosis but not complemented yeah so what may happen and what we'll get into this in this afternoon is that so why is it advantageous to have a second event essentially so you can imagine that there's a sequence of events where you may have a deletion there's only one allele left and in that situation you may get a mutation and so the only protein that gets made in this situation is the mutant protein and then but it may be disadvantageous to have all the other genes in here have lost the copy and so the cancer wants to up the gene dosage of the remaining surrounding position but keep that mutation so that could be a situation that gives rise to that so what we've been doing in my lab is really trying to associate integrate this type of data with mutation data again that's another way to potentially identify important tumor suppressors or driver genes through mutation I was just wondering why is it two events if it's just one of the other ones that get copied to it so the consequence could be that the other one just gets clept out essentially that's true that could be the case would you felt high coverage and low coverage actually have that issue depends what you want to do for things like looking at total copy number changes people have shown that this can be really done with something around 5 to 10 acts actually the signal is quite readily apparent when you get into actually specific alleles and try to find SMBs or doing this type of analysis what we found is that 30x it gets it's not as clean as more than 30x but 30x is actually even probably woefully inadequate I would say for mutation analysis and that's because you can imagine a cancer sample is number one it's made up of a mixture of different cell populations it's made up of a mania it's made up of cancer cells and then admixt in there are normal cells there just a part of the stroma or a part of the lymphocytic infiltration so you could be measuring something 30 times but for a given mutation that is maybe present in 10% of the cells that's going to be represented in 2 or 3 reads so you may get unlucky and just never see it so why are we interested in this well this is really how many people have heard of the Knitzen 2 hit hypothesis classic stuff alright so Knitzen proposed a paradigm whereby important genes essentially need to be inactivated in both copies of both leels and so whereby if this is the number of leels here and this is the follow up I think this is the 30th anniversary follow up of the Knitzen 2 hit hypothesis I think so this is a nice paper that talks about this so basically we have a model where if you lose one allele then essentially cells are vulnerable to becoming malignant and you lose the second allele and then the cell transforms into malignant state in this situation you have essentially just inactivation of one allele is that already leads to a malignant phenotype and then if you lose the other allele it gets even worse and then you have essentially even small changes in terms of essentially a gene an allele dosage effect where we have severity of disease increasing and then you have a situation where actually if you lose all of it then the cancer essentially that won't get selected for because that becomes lethal to the cell and so that's called somewhat quasi-sufficiency so the action is actually in the not the total loss but in the partial loss okay so we are going to resume here and I'm actually going to go a lot faster now so I think I'll reserve questions if you have something that's really burning and you can't wait go ahead and ask it but I think we'll reserve questions in the interest of time until the end okay so now let's talk about how to measure these events we've talked about their biological consequences we've talked about how they're observable in different tumor types but what I wanted to do now is just briefly go over some of the technologies we can use to measure these and they really go from very low resolution to low resolution and although resolution can be defined in different ways so this is fluorescence in situ hybridization and what this shows is essentially a design of probes that are fluorescently labeled probes that can actually be incorporated into the nucleus of cells of individual cells and then through essentially microscopy one can look at the presence of the number of copies of a particular probe within individual cells so while this is low resolution from the perspective that you can only really do this for a couple of bloci because it's labor intensive it's very high resolution in terms of in terms of looking at individual cells and ultimately in terms of cancer evolution the cell is the unit of selection so this is actually really quite powerful fluorescence in situ hybridization how many people have worked with that type of data before a few so this is widely used in pathology labs so then in the advent of the human genome led to the ability then to design generally speaking back probes bacterial artificial chromosomes that can be used to probe literally 100 kb chunks of the genome and this really came to prominence in the early 2000's and through the mid 2000's this was the method of choice array comparative genomic hybridization and so the resolution is somewhere between 30 and 100 kb the advent of genotype arrays started sort of in the mid 2000's and this has been really the dominant platform I would say for the last 5 years and we can probe up to 2 million positions in the genome and so the average resolution is somewhere around 1.5 kb and all the data that I showed you for the most part was done on these types of arrays and here's just what it looks like these are just affometrics and Illumina has a version and Agilent has a version as well and then finally we get to the 3G resolution nucleotide resolution through whole genome shock and sequencing and this is basically this will supplant everything else well for the exception of maybe fish we just want to do fish but as sequencing technology dominates and becomes cheaper and more affordable we're almost at a point where the gap here is still about an order of magnitude so it costs 500 bucks to do a SNP6 array it costs $5,000 to do a genome but that gap is going to be closing very fast okay so I think you've gone over array hybridization probably with Paul but I just wanted to show you how this manifests in terms of of in the copy number space you looked at this in expression space and this is how it looks in the copy number space so here you have a chromosome this is low resolution this is back array technology we have an alloyed region here where you have relative to some control you have very little change so the data points are clustered around zero and each data point here represents the hybridization intensity of the sample over the control and then here you have a region where essentially you have a loss and that gets manifest as a negative number in terms of the ratio in terms of the log ratio so any data points above the zero line or gains or a cluster around zero and losses are represented by negative numbers so this is essentially what each data point represents so it's the copy number of particular probe just calling the clone on a particular chromosome over the reference or the normal control so the high density gene typing arrays brought us to more than 1 million positions and the key the key advance here as well is that this technology allowed us to measure major and minor alleles separately and that's really a key distinction between the SNIP arrays and array CGH so with array CGH you could not measure copy neutral loss of heterozygosity for example because all you would see is that it's copy neutral the loss of heterozygosity part you couldn't measure because you weren't measuring individual alleles and so that was the big advance with the gene typing technology the original motivation for this was genome wide association studies and for inherited SNPs associated with human disease and this really this was a massive activity in the previous decade where literally thousands and thousands tens of thousands of patients were profiled using these technologies and I think to some extent the progress was made in terms of understanding the inherited basis of disease so this was the original design but then it quickly became apparent that this could be applied in cancer as well to look at regions of copy number gain and loss and loss of heterozygosity and so also all this analysis is readily amenable to cancer based studies so I just want to spend a bit of time you may have talked about this already but I don't think this can be overstated the a lot of the technology that that we bring to bear in cancer samples has been developed with normal genomes in mind so genome wide association studies you expect essentially most of the genome to be deployed sequencing based studies basically generated from a lot of the tools that have been generated are from the 1000 genomes project where the assumption is that you have essentially one sample that you're looking at and it's quiescent, it's normal of all the genomes of all the cells that have given rise to that DNA pool are homogeneous well that's just not the case in cancer and because in our cancer samples the pool of DNA has significant normal contamination it's often impossible to isolate cancer cells exclusively and so this will result in dilution of tumor specific signals and we have what's called intra-tumoral heterogeneity or clonal heterogeneity and that's and generally speaking tumors are made up of clonal populations of cells with different genomes and you can imagine how so essentially what you're sequencing is you're sequencing a mixture of cells with genomes and so you're getting one signal that's aggregated from a heterogeneous mixture and really most experimental designs consist of a single sample from a tumor or an aggregate sample from a pool of DNA so we have to just bear this in mind that this is a difficult problem the signals that are generated from somatic changes are generally quite distinct from germline polymorphisms and that's because germline polymorphisms will be present in every cell in the sample whereas somatic aberrations will be manifested only in the tumor cells and then we have the issue of sometimes we have genome-wide duplications or polypoiety in our tumor cells and we have maybe three copies of the whole genome or four copies of the whole genome that are manifested in these tumor cells so all of these issues and actually probably others are what give rise to the measurement that we observe and we tend to actually just gloss over this stuff and ignore this stuff and so everything that you do is essentially an approximation that ignores most of this stuff so we have to bear that in mind so in all the analysis that you're going to do in the lab today you're ignoring all of these facts that we know to be true so basically I think the point here is that the specialized tools for cancer analytic tools for cancer are underrepresented are badly needed and so generally what people do is they take what's available and try to apply it but it's a bit like taking a hammer to a screw okay so there's a nice review of statistical considerations for high density nucleotide polymorphism in this paper here it's a book chapter I think or handbook it's a very nice review of some of the issues and I recommend you read it this is generally speaking the workflow for how we analyze high density genotyping arrays and this is really with affimetrics so if you take affimetrics coming off the machine you get a file called the cell file and that represents all the measurements of the two major and minor SNPs and then they're actually half of the probes are just for C and B so they don't have major and minor SNPs they just have total copy number and so what we do is we do the following so there's some pre-processing normalization just do this time I don't think we're going to go into that in detail in the lab but you need to know that this is an important step then what we do is we do total copy number extraction we do the B allele extraction and then we do what's called segmentation so we look for break points in the data and I'll show that in detail and we look for loss of heterosugosity and allele specific copy number changes which is what I showed earlier and then what we want to do is try to consolidate all this data and try to make sense of it in terms of genes and pathways and clinical correlations so that's probably a workflow that you may have seen before how many people have done something like this or want to do something like this with the dataset okay good okay so let's just look at the specifics of Afrometric SNP6 so the probes here are 25 from our algonucleotide probes there are about 900,000 SNP probes 900,000 CMV probes and what we get out of this is essentially hybridization intensities so so how much of the probe lights up the intensity of that lighting up represents in some proportion how much DNA is at that particular locus this may be hopefully this link is still active if it's not you can probably just google what the chip definition file this has all the gory details of the platform that you can ever want yeah so the SNP probes have the major and minor alleles the CMV probes just look at the total copy number yeah yeah why 900? well so it's just to add to the resolution so these 900 SNP probes are really optimized for super unambiguous very specific SNPs that would result in minimal cross-hybridization problems and then there are similar regions in the genome that don't have SNPs but are good for copy number analysis so that's why so this is essentially to pad the data yeah cover all 21 sovereign genes of the genome that's a good question actually Gavin do you know how many genes are represented? we get measurements for most of them yeah but it's not evenly distributed so actually there are some regions in the genome that are just repetitive a lot of the genome is repetitive but there are kind of holes in the design here so the parts of the genome that are just not represented at all and if there are genes in there then they won't be represented so what's the resolution of SNPs compared to a platform like Agilent? yeah so generally I think Agilent has actually I'm really not sure about Agilent Illumina has a 1M 1M does anybody know the resolution of the latest Agilent? 60M 60Mers? yeah but how many probes? the biggest is 1M 1M yeah so I think SNP6 is the most dense because of the CNB probes okay this one is much better distributed than the genome you can also do customs I'm sorry? yeah well the reproducibility of these arrays is really astonishingly good yeah so we're going to hold questions to the end I think okay so we've got a lot of material to get through in 20 minutes so general preprocessing so normalization is definitely required but we use platform induced artifacts okay so so what we use in our lab is aromadotathematrix and we use a version a method called CRMA version 2 this generally speaking in our hands outperforms commercial software it's transparent, it's open source you know what they're doing you have the code you can even manipulate it if you want it's all good and so I would strongly recommend that if you're using AFI SNP6 that you become familiar with this package what it outputs is allele specific and total copy number real value data so one of the issues that happens is we get what's called allele crosstalk and what that is is when you have the major allele or the minor allele mishybridizes to the other and so what this shows is how is the correction of that so here's the major and the minor and what you should see is essentially the density of these signals should be really vertical or horizontal and so if you are off of those axes that means that there's cross hybridization so it means that if you're probing for the minor then the major is what's sticking there and imagine there's only a single nucleotide change in a 25 mer so it would be easy to get crosstalk and this does happen but there are methods to correct for that so after normalization this is what that looks like there are other issues in terms of GC content and actually the digestion of the DNA through restriction fragment length yields different hybridization intensities so these are just properties of the genome that yield variation and that's not desirable so all our variation should be contained within the biological signal that we're trying to measure that we don't want the signal to be diluted or confounded by other properties of the genome and so there are normalization techniques to adjust for these and then here's just the density plot of what all the probes look like where this is the log intensity here and this is just the sort of like a histogram value if you will before intensity and you can see that this is basically the same this is like a population of hat map individuals so these are all deployed normal individuals and in order for them to be comparable you want them actually to have the same shape in their histogram before normalization and this is what it looks like after normalization so these are much more comparable to each other than here I'm sure Paul went over this with expression arrays as well so let's talk about how we infer genomic features so what we're interested in doing is finding total copy number we're interested in loss of heterosugosity and allele specific copy number alterations so just by way of notation so we let y a j be the intensity of the a allele at a given position and then we let y b j be the intensity of allele b at a given position and then we have the total intensity is just the sum of those and then we look at with some normalization content constant we look at the total copy number at a given position it's given by the total intensity over the the total intensity of the reference normalized by some value and and then finally we have the b allele fraction this is important so this is the this is basically the b allele intensity over the total intensity okay so those are just terms that we'll use going forward okay so so from single single processing to copy number so we did all that and let's say we get some sort of we get some sort of representation of the total copy number for every position so what do we do after that so here's just an example of a tumor and a matched normal and and this is work from Gavin and and so what we have here is shown on the bottom is what the copy number, total copy number of the matched normal looks like okay and then we have up here we have this is the profile of the tumor and and so you can see that there are significant changes there they're present in the tumor it's not present in the normal here we have a deletion here we have an amplification shown in red and then one thing to just watch out for is looking at this in the context of the normal is very important so here what we have is we have what's called a copy number polymorphism or a copy number of germline copy number variation this is a region of the genome that is obviously probed because in the general population this part of the genome should be present but in this individual it's just not there and that's just due to inherited variation and you can see that's manifest as present in the tumor sample as well and that's important because if we were to not look if we didn't have this we didn't have the matched normal and we looked at this signal we would say holy cow that looks like a homozygous deletion I wonder what gene is there and then do all kinds of follow-up experiments and make mouse models and look at the function and say oh I didn't do anything so that's not good we want to avoid that so trying to find these these signals is extremely important because we don't want to be distracted by these and so Gavin's done some interesting work in trying to actually just from a tumor sample try to distinguish these germline polymorphisms from the somatic changes that we're interested in I think we're going to hold questions because we're really running behind but you can ask me after just another question so now we can think about allelic imbalance and I've already gone over this with sequencing data but here's just some examples from SNP6 data where we have very clear signals where we have loss of heterozygosity going on here and in both these regions here so this is what it looks like when you have a diploid region you have these kind of three bands that represent homozygous major homozygous minor and heterozygous and basically in regions where you have loss of heterozygosity you lose this nice heterozygous band and you get shifting away so these are really the signals that we're trying to capture so some approaches that you will encounter are really kind of some different algorithmic paradigms for this you can do smoothing which is essentially to try to fit a smooth curve to the data to try to get rid of some of the noise that's inherent in each of the samples you can do what's called segmentation we can employ mixture models and I won't dwell on this because this really kind of doesn't work very well and then we can use what's called a hidden markoff model approach and this is really kind of established as the gold standard in terms of algorithmic and statistical modeling in the field this paradigm is very amenable this data is very amenable to being processed in hidden markoff models so there's a couple of nice review papers on ArrayCGH that cover some of this that I've listed here so let's just look at the algorithms two of the main algorithms in detail so a nonparametric approach where essentially there are almost no free parameters in this approach which is really quite attractive is called DNA copy and this really comes from the work of Adam Olson and so he first published this algorithm way back in 2004 and there's some nice R packages that you can work with and it's readily usable on SNP6 data in my work we have an approach called HMM Dosage and it's really an extension of that works with SNP6 data and it's an extension of an original algorithm I published in 2006 and this is also readily available and are they using this in the lab Kevin? No, okay so they're not using this one but there's a modified yet another modified version that we adapted for sequencing data that you will use in the lab today okay so just briefly the DNA copy algorithm from Olson and all essentially what it tries to do is it outputs change points in the data so you can see if you look across this data sorry it's kind of blurry and noisy but essentially it's really quite easy to see with the human eye that if you scan across here at some point there's really an abrupt change in terms of the intensity of the data here and that's what we call a change point or a break point and so from an algorithmic perspective what we're really trying to do is we want to try to find these break points that signify these abrupt changes in the data and conceptually really that's all you need to know the computer scientists in the room they want to ask me more about that but that's the real concept where along my chromosome does the data intensity change abruptly because that signifies a copy number change okay does that make sense and the key concepts here is that we want to minimize the within segment variation and we want to maximize the between segment variation so if you think about the mean of this segment we want to make sure that that is as far away as possible as the mean of this segment and then the variation the standard deviation within this segment should be as small as possible likewise here so you can imagine that if you had this whole segment here if you consider this whole region the standard deviation would be quite high because there's a lot of variation between the probes in this segment and this segment if you treat them as separate segments then the standard deviation within the segment is minimized okay so I'm actually not going to go into the details of this because I don't have time but to contrast that the concept of a hidden Markov model is that the segmentation is important but the that requires some downstream interpretation so if you can find the DNA copy algorithm you can find the break points but that's all you can find so you don't know whether the segment is actually a copy number gain or loss or how much of it is gained and if it's only a small gain is it actually a gain or is it actually just some subtle variation in the data stream and so that's where hidden Markov models come in because what they do is simultaneously segment and classify the segments so in this paradigm and I think the main concepts again and I want to make sure it's clear is that the segmentation in this case helps with the classification and vice versa and so it's really done in an iterative phase where we segment, then we classify then we segment and classify and we learn what characteristics of the data better classifications at each iteration and so we use a framework called expectation maximization to do that and essentially we can assign semantic meaning to the states and this is the real advantage here is that the output actually has these semantic meanings so we have regions of loss, regions that are neutral and regions of gain and so and the other advantage is that the probabilistic framework such that we can get the probability that each probe is a loss neutral gain or even expand the state space so that we have homozygous loss hemizygous loss, neutral gain and then maybe high level amplification okay so there's a lot of literature out there, a lot of tools that have been developed over the last I would say 10 years now or 8 years really have really evolved and advanced how we think about hidden Markov models in the context of looking at copy number changes and I've just listed some of them here in that paper from Terry Speed I actually have included a summary of some of the changes and some of the methods that are employed for the methods you can see it's really dominated by hidden Markov models this is really kind of the way that the field is going okay so I think I'm just going to skip over this stuff so one of the things that we're going to do in the lab is once we've let's say we've got our dataset you've worked hard you've accrued the samples that you really precious materials you run the assays on your precious material you've run an algorithm to actually segment it then what do you do well so one way and I think it's very important is you always want to visualize your data you always want to try to plot and browse and get a feel for your data it was I was renovating my house once and I was I had this guy come he was this kind of old retired teacher and he was just a very good wise gentleman and he was doing all the work and I was being his lackey and so we're fixing the walls and putting putty in the walls where all the stuff is so then we're sanding it down he says okay well when you're done just run your hand over the wall just close your eyes and just do this with the hands and you'll find the places that are perfect and and this is this is what you do this is like browsing data you get your hands on the data you look at it and you find out things about it and this is essential this is really important so so think about renovations and painting walls when you got your data set and this is what you do so so this is what the Erby 2 Eplicon looks like in IGV in a thousand cases so you pull your thousand cases into data into your browser and you type your coordinates into the IGV browser and you see that there is a lot of red around the Erby 2 Elocus and so what red means here is high levels of copy number gain and so you can spend time and you can just go through each chromosome you can look for regions of red or you can look for regions of blue that represent deletions and that gives you just sort of an intuitive feel for what may be present in the data this is the Erby Elocus where we have some very focal homozygous deletions affecting the Erby Elocus ok so in the last few minutes I'll just talk briefly about some more advanced topics so analysis of next generation sequencing data and this is really still in development we've matured our code base with respect to this and I think it works well enough in our hands now that we've even published some important results using a set of tools that we've developed and we're working on just publishing this method but it's still relatively in development but a couple things that we noticed is that there's an extreme bias when doing copy number analysis in sequencing data that is correlated to GC content that's because in the data generation process there is a PCR step and PCR is definitely has its biases with respect to GC and the other thing that we notice is that there's essentially a bias with respect to repetitive and non-repetitive in the genome when you align the reads to certain parts of the genome you'll get what look like amplifications or deletions depending on how mappable it is so if you look at this so this is what it looks like if you look at if you've been the data and you try to just assess what is my coverage in a given 1KB window it might get a mess but it looks something like this and if you account for GC content it starts to get cleaned up a little bit it looks something like this if you account for mappability then it starts to become very clear what's going on so you go from this very soupy mess to this very kind of clear representation and then you can take that kind of pre-processed data and then run it through an HMM and segment it in your copy number states and you'll be doing this in the lab and this is a tool called HMM copy so we can also look at allele-specific changes in next-gen sequencing data we can try to account for normal contamination and ultimately hopefully account for intertomoral heterogeneity Gavin and I recently published a paper in genome research that essentially tries to do the analysis of allele-specific events in NDS data and this is from his paper and basically this is what we showed is that this is the results of doing that analysis from an array and this is a tumor sample that we profiled with an array and this is the result we get from taking that same tumor sample and using the sequence data that we generated to try to analyze it so blue here represents copy-neutral LOH regions green represents deletion induced LOH regions and red represents allele-specific copy number changes and you can see that we recapitulate what's found in the array quite nicely and so we'll go over I think you're doing Apollo in the lab too, right? Okay so this is a tool called Apollo okay so I just want to finish now with a couple of really advanced topics so what's I guess this earlier last year the first observation of something called Chromothripsis was published by Peter Campbell and Mike Stratton in Cell and what they notice is that they found this crazy pattern of what's called shadowing and followed by a non-homologous end joining and so what it creates is this incredibly complex chromosome that has been where you have loss and gains of genetic material in a very kind of sawtooth pattern and then all kinds of rearrangements as well it's like taking a jigsaw puzzle smashing it to the ground and then putting it back together in the reverse order in some other order and so this was being touted as kind of a mutational process or a way by which cancers obtain abnormal carry types and it really kind of gained prominence with the publication of a neuroblastoma study that looked at whether this group observed that Chromothripsis is quite prevalent in neuroblastoma and then really found that there was a strong association with outcomes so patients that exhibited Chromothripsis had a very severe prognosis and didn't do well at all whereas those that had no evidence of Chromothripsis seemed to be much better off and so this has a prognostic effect in some cancer types so the last topic I just wanted to touch on is intertomoral heterogeneity so this paper was published last year in Nature Navin et al from Mike Wiggler's group and what they basically showed is looking at copy number profiles in single cells from one tumor can exhibit really heterogeneous copy number profiles and so this idea that we're taking a single tumor sample running it through an array platform or sequencing platform and coming up with one profile for that tumor is probably very much erroneous that's just a concept that probably will have to get passed it's what we can do generally speaking not everyone can do this single cell analysis maybe a couple labs can that's about it and it's very expensive so you can imagine that if you do 100 profiles for each tumor now suddenly the cost of your study is two orders of magnitude higher than it was before so this is important and but maybe impractical but we must consider that this is the case and when we're dealing with a population of when we're dealing with a tumor sample we're dealing with mixed populations of cells so the genome architecture is really a fundamental important aspect of studying the cancer genome somatic copy number alterations change gene dosage and can drive expression of oligogenes and tumor suppressors copy number alterations can be measured using array based hybridization or next gen sequencing and really you all know this but properties of the genome revealed through copy number profiles can indicate phenotypic characteristics of cancers and so they're extremely important part of the genomic landscape and any study that involves the genomics of tumors must consider the genomic architecture as defined by copy number changes so I'll leave it there