 Great, so welcome everybody and to day three, this is day three. Hopefully you've had a good workshop so far. So my name is Rob Shaw and my lab is in Vancouver at the BC Cancer Agency. So I'm a scientist at the BC Cancer Agency and cross-appointed to the Department of Pathology and Lab Medicine at the University of British Columbia. And I also have a cross-appointment to the Computer Science Department at UBC as well. And my research program really sits at the interface of computational and statistical methods development, but really with a view that all that activity underpins activities in cancer genomics is studying tumor biology. So we develop using state-of-the-art methodology and statistics. We try to model datasets that you'll be working with over the course of the week. And really biology has changed to a point where we now have the capacity to generate incredible vast amounts of measurements and it's become a quantitative science as a result. And I'm sure that's why you're all here and you understand that. But often what happens is technology to generate data really comes first and methods to analyze the data come behind. That's usually how it works. And so I try to focus on really trying to extract the most out of datasets that are generated from things like next generation sequencing devices to maximize the biological output of that data. So there's a huge amount of investment going into actually generating the data. And so to optimize that investment, one has to use and develop the appropriate analytical tools to learn the most biological value and gain the most biological value out of that data. So over the course of today we'll be talking about two types of features we can extract from interrogating the genomes of cancers, copy number variations and single nucleotide variations. And so we'll go over some methodological aspects and also some conceptual points that I hope you'll take home with you. And then to accompany that work that I'll deliver in the lectures. Andy who works with me in the lab is going to run you through two practical exercises to really get your hands dirty with actually working with the data and predicting features that I'll be talking about today. Okay, so this is the outline for this morning's lecture, copy number variations. I think I'll just jump right into it. So we have until, Michelle we have until what time for this? Okay, good. Okay, so let's just start out with this picture. So how many people have seen pictures like this? Yes, no? Yes? Okay, so this is just a normal human karyotide. And what this shows is that the DNA in our cells is nicely organized into 23 pairs of chromosomes, 22 autosomes and a pair of sex chromosomes. And so you really inherit one copy of your genome from your mother and one copy from your father. Okay, so this is just the level playing field from which we all start. And all of our normal cells, the exception of some blood cells in the immune system essentially look like this. So I want to take you back to nearly 100 years ago to Theodore Bovery, who is an eminent biologist at the time. And so he had a hypothesis while studying sea urchins. And you notice that some of his cells became essentially malignant and had uncontrolled growth properties associated with him. So he says that we start with the assumption that qualities of malignant cells have the origin and a defect that exists within them. And the thing about sea urchin cells is that they have very large nuclei. And so he was able to really look at these nuclei without very high powered microscopy. So he noticed that when cells achieved this growth proliferation phenotype that they had aberrant chromosomes. They had an extra copy of one of the chromosomes. So he led them to this hypothesis. So the main point about this is he started to think about that the culprit behind human malignancy maybe originated in the genetic material of human cells. And of course he was proven right. In 1960 he would discover the Philadelphia chromosome in CML by Peter Knoll. And so that was really the first time that human malignancy had been associated with a change in the structure of the genome. So here's just an extreme example then of one of the most highly disrupted diseases with the most disrupted keratotypes. It's high grade serous ovarian carcinoma. And this is a slide given to me by David Huntsman. And so you can see here that instead of this nice organization of the genome where we have two copies, one from maternal and paternal sources, we have some chromosomes with extra copies. This is a really bad pointer. Is there another one? Oh, there's another one here. Okay, that's better. Yes. So we have some that have extra copies. We have some chromosomes that are entirely deleted. And then we have chromosomes that are mixed from two different original chromosomes that have come together with transifications. And so we're going to focus today on the types of events that yield extra copies of certain events and also parts of the chromosomes that have been deleted. And these are called copy number variations. So these are really losses or gains of genetic material. So you can think of these as, they're also known as segmental aneuploidies. And what that means is if you have a, you think of the genome as a linear object, there are segments of the genome that either are duplicated or deleted. And so this can be, we have really several different classes. One is, and these can be germline or they can be somatic. Okay. And so when we look at the profile of populations of tubers, this is the genomic landscape of breast cancer. And we plot on the, so against the x-axis, it's just a linear ordering of the genome here by chromosome. And the y-axis is the percentage of patients that exhibit an abnormality in that particular region of the genome. We can see that the majority of the genome has some level of disruption at some minimum number of patients in the population, dominated by, for example, amplifications of chromosome 1Q, deletions of AP, and et cetera. And so, and if you look over here in 17, this is the classic HER2-amplicon right here. Okay. And so you can see that this is really just a major feature of this disease, is that the genome is highly disrupted. And if you were to look at a normal population, you might see very small regions of the genome that have variation that basically leads to just humans being different from each other. But these are really a feature of the disease state. So let's just simplify that for a minute. And we can think of a few different types of events. And so let's just say we have a region of the chromosome here. And in this, contained within this region are genes A, B, and C. We could have a deletion where the part of the genome that contains gene B becomes deleted and is no longer there during a cell replication cycle. We could have a replication of just one gene. So we could have a localized replication of gene A here. Or we could have a whole section called a segmental duplication that generates an additional copy of a set of genes that are essentially in tandem. And so the consequence of that, as you can imagine, is that if we have, for example, in a deletion region, if there's a tumor suppressor in that region where the role of that gene is to guard against growth and proliferation or execute a DNA repair, we can imagine how the loss of that material would result in no translation of that protein and therefore that function is lost in the cell. And so that cell can gain a new phenotype. And then conversely, if you have copies of the material that are gained, we have extra copies of a gene whose job it is to promote growth and proliferation, then there'll be more copies of that protein around it and that can trigger phenotypic change that would result in growth and proliferation. And so we really, in cancer, looking for a copy number of variations can lead us to understand the biology of the disease state and really identify ultimately for diagnostics and prognostics targets for therapeutics and to really understand the association of biological phenotypes with genetic abnormalities. So CNBs can be broken down into three major categories, these congenital lab abnormalities which are usually germ line mutations and really the classic event that everyone can identify with is trisomy 21. So this is three copies of chromosome 21 that leads to Down syndrome and also intellectual disability. And then we have somatic alterations and these are associated with post germ line changes and these are acquired mutations that are tissue specific and it's really a hallmark of most if not all cancers and all cancers exhibit some level of disruption in the genome. And then we have benign variations which are, as I said, just polymorphisms that are naturally occurring in the human population. And it really wasn't until about five or six years ago that we really appreciated the level to which structural changes exist as a source of variation in the human genome. It wasn't until high dimensional arrays came on the scene that we were able to profile large populations of humans and realize that upwards of 10 to 15% of the genome is subject to variation at this level. And so we have, through the populations accumulated in the HapMap project, for example, where there's populations that were ethnically isolated, populations were able to really characterize that there are a significant amount of the genome that's subject to polymorphism at this level where it was previously only appreciated that this was happening at the single nucleotide polymorphism. That was well known, but the structural variation component was relatively new. So then in cancer, which is what we're all interested in, there are several different classes of events. So we can have segmental aneuploidies. We usually call these large scale, they're often low copy gains or deletions, and they tend to be broad. And so they can span whole chromosome arms, generally speaking, or it can be even whole chromosome events as well, so you could have a replication, for example, of chromosome 17, which harbors P53 is often a chromosome that gets entirely deleted, and occasionally will then duplicate again. So that's a feature of a lot of the epithelial cancers, is a whole chromosome loss of 17. Then we can have focal copy number alterations, and these are deletions of amplifications of really high amplitudes. So usually for deletions of this nature, they're what we call homozygous deletions, where both maternal and paternal copies are deleted, and they tend to target just one or a few genes. And the reason for that is that there's very little tolerance to eliminate both copies of the genome in terms of evolutionary selection. So the viability of cells that have large chunks of their genome where there are no copies left is relatively small, so these tend to get selected against in cancer progression. But targeting very specific genes can result in homozygous deletions or high-level amplifications. And these really can be the very good indicators of driver events, and we'll talk about what driver events are later on. And you'll likely, have you talked about rearrangements? You probably did that a little bit about that yesterday, yes? Okay, so rearrangements are part and parcel of this mechanism to disrupt the karyotype of the genome. We tend to view them quite simply because it's very linear from a copy number perspective, but many are actually also translocated so it's a result of translocations in shuffling the genomic deck. And so this is an important aspect, which we won't really talk about in the context of these two labs, but I think you'll be talking about translocations and gene fusions in other parts of the workshop. Okay, so this is really the quintessential copy number amplification in cancer. And the reason why it's so important is that this is the ERB2 gene, Shinco's HER2 protein. And so this is chromosome 17 of the breast cancer. And you can see there's a very localized amplification shown in red here where the y-axis is essentially the number of copies that are predicted to exist at this location. And each data point here represents a probe on a DNA microarray. This is a high-density genotyping array. We have approximately every 1.5 kB in the genome. And so this is a fairly high-resolution array and you can see that there's just a very localized event with a massive amplification of this region. And contained within this region is this HER2 protein, which is the growth factor receptor that sits on the cell surface. And this is really quite important because breast cancers that were characterized with over-expression of HER2 typically have a very aggressive disease course and up until recently had the worst outcomes and had the highest morbidity. And in the 90s, there was a drug, an antibody, that was developed against this protein and essentially now this is the poster child for targeted therapy in cancer. So patients come in, they all get tested for HER2 and then patients that exhibit high levels of HER2 will be administered a drug called Herceptin. And now the five-year survival rate for patients of this class is much improved due to targeted therapy. So whenever you talk about, you hear about personalized medicine, this is personalized medicine and this is the biggest and most successful example that we have today of personalized medicine. There are other examples as well, but this is one that's in routine practice and probably affects most people. So let's just look at the effect of these types of events on expression levels. What's plotted is the magnitude of the copy number of a particular gene and then on the y-axis is the expression level, the mRNA level of that gene in the same patients. So this is from a data set that was published last year from our group and that performed simultaneous measurement of the copy number and gene expression on 2,000 breast tumors. And so what's plotted here is a subset of 1,000 of those tumors. And you can see that the color dots here represent deletions in green, neutral regions in blue, and then amplifications in gains and amplifications in red and orange. And so there's a fairly nice correlation. This is really two distributions. Here you have the neutral regions that have this kind of variable expression, and then you start to see a pretty tight correlation between the copy number of that particular region and its expression. So this is something that we would assume is the gene expression of that particular gene is associated with its copy number. So the copy number in the genome is having some effect downstream on the transcriptional levels of that protein. And so here's another example here, and then this is a different region, 11Q13. We found some other genes that are highly correlated with the copy number state. So that's a concept that we'll revisit a little bit later as well. So those were amplification examples. This is a deletion example. And now just a zoomed in look at P10 on chromosome 10. And so here you can see, hopefully you can see this, so each one of these rows is a patient or a specific tumor, and then the profile where each dot is one of these 1.5KV regions is shown arrayed across the chromosome. And this is a zoomed in region of P10. And so what's shown here is that you can see in dark green are what we call one copy or hemizagis deletions. And then in the brighter green is an additional copy lost, and these are the homozygous deletions. You can see that with the dotted lines here represent the boundaries of P10. And you can see that these homozygous deletions target very specifically P10. And sometimes we have subgenic deletions of P10 where only a couple of exons are affected. And so this is a mechanism to inactivate P10 that's well known in breast cancer, and this is just what it looks like when measuring copy number changes at this high resolution. Any questions so far? This, yeah. Okay, so over the last several, let's say several decades, the community has accumulated a fairly large body of knowledge that's ever expanding, and this really has resulted in a set of genes that we know really across cancers are affected by copy number amplifications and deletions. So amplifications I mentioned are B2, and it has a cousin, EGFR. It's part of the same family on chromosome 7. Lung cancers are characterized by amplifications in EGFR and other breast cancers as well. The McDonald protein is well known, PI3 kinase, IGF1R, et cetera, and CDK4, CDK6, which is part of the RV pathway. So these are just a few examples that have picked out of well-known genes, some of which are targetable by selective therapies and to be inhibited. And then deletion space, a set of tumor suppressors such as RB1, P10, CDKN2A or P16, MAP2K4, NF1, et cetera. Of course, the BRCA genes as well. And so there have been in the last, I would say five or six years, a fairly extensive effort to try to profile and discover new genes that are affected by these types of alterations. And so a lot of these studies have employed high-resolution copy number arrays to really interrogate the somatic copy number landscape. And I've just listed a few of the papers here. There are many others that have been part of this effort. So if you're interested in this, these are really kind of population-level studies of hundreds of tumors and cell lines to really characterize the landscape of these events in cancer. And I would encourage you to read these. Okay, so I've talked a little bit about auger genes, tumor suppressors that are targeted by these events. And then here's just a list of events that we know for which there are drugs that are approved and in use that can target these events. So here's the IRB2 amplicon that's listed here, characterized mainly associated with breast cancer, a subset of ovarian cancers. And Trastuzumab is the scientific name for the Herceptin drug that I was talking about that selectively targets this protein. And so there are a class of PI3 kinase inhibitors that will inhibit amplifications of PI3KCA. Okay, so this is just to illustrate that these events are important to characterize because they're specific to tumor cells. And so inhibiting these amplifications events should have high specificity and action in just the tumor cells. Okay, good. Okay, so what I will do now is just go through a study that was published last year for which we profiled 2,000 breast tumors and used the data to stratify the population into molecular subgroups. So let's just look at that landscape. So if you think about the earlier landscape plot I showed, it was really, we had broad changes across the genome. And so not all of those events will affect the mRNA expression. So I showed you the classic examples on the scatter plots where we have genes that we were able to isolate that really seem to be highly correlated with expression. And when we overlay just those genes, what you can see is that the landscape gets sharply focused. And so we can really look at these peaks. And this is, again, this frequency in the population on the y-axis. And then the genes are just ordered according to how they appear in the genome on the x-axis here. And so this amplicon here is AP12, and this is well known and has FGFR1. And in fact, what our work led to was the discovery and characterization of ZNF703, which we show now, if you over express this, will definitely lead to malignant properties. And so this is a new driver gene in cancer. Here we have a region on 11Q13 that's characterized by CCNV1 and also a second, we can see the two peaks. So this amplicon here had traditionally been viewed as a single event. So it was fairly broad because low resolution technology couldn't resolve. In fact, that's actually two separate events that are mutually exclusive. And they contain different genes and actually cluster patients differently. And so this led to the identification of a new subgroup of breast cancer that's about 4.5% of the population. And this is really, I'll show you in the next slide, how that impacts outcome. And then so that's the amplification landscaping. And then on the deletion side, we're able to identify genes like PPP2R2A. Now this takes place in the context of a very broad deletion of the 8P arm. And so, like I said earlier, deletion of 8P is a major feature of breast cancer. It can be upwards between 30 and 50%. And so identifying a gene that might be targeted by that event is really quite difficult because it's literally thousands of candidates. But what we're able to show is that when you overlay expression, the gene that was most affected is this PPP2R2A gene. And Melissa knows a lot about this pathway so you could ask her about that. I'll put you on the spot there. And then also we identified what I would call a backseat driver. So what this means is that this is the locus of CDKN2A, CDKN2B. And this is a known tumor suppressor, also known as P16. And what we notice is that this gene M-tap, which is a metabolic gene involved in purine biosynthesis, is almost always co-deleted. And so the implications for that are not yet known, but this one comes along for the ride, but its expression and downstream expression of related genes in the pathway are also affected. And so although the conventional wisdom says that CDKN2A is the target, there's collateral damage here with M-tap and that suggests that maybe selection is operating on the co-deletion of those two genes. And then MAP2K4 is a relatively newly identified tumor suppressor that's now popping up in lots of different epithelial malignancies. It's a new breast cancer gene essentially that is shown across the subtypes, is shown in different cancers as well. And so this was amongst the first studies to really identify and hone it on MAP2K4. Okay, so this is the global view of the whole population. Of course, what I mentioned at the beginning is what we wanted to do is breast cancer had typically been classified in the late 90s as discovered that there were five reproducible gene expression subtypes and that you probably know in the doctrine today. But what we noticed is that even within those subtypes there's heterogeneous response to therapy and it's not always predictive. And so we tried to take this high resolution data set with copy number and gene expression with lots of patients with clinical outcome data and really try to substratify this population. So we take these measurements of copy number and gene expression together. We found that the data segregated quite nicely into about 10 different subgroups. And so they're characterized here of the showing in the discovery set of we basically split the data into half discovery and half validation. And these were the reproducible groups that are characterized here. And so here's the HER2 group here. And what's shown here is basically the specificity of the profile across the whole genome. And so when you see black dots that are high on this axis that means that that's very specific to that group. And so here's the HER2 group and you can see it's characterized by this very focal amplification on chromosome 17. This is the 11Q13 group that I was talking about that has this is characterized by the amplifications in CCND1. And then we have, for example, this profile here is number 10. This profile here is classically associated with the basal subtype if you know about the gene expression subtypes in breast cancer. And this is the set of tumors that really has the worst prognosis and the most aggressive disease. So then when we overlaid the clinical data and you've seen curves like this yesterday, hopefully, did Anna actually talk about this paper? No, okay, good. All right, so then what we then did is we took these groups that we were able to reproduce from just from a clustering perspective and then we actually ran the survival curves that we learned how to do yesterday. And this is really a unique resource because we had more than 10 years follow-up on most of these patients and so it was an international effort to accumulate resource that large. And so we had fresh frozen tissue and we're able to do nucleic acid extraction for the DNA and RNA as well as have the outcome data to see how prognostic the stratification of the groups actually were. And so I mentioned to you that the worst group... So a lot of these patients were prior to Herceptin, so these were acquired before the Herceptin era. And so the group with the worst prognosis here is the HER2 group. So you can see here that this group has the worst survival and if you were to plot this today I think the curve looks something like this. So there's been a really dramatic improvement in the trajectory of these patients. And then the thing to remark here is that it's really this group here. And this was a surprise to us. This is a group that is... would have originally been classed as Luminal Bs which has a fairly... the ER positive tumors which have generally characterized in the populations having fairly favorable prognosis. However, this subset of these patients when taken out to 150 months, for example, have a median survival rate of less than .4. And so this is a... this is a novel group that has a pretty severe prognosis that really splits this... one of the five subgroups, the Luminal B subgroups, into this 11Q13 containing and then the rest. Okay, so looking at the... going back to the importance of copy number alterations, so we likely would not have found these groups had we not looked at the genome. So a lot of the molecular subclassification of breast cancers had been focused on the transcriptome, so gene expression profiling using microarrays. And there's been a rich literature since the mid-90s that profiled this for diseases according to expression profiles. But a higher resolution and a much better stratification can be achieved when looking at the genome and transcriptome simultaneously. And so that's the importance of looking at the copy number space. So then in addition to what we call cis effects, so where the gene of interest is affected by copy number changes and then we look at the expression pattern of that same gene, we can also look at the expression patterns of other genes as well. So is there an association with affected expression at other loci of the genome? And so you can imagine, for example, if you have an amplification of a transcription factor whose job it is to drive a set of genes, then that might have a cascading effect across the whole landscape of that transcriptome. And so what we've plotted then was the correlation of the copy number on the x-axis and the gene expression on the y-axis. You can see that there are regions of the genome that really have a dramatic effect. So here a red dot in this matrix represents a positive correlation and a green dot represents a negative correlation. And so where you see... sorry, it's the other way around. Green dot represents a positive correlation. So you'd expect on the diagonal that a large number of genes will be correlated with gene expression in cis. But in trans, where you see effect across different spaces, we'll show up as vertical stripes in this diagram. And so when we look at some of these regions, we can then look at the pathways of these genes. And they turn out to be classic tumor-genic pathways like cell cycle and regulation of DNA replication, et cetera. And so what this reveals is that it not only does do the copy number affect the gene that it's targeting, but it has a downstream effect on entire pathways. And so this is an important concept to really pull out the full biology of these events when taking these high-dimensional measurements. Any questions on this? This may be a fairly new concept. Okay, so up until this point, we've been talking about the copy number as a whole. So we take what we call total copy number, which is the sum of the maternal and paternal alleles in the genome. But of course, we know that these copy number changes can be allele-specific. And so let's just talk a little bit about what that means from perspective of genotype. So in a normal heterozygous state, you would have, again, a maternal and paternal copy or a major, minor allele. And that would be characterized by AB. And so this is this classic nomenclature where we have one letter here. The A represents the maternal and B represents the paternal forsaken argument. And so you can be at each locus in the genome that you're looking at, you can be homozygous for the A allele. The A allele, it can be homozygous for the D allele, or one can be heterozygous. And so that's with copy number two. When you move to copy number three, the genotype state space increases. So one can imagine that there's an allele, starting with the AB state, there's an allele-specific amplification of the A allele. And that results in the genotype of AAB. Or one can imagine that there's a gain of the B allele, in which case the genotype would be ABB. Is everyone following that? Good. And then we can have loss of heterozygosity. So if we were starting with the AB state, then you can imagine that there's a loss of A, and then with three copies, you'd have to have two extra copies of B and that results in a BBB genotype. And that just continues basically as we go up in copy number. And so typically what we do is we try to class these events into zygosity status. One can be heterozygous, diploid, outer region. You can have a deletion induced loss of heterozygosity, or copy-neutral, which is two copies of loss of heterozygosity. And then as we go up, we have these different classes and when we get into copy number four or greater, you can have what we call allele-specific copy number alterations. So here you have just the A allele is amplified many times, but the B allele is still intact. So this has some different consequences. One can imagine that as long as the B allele is still expressed, that biology is still there. And so this is still classes heterozygous. And the biology might change significantly if you had extra copies, but the B allele is actually gone. So this would be amplified LOH. So how does that look like in the data sets that we're interested in? So here what's shown is a figure from published in genome research last year. And this is actually sequence data. And we'll get into how to process this. But what's shown here is the normal DNA for this particular individual. And what's profiled is every polymorphism in that person's genome. So every heterozygous polymorphism in that person's genome that's profiled just by sequencing that person's normal DNA and identifying the polymorphisms. And what you'd expect is that 50% of the reads might suggest the wild type allele or the A allele and 50% of the reads might represent the variant allele. And so the average allele ratio is centered around 0.5. So these are all the loci just on this particular chromosome that are heterozygous polymorphisms. Okay, makes sense. And then what we can see is that in the tumor at those exact same positions we profile the tumor and we look for the allele ratio of those positions. And you can see that this centering around 0.5 in certain regions is totally gone. So here's an example of a region that we'll class as having loss of heterozygosity. And so there are some explanations when you look at the total copy number. So let's just focus in on this region here. So here's a region where you have this deviation away from 0.5. And you look in the copy number and lo and behold there's a company deletion there. So this is one copy loss of one of the chromosomes and you can see that it has an effect on the heterozygosity of the polymorphisms in that region. Okay, alright. So this can really be viewed as this is like the symptom of this event here. So why is this important is that so here's a region, so if we just look at the copy number you would look at this region and say, ah, okay well it's just neutral. It's unaffected. It's got the same number of copies as in normal and here it's just classed as blue and this is a neutral region. So this has two copies by our prediction from total copy. But then when we look at the alleles in that region you can see that there's also this split away from heterozygosity. So what could have happened here? Any ideas? Right, right, exactly. So there are a number of terms for this. You need parental disome, copy neutral, LOH. But it's exactly that. So it really one requires two events for this pattern to occur. There has to have been a deletion of at least one allele to make it homozygous and then that remaining allele has been duplicated again. And so this is copy neutral LOH and really you can only see this by looking at the total copy number and the allele specific copy number simultaneously. And that's essentially what this tool called Apollo, which we published in this paper, tries to profile. So there's a third region, yeah. So that's coming in the next slide. Good question. So the third region I want to just talk about here is this one here. So this also has a signal that is not like this signal. It's not centered around 0.5. But it's not quite homozygous either in terms of its spread. And so the explanation here is that if you look at the copy number there's an amplification there and what's likely happening there is that this is an allele specific amplification where both alleles are still intact and so you don't have this kind of extreme profile where the data are centered on the extremes but there's definitely skewing away and so this is a symptom of an allele specific amplification. Does everyone see that? So this would be something of the AAAB variety or ABBB. All right. So why do we need to be concerned about modeling alleles in cancer? And this gets to Andy's question here. So if we think about the this concept of haploinsufficiency so this is really rooted in the idea of from Knudsen's to hit hypothesis in the I want to say 50s but I think it's 50s. Is that right? So where he established that in retinal blastoma by studying these very rare families that there was a susceptibility to requiring retinal blastoma that could be isolated to a genetic abnormality in RB and then a second hit which would render that gene homozygously inactivated would eventually lead to cancer. So we have one hit it leads to cancer susceptibility and then the second hit would actually lead to full blown phenotype. So then there are other genes where it's enough to just inactivate one copy. So if we go back to this region this would be an example where you have a deletion but we know that there's probably one just by the copy number levels here there's one copy of the gene is actually still intact. There are classes of genes that we call haplominsufficiency genes and p53 is one of them where the loss of one chromosome is sufficient or the loss of one copy is sufficient to induce the malignant phenotype and so this is what selection is operating on here is the loss of that copy and then so you can imagine that if the other copy is also deleted then the severity of the disease even goes up further. So there's an aggregate effect of losing two alleles but it's sufficient to just lose one allele for that class of molecule. And then in this paper they introduce this notion of quasi-sufficiency where just a reduction, a small reduction in expression level starts to induce the malignant phenotype and selection however if all of the protein is lost then we don't have any, then the phenotype is restored again so this is what's called obligate haplominsufficiency so there must be at least some of the wild type present for that to work and p10 is an example of that. So the only way we can really get at this is by looking at the alleles to really understand the nature of what alleles are still present in the tumor and so doing this copy number analysis can help us do that and looking at the actual alleles in the context of heterozygosity can start to get at these different classes of molecule that are selected for in different ways. Okay, so 1030? Okay, I think we're in decent shape so maybe this is a good place to pause and take some questions if you have any. So what I want to do for the rest of this session is to really go over some of the measurement technologies for how do we profile these events in cancer and they really range from very low resolution up to the highest level of resolution and so fluorescence in situ hybridization for example is a way to look at a very small number of loci where one can actually design probes and they're usually back probes that can be illuminated inside the nucleus of cells of individual cells and so these are fluorescently labeled probes that can hybridize to the actual part of the genome that you're probing and then one can actually through just counting the dots as they light up under the different fluorophores one can then just count to see whether there's an amplification or deletion of that region of interest so here you can see let's just look at this cell here so the control probe is green and the experimental probe is red here and you can see there are four copies in this cell of that particular locus relative to two copies of the green and so then the inference here is that this is an amplification of that particular region and so this is very nice technology but it's very low throughput it's very labor intensive one of the big advantages of this though is that one can actually look at individual cells and gain some measure of the heterogeneity that exists within the tumor at the single cell level so this is classic cytogenetics that is low throughput old technology but it has a really, really important advantage in the sense that we can look for specific targets and see how the population of cells, what the distribution of that event in a population of cells actually looks like so one can identify different clonal populations by this technology and we'll talk about that in the next session so then array CGH started coming on scene in the say late 90s and early 2000s with microarray technology and so where we put for example array in parallel between 30,000 and 100,000 probes across the genome and really this was not possible without the human genome scaffold already in hand and so as the end of the human genome project was completing these types of technology started emerging where we could localize regions of the genome we know where they fit into array technology and start to profile the genome across many different loci in parallel so the shift from this to this is a giant leap here this is really labor intensive to the point where it's not really practical to do much more than let's say 10 loci but now with array technology that gains orders of magnitude and then in the early 2000s towards the mid 2000s we started to see emergence of much higher density arrays and this is really driven by this idea that one can profile the SNPs in the genome to gain some notion of human genetic variation and that's really drove this process was to take individual polymorphisms and design oligos around those polymorphisms and they tended to be about 25 base perished and try to profile as many common SNPs in the human genome as possible and so the vendors like Agilent and Lumina started making these arrays that could really profile up to 100,000 to a million SNPs at once and the advantage of this for cancer so really this is very much driven by the idea that we could look at polymorphisms in the genome to study human variation also associate germline genetic variation with disease so we've probably heard of GWAS studies and so this drove a huge amount of technology development but in the cancer community it's very nice about this hybridization technology of course is that the quantity of DNA is reflected in these arrays and so we could start to leverage these genotype arrays to study cancer and so that was really quite nice and then so cancer became quite a big application that maybe people hadn't thought of originally when designing these arrays but became very popular use to profile tumors in a similar way to what it showed with the big breast cancer study and then these days now by the way I should just some idea of cost so to do this genotype arrays is somewhere between let's say five to eight hundred dollars a run per sample and and then now we're achieving full genomes at the nucleotide resolution where we have literally the three billion base pairs that we can profile and of course the cost of this is to do a tumor normal pair is still around ten to twelve thousand dollars so it's orders of magnitude more to look at the whole genome at a 30x the 50x coverage then it is to profile a genome with an array so you have a large population of tumors that one wants to study for copy number changes this is the genotype arrays are probably much more cost effective than to do the genome because it's just so prohibitively expensive but of course the advantage of this is that we have nucleotide level resolution of the break points some of which you may have covered yesterday okay so let's just think about how this works schematically okay so we have a probe that we've arrayed on a glass slide and the fluorescence intensity of the hybridization can be measured just with the laser excitation and measurement of that that intensity and we go through some image processing step and then we can take these signals that we know where each one of these probes should array onto the genome or the chromosome so we can plot the intensity as a function of where these probes sit on the genome and then start to measure essentially what is the quantity of DNA at that particular locus and so again just the more DNA at a particular locus the higher the hybridization intensity signal and the less amount of DNA at a particular locus the lower the intensity signal and so we can really measure the amount of DNA that exists at each one of these loci in parallel such that we get something that looks like this okay and then if we zoom in so here's an example of a segmental deletion just localize them this little area here and this is what it looks like when zoomed in these dots again represents the hybridization intensity of the tumor relative to normal and then we have here's our deletion that's pretty clear this region of the genome has been it's gone not having very good luck with laser pointers today but maybe I'll use my mouse instead okay so then the way that we actually quantify this is by through raciometric measurements and we just basically take the copy number or the intensity of the clone T particularly the chromosome and divide it by either a max normal reference that we've hybridized as well so we can take the ratio of the actual max normal DNA from the patient of interest or we can think about a pooled reference where we might expect the average copy number across the human genome is 2 and so just assume that you have copy number 2 so the max normal paradigm is becomes very much essential for whole axiom or whole genome interrogation typically for arrays it hasn't been necessary to the point where the somatic changes can be de-convolved from the germline changes but that's much harder to do for sequencing okay so that was the race CGH and so then we moved to high density genotyping arrays and the measurement is usually at about a million loci or more and again the major and minor alleles are measured separately and this really offers a key distinction between a race CGH which just measures total copy number and that's because then we can start to look at regions of loss of heterosygosity as I showed you before so covered this already in terms of how these technologies gained prominence was really through GWAS genome wide association studies for associating inherited SNPs with human disease and it became a nice application to cancer to profile segmental and the other types of events covered this morning so let's just talk a little bit about some of the challenges of statistical inference in cancer samples and I don't know if you've covered this material already have you done this in terms of material processing and normal contamination that type of you have time? okay okay good alright so basically in tumor cells or tumor tissue of course through increased literature and through lymphocyte infiltration and through stromal integration you have a number of cells in the sample that are probably not malignant and that can be quite severe in epithelial cancers to the point where most of the cells in fact won't be malignant or in diseases like Hodgkin lymphoma where the malignant cells are 1 in 100 so when taking a biopsy one has to be very aware that one is not just studying tumor cells or profiling tumor cells or measuring tumor cells so the other aspect is even within the tumor cells let's say one can assume that you can isolate all the normal cells away we're left with populations of tumor cells that are different and they vary and we'll talk a little bit about that in the next section and so one has to be very aware that when studying epithelial cancers in particular but also liquid biopsies as well that there will be colonial populations of cells with different genomes so what you're actually profiling is a mixture of populations so the signal that comes out is an aggregated mixture from different cells and most experimental designs really consist of a single sample from a tumor and so that can have some drawbacks so then the other really important concept is that we want to be able to distinguish somatic aberrations that exist that are only in the cancer cells thank you from aberrations that may exist in our germline DNA this is a really really important concept to understand the nature of a somatic genetics in cancer and how for sporadic disease not or editorial disease these are the types of changes that we want to be able to profile and then finally we have this notion of a ploidy and so what that means is a ploidy is essentially a measure of the number of copies of the genome and often tumor cells will acquire triploid or tetraploid genomes as I showed in the very early slides where we had multiple copies of the genome that's been replicated and this can be through whole genome end over duplication where during mitosis there's an event that doesn't allow for proper sedugation of the chromosomes and so we end up with nuclei with extra copies of the genome across the whole genome and in fact it can also happen the other way around where there are examples of cell lines that have been cultured that are haploid where an entire copy of the genome is wiped out and somehow these cells are viable and so the malignant cells are actually completely homozygous with one copy and so there's an example of some CLL lymphocytic leukemia cell lines that are really wonderful tools for looking at genetic manipulation because one only has to induce the mutation and that will have its homozygous effect and so there are ploidy influences that will yield specific signals and so the concept here that I just want to stress is that as the assumption in most statistical software packages ignore at least one and often most of these issues and a lot of the times we'll have tools out there that are really designed for normal human genetics and studying blood cells for example from healthy individuals and so these are often repurposed for the cancer domain and that will fail to account for all of these different properties that we know exist in the measurements that we're taking and so hopefully we'll illustrate over the next day is that we can account for some of this into statistical modeling so the point is that when studying cancer really specialized analytical tools are needed and one should not adopt the practice of repurposing a tool that's designed for normal human genetics into the cancer domain so there's a very nice review of these statistical considerations in this paper here I would encourage you to read it it comes out of Terry Speed's group and it's about the particular properties of cancer cells and how they're manifested in high density genotyping arrays so let's now look at the workflow for high density genotyping array analysis this is just this is with respect to afro-metrics no arrays and this has really been a dominant platform I would say that has been in use for the last say 5 to 7 years now I think and this has been the platform of choice for a number of the studies that I put up earlier in the slide so the first file that comes off the machine is called a cell file how many people have worked with cell files before maybe with gene expression arrays a few of you there's some level of preprocessing and normalization that's required in these workflows and so I'll just walk you through some of the tools that can be used and then we really go through two parallel two parallel tracks where we have total copy number extraction and vealial extraction and so I mentioned what those two things are and we'll go into a little bit of detail and then what we do is we do a process called segmentation and we want to separate the genome into discrete segments that exhibit different copy number levels and explain how to do that and finally we can take this and do gene and pathway analysis and clinical correlations so I'm going to make a radical proposal here and suggest a quick bio break can we do that? five minutes? is that alright people? and then we'll get into some of the nitty-gritty of SNP6 analysis so five minutes? yeah, is that okay? okay we can resume now so any questions so far? it's rather quiet so you're very clear yeah macro is it illuminating at the dominant? yes, that's right yeah that's a good question so I think there are yeah so the question is the other vendor that is in the business of making these high density gene-typing arrays is Illumina and generally speaking tools that tend to get developed on one platform basically that's the short answer and there's some theoretical translation over to a different platform but it doesn't always work out and so I think the the only way to really assess that is to do head-to-head comparisons and that often doesn't happen so the developers of methods typically work on the platform that they're interested in or the data they have at hand and there are very few methods I think that with respect to at least for high density gene-typing arrays that can translate very well and generalize across different platforms so it's just something to be quite aware of which is a good point there's another question so for especially at nucleotide level analysis for sequencing data that's basically a flawed process and the reason is because it ignores the fact that the two data sets are highly correlated with each other they're from the same actually genetic background the somatic changes may represent one in a thousand changes that you might one might see across the whole genome and so 999 out of a thousand events will actually be shared between the tumor and the normal so one can leverage that and we'll work on that in later this afternoon Andy will explain how to do that in real with real data analysis and so one of his Andy's papers is to actually has shown very nicely that we jointly analyze the two data sets that the result is much more accurate okay so let's just look at the the structure of apometric slip 6 arrays we have 25 where all of the nucleotide probes this is the these are highly optimized to look at these are unique regions of the genome 25 mers and there are approximately 900,000 SNP probes where both the major and minor allele are probes so this would be a 25 that differ at just the one locus the one nucleotide it's in the middle of that 25 and then also part of this platform are 900,000 CNP probes which don't have a polymorphic locus they're just for the purposes of copy number variation and they work on the notion of hybridization intensities where again the more DNA locus then the higher the intensity of the signal and then there's a chip definition file which has all the gory details of this design at this URL to here and I apologize this is out of date but I think this is still this is still I didn't look at that since last year in that actual URL but you'll be able to find the chip definition file so going back to the workflow the first step is pre-processing and there is quite a bit of normalization required to remove the platform induced artifacts and the method of choice that that I've really come to like is this Aroma.Aphometrix package again out of Terry Speeds group here's the URL here and generally speaking it outperforms commercial software it's transparent it's open source one knows what you're getting and it's by probably the world's leading group in statistical analysis in my go rate so I trust this package wholeheartedly it's not without its faults but it's the best package out there and what this outputs is allele specific and total copy number real value data so let's just we can go through the different steps so one of the flaws or one of the unintended consequences of this design is that we have allele crosstalk and what that means is that major allele may mishybridized DNA from the minor allele and vice versa so remember that these probes only differ at one nucleotide and so there can be they can capture the DNA from the unintended major minor allele and so what that looks like is if you were to plot the B allele against the sorry the minor allele against the major allele one should actually see sorry and it's very difficult to see on this plot but you should really see three groups you should see the homozygous data point should line up against the Y axis here for the B allele and the homozygous for the major allele should line up on the X axis and then there should be a cluster that's in the middle and you can see how really these are data clouds and really the clouds represent the level of noise in the system so these are not discrete measurements that one would get very accurate and completely faithful results there and the point about allele at crosstalk is that one does not get these flat lines across the axis that one would expect theoretically and so the aromadiphometric package has a way to correct for this and so adjust the intensities that are achieved to account for the notion of allele at crosstalk so then there are other artifacts that actually are really ubiquitous across any kind of measurements in the genome is we know that there are non-uniform properties of the genome such that there are GC rich regions and these perform quite differently across the whole genome with respect to the probes and so one needs to adjust for that so a signal that one could obtain may simply be just due to the fact that the region that's being probed is highly GC rich for example so the hybridization properties of GC rich regions is different than AT rich regions etc hopefully that makes sense and then the fragment length to do the digestion also has some properties as well and so without going into the really nitty gritty details here this package essentially it counts for these tries to account for these different properties and adjusts the data to make all the probes comparable to each other and so you can imagine that if you were to do several arrays one might look at the profile of each one of the probes or a smooth histogram of the intensities that come out and you can see that they don't all line up, it's not completely reproducible and so to make these arrays comparable to each other there's normalization that can then make all the experiments relatively comparable to each other after normalization so we do normalization and once we have normalization we start to look at the genomic features that we're interested in and so these consist of total copy number and loss of headers of agosity and really also some copy number as we talked about so just by way of notation this isn't you don't need to take away this but I might refer to some of these terms so we can have y sub j a which is the intensity for the allele a at position j and so this position j would be one of the either 900,000 or 1.8 million probes on the array and then similarly for allele b and then the total intensity at that position is just the sum of those two intensities so you have the intensity for the maternal allele let's call it and the paternal allele and then to get the total intensity we just sum the two and then we have the total copy number at that position is that quantity that's essentially divided by what we might expect from the reference so the same quantity of the reference and this can be again the match normal or it can be a pooled reference and then the b allele fraction is b allele quantity over the total so that just represents and so you can imagine if one is homozygous for the reference then this will be 0 and if it's homozygous for the b allele this will be a 1 so all of the signal will come okay so how do we go from the signal processing step to actually inferring copy number so here's an example of a tumor normal pair so this is a match normal and that's shown on the bottom here and this is total copy number that's being shown so this is the sum of the two alleles and then up top here is the tumor and so red represents an amplification that's been that's been segmented and green represents deletions and so how do we get from you can imagine that when the data comes out of course they don't have these nice color coding associated with it they're just all black dots and we use some sort of algorithm to map these dots to these nice discrete biological categories and what you can see when we do that is that we can identify these very nice events here that suggest that this is material in the genome that has been amplified relative to the normal so the normal shows no sign of these events but these are events that are specific to the tumor and so these are the areas of interest that we want to zoom in on what's shown here is an event that's actually shared between the tumor and the normal so this is the real advantage of doing match normal experiments is that we're to look at this in isolation we might say look at this event here this looks like a focal homozygous deletion this has been the gene in here is completely inactivated and therefore that must be related to cancer progression so I'm going to go study and make a functional knockout and I'm going to put it into a mouse and I'm going to study its biology and I'm going to spend five years of a postdoc's life studying this and you see okay wait a minute there's also a polymorphism than the normal and so this would be one that you would probably not want to continue on for the purposes of pathogenesis and cancer because it exists in the match normal yes oh sure absolutely so there are the BRCA gene for example is one where you have frame shifting deletions and skeined widespread prominence with movie stars getting double mastectomies based on the presence of a genetic abnormality in those genes and so without a doubt there will be the single nucleotide type of events are much better characterized than the copy number events but now there's with whole genome sequencing becoming relatively cheap and even the genes are raised there are large scale population studies that are looking for a hereditary basis of these events in cancer now that really requires large scale population level studies to really hone in because because the statistics where you have it's a cursive dimensionality problem you're looking at many many loci you need many many patients to actually be able to hone in on a statistical signal that may associate but there's certainly it's a high area of high activity so the normalization process we were talking about before obviously these are still showing up so that normalization was done in terms of not the match normal in this case so that's right so in this case these would have been done to a pooled reference standard reference and so it would take the tumor to the standard reference and normal to the standard reference and then do some sort of subtracting analysis it's much better however in this case I agree that it would have been to normalize the tumor to the normal and that would have hopefully have been made yes so this is an excellent question and there are tools that can account for that and so basically that's where looking at the actual a lead and so this is an excellent question and there are tools that can account for that and so that's where looking at the actual alleles really makes a difference and so total copy number maybe one couldn't tell if there's tetraploidy going on but when one examines the alleles then that pattern starts to develop and one can look at that and actually infer that and so one of the tools that I've listed in particular AlcoSNP is a tool that can account for what's useful even if it's a germline event yes you sort of touched on this but a lot of times I find I don't have to match the rules and so what I've done in the past and I don't know if this is okay is I've downloaded let's say SNF chick SNF 6-rays from PatMap or 8-dimension but they kind of use that as a pool yeah but I think you also mentioned do you just make a dummy kind of file that everything gets to or is it better to use the pool? it's definitely better to use the pool because the pool normal will actually have the just a platform specific variation encoded in it and so it's much better to do that than to use a dummy reference but the you wouldn't want to sequence a tumor without a match normal so don't do that, that's like burning money it's uninterpretable data so I should mention that 1500 of the 2000 breast that we did in that study that published last year did not have match normals so we're still able to pull out somatic genetics of that but that's looking at large segments that you know at a resolution of a few hundred KB and that's very different prospect than single nucleotide analysis okay so let's talk about allelic imbalance so here we have again just illustrating this concept so here you have the total copy numbers illustrated here and this is from this is from a a paper from terry feets group and so the total copy numbers centered around two little amplification event here and so again what you can see is that so here's a profile of a region of the genome that's essentially unaffected so you have essentially these three different classes you have your homozygous polymorphisms that essentially center around zero or one depending on whether it's homozygous for the maternal or paternal allele or the major minor allele so here we have a cloud of data that's centered around 0.5 and these represent the heterozygous loci in that person's genome so here's a neutral region that where you see it's completely unaffected again here's another neutral region that has this concept of copy number copy neutral loss of heterozygosity you see a sharp departure from this pattern over here and we end up with this pattern here so this suggests that this is one of those events where you have a deletion followed by a duplication and what's left over is are the homozygous loci so can anybody try to estimate or guess why is it that we have these bands that are sort of close to the edge but not quite at the edge if they're truly homozygous we should see the data clouds should just line up right on top of each other so what's happening here? normal tissue right, normal tissue so that's what's mixed in with this is there still some residual cells or some number of cells that do exhibit heterozygosity and so remember that this is an average signal across the whole mixture and so the loss of heterozygosity in the tumor cells shifts the data but what's holding this back from being completely to the extremes of the distribution are the presence of normal cells that we're still measuring there so that contributes to the signal that we see so you could deconvolve this into the normal component which would look like this and the tumor component which would look to the edge so we have an aggregate signal and this is actually used we can use this to our advantage to try to estimate the amount of normal contamination that might exist in our sample so one can actually deconvolve this signal into two components and one can estimate the contribution of each component to the underlying signal and we do this quite frequently in analysis of this type of data what's happening there is one here? so this is probably an allele specific copy number change so here you have a copy number change and there's this one here alright so 15 minutes so I'm going to just go through this part fairly quickly alright so here's just a simple diagram of total copy number and we're just going to go through now how the underlying principles for how one would segment this data and so here you have a deletion here you have an amplification the question is how do we actually infer these regions these features in the genome so there have been a number of review papers on this and so I'm going to compare really walk you through two different approaches that have been widely used in the literature and some of these tools are fairly old and there have been newer examples but the concepts are fairly similar and apply to the newer tools as well so a very popular approach is a non-parametric approach called DNA copy and originally developed by Adam Olson I was at Sloan Kettering there's a nice bioconductor package for this so have you worked with bioconductor yet in the workshop? no? yes? no? I think you will be so you can just download that's in bioconductor and it integrates well it works within R and then there are hidden markoff model approaches which are parametric approaches and I'll walk you through that okay so let's just look at this DNA copy algorithm and the key ideas here is that what it does is it outputs change points in the data and so it tries to find regions across the genome where there's a sharp change in the profile so here you have a cluster of data points that is centered around this mean line here and you can see at this point there's a sharp transition and so there's a change, an abrupt change that signifies that there probably is a copy number change and a break point that's happening in this region and the major concepts here is that this algorithm tries to minimize the within segment variation and maximize the between segment variation alright and what they introduce is this idea of a circular binary segmentation so how does that work so it's a similar notation for what I was talking about before and so the new notation here is just really to compute the mean of a segment from i to j so from position i to j and we can just look at that pictorially we take the all of the the full chromosomes essentially splice it together into a circle and then we look at the i and j that maximize this score so we try to identify these regions in the genome that maximize some scores so what is that score the score is essentially a trade off between the within segment variation and between segment variations so again we try to maximize these quantities where the difference between this part of that circle and this part of that circle is the maximum and then so it tries to insert these positions at those particular points and so it's an exhaustive search across all possible points that identifies break points that again the major concept is to maximize between segment variation and minimize within segment variation is that clear okay and so and then one tries to compute this for all possible break points and then assess statistically whether this is the likelihood of that particular event under permutation and so we end up is we end up with the first step is the segmentation that identifies change points and then it's just done recursively until there's no more changes so we take the new segment and we make a new circle out of the new segment and repeat the process and so sorry this is a really bad resolution here but essentially what this green line represents are the means of the segments and so here's a segment here's a segment here's a segment etc and so the issue here is that this algorithm outputs segments can identify regions of change but these segments have an arbitrary number of levels and so it still requires some sort of post-processing to interpret the results and this is really done typically by thresholding and sometimes something more sophisticated so there's a nice tool called merge levels which could take for example you see that there's a level here and there's a level here so there's some sort of change point here but this may just be due to noise and so the idea is that one would have to post process and maybe join these two segments so that we don't over segment the data and misinterpret those results and so so ultimately what we want out of this process is we want to take these black dots and we want to classify them into something that's biologically interpretable so for example what are the regions of loss what are the regions of gain and what are the regions that are neutral and so this really requires a whole other step after segmentation to make that classification and these are some of the tools that have been used to do that so by contrast to this the advantage of this is that it really doesn't require any parameters so you can just run the algorithm find the break points and then impose almost a user based thresholding of the post processing to do the interpretation another approach and alternative way is to try to simultaneously segment and classify the data at the same time so what do I mean by that so here the classification helps with the segment classification and vice versa and so the classification is done at the probe level into a fixed number of states and so the state space and the simplest notion is basically a loss neutral and gain and there can be increasing number of states as well so one could for example separate losses into homozygous lesions and hemizygous lesions or gains into multiple levels of gains or to try to really classify these super high level events that we are probably most interested in so how does this work so there are a number of methods oh wait a minute it seemed to be missing some slides oh well so that's okay so the way this works is essentially we have parametric distributions that define the different classes so let me just go back to let's go back to this so the idea would be that we have a distribution governing losses a distribution governing gains and a distribution governing neutral the idea is to assign the likelihood that each point belongs to one of these distributions and then the model assumes that one should be just due to the nature of the data that the biological segments are likely to span a large number of probes a large number of consecutive probes so the model assumes that one should be essentially the same class as your neighbor so if you're a loss if your neighbor is a loss then chances are the next one will be a loss but there's some sort of transition that matrix there's a probability that one can transition from one state to another and so it's using an algorithm called expectation maximization one estimates the parameters of the model and then does a segmentation and then based on that segmentation re-estimates the parameters of the model and that process just iterates back and forth and there's a vast literature on hidden Markov models for copy number arrays appointed you to some of the papers but essentially that's the way it works and so at the end of the day in contrast to the DNA copy which just produces the change points one gets the change points and the classifications of the different segments and so that has some advantages to it but it can be restrictive in the sense that usually we operate on a fixed state space where we have to specify for example the number of states that we think are in the data so for example typically what we do is might use a six state space where we have homozygous loss, hemizygous loss, neutral so those are three and then the three levels of amplification and that's usually enough to cover the space but in sequencing data since the resolution is higher sometimes one might even want to expand to 10 or 20 states or estimate exact copy number which is difficult to do really collapsing down to six states is an approximation and it does not estimate the true copy number but can really we found in practice can identify the regions that are under homozygous change and those that are under super high level amplification which are ultimately the most interpretable results that we want to get okay so I apologize for some reason that some slides were deleted but I can try to get those to you anyways okay so like I said there's been a rich literature leveraging hidden mark off models for these SNP arrays these are some of the tools and there have been more since and so there's a nice in this paper that I've already outlined to you there's a nice table that compares the different methods and talks about the different approaches and in comparison so I urge you to look at that okay so we're getting to 1030s coming up okay so I just wanted to quickly go over some of the concepts that may be covering the lab so one nice way to visualize these events especially when looking at a population of tumors is with IGV so one can actually go into the TCGA portal for example have you covered TCGA yeah okay so you can just download the TCGA data that is segmented and for the different tumor types and just upload it into IGV so just natively without any manipulation whatsoever let's say you're studying a gene of interest and you want to know in the population of TCGA tumors that have been studied there might be 500 or close to 1000 and you've got your native interest that you want to study you want to know how often is that particular gene deleted in TCGA and the mutual cancer versus ovarian cancer for example you can pull in these data sets and just look at it so I think are you going to cover this in the lab? yeah okay so we're going to go over this I won't spend too much time on it just to show you that this is the LB2 amplicon in the Metabrick breast cancer data set red here means amplification you can see that there's a very focal and localized region where these are the tumors that exhibit hertu amplification and the key point here is that in a population level one can essentially look at what's the overlapping region here of all these segments and that could be some idea of what selection is actually operating on so one can really hone down on just where the interval between these two bars and say and so it turns out there may be five genes in that region and one of them of course is LB2 and so here's what a homozygous deletion looks like and here's just this very focal deletion I think this is this is probably Rb1 and so here's Rb1 here homozygous deletion and so just in the remaining couple of minutes here we're going to talk about analysis of next generation sequencing data so whole genome sequencing data so a few things that we and others have noticed is that the GC bias is a real phenomenon in sequencing data as well and so different regions of the genome will in the bridge amplification step will amplify differentially than the AT regions and so that creates some noise in the data so here's for example if we were to just take bins and we take 1kb bins and we try to plot the number of reads that align to a particular 1kb 1kb bin and we just plot that across the genome this concept is really rooted in the fact that again sequencing a mixture of cells and the amount of fragments of a particular region will be proportional to the quantity of DNA present in that region so there's an amplification there's the number of fragments that we sequence for that particular locus will be higher than if there's a deletion makes sense so here's what it looks like when just correcting GC content so we start to see a much smoother and easier to interpret profile and then we also correct for properties of the genome that allow reads to map there so of course most of the genome more than 50% is highly repetitive and has a low map ability so that will also influence how many reads align to a particular part of the genome so we can correct for that as well and then once we do that then we can really start to see where the biology exists for the genome if you were to just look at this it would be almost it's highly uninterpretable once we actually through the normalization steps we can start to see the regions of interest that we want to focus in on and so when we actually do this in practice this is what actually a genome that's been sequenced with whole genome sequencing and subsequent processing of the data this is what the copy number profile looks like so we can start to get really nice discrete blocks where we can start to estimate where the change points are and where the actual biological events of interest are so I mentioned that we can do loss of headers so got to the analysis, just skip over this skip over that so some new concepts then that have been illuminated by whole genome sequencing one of which is chromothripsis so what this is is a paper published by Peter Campbell a company in cell in 2011 and what they describe is this concept of a chromosome shattering followed by a non-homologous end joining and this is actually visible in the data and becomes actually quite a measurable property in whole genome sequencing so what this basically shows is that there's been this catastrophic event that has essentially blown the chromosome apart and the repair mechanism actually sticks it all back together but it's all been shuffled around and somehow again this event is selected for, the cells are viable the evolution is selected for this event and this clone still expands and exists in the tumor and so what these arcs represent essentially exchanges of information across the chromosome so where you see an arc from one point to another there's a read that spans that break point so there's part of the read aligns here and part of the read aligns over there and so this is really an extreme example of a genome that's been completely rearranged it doesn't resemble anything like a normal cell but it started from a normal cell and it's been just completely obliterated and so what is the significance of this this may be due to the fact that you may have compromised homologous recombination for example or other DNA repair mechanisms that are compromised and the significance of this can be that in a nice paper published on neuroblastoma they looked at a cohort of approximately 90 tumors and found that cases with genomes that look like this where again so you've seen these circles plots yeah okay so this is just one chromosome where you have this many rearrangements happening on the chromosome I don't know how many it is but it's probably thousands of people have predicted this particular phenomenon had a much worse prognosis than tumors that didn't have that phenomenon and the reason why this is important is that neuroblastoma is a childhood cancer that typically doesn't have very many somatic mutations in the point mutation space and so when people started originally sequencing neuroblastomas it looked like a barren to pin your hat on there were no low hanging fruits in the coding space and then this group came along and did whole genome sequencing and found that look there's a subclass of tumors that have undergone these really dramatic changes in their genome architecture and this is probably what's leading to the malignant phenotype in those tumors so some more advanced topics complex rearrangements Andrew McPherson in my group has been working hard on this problem and looking at events that involve more than two regions of the genome so we can think about translocations where we have exchange of information between two parts of the genome he's published a nice paper that really profiles where you have involvement of three or more regions of the genome that create viable transcripts so you have genes is that we fuse gene A with gene B and that ends up with some sort of uncle protein this would be many different genes coming together to create a chimeric transcript so this is something that is gaining prominence in terms of looking at genomes like this they tend to create entirely new proteins that wouldn't have existed before in nature and those get selected for and then finally I just wanted to end with this concept of intertumoral heterogeneity so this is from a paper published by Nick Nabin from Mike Wiggler's group in 2011 and so what they did is they did a single cell sequencing of a population of cells from breast cancer and they flow sort of the cells into discrete populations that had been characterized by different cell surface markers and then they sequenced the individual nuclei and estimated copy number profiles from the individual nuclei and so what they noticed is that there were one can actually relate the genetics of those cells by a phylogenetic tree and just by clustering they fell into really these three different discrete categories and so within a tumor the copy number architecture of the different cells is quite different in these particular tumors that they sequenced and so, yeah Are you saying that every cell... Yes, that's what this shows Yes Yeah, that's exactly what Shaz says and so but these are very subtle variations within these groupings but there are these three very distinct groups that suggest really kind of dramatic punctuated changes in the evolutionary history of this tumor to select for these three populations Okay So just to summarize this section So the genome architecture is Yeah That's right So that this population or this representation is about a 50 40 in a different type of substance Like the left hand would be half That's right And so that's the general population Correct, so the structure would be that half the cells are this belong to this group Maybe 25% belong to this group and another 25% belong to that group No, this is all from one tumor This is a population of cells from one tumor in the tumor These populations are different So typically what we do is we actually will measure the aggregate signal from all of these cells and so that is just something to bear in mind So we showed in the Metabrick sample how there can be vast differences across the population of individual tumors but even within tumors there can be incredible stratification of the cells and so there's a beautiful paper actually in PNAS from Simon Taveray's group or just a couple of months ago and what they showed is so a few years ago Rollverhack at all published the subtypes of glioblastoma These are brain cancers that have can be stratified into four discrete classes by expression profiling So beautiful paper big advance in terms of understanding the difference in outcomes in these different in these different kids with brain cancers and what Simon's group did is they went into individual tumors and analyzed specific fragments of the tumors within one patient and they were able to identify a subset of patients that had within one tumor examples of all four expression subtypes So what that shows is that a single sample will likely not represent the spectrum of changes in the whole tumor and that maybe what the subclassification across the population the interpatient classification may actually be due to just the sampling error that exists within the tumor so something to bear in mind So I should probably wrap up because Michelle told me to be on time so I'm going to be on time you know I'm ten minutes late So what I hope I've convinced you of is that the genome architecture and in particular in copy number space is a fundamentally important aspect of studying cancer genome any experiment that looks at mutations or expression in isolation without considering the copy number landscape is probably an incomplete representation without a doubt The copy number alterations can change the gene dosage and therefore drive expression of oncogenes and tumor suppressors so we saw in the Metabrick example both in cis and in trans and those proteins that are affected by copy number alterations are often the ones that we want to hone in on to understand the properties of those tumor cells So copy number alterations can be measured using array based hybridization and increasingly next generation sequencing and I think are you going to do both in the lab? Yeah, okay so you're going to actually look at both arrays and sequencing data in the lab and eventually it's been said and it's true that arrays will basically become obsolete that's absolutely true, I believe that however in the present day as I said it's still an order of magnitude more expensive to look at a whole genome sequence of a tumor versus an array and so if one is only interested in profiling the copy number architecture of a tumor it's still much more cost effective to do that by array and the technology's been proven and is reliable and works and we have all the analytic machinery for it the whole genome sequencing is more expensive, it gives you much more information but as I said if you want to take a restricted view of just the copy number architecture or even a preview of a tumor before sending it to sequencing it's a very cost effective way to look at it and sometimes the results of that preview can inform your experimental design for the sequencing and the expensive sequencing experiments you spend a little bit of money look at what the monster that you're actually going to look at with sequencing and designing the experiment appropriately and so I hope again that the properties of the genome that are revealed through copying or profiling really can indicate important phenotypic characteristics of cancers and so it's just an important thing to look at so here in your slide deck is just a number of tools that I may have mentioned in passing and then there are a number of all the URLs are there and so you can go and look them up on your own time and start to explore this landscape and these are all tools that are available, you can download they're not very many restrictions on these tools and that's why part of the reason why they're there and they all have papers associated with them too these are all published methods so I'll just leave it there and take some questions I'll let you carry with your questions before I introduce what it's going to be okay it's kind of interesting about the intertumor heterogeneity about a hundredth breath cells, do you think like if you like sequence another hundredth like breast tumor cells why not another patient and if you class them together do you think some of those like cells will kind of class it in the same group like so it's possible but I think one has to think of a tumor as being an individual evolutionary process so so the and think of that each cancer that exists in a patient population will have undergone its own distinct evolutionary path and so one might see through the concept of convergent evolution that there will be certain properties that are in common so the micro environment of the location of the tumor for example may select for certain features so maybe p53 loss gets highly selected for in breast cancer for example or and so those types of commonalities are certainly there and we also see even across different cancer types you'll see commonalities but at the end of the day so to look at the whole genotype then there will never be two cancers that are identical yeah so for whole genome sequencing or for yeah okay so glossed over that so essentially what we can do is we can fit a distribution we say okay so so we can look at the properties of this GC bias and we can so we can take the recount as a function of GC content and you can see that essentially it's not uniformly distributed right so there's some pattern associated with that so we can fit a model to this and essentially adjust the data points based on that model and end up with a profile that looks like this which would be much more what we'd expect is that universal? yes yeah so there are tools so the HMM copy tool that put in slides essentially implements this method and there are other tools in Bioconductor and others that have taken this concept and implemented it so this is where a lot of this data is high dimensional data it's very noisy and so advanced statistical consideration is actually really quite important it tends to get glossed over but it's a fundamental step in extracting again maximizing the biological output of these very expensive datasets one has to treat the data very carefully and understand its biases and its worts there isn't this concept where sequencing a tumor is not going to just automatically give you all the biology that you need there's a significant amount of analytical processing that needs to take place and after that process it's probably quite an incomplete picture still