 Good morning everyone. Is everybody here? I think everyone looks like all the seats are filled up, which is good. So I'm Sarab Shah from the BC Cancer Agency in Vancouver, and I have appointments in the Department of Pathology at Faculty of Medicine, UBC, and also I'm first appointed to the Computer Science Department there as well. I also have a presence in the Genome Sciences Center, which is the equivalent of the ORSCR Genome Center in Vancouver. So I'll be talking to you today about a couple of different features of cancer genomes. I'm sure that you've all been exposed to by now, copy number alterations, and also somatic poignantations. And we'll be going through various detection tools, measurement technologies, and interpretation capacities for these various tools. So before we get going though, I just thought I'd give you a little bit of a background about my group and what we do. So I'm interested in how cancers evolve, and both in the presence of different select pressures due to, for example, drug intervention or the microenvironment and how that shapes different clonal populations and tumors. And to execute that type of research, we take large-scale measurements of cancer genomes through sequencing technology and use mutations as markers of evolution. And in order to do that, because it's such a high-dimensional type of analysis space and where errors are rampant actually in the measurement technology, we use pretty sophisticated machine learning tools, statistical models, and different data analysis workflows to get us to those conclusions. So we really sit at the interface of cancer biology and computer science, and my lab is completely dry lab. I don't have a wet lab presence, but we do generate a lot of data through collaborators and so forth. Okay, so the outline for today will go over the relevance and impact copy number alterations. This is in your notes, so I think I'll just, I don't need to read this out, I'll just skip over it. So let's start with this picture, though. This is an important starting point. It represents a normal human karyotype. How many people have seen pictures like this before? Okay, great. So it's a spectral karyogram with chromosome painting, and it really depicts how the genome is organized into 23 pairs of chromosomes. So at any one locus, you have two copies of the genome, one's inherited from your mother, one from your father, and under normal circumstances, this is the type of picture that we might want to see. Over 100 years ago now, Theodore Bovery, he started to make the assumption or the leap that, in fact, malignant cells in humans could be attributed to changes in chromosome content. And this observation was made primarily through the study of C- or T- nuclei, which have extremely large nuclei, and he was a biologist that was interested in studying the organization of cells and realized that there was a certain phenotype, meaning high proliferation and growth, and that it was associated with the presence of an additional chromosome in the C- or T- nuclei. And so he made the link that this may be, in fact, the cause of human cancers. And in 1960, this is proven correct with the discovery of the Philadelphia chromosome in CML by Noel and Hungerford, and this really began a revolution to start studying the genome as a primary cause for human malignancy. So here's a picture of several different high-grade serous ovarian carcinomas, and you can see it looks nothing like the original picture I showed, which showed the nice diploid organization of a normal situation. So in these particular tumors, these are probably the most copy number replete genes with the most copy number changes across the major human cancers. And you can see that there are some chromosomes that have extra copies, some chromosomes that are missing material, and other chromosomes which are made up of fusions of multiple different chromosomes, and that's presumably what you learned yesterday. And so this is a major feature of cancers, and so it makes a lot of sense to study the copy number profiles in detail to get insight into biology of these cancers. Another view of this can come from aggregating data across a population. So this is a summary of a thousand breast cancers where high-density, high-resolution copy number arrays were applied, and what this plot shows is the frequency in the population or the prevalence in the population of a particular copy number alteration. So meaning red are amplifications, extra copies of that locus, and blue are deletions. And so that's a raid over the genome where the position on the x-axis represents a particular locus, and then the y-axis shows that frequency of prevalence. You can see that across the population of breast cancers, nearly the whole genome is affected in some way by copy number alterations in at least a few percent of cases. Okay, so any questions on this so far? Do you have a similar graph for normal people? Yeah, so it does... Of course there are recurrent germline CNVs as well, which are inherited variations that are generally thought to be benign. And it doesn't look like anything like that. So let's see if I can find an example of... Here's one here actually, so if you look at this feature right here, it may be hard to see, but you can see there's a very sharp heat spike here, and actually it's bi-directional, so some people have losses and some people have gains of that particular locus. So we tried hard in this study to actually filter out all the germline polymorphisms and some get through. I think that's an example of one that leaked through. And so actually if you looked at this plot across a normal human population, it would be mostly flat but interjected with very sharp spikes throughout. And the germline CNVs tend to be quite narrow. Yes? What is frequency and how do you doubt the copy number of aggregations for this? Okay, so this is an aggregation over 1,000 different breast cancers. So what we did is... So she's asking essentially what the y-axis is, right? Yeah, what does it mean? You don't... So this is highly processed data. So this would be the result of taking 1,000 individual breast cancers, measuring the copy number and summarizing the result through aggregation. So it's essentially summing over each locus the number of amplification events and the number of deletion events. And that's what the y-axis is. So it's normalized here but it would be... Essentially it's equivalent to counts. So if we had 1,000 cases then 500 of them would have amplifications in chromosome 1Q here. Okay, and amplification could be anything from the number to the number of deletions. Correct, correct. So there's no... We'll get into amplitude in later parts but this is just looking at at least one copy game. Yes? But those losses that are benign... Would that be a kind of heterozygous loss or a complete loss of the... So in some cases actually it's a bially look loss that just the locus isn't just even there. Similar to we're all walking around with some loss of function truncating mutations. So there are parts of the genome that are just not there in certain individuals. Okay, very good. Okay, and so for a very high level conceptual framework you can imagine that there are maybe three or four different classes of copy number alterations and how they might appear. So if we zoom in on this particular locus we can see that there may be three genes there and a deletion would result, for example, in the loss of gene B. And these are typically in the range of 1KB to entire chromosome arms. And so deletion of an entire gene is a frequently observed event. You can imagine that there's some insertion of material that creates a potentially new gene. And you can have inversions which I think you probably went over yesterday as well which reorders genes. And or you can have duplications of particular locus or segmental duplications that repeat the whole section. Yes. Sure, makes it look like they're all connected to each other. Is it really like that? No, so typically so tandem duplications tend to be like that but then actually for things like A where you see multiple high level copies typically what happens is break diffusion bridge cycles will distribute that material throughout the genome. So it's just schematic representation. It's not actually the way it looks. But when we actually read copies out we typically have a scaffold of the genome that so everything's related to the reference genome. And so we can count the number of copies of each segment of the reference genome, for example. And so it ends up having what you'll see a picture of it in a second. So in general these types of copy number alterations are a hallmark of tumor genomes, maybe all tumor genomes harbor such events. And you can imagine of course that if the loss of a key gene like for example BRACA or P53 that would actually have an impact on the function, regular function of a cell. And so a deletion of a tumor suppressor gene can result in an ointment phenotype amplification of a growth factor or a proliferative gene can result in a malignant phenotype as well. And so there's been a huge amount of effort to look in cancer genomes for copy number alterations that could be diagnostic, have molecular function or have prognostic ability as well. And importantly, some of these events are targets for therapeutic agents. So there are a few classes of CMDs or copy number variation as a whole. You'll see in the literature different terms. So copy number variation usually encompasses both germline and somatic and it's not as hard and fast rule but typically copy number alterations are referred to as somatic changes. So these are changes in tumor cells, not in the germline cells. So we know there's a lot of work happening right here in Toronto on congenital abnormalities that affect intellectual disability, autism spectrum disorder, etc. These are typically found in the germline and then somatic alterations are tissue specific type of changes that happen in most, if not all cancers. And then benign variations which tend to confound analysis in the same way that germline snips might confound the analysis of somatic point mutations. There are equivalent analogous features in the genome which are benign variations that are just naturally occurring in the human population. And of course these get measured along with the somatic changes and part of the challenge is to remove the germline variation from the somatic variation. Okay, good. And so in cancer we can think of a few terms that you might encounter. Segmental abnormalities are often large scale or arm length events. And really these types of changes you can have whole genome end over duplication events that alter the entire ploidy of a cell, for example. So many cancer cells are polyploid, triploid, or tetraploid. Even octaploid is sometimes observed. And then often the more interpretable changes are focal changes that maybe impact very few genes. And these focal changes, if they're functional, tend to be involving maybe one gene and have really extreme amplitudes. So these would be homozygous deletions where both copies are deleted at a particular location since the gene is just not present in those cells. On the other hand, we have high level amplification events that end up being the targets of many drugs. And then rearrangements, transformations into infusions you've heard about yesterday. So this is a brief list. I just thought that I'd give you some specific concrete examples of actionable changes in cancer and these are really markers of treatment and targets of drugs. So measuring, for example, the copy number of EJFR or ERB2, which we'll discuss in a little bit more detail in a minute, can be an indicator for therapeutic guidance. And so in breast cancer in particular, every clinical case would get tested for the three hormone receptors, ERB2 being one of them. And in the case of her depositivity, it indicates for potentially treatment with her septin. It's the very potent poster child of personalized medicine is that particular event in breast cancer. And I'll show you an example of what those look like. Okay, so in addition to guiding treatment, the nature of the genome in cancers can be used to stratify different patients. And so there's a nice synthesis study performed by Chris Sanders group of the TCGA which really showed that there's a relative spectrum of cancers that are considered point mutators. So they typically harbor a lot of point mutations, but very few copy number alterations. And then the other end of the spectrum where there are cancers that harbor a lot of copy number alterations but relatively few point mutations. And the logic is such that it means that what's being selected for it is a negative selection against deficiency in DNA repair. That repair double-strand breaks, for example. And deficiency in mismatch repair, which repairs single base changes. And so the presence of both DNA repair mechanisms being altered is essentially selected against. And so we typically end up with our cancers at either this end of the spectrum or that end of the spectrum. And so today we'll focus on cancers that are in this end of the spectrum here. And amongst those are, here's the ovarian cancer group that I talked about. So that is in particular the high-grade serous subtype is considered the boss cancer that's associated with copy number changes. Okay. Okay, so here's the promised example of ERB2. So this is chromosome 17 of a breast cancer patient. And what's shown is along the x-axis just the genomic position of chromosome 17. And on the y-axis is a measure of the number of copies at that particular locus. In this case, it's normalized to zero so that this is relative to reference genome where the expectation is that there'd be two copies of the reference genome. And each dot here represents a particular locus on a high density genotyping array, which we'll discuss a little later on. But the way to interpret this is that the higher you go up the y-axis, the more copies of that particular locus there are. And so this is a, we often call these skyscraper type of alterations that really just jump out of the data. And in some cases, so this is showing relative copy number about 20, but often we see 50 to 100 copies of a particular locus. And in this case, it's a growth factor receptor which then drives growth and proliferation with the traditional number of copies. And as I said, this is a target for a drug called Herceptin or Trustuzumab, and it's using clinical practice and affects approximately 15% of all breast cancers. Okay. So the colors here represent a process. So when we get the measurements, the question is what do the colors represent? So when we get the measurements, we don't have this color coding. This is the result of analyzing this data with algorithm to try to separate out copy neutral regions from those that are amplified and deleted. And so you'll be doing some of that work in the lab, actually processing data to output those categories of changes and segments across the genome. So a technique that's often used in clinical practice is fluorescence in situ hybridization where a probe is designed to hybridize to a particular part of the genome and it lights up when we get a hybridization of that part of the genome to the probe. And so essentially what's shown here is in the control probe is a screen probe and each one of these blobs is actually a nucleus of a cancer tissue and most cells have two copies of the reference here, or the control, which is a good example here. And then, of course, the probe that we're testing here is there a B2, and you can see that in some cases there are just literally hundreds of copies of that particular locus in these cells. And so this is a typical result from the fish assay of a B2, and this is also used in clinical practice. The alternative is to actually measure the protein level through immunohistochemistry, and that's often done as well. So typically these types of changes, the ERV2 type changes will drive gene expression. And so in a particular study that we examined the population level impact of copy number alterations on a gene expression, you can see here several examples, and this is ERV2 here, where there are cases with the copy number alterations shown on the x-axis, and then the expression is shown on the y-axis, so you can see that there's really an impact of, as we go up in copy, that has an impact on gene expression. And this is really like a multimodal distribution where these are all cases where essentially there are a copy number neutral, and there's a bit of spread and expression, but once we get to the high level amplifications of ERV2, you can see that there's a nice correlation between expression and copy number. And so these are changes in the genome that are having an impact on the expression program, yes. Three different colors. Yeah, so they also represent, each data point here is a patient, and the color of the dot represents the copy number in that patient. So green means there's a deletion, blue means it's neutral, and red means one or two copy gain, and then it's orange means high level gain. Nonpatient data. Sort of not. Does this plot include nonpatient data or is it all patient data? This is just all tumors, yeah. These are tumor measurements, yeah. Do you see an effect like after a certain number of copy number variations, the expression doesn't change? Yeah, well, you can start to see that here, actually. So there seems to be a set. So the question is, does eventually the expression saturate? So you can imagine that there would be, at some point there would probably be some pressure against too much expression of a gene, and that might be actually deleterious to the cell. And it does appear to be the case that, and I don't have functional data to support this at all, but it does appear to be the case that there's a bit of a plateau in this distribution, and we noticed that as well. In our cases, though, there's an almost linear relationship, and it keeps going up. Does the expression has a direct correlation with the protein synthesis? Well, sometimes. In this case, yes. Most of the time, though, it's unpredictable, and that's usually because the stability of proteins and, of course, it depends on what protein product that you're actually looking at. So there are different isoforms, modifications of proteins, et cetera. And so the phosphoproteins, there have been phosphoprotein arrays now incorporated into CCGA studies, and they often don't correlate with expression data. So there's a whole layer of regulation that's happening that doesn't lead to a nice linear correlation. Students who have increased levels of protein, but not increased common number, these ones here, like that one here. Yeah, so that's a great question. They're definitely outliers, and so one potential explanation is that the expression array just didn't work very well in that case. In other cases, so we didn't validate this at the protein level, so it's not clear that that is a real result. It could be a technical result. But the interesting thing is that we also found some patients, like these ones here that didn't have super high-level copy number changes, but they had high expression levels. So maybe other mechanisms for up-regulating would be too in those cases. Very good. Okay, so on the other end of the spectrum, this is what a typical homozygous deletion looks like. So again, I'll explain these two tracks now. This is actually from whole genome sequencing data. Again, the chromosome position is on the x-axis here. Each dot here represents actually a 1KB window in the genome. y-axis is the copy number level. And so on the zero line, we see there are segments like this one here that's deployed. It's unchanged from the normal. This is a low-level gain. And this is one copy deletion. And this is a two copy deletion here. This is probably the cleanest data that you'll ever see. This is just the best example, the clearest example that I could find just to illustrate the point. So this is a case that has very high tumor content. The signal is very strong, and we can pull these out. And so here's a little segmental amplification of this locus here. And so the idea here is that these are the types of events that are usually the first low-hanging fruit to try to identify. Because they're really focal, they tend to target one or two genes. And often, this is a signal, for example, that you might see in the end deletion. It will be something just like this. And Q10 is a very potent tumor suppressor. Or a cyclone-dependent kinase in 2A. So CDK and 2A will often show a profile like this, and it's characteristic of lung cancers and breast cancers. So that's the top number, that's the top plot. The bottom plot shows something different. So what this represents is each dot here represents a single nucleotide, polymorphism that we know is heterozygous in the normal. And so that means that the allele counts of those particular loci should be in the order of 0.5 in an unchanged scenario. And so that's this situation here. So here's a segment that's copy-neutral so it's diploid. And most of the data points here line up around the 0.5 ratio. So it means that about half the reads are coming from one allele and the other half are coming from the other allele. So it's a heterozygous locus. You can see here that this results in a skew away from that 0.5. That's because there are two copies of one allele and one copy of the other allele. And so we start to get this right away from 0.5. In this situation there's only one copy that's remaining. And so it's either we either get one allele or the other. And so it starts to get a really bimole distribution away from 0.5. And in this case it becomes even clearer because we've lost one allele and then that remaining allele has duplicated itself. And so we get two copies of just one allele. And so when you see a plot like this so here's a typical case. Here is what we call copy-neutral loss of heterozygosity. So LOH is a feature of cancer genomes. And these are potent examples of that here. So the implication of this is one can imagine that there might be a loss of function mutation in this region here. And so that's one allele that is disrupted. And the other allele can be disrupted through loss of heterozygosity like this. And so it's the classic two-hit idea where one hit is a mutation and the other hit is in the copy number deletion. And so that gene is no longer functional with no wild-type copies left. And that can be a mechanism for driving malignant phenotype. Any questions on that? This is quite an important slide. If you take away anything from the day this is it. These measurements represent one of the different types of features. It's got all the things that one would look for. A focal deletion here, a focal amplification, loss of heterozygosity relative to diploid here. Okay. So the gene content of these types of features in the genome are genes that will likely be familiar to you because we talked about epidermal growth factor receptor, the MYC, oncogene, PI3 kinase, et cetera, et cetera, KRAS. Deletions are the typical tumor suppressors. You may find deletions in genes like RB1, P10, CDCAN2A, et cetera, et cetera. So there's a list of tumor suppressors that are known to be altered by homozygous deletion in different papers. And so these papers, this is a relatively incomplete list at this point, but it gives you a sense for some of the work that's been done to look at thousands of cases and really understand the landscape of these events across both within single tumor types like glioblastoma or ovary, or across multiple cancers as such described by Barry Kim at all and Cerello at all. This is the pan-cancer. And so the field is poised now for the results of the ICGC project, which is about 3,000 cases from numerous different tumor types, but this time measured using whole genome sequencing technology. And so there'll be an analogous type of landscape papers emerging that show the copy of the landscape as measured by whole genome sequencing. So I thought I'd give you just an overview of this paper that I was involved in, which is a significant effort in profiling the genome landscape of breast cancers. And so we actually examined 2,000 breast cancers together with Sam Apricio and Carlos Caldas in Cambridge. And so we set out to look at essentially what is the landscape of copy number changes and how does that landscape modulate gene expression. So we took on the same tissues extracted DNA and RNA and measured the copy number profiles of the genome and the gene expression profiles of the transcriptome. And so the results suddenly figured out something like this. So this is across different cases. What's now shown in the y-axis are copy number changes that alter expression. And you can see it looks quite different than the original landscape plot I showed you at the beginning. This is the same patient set, but now with looking at which changes actually modulating the expression. And the picture gets quite sharp. And so here, for example, is our friend who and it's the most potent signal in the whole genome. So we didn't discover a new ERB-2, but we did really capture that signal there. And you can see how sharp it is. It's really only affecting a couple of genes here. Some of these callouts show the number of genes that might be in particular locus. And then across the landscape we identified about 45 regions that are that have copy number changes that modulate gene expression. And so it narrows down the number of particular driver events that could be associated with driving malignancy and breast cancer. Here's an example of a deletion event, here's P10. This one here, map 2K4 was really is now a bona fide tumor suppressor gene. But at the time that we uncovered this, this was really a new feature of the breast cancer landscape. And then we have other genes that are actually encapsulated in larger regions. So for example, this AP deletion harbors a phosphoprotein called PPP2R2A. And this is the subject of much research both in SAM's lab and other places around the world. So that's the overall high level description. When we took these profiles and asked if we could segregate the population or stratify the population into patients that had similar copy number and expression profiles we were able to cluster the population into about 10 different subgroups that had recurring or group specific patterns that were consistent within a group but different between groups. And so what this shows is essentially what those profiles look like. There are approximately 10 stable groups and there are a few things to highlight here. This is a discovery cohort of the first thousand cases and then actually we recapitulated this in the second thousand cases and so to show the validity and stability of these profiles. And essentially what this does is really transcends what had been known about breast cancer just from the gene expression level. So you've probably all heard about the five major subgroups, expression subgroups in breast cancer commonly known as the PAM50. And so what this work did is essentially show that once you include the genome we can actually transcend PAM50 in some cases it's subgroups PAM50 subtypes. In other cases we found groups that are composed of multiple PAM50 subgroups and so it's a different way of stratifying the population and I'll show you in a minute has actually has impact in terms of prognostics. So just to zoom in on a couple of these different profiles so this one here what's shown here is again that a lot across the population on the bottom really shows a statistic that tells us whether that part of the genome is subgroup specific. It's just a chi-square test. It's a P value of a chi-square test. And you can see actually that a large part of this profile is actually subgroup specific. And so this is what's commonly known as the basal subgroup is characteristic of P53 mutations and has really large scale copy number changes across the whole genome and so this group here is has very poor outcome and is one of the more difficult clinical groups to treat. Interestingly we also found a group representing about 16 to 17 percent of cases that were actually copy number devoid. So these had really barren copy number landscapes very reminiscent of what you'd find in a normal genome. Now this wasn't due to for example low cellularity or lack of tumor cells in the sample it's just a feature of these cancers and these are characterized with high T cell infiltrates and really show a different side of breast cancer this is one of the most surprising findings is that there's this group of cases that actually don't harbor a lot of copy number alterations and here you can see it's actually composed of what this bar plot shows is the relative proportion of PAM50 subtypes and so that's what the different colors here show and so this group is actually composed of really all the five PAM50 groups. Go ahead Did this include epigenetic data? No we didn't actually measure the epigenome in this study so this is really a new view and shows that the power of doing high resolution copy number in a large population can reveal something new about patient stratification and so if we then take the next step and one of the major features of this cohort Yes, go ahead In the previous graph do you see some ethnic group that you used to cluster together? Ethnic groups Ethnic groups, oh ethnic groups So the question is do we see ethnic groups clustering together so we looked at that we only had normal DNA in this study for about a quarter of the patients so we weren't really powered to say much about ethnic groups. I think you need very large case control studies in the tens of thousands What's the ethnic composition? So interestingly it was a collaboration between the UK and Canada. Mostly in Canada it was Vancouver, Alberta and Manitoba and in the UK it was in Guy's and Anbrook's and so Guy's is quite interesting because there's a large African population in that part of London and then Vancouver is interesting because there's a large Asian population so the question can be asked but I'm not sure we're in a great position to answer it but we did have a range of ethnic composition In terms of the classification of the time groups was it automatic clustering yeah So the question is essentially how are the groups clustered so we used a feature selection followed by a joint clustering method that used both copy number and expression features and the number of groups we actually ran over a large number of potential clusters from 0 to 20 and through cluster stability analysis determined that around 10 was the right number and then that was validated with the extension cohort of an additional 1,000 cases which we repeated the procedure and the proportions were roughly the same So a very important result is that one of the really special things about this cohort is that many cases had 10 years plus clinical follow up and we knew the outcomes of these patients and that's what made the study really quite unique For example people often compare the study to the CCJ but what's distinct about the study is the quality of the clinical data that's available and so that was one of the criteria for inclusion was that we had at least 10 years follow up so the median follow up was somewhere around 12 years in these cases and you can see that makes a difference because often people use around the 5 year mark for the censoring limit of outcome analysis and we do more outcome analysis on Fridays but essentially it's a Kaplan-Meier plot which shows a proportion of patients surviving after some period of time and so one thing I want to just point out is that the cluster group that was composed of it's this group 4 here so the cluster group that had was copy number relatively flat actually wasn't the best performing group it's this one here it's just turquoise here so there isn't a strict correlation with lack of copy number alteration and outcomes and so the groups that we discovered in this actually had some clinical relevance to them they're not just a set of clustering data points they actually had some impact on clinical outcomes so overall we found that recurrent copy number profiles can be used to stratify these patients and identify novel molecular subgroups they had clinically meaningful relevance since they co-structured with prognostic profiles and then we found essentially a landscape of about 45 regions of driver alterations that modulated expression so this is just to as an example study of really the power of using high density, high resolution genomics and looking at copy number profiles in a particular cancer any questions? you have a group that is flat that's not much basically not much specific means that the onset of breast cancer can be attributes to other genetic alternation like so right so I think the question is if not copy number alteration is driving it then what is driving it could be somatic point mutations we we did sequence p53 in these cases and some of those cases indeed Harvard p53 mutation in other cases so we didn't do epigenetic profiling and that's the obvious place to look I think in those cancers is epigenetic deregulation so that would be a logical follow up study yes how do you say correlates the clusters with this stage I would stage good question I don't have the answer at my fingertips for you is there the reason why you group them the way you group them not necessarily the same order this four is the same as the one that looks like this one here the indicator is actually here so the cluster group it's not sorted by by cluster number it was actually sorted by I think it was sorted by the I think maybe the proportion the genome altered or something like that well actually that can't be the case either I'm not sure how they're sorted I can't remember but these indicators here correspond to the indicators over here so these groups here represent the index here is maps to the index that's shown here there's nothing to be meaning in this one in this one yeah that's exactly the point so here's the group four and that's the finding is that there's this group of approximately 16% of breast cancers that actually don't have very much copy number alteration so is the 164 in the next page is that the number of patients correct that's shown here so a couple of them dropped out this is a survival data so this is the catamyr plot showing the overall survival yes did you guys do any multivariate analysis? is that just you because one of your group so with clinical correlates so lymph node involvement certainly would predict for poor prognosis so systematically I would say that because it was a huge score and not all the clinical data was uniformly collected so what we did have is essentially overall survival and disease specific survival on all the patients that was really the only variable that we had that was complete that was the requirement but subgroups could be done we could do the independent hazards ratio for various for various clinical covariates on a subset of patients yes since Bracket 2 is involved in the in the peer-to-peer correlation between the group that had low copy number alterations and those with Bracket 2 mutation so not all these cases in fact most of them we didn't have German testing on Bracket 2 or Bracket 1 and that wasn't a criteria for involvement in fact most of these cases would actually be non hereditary and so there would be sporadic cancers we we didn't we noticed a couple of copy number alterations in Bracket 2 but it wasn't a pervasive signal to actually do any statistical analysis on yes yeah so just the measurement technology used here was actually just genotyping arrays so we didn't actually sequence these genomes there's a product that's ongoing right now that you know about that where of course you're sequencing a large number of triple negatives that we ought to make that association yes why does this provide a little flat note at the end like where you see it drops really quickly and then at the end so there are some cases that just so what that says is that after some period of time that if a patient has survived that long then they tend to just be cured it's a signal of cure and that's treatment or through other so sure so with various surgery treatment I was telling me to move on so I'll move on okay so we've talked about total copy number mostly but I wanted to now break this down into what we call genotypes and so genotypes can have many different meanings in different contexts in this case you can imagine in the diploid case as I mentioned before we often have two alleles and we can class them as A and B as major and minor alleles or these can be maternal paternal or they can be variant reference so there's many different ways in which one can describe two different alleles at a locus when you have two copies there are really three genotypes that can arise and that's essentially homozygous for the major allele it can be heterozygous where we have one copy of each that's A, B or it can be homozygous for the minor allele or the variant allele and that would be B, B and so in the case where it's A, B we can ascribe what we call as agosity status to that and that can be heterozygous and in the case where only one allele is present in one or two copies and that's loss of petrozygosity or LOH and so that actually for all copy numbers essentially the number of genotypes is the number of copies plus one that's the rule and so each copy number you can have this many genotype states so here's copy number three you can have A, A, A one can have A, A, B A, B, B or B, B, B and so that's the number of ways in which one can actually choose two different alleles from three copies and that trend continues as we expand the number of copies and so by the time we get up to five copies we actually have six potential genotypes and it's of interest to try to ascertain what the genotype is because the implication of a loss of petrozygosity for example this would be amplified loss of petrozygosity the implication here is that the A allele is actually completely lost and what's remaining is multiple copies of the B allele whereas one can have a balance more balanced copy number change for both alleles that are amplified and that might have different implications altogether and so these measurement technologies will actually allow us to decipher these different cases here's an example and this is from a paper from Gavin Ha from my group from a couple of years ago and this is whole genome sequencing data where again so each data point here represents a single euclid type polymorphism that's been called in the normal sample and expected to be around 0.5 allele ratio which suggests that there's one copy of A and one copy of B and we can see in the cancer how that distribution changes and so once again here's an example of a neutral region it has two copies but actually it's a BB or AA locus so there's loss of petrozygosity here and we can contrast that to a region that actually is still heterozygous but is skewed away from that 0.5 so here both alleles are still present but there's an access of one over the other here you have a loss and that's represented by this distribution of alleles here and so again this has interpretive capacity for interpreting mutations if you see a loss of function in a region like this then one can start to infer that perhaps it's a two copy loss of that allele and there's no wild type present anymore so it really helps interpret mutations across the genome and so you'll be using a tool in the lab that really investigates how to measure this using high density genotype arrays called Oncosnit it's developed out of Oxford and so why should we model that why should we care and really have sort of gone over this but the two-hit paradigm is the classic Knudsen two-hit hypothesis whereby essentially if you lose one allele there's no phenotype losing both alleles will result in a malignant phenotype and then we can have what's called really just a loss of one allele is sufficient to actually generate a phenotype and so often we see mutations in p53 that can have that's actually half lobe insufficient often we see the other copy deleted as well and so the severity increases when the other copy is deleted in some cases we have what's termed as quasi-sufficiency whereby generally a small change in the expression of a particular allele will start to change the phenotype but complete loss is actually deleterious to the cell so we don't see complete loss and sometimes we see interaction between wild type and mutant allele as is in the case of EZH2 mutations in lymphoma it's actually required that the mutation be heterozygous to be functional because it actually interacts with the wild type allele as well so these are just an example of why it's important to look at the zygosity of particular events do you ever see that an alternate allele suppressing the expression of wild type in cancer cells so do you ever see the alternate allele inhibiting the expression of wild type I don't know any examples but other people in the room might here's a spectrum of measurement technologies for copy number changes so they range from very low resolution but high accuracy to very high resolution and actually I would say high accuracy now whereas there's a middle ground here where we had fairly high resolution but fairly low accuracy so starting with fluorescence and situ hybridization one can look at one, two, a handful of loci across the genome sometimes you can do multi-color fluorescence and situ hybridization to look at maybe two or three or four loci simultaneously and this gives a fairly accurate picture of of single cell resolution of what's going on in the cancer and this is used clinically and it's really an accepted gold standard although very low throughput in old technology the advent of array comparative genomic hybridization in throughout the early 2000s started to expand the capacity to look at copy number across the genome and so platforms were developed for approximately 30,000 to 100,000 loci across the genome where essentially hybridization products could be spotted on an array and then total DNA extracted and then washed over the platform and then we get outputs that look similar to the things I've shown you already and then in the mid-2000s the advent of very high density genotyping arrays up to two million loci across the genome started to appear on the market generally Lumina and Affymetrix were the dominant providers of these platforms and really drove analysis of copy number in cancer for I would say a good five years and maybe still the dominant platform today with TCGA essentially profiled on the order of 5,000 cases with Affymetrix SNIP6 arrays and that data is all available to download, you're free to pull down that data crosses about 20 different subtypes ICGC is not doing it anyways is that correct? Yeah, this is all whole genome sequencing RAC and then now we're going to move into the field is already there I do a lot of whole genome sequencing of my own work and then the larger consortia have now generated, I would say they're probably on the order of 10,000 whole genome sequences available for cancers across the world and so knowing how to process the next generation sequencing, whole genome sequencing platform for copy numbers is quite important and we'll investigate that in the lab as well. Okay, so why is this problem hard? I mean we should just be able to measure run these different platforms and we should just be able to pull out the copy number there's a computational layer in there that is really quite important and that's for several reasons. One is that often in the cancer sample I'm sure you've covered this ad nauseam now but it's always a mixture of normal cells that have either infiltrated the cancer or part of the tumor micro environment and that can those normal cells of course don't harbor the signals that we're looking for and so the signal can get diluted down and so that makes that affects sensitivity of the signals the cause of the intertumoral heterogeneity I don't know how much you've covered that global global evolution? No not at all okay, so we're going to spend a fair amount of time on that today but essentially tumors are often comprised of populations of cells that have different genomes so a tumor is not a homogeneous entity in fact through evolutionary processes one can often at diagnosis see multiple populations that exist and those populations can have different phenotypes and certainly have different genotypes and so the copy number profiles of different populations can be different even within a tumor and so that also creates really quite a lot of what we call biological noise when we're trying to infer signals and so Andrew does a lot of work on trying to deconvolve these mixtures of cells and has made some important advances in that area so you should ask him about that while he's here and then so then the other confounding factor is that we're looking for these signals in the presence of germline alterations and so germline changes are often the most strong signals that we see and that's because they're present in all the cells so these are changes that are part of the initial zygote and so carried forward and propagated forward and so we have to interpret these somatic changes in the presence of these germline polymorphisms and then of course we have the problem of polyploidy and so it's not straight forward to know in advance what the baseline number of copies of the genome is and so the changes are of course relative to that baseline and so the original set of algorithms that were used to, that were designed to pull out copy number variation from these arrays were actually designed for population studies of normal human variation so these arrays became quite popular because they're a way to measure human variation in different ethnic groups for example, the HapMap consortium made potent use of the platform and really limited in asthmetrics I think started engineering the technology for the use of measuring normal human variation but then became quite apparent that this is a handy tool for cancer as well but unfortunately these statistical methods and the algorithms haven't caught up to these biological factors that would concern signals and so really the point being is that one cannot use off the shelf methods that are designed for normal genome analysis for cancer, it's a different problem and so now I would say I think the first time we did this lecture it was about four or five years ago there was really a dearth of methods for cancer specific analysis and since then actually it's been an incredible advances in the algorithmic space to develop specific methods that take into account all these factors I'm not that avised on some of that work okay, so this is a nice reference that goes over these different considerations and it comes from Terry Speed's lab so Terry is sort of one of the godfathers of genetics and I think he's retired now more is just about to retire but has spent a lot of time thinking about statistical consideration in genomics data and also in cancer genomics data so he's got a nice review there that you can read about alright so what you're going to do in the lab today is essentially take high density genotyping arrays like SNP6 platform and analyze it analyze the data so what are the cases that we're looking at in the lab? okay so when you look at a breast cancer cell line it's replete with copy number changes lots of fun changes to look at and so essentially this is the workflow so we start with a cell file this is what comes off the machine we go through a pre-processing and normalization steps and I think in this case in the lab you'll use PENC and VE often people use a tool called Aroma data after metrics and then it essentially goes through a couple of different extraction techniques and so then we take the measurements at each locus of total copy number we take the measurements of the middle of the minor allele and then we take those measurements and process them with a statistical model that can infer where the copy number changes are across the genome and then once you have those segments which segments are actually altered we can project what genes are present in those particular alterations and start to do things like pathway analysis across a large cohort for example to understand if there's dysregulated biology across the population and so this latter part is going to be covered what? Thursday Friday so this is the total workflow you can really apply this workflow to almost any platform but in the lab today we've been working specifically on the SNP6 platform so just a few words about how the array is constructed these are 25 or all the nucleotide probes so they're regions of the genome there are 900,000 regions that contain polymorphism in the middle and so there are 900,000 SNP probes and then there are 900,000 regions that are known to be in regions of copy number variation across the genome and then what we get out of that is essentially hybridization intensity so it uses fluorescence as a continuous measure of the degree of hybridization to a particular probe and so the more copies the brighter the signal and so it's a continuous signal that is output from these measurements and it's essentially measured by taking a picture after excitation with lasers and so we can get a picture of brightness of each of the probes on the array and that's what we're actually reading and then because we know the positions of those probes on the genome we can actually plot those intensities on the genome as I've shown you with those chromosome plots yes so in these probes there is a no one can tell about a SNP of the probe then does it affect the array frequency the hybridization? these methods take care of that no, intensities? absolutely not good question alright so so really and again these statements I think can apply in general including whole genome sequencing normalization is absolutely required to remove platform-induced artifacts so the probes are often non-specific so they'll actually hybridize parts of the genome that they aren't intended to hybridize there are sometimes some of the SNP probes in particular they only differ at one position so one can actually get cross-hybridization of the wrong allele there are different effects according to when D&A is extracted and fragmented there's a variation in the fragment size when they're applied to these probes and that can affect the degree of hybridization so the aromidata afro-metrics package handles a lot of these artifacts and at least normalizes the data so that they're comparable to each other and so hopefully it outputs the actual copy number and allele fractions that are representative of the biology and not necessarily artifacts of the platform and so actually another point here is that in many cases commercial providers will provide a solution and it's often the case that those commercial platforms are a bit of a marketing tool and so the sales people will say we have great software but often you can't know what underlies the school assumptions are part of that package and in many cases certainly in leading academic labs will be producing much higher quality software than what comes with the package so that's just a word of caution is to make sure you evaluate what software accompanies a commercial package when using these types of platforms because signals can be missed data can be misnormalized or not normalized et cetera so having a reasonable understanding of the range of methods out there to fit a platform is worthwhile in many cases yes how much standardization is there between your platform? so if you were to use an AP system versus another system for it can you actually compare data from one? so the question is can you compare data across platforms that tends to be very difficult because often the variation that you see is just due to the specific words of the platform and you know maybe 1% of the data is something odd is shown by one platform versus another but that 1% of the data tends to be a lot of things when you look at the size of the genome yes I mean reduce the profile and it will get more accurate picture that's an excellent point so that is the preferred design is having a matched normal sample and a tumor for the reasons that you just said there are a number of additional reasons that I think are important is that in such cases the amount of germline signal is dramatically reduced because of course that's what the normal will contain and so that can be subtracted out very efficiently and certainly in the whole genome sequencing it's absolutely essential to have a semantic mutation detection because the polymorphism rate is about three times higher than the mutation rate in cancers so that one absolutely has to have a matched normal in doing the same as he said similar concepts apply here although in cancer in particular copy number space typically the semantic mutations will cover a broader range of the genome than the germline variation so it's a slightly different problem in sequencing in a single nucleotide space versus copy number space but it's absolutely the preferred method is to have both what if you don't have any good matched complete samples don't sequence the tumor the question is what if you don't have a matched normal that's equivalent to taking $5,000 and lighting it on fire so don't do that you can get away with doing things like RNA seek on the tumor for gene fusions depending on the cancer type that's often quite useful or just even for gene expression patterns and these copy number arrays are actually a reasonable thing to do as well because you can at least get a sense for how deranged the genome architecture is and that doesn't really require a matched normal on a global scale and things like homozygous deletions of RB1 or P10 those are uncontroversial you don't need a matched normal to interpret those high level applications of RB2 there's no way anybody's born with that good so in these so once we have normalized data then we can get to the business of inferring these patterns of total copy number alterations loss of heterosygosity and allele specific changes so this is an example of what it might look like this is the case where we have a paired normal and you can see here that if we only have the tumor here's a region of deletion that is really a potent signal essentially it's a very focal change this might be if you were to look at just the tumor plot one might get quite interested in this region as oh this looks like a homozygous deletion it looks just like the one that I showed you before where you have this very deep deletion in a very focused region of the genome however in the case we did actually the same platform on the normal sample you can see that it's there in the normal and so this is one of the strong advantages of having a matched normal sample in that one can start to subtract out the germline signal so this is an example of that and then here are some somatic events here that are really quite visible the interesting thing that we showed in this paper actually this is wrong this is old but this is the same paper that I referenced earlier from Gavin is that we have there are different distributions in the germline signals then the somatic signals and that can actually be leveraged to distinguish the two types of events Gavin did a lot of work in that area and developed the methods called HMM dosage that can actually distinguish germline polymorphisms from somatic changes yes just because you detect the abnormality in both the normal and in the tumor how can you rule it out if it's not being involved in your genesis maybe it's an imperative to predispose the patient to the cancer I don't think you can rule it out absolutely right the number of patients required for such analysis is often very very high so case control studies families, pro bands for example are better design for that type of analysis where we're looking for events that predispose having families where the strong family history and multiple affected and unaffected for example this is the key to uncovering that type of signal now if you're to see for example you mentioned before like a bracket 2 or a bracket 1 that would raise some eyebrows and you should get interested in that obviously a deletion of P10 in the germline same thing P53 deletions or mutations are associated with leaf flowering syndrome, P10 losses are associated with lymph syndrome etc so you'd want to certainly pay attention to it there's certain genes that you pay attention to most of them are usually benign and one can look at large sets of data from normal human variation studies and actually create a mask morphism mask for example across the HATMAP or the thousand genes type of data and that helps interpret those so typically you wouldn't find pathogenic germline variants in those samples although I'm sure they're there yes this much normal is the combination of several different patients okay so in this case it's the just the patient of interest so it's just the DNA from the blood of the patient and then the top plot is the cancer now you can also in the absence of such a mask normal one can use a pooled reference for example from the HATMAP project public data and use pooled reference that can be a powerful way to use as a reference and it will eliminate a fair amount of germline variation not all of it but a fair amount and that's what a lot of people do is they aggregate a pooled reference to do the study in the absence of match normal DNA okay okay so that yeah alright so let's then look at so let's see we actually covered this already so this is a similar concept to what it showed this slide here shows a number of methods for high density genotype arrays I've actually added one here apologize for that it's not on your printed slides but this is the one you're actually using in the lab and I would say it's actually the best one out there and so it was it's just not included in this review paper but I've added this so you can just scribble this down this is a yaw it all Genobiology 2010 if you want to really learn about the method so this is the package that you'll be using in the lab it's decent software we use it in our own work quite often and is a reasonable method okay so let's see it's worn what time is it how's everyone feeling good keep going I've covered a lot of cover here so I'm going to speed up alright so now it's interesting as we when I started this series of lectures a few years ago the field was nominated by genotyping arrays and we had only just really done one genome with whole genome sequencing I think when we first started so my group together with Sam Abrise we published the first whole genome sequence of an epithelial cancer in 2009 and for a long time that was one of the handful of whole cancer genomes that was around and but now the field has really moved along sequencing has become cheaper and I shudder to think of the cost of that first genome I think it was in the range of something like well it was a lot but but now we can do many cases for the same price and and so the difference here is that I mentioned that for arrays we are measuring hybridization intensity as from fluorescence and so it's a continuous measurement that has a noise distribution associated with it there's actually a dynamic range that's really limited by the ability of the camera and the laser to capture the intensity of the signal and so what next generation sequencing offers is really going from a continuous measurement to a digital representation of copy number and so we go from it's the analog to digital and so what's shown in this picture here are reads piling up across the genome and here what you see is a function where so this is a region of the genome where actually there's no reads covering it at all and so that might be an indicator of a homozygous solution there's no gene content there there's no genomic content there both for reals have been removed in reality we never really see a signal quite like this there's always some smattering of reads that are here some contribution in normal cells but you get the picture and then so schematically here a one copy deletion might look something like this where you have the reads of the normal relative to normal are piling up here and then we might have some regions of the genome with extra reads piling up there and that can be reflective of a gain and then in the case of implications which I'm sure you went over into great detail yesterday we're seeing two different parts of the genome coming together so where you might have one part of a fragment one read of a fragment aligned to one part of the genome and the other read aligned to another part of the genome so really making a conceptual shift from using fluorescence intensities now to actually counting reads that pile up on a particular part of the genome so unfortunately it's not foolproof it's also subject to different biases so in library construction for whole genome sequencing often GC content is a contributor to the degree of the number of reads that pile up on a particular locus so this is what's shown here in the top left is that there's a really strong correlation with the percent GC content and the coverage so the other way of saying this is the number of reads piling up on a particular locus is also called a coverage so the coverage is a function of GC content and so that really needs to be corrected for and so we can use regression techniques for example to correct that bias and so we can remove the GC content bias because it's generally predictable the genome is a stable entity and so we know where the high GC content regions are and so we can subtract out the contribution of the GC content to the signal yes so high GC low fewer reads I think it's the other way around and then the other interesting phenomenon is that of course the genome is repetitive regions and so not all regions are actually equivalently mappable and so that actually plays a role in the number of reads that can align to a particular locus especially unambiguously and so that can also be accounted for and so the take home is that if we don't do any processing so this is again just been to data so we take one KB and we count the number of reads that align into this one KB windows and we plot that we might see something like this if we take into account GC content we subtract the GC content it starts to look like this and we subtract the mappability it starts to look like this and so you can see this is very much cleaner representation of the data without normalizing we get this mess and once we normalize we can start to see the biological signal over the noise and so I think the so you're covering normalization phone, yeah, in the live as well or right okay so I think you're doing essentially we'll be processing at this stage here but the tool that you'll be using called Titan actually has this preprocessing rolled into it as part of the package actually get the data of just Illumina will this be no so the data that comes off of the Illumina machine to be fast queue files essentially you've probably covered that you've gone over alignment so this is assuming that you have aligned data and then what you're going to do is process the those BAM files and count the number of reads across a window so that's part of the tools essentially so these tools all work from BAM files so in the live we're not doing the I was just asking we're not doing the preprocessing whatever bond has provided the scripts of what you would be doing BAM so we're going to start at a different point but the scripts are there for you to actually do that but right now there's a separate page called data processing to explain how all three policies that we've got this is stuck work I mean all of this work once you get into it's computationally intensive so you know I would say don't sequence without a mesh normal and don't sequence that access to computing either it's a B foolish what if there is uneven distribution of reads well that's exactly what it is so there is an uneven distribution so that's what this plot here shows is that the coverage is a function of GC content and so regions of high GC will have lots of reads piling up and vice versa for the lower so that's what this tries to account for is that actually there is an unequal distribution so after this corrects for them as best as possible I mean elimination I would say no but reduces the impact so you can imagine trying to work with this would be very difficult but once we get down here it's much more difficult oh right so this is median absolute deviation so this is a measure of variation it's actually a measure of unequal coverage across the genome so the maximum absolute deviation, the median absolute deviation sorry, is reduced with each step I'm being urged to go faster so I will okay so we've seen pictures like this I'm going to go over this alright so here's an example of I've shown this genome already this is an example of processing the whole genome sequencing using a tool called Apollo which is really a precursor to the tool that you're using in the lab called Titan and so really what's shown is that after normalization we can very clearly see copy number alterations across the genome from whole genome sequencing so this is really quite powerful if you think about it so often the goal of a whole genome sequencing is of course to get the sequence level featured so you want to find point mutations with the exact same experiment it's no, you don't repeat the experiment with the same data set one can actually pull out the copy number architecture as well okay so whole genome sequencing is incredibly versatile in the in that collecting the data will give you access to point mutations which go over this afternoon copy number alterations and translocations and rearrangements that Andrew showed so actually in my lab what we do is we do a lot of whole genome sequencing and then we actually try to pull out as much biological features of the data as possible and so we've developed often we'll put the same data file through about six or seven different analytical tools to extract the different biological features of interest and so I think that's a really important point to make is that in the old days what you might do is you might run an array, a SNP6 array and to get the copy number architecture and then you'd have to maybe run an exome for example to look at the sequence level of the coding space and essentially the whole genome sequencing is arriving now at a cost there's a cost threshold both in material so the amount of DNA that's needed and also in dollars and so whole genome sequencing is arriving to a place where it's exceeding I would say both of those platforms in terms of what it can deliver with very little input material and so the field in just a few years time will be completely dominated by whole genome sequencing any investigation of a cancer's genome will be done with whole genome sequencing it's pretty much there now okay so I often get asked if we can do this type of analysis exome data so exome capture is still quite a popular method it's relatively, it's a little bit cheaper than doing whole genome sequencing only gives you a picture of about 1-2% of the genome but a lot of people work with exomes and TCGA for example is essentially dominated by exome capture data so there's some strong interest it'll be a narrow window in time where exomes are of use but generally speaking question can we get copy number from exomes the answer is yes there are a couple of papers that have been published telling me that this tool exome CNV is actually no longer available by a conductor so we'll ignore that one and talk about this tool called control free sea and also Titan can work with exomes and you'll be working with Titan in the lab this afternoon ok so good so what I've shown you so far relatively actually benign complexity genomes some of the things, the carrier types I've shown you are really quite complex but it can get really really significantly complex and so this is an example of a chromosome that's undergone a process called Chromothorsis which is essentially it's a chattering of the chromosome followed by non-homologous injury to repair it and stitch it back together and so what this results in is a substantially rearranged chromosome with copy number alterations that essentially are replete throughout these are incredibly complex signals and interestingly enough in the metabolic project we noticed a couple of cases with this type of pattern and we called it the sawtooth pattern and actually the resolution wasn't high enough to really untangle this and we actually cast it off as it must be just like a failed array or something like that it didn't work very well but then there are about one or two percent of cases that actually exhibit this chromosome shattering and so this is as complex as it can get and what's interesting about this is that in certain diseases like neuroblastoma so this is a pediatric disorder and essentially they're mostly devoid of mutations and so early sequencing efforts of these types of cancers didn't turn up very many recurrent mutations, that was the goal was to find recurrent mutations that were driving the disease but what people did notice is that some of these cases had exhibited this phenomenon of chromothorpsis and the very interesting thing in this paper published in 2012 is that in fact the cases of chromothorpsis had markedly poorer survival than other cases so as a feature the genome properties itself can be a prognostic feature so it would be very difficult to disentangle what genes are being altered here because essentially the whole chromosomes are being altered but just as a general genomic feature in this study they show that that can actually be useful in itself as a prognostic marker so that's how complex the genomes can get but the material that's actually generally sequencing from tumors is itself very complex as well so as I mentioned before tumors are comprised of distinct global populations that actually almost always have some degree of variation in their copy number genotypes and so this is just a schematic that shows that there are different populations of cells as shown by colors and these colors represent genotypes that differ they're the same within group that differ between groups so as we get into this part of the lecture I want to introduce some terminology this particular interest of mine is in analyzing tumors for clonal population structure and understanding the dynamics of that structure over time but it's important to have some terms so that we can discuss this on the same plane field so the one definition that you might hear is allele prevalence and this is for particular mutation or particular features is the proportion of measurements at a given locus supporting a mutant allele so it could be the number of reeds or the proportion of reeds at a particular locus that harbor a mutation is a good example the cellular prevalence is the proportion of cells in the sample that may harbor a mutant allele so if we go back to this one there will be some mutations that are specific to the pink cells and if the pink cells are somewhere around 20% then the cellular prevalence of that mutation that are specific to the pink cells might be 20% but if we think about how clonal evolution actually works the initial set of mutations that create a malignancy will be present in all the cells and be carried forward so there will be a set of mutations that are shared in the entire population that will have a cellular prevalence of 100% the clonal genotype is the set of mutations that harbor a cell population so it's the full set of mutations that co-occur in individual cells and that's what we call the clonal genotype and then the clonal prevalence is the proportion of cells in a sample with the clonal genotype and that's really that one can count up with the number of pink cells and that's the clonal genotype that's the proportion of pink cells in the population so copy number as it turns out it can be used to mark these clonal populations and so this is a landmark paper published by Nick Naven he's now at MD Anderson at this time he's at Cold Spring Harbor and what this group is isolated individual nuclei from breast cancer and they sequence the genomes of this individual nuclei and noted that the cells essentially clustered into groups according to copy number so they computed the copy number and then found that there are really three major tumor populations characterized by the red, yellow and blue I think these are normal cells here if I'm not mistaken and so this was a really very nice high resolution look at how different cells exhibit different copy number profiles within the same tumor and you can see that they cluster nicely together and you can think about this as a bit of a phylogenetic tree if you will they didn't actually do phylogenetic inference but this is analogous to a phylogenetic tree where this is essentially the root node normal cells and then we have a branching process that generates the observations that can be clustered together according to shared genotype so this was an important advance single cell sequencing however is difficult and is challenging in even the most advanced labs and so this is a technical breakthrough that really shows a picture of what we might see but in reality what we are typically doing is sequencing the mixture so we don't sequence each individual cell we sequence the aggregate mixture of all these cells and so how do we actually leverage the principles of colonial evolution to start understanding what the composition of our population is and so this is sorry I omitted the reference here so this is a paper that appeared in Cell in 2012 the first author is Nick Zainal I can write this down and this is from the Sanger and what this was actually remains one of the deepest genomes ever sequenced this is a 188x genome and what it showed really with quite nice resolution is that there are signals that vary according to according to the dominance of a particular feature and what that means is that here is an example of an event that is present in all cells so this is part of what we call the ancestral clone and I'll go over more of this concept in the afternoon but this is an event that it's the same type of plot that we showed earlier it's a deletion it skews the alleles away from 0.5 you can see with this high level coverage how clean the signal becomes much cleaner than the plots I showed you earlier and that's because those plots were generated from around 30x coverage and this is 188x coverage and the signal becomes much cleaner and so if we just contrast this event to this event you can see two things one is the amplitude of the deletion is lower here than here and also the skewing of the belial is different and the degree of skew is different in this feature this is this feature so we can start to leverage this and learn something about these events so if the cellular prevalence of an event is very very high then nearly all cells will have that event and the signal will be very strong if the cellular prevalence is low then the signal will be a little bit muted or not strong so we can try to quantify this and if we believe this is true then we can start to profile the events as being part of events that are shared by all cells and contrast those to events that are probably specific to a subpopulation of cells and once we do that we can start to get some insight into what the population structure of the input material was that we sequence ok so and so this is we developed a method called Titan and this is developed by Gavin Ha in the group and the conceptual framework here is that we have a normal genome we have a tumor genome and what we do is extract the heterozygous snips as mentioned before and that determines the positions of interest that we look at and this is typically around 2-3 million per genome so that's about how many polymorphisms any individual might have and then what we do is we count the alleles at each of those positions in the tumor and so that's just shown schematically here it's really the input are these two vectors here where we have the allele counts from the tumor at the low side specified by the normal genome and then we developed a statistical model that has its foundations in a hidden Markov model and we can take this as input in trying to learn two important factors one is where are the loss of petrificosity and copy number of events and then number two is the prevalence of those events and so schematically it might look like this so let's say we have two clones in a mixture that look like this so here you have a deletion here and an amplification here and the other clone may lack that amplification when one mixes those two clones together then the signal might look like this so this amplification because it's not present in all cells actually shows a lower amplitude and so the signal gets compressed whereas the deletion maintains its same level because it's present in both and so the goal here is to try to decouple this and try to determine the events and their cellular prevalence so I'm going to skip over this this is just some benchmarking to say that yes we can do this and that considering the population structure increases the sensitivity to events of low cellular prevalence there's been well, lost my it's telling me to take a break not sure what happened there mm-hmm here's there so this is work by Vilem Moustin and colleagues at the Sanger and essentially what their work does is to actually try to infer what those clonal genotypes are so from a single sample tries to deconvulate what the genotypes are and then infer their cellular the clonal prevalence and I should say it also that Andrew has done a lot of work in this area and the work is not yet not yet published but but is definitely probably the best he's developed I would say probably the best method that exists for this work and we're working on getting that out there soon okay so so then I just wanted to finish with a view of emerging technologies to try to deconvulate mixtures and understand number one is clonal population structure and number two is how do those populations what are the dynamics of those populations especially in the context of clinical care and so advances in single cell sequencing have brought us to the point now where I wouldn't say it's routine yet but it's becoming much more prevalent in multiple labs to sequence individual nuclei and so the first paper I showed you from McMaven's group was 2011 and then recently published something last year at the end of last year showing some advances there and then I know that they have are essentially ramping up now to be able to sequence hundreds and thousands of nuclei from individual samples and in our work at the BC Cancer Agency we're also making really important improvements in single cell sequencing and we're now at the point where we're applying single cell sequencing devices to literally look at thousands of nuclei from individual cancers in the sampling area because in some ways it's a direct measure of the composition without having to deconvolute it from the mixture this is an algorithmically very challenging problem and often we can extract these signals but the results will always be predictions and somewhat ambiguous with directly looking at single cells you can essentially read off what the mixture is without having to deploy sophisticated algorithmic statistical models now that's not to say that the data are easy to work with or the data are easy to generate so as you can imagine in each individual nucleus we're working with very small quantities of DNA and so they need to go through amplification processes that introduce biases and measurement errors such as allele dropout for example so sometimes both alleles won't amplify equally and so the result will be that we only see one allele and in the case of single point mutations the absence of a mutation doesn't necessarily mean it's not there so that's a bit of a problem that the field is struggling with and that's why we don't see a wide range of adoption yet there's still some issues and so right now the field is probably the edge of the field is somewhere we really need both we need to look at bulk and single cell simultaneously to make inferences eventually I think for this type of analysis where we're trying to understand clonal population structure and their dynamics the field will eventually I think move over to single cell technologies as a potent way to measure clonal dynamics alright so I think I'll start wrapping up now I hope I've shown you that the genome architecture, the copy number architecture is really a fundamentally important aspect of studying the cancer genome and that this copy number changes altered gene dosage that drive expression of oncogenes and tumor suppressors and the measure of technologies we've covered our array hydrohybridization type methods and also next gen sequencing and I think that it's clear that the copy number profile can indicate important phenotypic characteristics of cancers and finally the copy number profiles can be used as clonal marks to reveal population structure of a tumor so there are a number of tools that I've listed here that are relevant I suppose that this should be updated with Titan I guess we can get that on there on the wiki, you'll learn about that in the lab and I think I'll wrap up there and take any questions