 Everyone, it's good to be here. I'm proud to be probably one of the longest-serving faculty members in this workshop, outside of Francis, been involved in this series of workshops since, I think, the year 2000. So that's a long time. And I keep coming back every year because I enjoy. It's a great way to meet people and interact with people across the country in different parts of the land who are interested in this field. And so very, very pleased to be here. So today, we're just talking about somatic copy number alterations. Before we get into that, I thought I'd just give you a little bit of a background. So my work centers around understanding properties of the cancer genome and how tumor cell populations evolve in different contexts. And really making use of these wonderful new measurement devices that we call mixed and sequencing devices to study cancer genomes at the very essence of a nucleotide resolution. And that involves quite a diverse array of expertise and knowledge bases, and also is really a multidisciplinary science. So it's very collaborative work. And it is becoming increasingly computational in nature. But I would just put that into context in the sense that we do a lot of algorithm statistical modeling development in the group. And my PhD is actually in computer science. So I did a PhD with Kevin Murphy a few years ago. And I think you've probably heard from a couple of people who actually have formal training and whose PhDs are in the computational sciences. But the way I work things in my lab is really try to be focused on certain biological questions. And so we have a biological question. We're taking measurements using these devices. And then we ask, OK, where is the computational gap? Where is the statistical modeling gap? And try to fill that. So we can answer the biological question of interest. So it all comes from that scientific perspective, either clinical or biological. But so we really run the spectrum of also, so the other aspect is that we do a lot of method development. We publish a lot of papers in methods journals. But I always have a view that a method is always best developed if there's going to be an application data set on the other side of it. And so in parallel, we've published a lot of papers that are applying those methods to new data sets to gain new biological insights. And that's really when the magic happens. So if you innovate on a computational perspective and you learn something biological that you couldn't have seen before without that innovation, that's really when we get quite excited. So this is my group. I have a group of about 25 people in Vancouver and it ranges from five postdocs and a number of students. But I also have a number of core staff that do a lot of the data analysis, the day-to-day data analysis. So we build a lot of computational workflows that are reproducible. And we have a lot of collaborations with people and also generate data from our own perspective and analyze those data through pipelines and high-performance computing environments. So that's kind of the background of where I'm coming from. All right. So today, we're going to cover a fair number of topics with respect to copy number alterations. This is the outline. So we'll talk about biological relevance and impact. We'll talk about some of the measurement technologies that are available and in particular, high density genotyping arrays and also next-gen sequencing. And then we'll get into some advanced topics at the end given some time. So we've got a two-hour block and I probably have maybe just over an hour's worth of prepared material. So let's have a lively discussion. Feel free to interrupt at any time. Usually if I prepare half an hour of material, sometimes that's enough to end up eliciting a nice discussion. So I think we're going to be still pressed for time. So don't be shy at all about interrupting and we can discuss anything that comes to your mind. OK. So how many people know what this is? Have you ever seen something like this before? So shout it out. What is it? It's a how does this generate it? Anybody know? Yeah. So metaphase spreads and chromosomal painting techniques called spectral karagrams. So what this is essentially is showing that organization of how DNA is organized is all in 23 pairs of chromosomes. I should also ask actually before we really get into this. So how many people come from a pure computational engineering statistics background? And the rest are biological? How about basic science and biology? And clinical. And the clinical are pathologists or oncology specialists or oncology? Yeah. And more hematology. OK. OK. Great. Excellent. So we've got a really great. If we all put our heads together, we could solve cancer genomics. This is great. Fantastic. Good. So OK. So this is a human karyotype that really illustrates that there are two copies of a genome in every cell. And one comes from your mother. And one comes from your father in the initial formation of a zygote. So that's what really your normal cells should look like that. So as far as 100 years ago, Theodore Bovery hypothesized when looking at sea urchin nuclei. So sea urchin nuclei are very large. They can be looked at with low power microscopy. He was studying these nuclei. And looking at their replication rates and their growth rates. And he noticed that occasionally there would be a clone or a group of cells that started dividing rapidly, more rapidly than others. And he associated that with an acquisition of an extra chromosome, I think it was chromosome 6, they called it. And those cells went on to divide much faster than the cells that didn't have that extra copy. So this is an incredible leap, if you can think about it. But he actually made this association. He said, well, maybe this is the origin of human malignancy, that there is a change in chromosomes that leads to this additional growth rate. So this is really quite pressing. And he was proven correct in 1960 with the discovery of the Philadelphia chromosome in CML. And I believe you would have covered that yesterday in the gene fusion lab. So it's an endogenous factor. Our chromosomes change, our genome change, in order to drive malignancy and acquire a new phenotype. So if you contrast the picture I showed you before with this one, this is a high grade serious ovarian cancer. And this is probably the most, the cancer type that it has the most genomic disruption of all human malignancies, at least of the common types. And you can see that there are, this is six different tumors. And you can see these tumors resemble nothing like the picture I showed you before. So we have acquisitions of extra copies of certain chromosomes. We have exchange of information of different chromosomes. So let me just see if I can use this pointer here. So here's, for example, an exchange of information. We've got two different chromosomes come together. Here's one that's much more even complex than that. And it's so on and so on. And some chromosomes have a reduced amount of material as well. So you can have acquisition of extra copies of material. And you can have actually removal of material as well. So this is actually a hallmark of most cancers. Some cancers actually have fairly diploid genomes. But for a large number of the solid tumors that are very aggressive malignancies, this is actually a common phenomenon. So then just from a very basic perspective, you can imagine that these copy number variations are just losses or gains of genetic material. And so you can have, as very simple, you can have a deletion. So here's a mark or a locus that's lost in one chromosome. Here's a duplication in the chromosome. Or you can have even deletion first followed by duplication. We'll go over those examples. So this is a picture of a summary of 1,000 breast cancers that show the level of copy number amplifications in red going up and the proportion of cases with copy number deletions in blue going down. And the chromosomes are rated out on the x-axis. This is like taking the genome and putting it flat and stringing it out in a string and then looking at how many cases out of 1,000 have a copy number alteration or deletion at that locus. You can see that almost the whole genome has some level of frequency of alteration. And this is across all types of breast cancers. So here are positives, here are negatives, et cetera. And in some cases, you have events like this where you have 50% of all breast cancers have an amplification of chromosome 1Q. It's a major feature of the disease. And if you really zoom in, so there's a little spike here. You can see that right here. So that's the ERB2 locus that constitutes 15% of breast cancers have an amplification in ERB2. And we'll talk a little bit about that in detail. And you can see that throughout there are deletions as well. So here you have a deletion of chromosome 17. This happens to be where TP53 sits. So there's often mutation of P53 followed by deletion of the wild type locus. Here you have a pattern of deletion and amplification of the two arms of a particular chromosome that's called an isocentric chromosome. That's a common feature of breast cancers as well. So we can study from a very high resolution perspective. This is about a 1KB, 1.5KB resolution where we look at that type of marks. And we can really start to zero in on where a copy number changes are actually occurring in the genome. So why are copy number changes important? Well, the first is that the major concept here is that, of course, we have genes that are encoded in the genome. So if you have a deletion of a particular locus at Harvard, a gene that is functioned is to have, for example, keep genomic integrity intact or to suppress growth under normal conditions. We remove that locus. That gene can no longer be expressed. That protein can no longer act in the cell. And so then that allows cells to evade the normal checkpoints that keep a cell functioning as it should. On the other side, if you have amplification of material where that contains a gene whose role it is to promote growth and proliferation, we have extra copies through gene dosage that you're going to have lots of protein in the cell that's driving those. And those are typically associated with biochemical cascades. They drive entire pathways and overexpress pathways. And that's how the cells acquire that particular phenotype. So the most important association here is that copy number changes can lead to changes in gene expression. And that, in turn, of course leads to the abundance of a protein in the cell. Changes in the abundance of protein in the cell. And many copy number changes have been associated with mechanisms of tumor genesis, but also can be used for prognostic purposes and for diagnostic purposes as well. OK, any questions so far? Back at the genomic landscape of breast cancer. So a normal sample would look pretty much flat. It would be very, yeah. So this is, don't forget, this is one sample. So this is a collection of a thousand. This is a synthesis of 1,000 cases. OK, so what's shown on the y-axis is the proportion of patient, the pair of tumors, that have a particular alteration at that locus. Is there something similar just for one? Yes, yeah, we'll see that. Let's see what that looks like. Yes? So the changing of copy number can lead to changes in the gene expression. For clinical application, how common is it for people to run both copy number analysis and gene expression analysis on the same subject? Yeah, that's a great question. So gene expression analysis in general turns out to be not the most reliable measurement. So people typically use actually protein expression through immunohistochemistry in a clinical environment. Or they use a technique called fluorescence in situ hybridization, which is a very old school technique that's been around for decades. But it's very reliable in the sense that we can take a fluorescent probe and it gets incorporated into the nuclei of the cell. And so then it can light up whether the number, light up for every copy that exists in that cell's nucleus, we see a spot. And usually either through experienced technologists actually look at it manually and do some scoring through just kind of SOP algorithms. Or now it's sort of shifting over to automated counting of these markers in the clinical environment. So we're still in a world in the clinical environment where it's either fluorescence in situ hybridization or immunohistochemistry. There are some gene expression profiling assays that are coming online now. Like you may have heard of Oncotype DX, for example, for breast cancer. Or the competing product, which is ProSigna, which is based on the PAM 50 classifier. So these are actually RNA extracted from paraffin. But they're really meant to do breast cancer subtyping and risk recurrence score. So it's a slightly different thing that they're measuring. So those are the clinical applications for RNA are typically in these kind of gene expression profiling cases where if you're looking for amplification of RB2, for example, if you're looking for HER2 positive, it's either done by immunohistochemistry or by fish. We found using arrays that both can be problematic and actually arrays are probably more accurate, but it's going to take time to work that into the clinic. So you can't find some null things with fish and you can't find a smaller one-gaming to PV-gaming. Correct. No, but in the clinical, he's asking about a clinical environment. So you have to know what you're testing, obviously, right? So it's not a discovery platform by any means. It's you're looking for one, doing a test, is this locus amplified? And you get your answer. It's not what loci are amplified in the whole genome. That's a different question. Okay, good. Okay, so there are several different classes of CNVs that are associated with different abnormalities. And so I just realized here that we probably have a very outdated terminology here. We should probably, I'm going to fix this right now. I think this is called intellectual disability is better. There we go. There we go. So you can imagine that we can have de novo acquisition of changes from two parents and the child is born with, in the zygote, is born with a particular marker in the genome and that can lead to congenital abnormalities where it's usually associated with intellectual disability or impaired motor function. Then we have somatic alterations which are acquired mutations in specific tissues and these are usually associated with malignancies and are really contained in most, if not all cancers. And then of course we have many benign variations in our genomes, which actually just distinguish, end up distinguishing normal human traits. Or even not at all. So they can have no effect on phenocyte whatsoever. But copy number variations in the human population actually have been really underappreciated and it really wasn't until about 2006, 2007 that we really realized that in fact, more than 10% of the genome, and a lot of that work was done here in Toronto, is actually affected by copy number variations and actually leads to variation in human populations. So that was actually something fairly new on the landscape and in many respects, you can imagine that because copy number variations are large chunks that may be a KB or more, in aggregate, they actually affect more of the genome than single nucleotide polymorphisms in many cases. Okay, so moving on then. So in cancer, we can have several classes of changes. We'll call these segmental aneuploidies are often large scale. So large scale, when I say large scale, is occupying a large part of the genome. It's usually a chromosome arm level event or a whole chromosome event. And these are common, but they're often low amplitudes. So you may just get a single copy gain of a chromosome, for example, or a single copy deletion. Because it may be deleterious to the cell to actually wipe out an entire chromosome, two copies of an entire chromosome, for example, or gain many, many copies of an entire chromosome. And then we have, to contrast that, we have focal copy number alterations. So these are deletions and amplifications of high amplitude that typically target just one or just a few genes. And a classic example of this would be there would be two locus, for example. And these can be very good indicators of driver events. So these are events that could be likely to be changing phenotype or driving a signaling cascade in a cell. And then you've heard yesterday about rearrangements. And typically we have a rearrangement will often lead to deletion of material or amplification of material. So actually they're the same event, but we haven't been able to until very recently with next gen technology in a high throughput way be able to simultaneously read off rearrangements and copy numbers. We're just able to do that now. It's in the last two, three years. That's become a possibility. So we see changes in the genome. We think about the genome as this linear construct. And we look at copies of, at each locus, but in fact, you may have changes in copies that also have scrambling of the genome associated with that as well. So I'll show you some pictures of that later on. Okay. So this is just a list here of known copy number alterations and cancer that are what we call actionable. So actionable really suggests that there's a drug that can be administered in the presence of that particular event that would be able to be targeted and have some effect on treatment outcome of the patient. And so the classic examples here are our Erby-2. So here it is here. Who knows what Erby-2 is? Go ahead, shout it out. Yeah, it's a gene, yeah, yeah. What's its significance in? It's very significant. Yeah, okay, so. It's some of the same guidance. Yeah. Which is at close to degree. Okay, good, so I call it the poster child for personalized medicine. We go on and on about personalized medicine and what is personalized medicine? Well, nobody really knows what that is. But so don't believe anybody when they say what they know what personalized medicine is. But nonetheless, it is the marker, I think that we all point to, this is the thing that we're all looking for because this is an amplification that happens in the genome. A 15% of breast cancer tumors. And it used to be that women that had that type of breast cancer really were sentenced to death. It was a very highly aggressive phenotype. And then in the 90s, there was an antibody developed against it. I think Dennis Slayman originally did that. And then Genentech took it over and developed this drug called Herceptin. And so now, I think the first patient was administered something in the late 80s. She was administered and she's still alive today. And it's pretty amazing. So she's taken a very highly aggressive disease and essentially made it, changed its outcome trajectory dramatically. So the five year survival rate for her to positive breast cancer is actually not bad relative to what it was. It's still not perfect, it's not a hundred percent, but it's much, much better than what it was. So this is in cancer research, this is what we strive to find. A target, develop a drug, change the outcome trajectory of patients. So that's, and there are other examples as well, but that is by far the biggest success story in the last few decades. So in general, copy number profiles indicate, so how do these things occur? And it's usually through a compromised DNA repair pathway. So in high grade serious ovarian cancer, for example, it's homologous recombination, it repairs double strand breaks in the genome on cell division. And through, usually through disruption of BRCA1 or BRCA2, either through germline, somatic, or methylation, those cells have compromised ability to repair double strand breaks. And so copy number changes can accumulate. And to the point where you get the spectral carograms that I showed you earlier. And in this study, which is a synthesis of the TCGA, surely you've been introduced to TCGA by now? Yes, okay, good. So if I say TCGA, you know what I'm talking about? Okay, great. So this is perhaps you've even seen this figure. So this is a study of 12 different tumor types after the first phase of the TCGA and shows on a spectrum of whether cancers are affected by point mutations versus copy number changes. And so there are two classes of these point mutation phenotype and the copy number alteration phenotype. You can see that ovarian cancer here is by far the tumor with the most copy number changes and associated with the most copy number changes. And the interesting thing about this study is it showed that really tumors can acquire DNA repair deficiency and mismatch repair. So that's an accumulate single nucleotide changes or have changes in DNA repair pathways like homologous recombination, which repair double strand breaks. But you rarely see tumors with both because that's gonna be essentially synthetically lethal to the cell and those cells just won't be able to cope with having two DNA repair mechanisms altered. So you often see one or the other and there's this mutual exclusion event. So today we're talking about tumors on this end of the spectrum. And this afternoon, it would be more applicable to tumors on this end of the spectrum. Okay. So the bottom one is SNP changes and the top one is... Sorry, no. So the top one is SMB, so single nucleotide changes and the bottom is copy number changes. Point, I'll put the reference here. It's actually worthwhile reading. If you want sort of a high level overview synthesis of the TCJ dataset, this is a decent paper to look at. So the pattern of copy number change can also then be potentially used to stratify different cancers into these different groups. And so it itself, even though it's something in the genome, it's essentially the pattern of copy number change or the abundance of copy number change can essentially be used as a phenotypic classifier because it tells us what DNA repair abnormalities are compromised in the cell. Okay, so this is essentially now coming back to your question about what does it look like in the single tumor? This is a single tumor now, okay. So you'll often see pictures like this when studying copy number changes. So what's shown here is just the karagram and so this is the chromosomal banding pattern shown on the x-axis here and shown here is essentially the number of megabases across the x-axis here. And the y-axis tells you the estimated number of copies of that particular locus. And each dot here, this is an aphymetric SNP6 array. That's probably the still the most used platform for copy number change measurement. Yeah. So zero actually different? So zero is different, yeah, that's right, yeah. So this is relative copy number not to normal rather than absolute copy number. So each dot here represents a particular probe on the array and those probes are ordered according to where they sit on the genome and just literally spatially ordered according to their chromosomal coordinate. And then the y-axis here is essentially a reflection of hybridization intensity of DNA to that particular probe. So you can measure with a camera and take a picture and measure the fluorescence intensity of that particular locus. And then we can measure that relative to a control or sometimes a pooled normal as well. And so this is the unambiguous signal. You don't need any fancy machinery to look at this and say that locus is amplified. So we have, this is the Erby 2 locus. And you can see here that really only constitutes a small part of the genome. Can imagine if the whole chromosome had this many copies it would very likely be deleterious to the cell and just that clone that acquired that would never really survive. So that's what an Erby 2 locus looks like. This one here? Okay, so that's a low level amplification. So it's probably a duplication of the telomere, the telomeric end there. And very likely that's a transocation as well. The centromere there are also these red spots to that region. Yeah, so that's likely, very likely noise. The centromere regions are very, very difficult to study because they're highly repetitive. And so typically in order to cope with that people ignore the centromere, centromere regions because they're essentially uninterpretable. So not to say that there aren't associations of phenotype with changes in genetic material, centromere, but they're very, very different. It's a signal to noise problem. When you run a SNP6 chip, do you typically run it on say one tumor derived sample or do you usually use a tumor-normal pair? Yeah, it's a great question. So in many situations, tumor banks will have, if you're doing a retrospective sample, sometimes the normal isn't available. So in respect to copy number changes, this type of thing is unambiguous and it's always gonna be somatic. So you don't actually need the normal to interpret this particular change. But I'll show you other examples whereby having the normal is quite important because if you know what you're looking for and you see an amplification there with you, there's no way that that would result in a viable zygote if that was in the germline. That just still sells just wouldn't be able to proliferate and make a normal human being. But in certain cases, we do see because of germline copy number variation, you see changes that look like that could be quite interesting, but in fact, they're in the germline and are likely not driving the latency. So generally speaking, it's better to have a tumor normal pair. That's not absolutely necessary. And so in the study that I'll talk to you about, only about a quarter of the cases out of 2,000 cases actually had match normal. And we were able to pull out a lot of interesting biology out of that data set. And so did you just use a standard sample that has no known? Yeah, so what we actually did is we created a pooled reference out of the 450 cases for which we had normal and then ran the tumor against that pooled reference. And you can download these pooled references as well for if you're using AFI. So if you have a tumor sample that you want to examine without and you don't have the match normal, then you can use these pooled references. And you just pulled on the data set and you can use it as part of the analysis. So you can do that. You can use HapMap for that. You can use 1,000. HapMap did something like 1,000 cases, normal cases. That gives you some reasonable pool of normal storage. Yes? Is this the pool of 1,000 or 1,000? This is one tumor here. This is just one tumor. So they are pooled from the number that they have to raise to? Yeah, that's right. And this is a vast underestimate. Very likely. So this says 20 copies, but that's because in hybridization, intensity has an upper bound in terms of how much resolution you can actually find. And in many cases, you'll see, for example, 100 or 200 copies of the locus in their B2. Yes? So the copy number 10 changes would be tissue specific. Correct. Only in the cancer. Right. These are only in the breast epithelium. Breast epithelium. That's malignant. So when you're using a pool of normal, you would want to use that kind of tissue and the tissue you're looking at. Oh, no, not the same. I mean, okay, so then you have, right. Okay, so you do have subatic changes that are in different parts of certainly of the anatomy, but it's not like gene expression. So most of it is quite conserved. You do have, for example, in the brain, you often have tandem depictions of certain regions. And it's thought that that allows the brain to acquire certain plasticity. Are there any neuro people in here in the room? They're the system folks, no? But, and so there is a very nice paper a couple, maybe two years ago that really showed that there can be genomic copy number changes in the brain in brain tissues and different tissues. And it's thought that that gives rise to difference in plasticity. There are also, of course, the classic example is T cells and B cells, which rearrange their genomes to allow for immune surveillance and response to antigens. So that can happen. And we probably under-appreciate, make this assumption that the genome is stable in all of our different tissues. Very likely that's probably a false assumption, but that is the assumption that's generally made, yeah. Okay, any more questions about this is fundamental. This is kind of, this is what we're after. We're trying to take a tumor and produce something like this. Yeah. Sorry, when you said relative copy number, like those DNA dense regions, is it the same number of probes? Right, so the probes are not uniformly spaced, actually, across the genome. As much as they tried to do that in the design, but it's for AFI, some six. But we tend to make the assumption that they are. Some of the statistical models can adjust for that. And some of the tools you'll visit today in the lab adjust for the variance in adjacent, distance between adjacent probes. But on average, it's about 1.5 kV. Okay, yeah. Copy number, you're looking at the intensity of your probes. And then you get a relative value from there, and then you sort of compare it to a normal, or how are you bringing it? Yeah, so that's right. We'll get into that. Okay, so we took a, this is now, last instance, it's at your hybridization. So if you took a fish probe, and this is a clinical fish probe, this is actual tumor that I've worked with. And we can study, so each one of these blue blobs is essentially a nucleus. And you can see that there's a control probe in green, and the HER2 probe is in red. And in each cell, you'd expect to see two green dots and two red dots if it was deployed. Okay, and you can see here that there are just hundreds of copies of HER2 in these particular cells. Yeah. I was curious, what control did you use? I think it's the centromere in this case. Yeah, I can't remember. This is actually the clinical assay. So I didn't even look into what the control is, because I didn't have to design it. Okay. All right, so you can see that for clinical purposes, this is unambiguous. You just look at it and say, okay, this is easy to decide whether something is amplified or not. Okay, so then you can imagine what the impact of that is on the expression. So we have hundreds of copies, and here what we did in a large series, this is the Metabrik series that I'll talk a little bit about in detail. We looked at the correlation of copy number, with gene expression. And here's a few examples of genes that are dramatically affected and really quite stunningly affected by copy number change. And so copy number on the x-axis and gene expression on the y-axis. And what's interesting is that, and so the dots are colored according to their predicted copy number states. So the greens are deletions, the blues are neutral, and then the reds are increasing copy number. So the correlation doesn't really kick in until you can see for the neutrals, you have this spread across the vertical that really suggests that there's no correlation at all. And, but when we start to get into the red points, there's a very measurable response of expression from copy number. Okay, and then you can see this in the case. So grab seven is this gene here. This is actually just adjacent to ERB2 on chromosome 17. So it often comes along for the ride. And so people study grab seven for a long time as a potential oncogene, but is there some debate as to whether it's functional in breast cancer or not? But it definitely is co-amplified almost always with ERB2, you never see grab seven amplified on its own. So that tells you that maybe it's just a passenger. Here's the 11Q13 locus. This is another of these focal high-level amplifications in breast cancer. And this locus is associated with ER positive breast cancers, typically not HER2 positive breast cancers, it's mutually exclusive to HER2. And you can see here that there are a couple of genes in this locus that are driven that have almost exactly the same pattern as ERB2 and grab seven. So once you get into the amplification range of the X-axis, you see a really dramatic response and expression. And so we can use this as a potential guide to tell us which genes are likely to be impacting the behavior of the cell. Any questions on this? Yeah. I'm sorry, is that mutually exclusive with HER2 or not exclusive? Mutually exclusive with HER2, yeah. This is, sorry, yeah. So I should mention that. This is just a gene expression array. It's the Illumina beta-rages. So the correlation is done, it's calculated for all the points? There's no correlation value calculated here. I'm just showing the raw data, just the data points here. But there's a spearman. Oh, I see, you're right. I'll take that back. Yeah, so it's over everything. So it would be better though to calculate that as a mixture. So where we calculate each state independently. And it'd be a much stronger correlation in the red points than, for example, the blue points. Is that the practice or is that the point's done? Well, so it would be better to do it that way. Sorry, I realized that that is there, but I think it's just there out of convenience more than anything. I wouldn't interpret that correlation because I think it doesn't make sense to treat the copy-neutral cases the same as the coming over. Yeah, I would separate. If you have a population of 1,000 cases, then I would separate them out for sure. Yeah, yes. So the amplification is measured with microwave, right? So how do you classify something as being? We'll get to that. That's coming. Good questions. OK. So OK, we talked a lot about amplification. So this is what a homozygous deletion looks like. So this is an actual tumor. This is an ovarian cancer. And actually, this is maybe a little bit more representative slide of all the types of things that you might expect to see. And I literally just plucked this out of a study that we're doing right now. And so this is real-time teaching here. And this is actually copy number estimated from whole genome sequencing. So what I showed you before was AFI SNP6. And this is whole genome sequencing. And what you can see is that it looks quite similar, right? That's quite nice. So we can use a lot of the same conceptual algorithmic tools to estimate copy number from whole genome sequencing as we can for AFI SNP6. And there's a lot of statistical machinery over the course of 10 years that was developed for AFI SNP6. And we tend to borrow some of that when looking at whole genome. So what's shown here is you probably went over, have been over read depth as a concept in terms of sequencing data. And so this is read depth. Consider this normalized read depth. In this case, whenever doing whole genome sequencing, I would always advocate that you have a normal, a match normal, never sequence a tumor without a match normal. Interpreting single nucleotide variants without a match normal is very, very difficult. And so here we show copy number relative to the match normal. And so here you have a red. So of course when we get the data out and we normalize the read depth, according to various factors, which I'll talk about later, all these points would be color. They wouldn't have color associated with them. They should be black. And then we actually process it with a hidden Markov model and try to classify each segment according to whether they're amplified, neutral and blue, deleted and green. And here you have a very, very clear homozygous deletion. And so again, you would rarely expect to see a homozygous deletion that's much bigger than the locus that's about that size. Because taking out both copies are gone. And so if you remove both copies of a gene in a cell, that very likely is going to have a deleterious effect on the cell if it's an important gene in that cell's function. But occasionally, selection will operate to select for cells that have deletions of very tight loci. Again, usually they're tumor suppressors or growth regulators of some kind. And so this is probably a locus that contains maybe three or four genes at the most, very likely just one. OK, so and here's a little amplification here, just for good measure. Yeah, so that's the next bottom plot there. So I'm just about to get to that. Yes. Is this an evenly-spiled window, or is it an alternate number of windows with the same number of rays and normals? So these are 1KB windows. And that is typically, so if to treat that with some caution, because the GC content of that 1KB window actually has a dramatic effect in library construction for next-gen sequencing. This tends to be a bias for high GC content, nucleotides. And so I'll show at the end of the lecture, actually, when we get into the technical part of a copy network of an analysis, that we really have to normalize for GC content. And the other concept is this concept of mapability. I don't know if we've talked about mapability. Yeah, no? Yes? No? No? Yeah. OK, some people weren't listening, maybe. So mapability is essentially the ability to uniquely map reads to a particular part of the genome. And so that actually has an impact on the readout, as you can imagine. So if you have highly-mapable regions, you'll get lots of reads there. If you have regions of low mapability, they'll actually be paradoxically lower read counts there. So that needs to be adjusted for as well. So this is a highly post-processed recount, if you will, of 1KB with those. This is full genome. Can you do this with XOM? Yeah, so yes. But the whole genome is better. Whole genome is best. AFI SNP6 is second best. XOM is third best. What is the RACGH for? RACGH is too old school. Just don't do it. No, that's not true. So in some cases, especially in paraffin-embedded tissues, where you have highly fragmented DNA, one has to then resort to lower-resolution technology. And sometimes there are ECG-8s. And then also, Afimetrix also has a platform that's designed for tissues that are fixed in formalin. And that's going to work better than their AFI SNP6 platform. Essentially, you have to have frozen material for AFI SNP6 to have a, I think it's called OncoScan or OncoScan. Yes? So can the read number actually reliably be used to the 10? Probably not the variation, given that there's varying coverage by read throughout the read. Yeah, so that's what this is. So it looks pretty good. And there's a homozygous leach in that you could not miss. So that's a pretty tight-locust set. So what are the typical sensitivity and specificities of that? Yeah, so we've done a lot of work in that area. I'll reference a couple of papers that show head-to-head comparisons of where we have to use. So the problem is you often have to use an inferior technology as ground truth and then measure sensitivity, specificity against a new technology against that. And that's not always ideal. But there are a lot of groups who have done that. And there are a few papers that have compared the ability to detect copy number changes from a whole genome sequencing, and for example, SNP6, from the same DNA extraction. So you start with the same DNA extraction and do a head-to-head analysis. There's a lot of variation. There's a lot of variables there at work there, though, because the algorithms for SNP6, for example, are designed for SNP6. And then often you have algorithms that really don't work. You can't work out of the box on old genome data. And so you have at least two variables. You have the platform and the algorithm. And that starts to get really complicated in terms of decompleting what's actually constituting the change that you might see. But by and large, I've done enough of this now that unambiguously, whole genome sequencing, especially the new PCR-free tag mutation libraries, are the best way to look at copy number changes. It's also, by far, the most expensive way. So that often comes into the decision-making process. So why would in that theory it would be better than an exome when you're looking at regions which are actually Yeah, so well, well, the exome, if you use the 50 megabase platform, does not comprehensively cover the transcribed regions of the genome, first of all. And second is that you have introduced into the... So there is this variation. We talked about library construction and GC content and mapability. With exomes, you introduce a third source of variation, which is the ability of the exome design and the uniformity of hybridization across the exome. So you have the hybridization component is variable across the exome. Even in a diploid genome, you'll see really massive changes in hybridization. And so that impacts the ability to resolve copy number changes, obviously. And in SNP6 data, there's much more uniformity of coverage across the genome. And you have 1.8 million data points to work with. Whereas the exome, if you're binning by 1 kb bins, for example, then the resolution starts to get pretty small. You get 50,000. Is it possible to do this for a targeted panel? Yeah, that's a good question. I mean, is it possible to do it for a targeted panel? I think you really do need probes or regions that are on the borders to be able to calibrate exactly what's going on. It's not ideal. And I think people have done different ad hoc solutions for that and are reporting reliable results. The problem there is that, let's say you see a very low signal in a particular locus. If it's a PCR-based panel, you don't really know if it's because PCR just didn't work in that assay or whether there's actually a deletion there without having the flanking parts of the genome to be able to look at that. So it's a tricky question, I think. But a lot of people will tell you, especially the people that are selling panels, that yes, you can find copy number variations using our panel. And often, it's probably true. But there are some major caveats with that. Would you consider people who do commercial, the hot-blocks panel, or something like that? But still with all those caveats here? Yeah, definitely. Is there still the normal sample with the panel? And you normalize all the signals using this normal sample? So yeah, OK. So it depends what the panel is designed to do. So if you're looking for, for example, mutational hotspots, which we'll talk about this afternoon, if you have a KRAS code on 12 mutation, you don't need a normal to say that that's going to be important. In fact, I mean, that's, and often you're working with paraphernalia and bedded tissues, and you don't have normal anyway. If it's in clinical context, often you just don't have blood to work with in a clinical assay. If you want to just, so with a panel, a clinical panel, you really want to be able to minimally affect the normal workflow of looking at a tumor specimen. And so in pathology, there's a formal and fixed biopsy, and that should be the material that one works with. In the vast majority of cases, it will not be a blood draw available to do that type of work. So for panels, especially hotspot mutations, BRAF-B600, KRAS code on 12, PI3-Calonase, 1047, et cetera, it's sufficient to have the tumor, because that mutation is interpretable. That mutation is in the germline. That person wouldn't be a person, I don't think. OK, OK, so now the bottom plot. So the other important aspect of this is, of course, we do have these two copies of our genome. And this is a nice example, because it shows a couple of patterns. So what's plotted on the y-axis here is each one of these dots is a SNP. It's a heterozygous SNP that's called in the normal sample here. So the workflow here is you first take the normal sample, and you call all of the heterozygous polymorphisms in the normal sample. And then we can actually look at what happens in the tumor. So what's plotted here is that those same loci would plot the allele ratio of those two alleles, because we know that the heterozygous, and we collapse that down to whether reads match the reference or not. And we count that in the tumor. So we count what proportion of reads are matching the reference. And for regions that have maintained heterozygosity in the tumor, that should center somewhere around 0.5. So half the reads should show the variant here. And so that's this region here. And you can see this is a diploid region. And the allele ratio here is centered around 0.5. There's some noise in the system, but it's essentially that. Then let's look at this green region here. This is a deletion. And you can see that that has shifted the pattern away from heterozygosity. So now we have either one allele or the other. And you see this pattern. This is often called the B allele frequency or the allele ratio. It's called the B allele frequency in SNP6, because in SNP6, it's designed with probing the major and minor allele. And that's often called A and B. And so this is just looking at it. This would be called the minor allele frequency in SNP6 data. And so this is really quite important, because you can imagine if you had a mutation in one of the alleles. And then this deletion is actually removing the wild type allele. And that mutation becomes homozygous. And the only thing that's left in that cell is the mutant allele. And so that's often a very, very good clue, especially if you have truncating stop codon mutations or you have frame shifting insertions and deletions. That is a very good clue. That's a tumor suppressor gene. Almost all the tumor suppressors have this type of pattern, especially P53, where you have a loss of function mutation. We'll get into that this afternoon. That then is followed by or accompanied by a loss of the wild type allele. OK, so that's what this pattern looks like. So if you delete one copy, then you get something like this. Then what's shown here is what's happening in this region here. Can anybody take a guess? Yeah, so what's the sequence of events here? No, I wouldn't say that. So you had a loss and then you had a mutation. Right, right. OK, so you probably had this whole chromosome arm loss. And then it was followed by a duplication of this part of the chromosome here. And so what's quite interesting is that you see that that results in even a further split. And that's because essentially there are more reads here because there's more material there, so you get a better signal. And so we call that copy-neutral loss of heterozygosity. Or some more technical terms are uniparental disomy. There are a number of different terms for this. But copy-neutral loss of heterozygosity is a reasonable way. So what's important about this is that you can see that here you have a diploid region. It's blue. It's right on the zero line. But it's maintained its heterozygosity. Whereas here you have a diploid region and its homozygous. So these have very different consequences if you had a mutation in those two different regions. OK. Can you use this type of thing to calculate on purity? That's a great question. Yes, you can indeed. So all right. So I wasn't going to get into that too later, but I might as well talk about now. So the degree of spread here. So you can imagine that. So why do we still, if it's actually homozygous, and it's homozygous in all cells, all tumor cells, then why do we still see data points down here? So it's often because of normal contamination or infiltration. This can be due to stroma or infiltrating lymphocytes that are still, of course, non-malignant cells. They don't have those cells that don't have these changes. But they're admixt with the population of cells from which we do the DNA extraction. So you do the DNA extraction. It's a soup of DNA that is contributed from many different sources of cells. And we'll really get into that details of that this afternoon. But as a byway of introduction, so this is why we don't see data points right at the extremes. And the degree of shift away from heterozygosity is often a clue of how many cells harbor that particular abnormality. And it's proportional to the number of cells that harbor that abnormality. So if you have contaminating normal cells, it will actually dampen that signal in predictable ways. And so that one can actually infer the proportion of normal cells in the sample. So it's often highly beneficial to have a pathologist review the material prior to sequencing. Because often, you can have materials that are so contaminated that you may not see any signal at all. And that's happened, typically happens, even with a pathologist review, it can still happen. Because sometimes you're not always looking exactly the same material. But these are expensive assays. And so you want to make sure that when you do a DNA extraction, you're actually sequencing tumor cells and not just normal cells. So in the whole genome, you're seeing two parts. One is homozygous deletion. One is homozygous duplication. But in the homozygous duplication, if I don't see anything under the homozygous deletion, the pattern looks like duplication. If you look at the light green one I'm seeing there. This is the same data. This is the same assay, both from whole genome sequencing. There's no step array here. This is derived both from whole genome sequencing. Except we use the SNPs in the bottom here to infer something about whether the locus is heterozygous or homozygous. So why we are not seeing the homozygous duplication in the blue part of the adhesion? So this is, can we consider, are you talking about the difference between the top and the bottom? There is one dot at the right in the part that we are seeing. There are two blue parts, the one on the right That's one homozygous duplication point. This one? Yes. Do you see the homozygous duplication? So homozygous duplication. So the way I would think about this is that first there's a deletion of this entire region from here all the way to the telomere. And there's a second event that results in a duplication from this material to this material here. And there is one region there which is red. Is this what you're talking about? So we don't know when that happened. We can start to try to trace it with various different techniques. But that could very well have happened as the third event. But I'm asking why I'm not seeing it in the data from the SNPs. Ah, okay. Well, if you look very, very carefully, in fact, if you just line it up, you can see that there's even a greater shift towards the top and bottom that corresponds with that red amplification. The one which is shown with green? Yeah, that's right. So it's green because it's the algorithm that's predicted and that's actually likely to be homozygous, but it's amplified homozygous. Whereas the blue is copy-neutral homozygous. Yeah. And one more thing you're saying, that this alien ratio is the same as the alien frequency. It's analogous to that, yeah. It's analogous, but it's not the same. So this here is the ratio of minor to major, but the other one is the minor. So in sequencing data, we typically look at SNPs in terms of reference versus alternate, because we align all the reads to the reference. So in sequencing data, it's typically relative to the reference. Whereas in the SNP design, minor are classed as population-based alleles. So the reason that those probes were chosen in the first place is because they have some sort of frequency in the human population, and that's major and minor. So A and B are the major and minor. Whereas in sequencing data, it's reference and alternate. Okay. So we're still okay, Michelle. Okay, that's all right. Yeah, we have until 10.30, it's all good. All right, so let's just look into what genes are known to be affected by these alterations. We have the classical genes that are associated with high-level amplifications. Irby 2 and breast cancer, EGFR, in lung cancers and also brain cancers. MIC is associated with many different malignancies, including lymphomas, PI3-canase and breast cancer, and the list goes on, CDK4, CDK6 as well. And then deletions are a collection of the well-known tumor suppressor genes. So there's the retinoblastenolaprotein, CDK2A-B, MAP2K4, NF1, et cetera, et cetera. So these are genes that might be typically associated with homozygous dilution. And there's actually a really an incredible amount of literature that's been generated in the last five years on high-resolution views of large collections of tumors. And this is just a small sampling of that. Those papers, but if you really want to educate yourself in terms of somatic copy number changes in cancer, read at least these papers, and that will give you a good overview of the landscape. Okay, so this is work that I'll present now that describes the genomic and transcriptomic architecture of 2,000 breast cancers. This is work I did in collaboration with Sam Apparicio at the BC Cancer Agency and also Carlos Caldas in Cambridge where we collected this set of 2,000 breast tumors. And to date, it's still actually the largest exploration of copy number landscapes in the literature. It came out a couple years ago. So the TCGA study set was about 600 cases. And so this came out before the TCGA and still is the largest collection. So I showed you a while ago, at the very beginning, the frequency of alterations in the patient population. And you can think back to that and remember that almost the whole genome was affected in some way. And then we talked about how copy number changes affect the gene expression measurements from the same tumor. And so when we overlay gene expression onto that landscape, we actually get a much sharper view of where the hotspots or the important regions of the genome are. And so what's shown here is where you have high-level amplifications or homozygous deletions, which I've already introduced, that have essentially a concomitant change in expression. And so we identified a number of loci across the genome. For example, here you have Cycline D1. And this is known, but right next to it, very close by, is actually a separate and distinct amplicon on the 11-q13 locus. And I'll get back to why that was important to you to identify. And the genes in that locus are Pac-1, RSF-1, and S4. And then so we see a number of these regions and a lot of these loci were already determined. So here's RB2. This is the boss locus. You can see that this is the locus in the genome that is most frequently has the most frequent high-level amplifications and association with expression. But we also see, for example, Cycline E1 and these other loci spread throughout AIM1, for example. And these were new regions to study in the breast cancer landscape. So by overlaying expression, one gets an interpretive advantage because if we were to look at that plot that I showed at the very beginning, we'd have to study the whole genome to really tease out what's likely to be driving these cancers. But when you start to overlay expression, it gets reduced dramatically into sort of a manageable number of regions. And we have now ongoing research that stemmed from this where we're trying to mechanistically identify many of these targets as pathogenic drivers. And in fact, we've already published a paper on ZNF703 as a result of this work, and that's now confirmed as a driver gene in breast cancer. So the other important result from this paper is that we used copy number profiles to stratify the population. So at the time when we started this work, breast cancer essentially was divided into five different gene expression-based subtypes. So and really in practice, we still have three clinical subtypes of breast cancer. And so the gene expression work from the late 90s and early 2000s has yielded a little bit more resolution in terms of five different subtypes. But we wanted to analyze a large series in great detail because even though there are these nice classifiers that existed, there's massive heterogeneity in response to treatment within each of these classes. And so we try to explain this heterogeneity with a much higher resolution classification. So this is quite a complicated slide, but just to show that remember the population level plots that I showed you at the beginning, well, this is now a representation of that population, but broken into ten different subgroups, which we actually found an unsupervised clustering approach. So we took the copy number profiles and gene expression profiles and then clustered the data according to those markers. And we reproducibly found that the population could be split into about ten different groups, according to both the copy number profile and expression profiles. And so here is the discovery set of the first thousand cases, and we validated that in a secondary set of another thousand cases using the same approach. And so you can see that there are really these reproducible patterns. So the patterns of note that I want to point out are, so here's the RB2 locus. And what's shown in the black here is what is the subtype specificity of that particular region. So when you see a high black line, that suggests that that locus is only affected in that subgroup. And so this is essentially characterized almost uniquely by amplification of RB2. And then we see another group, this group II down here, which is characterized by an amplification on 11Q13. And I'll talk about the significance of that in a minute. And so each of these groups have distinct copy number profiles. And so one can ask, well, okay, so what? Who cares? And so the really important part of this study is that we were very careful to collect tumors that had at least ten years clinical follow-up in different registries. So we actually had the outcome data from all 2,000 cases. And to get to 2,000, we actually had to look at tumors from five different centers in the UK and in Canada. And so we were able to project what the outcome distributions of these different groups were. And a lot of these tumors, interestingly enough, were collected in the pre-herceptin era. So these are quite old samples that we were able to collect. So it's quite rare to have these frozen samples. You can do AFV metrics, SNP6, gene expression data, and with clinical outcomes. And so is everyone familiar with Kaplan-Meier plots? Anybody not familiar? Okay, we're okay with that? We're gonna look at this Friday? Okay, well, essentially what it shows is over time what the proportion of surviving patients is and in each group. And you can see here that this pink group here, it has a pretty decent outcome spectrum. So over time, after five years, about 80% of the population is still alive. And whereas in this group here, these are the really bad actors here, and this is actually her two-amplified group. So remember, it's pre-herceptin. This is only 40% of cases we've been alive after five years of diagnosis. So if we were to do this today, that brown curve would very likely be closer to the top, and it'd be more like in the 75% range. Yeah? That would be the triple negatives. So okay, so the triple negatives are, I think it's group 10. There's purple ones here. Yeah, these are the basals here. So of note is this group two here. I just wanted to point this out. So this group two is actually composed entirely of ER positive cancers. And in breast cancer, the estrogen receptor positive cases tend to respond very well to hormone-based therapy. And so the outcomes for these patients is typically very, very good. And so this pink group is also predominantly ER positive. But you can see here that there's a subset of patients. It's only 5% of the population. It has very, very poor outcome, and almost as bad after five years as the her two treatment naive group. And so this is actually quite important. If we go back to this group here, this is the group down here. We would not have found this with first of all, high-resolution technology, and second of all, a large series of cases. And because it's only 5% of the population. But this represents, I think, a real opportunity now for to develop a therapeutic against that 11q13 amplicon because it is a high-level amplification. It's very much like the her two locus. And the genes in there are probably really good targets to start developing inhibitors against because they have overexpression. And hopefully, in 10 years' time, we can look at this curve and maybe push it up further towards where these pink guys are. So this is where, I think, high-resolution examination of genomes and transcriptomes together really has great power in identifying particular phenotypes in a large series of cases. And so the end here is really quite important. So 2,000 cases, not easy to come by. It's the largest collection in the world, but it illustrates the power of doing this work on a large scale. So the major conclusions here is essentially that these recurrent copy number profiles can be used to stratify patients and identify novel molecular subgroups. And you can see this coming out in the TCGA and other large-scale studies like the ICGC will undoubtedly reveal this as well. The subgroups are clinically meaningful and metabritic since they co-segregated with these prognostic profiles. And then just to illustrate that, typically driver alterations will be focal and low amplitude, or low amplitude. Okay. So maybe this is a good time for a pause. Any other questions? Yeah, that's a good question. So one way to do it is by, you can imagine that tumor suppressor genes will be affected in a number of different ways. This is tumor suppressor gene that will often be lost. The protein will be lost. And that's what leads to malignancy. So you can imagine that if you were to change the genome, you can lose a protein in numerous different ways. You can have mutations, stop codon mutations, frame shifting, insertions and deletions that result in either degradation of protein or the protein never gets made due to trash, which means sent to non-sense mediated decay. And then those can also be accompanied by homozygous deletions and or deletions of one copy that render that mutation homozygous. So that's an opportunity right there. So you can imagine that there's a profile of a tumor suppressor that has copy number changes of a certain kind, focal homozygous deletions, loss of heterozygosity accompanied by loss of function mutations, and then mutations spread throughout the gene because it doesn't matter where you have a stop codon mutation just as long as there is one before the end of the protein. So we'll talk a little bit about that pattern in the afternoon. So that's one way. And then the other is that you can imagine that, again, you can reduce the whole copy number landscape into these focal changes. So it's very difficult to interpret a chromosome arm level gain of one copy. That's really hard because you've got thousands of genes in there. But if you see a high level amplification of just two or three genes that's in the 20, 30, 40 copy number range, that's a really, really good indication that that's going to be important. So those genes could be put into a gene set that is also accompanied by mutation lists. And I think you're going to really explore that on Thursday, I believe, where you have gene tomorrow, where you have gene sets to work with, and then you try to understand, is there some sort of biological pattern that's associated with those genes? So it's a pathway analysis. It's a really nice way to integrate multiple molecular views of a particular data set. Okay. Yeah. So you were saying before that we would see a pattern of super nucleotide variation, or we see a copy number alteration. Can you just see both? Yes. So what I was referring to is, so you will often see both events in a tumor, but the tumor type will typically associate with large scale changes in copy number changes or lots of point mutation-driven type of events. But you rarely see a tumor type in aggregate. So when you're actually looking at populations, that will associate with both. Because it would be probably deleterious to the cell to have two types of DNA repair mechanisms that are aberrated. So if the DNA repair mechanism that's compromised in cognitive alterations is involved with true combination? Often that's the case, yeah. Are there different players for bulk length alterations versus large scale alterations? Oh, that's a good question. Yeah, so certainly the BRCA cases have this genomic disruption associated with some massive amounts of rearrangement, large scale copy number changes. The focal changes can arise through many different ways. And one of the ways is to break intrusion bridge cycles. And we have this really cyclical pattern of telomeric loss that leads to different ways in which the same locates can then get duplicated and stitched together. And so it creates this kind of very complex structure that manifests itself on these readouts when you have many, many copies of a locus. And often these focal amplifications are driven through these break intrusion bridge cycles. It's not known exactly what the real mechanism for that is. Why is that allowed to occur? Just because of lack of double-strand break repair leads to that kind of stochastically in a way. And I'm not sure how to really attribute that to the mechanism. So what's interesting to think about is that the new class of drugs called PARP inhibitors, for example, that target cells with homologous recombination deficiency actually operate to inhibit the other DNA repair pathway. So a different complementary DNA repair pathway that operates on single-strand breaks. So that's like a synthetic lethal type of approach where the cancer cells have this one vulnerability and then the drug actually then creates a new vulnerability and then the cells can't cope. So if you think about from that perspective, evolution would not select for both things to operate because the cells have no capacity to keep any kind of genetic integrity after that. Yeah. Okay. Okay, so let's move on. We should push this here. Okay, let's do a time check. All right. So I just want to spend a bit of time on genotypes and what that means. So this is just a very basic table that shows when we have, for example, the most basic genotype that we should always expect is AB. And AB meaning that there's one allele from... that's maternal, one allele that's paternal. And again, that can have different contexts depending on the platform. So a SNP6 array that will be major-minor and in the sequencing data that will be ref non-ref. And you can imagine that has a zygosity status, if you will, of heterozygous. But then you can have two copies that are BB and that would be LOH. And then that pattern continues as you have more and more copy number states. So here's a copy number where you have three copies. That yields four different possible genotypes because you can have three copies of just the A allele. You can have two and one and one and two and then three of the B allele. And so this is an important concept because the pattern of spread of the B allele actually corresponds to these different discrete genotypes. And a lot of the algorithms take this into account when segmenting the data. So this is an important concept to carry forward as we look at the data. So this is just an example. I've gone over this already, so I won't spend too much time on this. But this work was one of the first examples of how to process sequencing data to infer loss of heterozygosity. So here's a deletion, again, that results in the spread of the B alleles. This region here is amplification. This also has a spread of the B alleles. It's not tightly centered around 0.5. But the inference here is that actually both alleles are still present, but it's just been skewed away because the amplification is probably just amplifying one allele. So if we go back to this, we'll probably have a situation like this where you have AAAAAB. And so that's a very different inference than, for example, here where you have only one allele is present. So to be able to distinguish these is very important and to do that, you can think about simultaneous analysis of these B alleles and the copy number. And that's most of the approaches that are out there now actually do. And so, actually, Andrew, are you going to use the Oncosnip in the lab? Yeah, Oncosnip. So Oncosnip is an algorithm that simultaneously infers the copy number and the loss of heterozygosity pattern. So the output is actually not just copy number here, but it's actually the genotype at each locus. Okay? Any questions on that? Okay. Yeah, so this is also from sequencing data, but it's analogous in SNP6 data. You see the same picture. Okay, so the studying alleles is actually quite important because of the notion of haplominsufficiency. So often a phenotype can be induced with just a single copy loss of a particular protein. And that's borne out in, for example, P53 levels. Whereas in other cases, it's really important. It's required to have two hits of a particular locus to induce the phenotype. So the classic example is the RB locus retinoblastoma locus which is discovered by studying actually rare, the rare incidents of retinoblastoma in certain families. And so there's a strong suggestion that there's a genetic component to this. But it wasn't, the mechanism wasn't revealed until looking at the somatic genetics whereby there's a mutation in one copy that's in the germline and that's inherited. And then some cells somewhere along in development of these children resulted in a secondary loss of the wild type allele. And so it wasn't until we get that secondary loss that the malignant phenotype is induced. And so that's called the classic Knudsen two-hit hypothesis. And there's a couple of classic papers you can read on that. And then so then there's this notion of quasi-sufficiency where just small changes in the number of copies start to induce phenotype. And it even gets more complicated in that in some cases mutations, for example, actually need to be heterozygous to operate. So it becomes a cooperation between the mutant allele and the wild type allele in order for that to be borne out. And an example of this is the EZH2Y641 mutation in lymphomas where it operates by essentially cooperating, it's a mutant allele operates by cooperating with the wild type allele. This was actually a mutation that was discovered in our center a few years ago. Okay, so let's look at measurement technologies. It's going to be mostly conceptual here because you're actually going to look at this in the lab. But we talked a little bit about resolution. So the lowest resolution technology for looking at copy number changes is of course fluorescence and situ hybridization. I guess one could actually say that those spectral karygrams are really the lowest resolution of them all. But then we can start to look at around 100 kb. And there are fish probes. Actually, you can do 16 multi-color fish now fairly reliably. So you can look at 16 loci across the genome in the same assay. And then in the early 2000s, we started to see the emergence of taking this concept of looking at multi-probes and actually being able to scale that to look at 30,000, started with 10,000 and then 30,000 probes across the genome using arrays comparative genomic hybridization. Then we saw the emergence of genotype arrays really actually designed for studying variation in the normal human population. But it became apparent that there are going to be very powerful technologies to study cancers as well. And so they were widely adopted by the cancer community. And in fact, I think AfroMetrics has probably benefited from projects like the TCGA much more so than the original design, which is for the 1,000 Genomes project and others hat map type projects. And then finally we get to 3G resolution, as I like to call it, which is whole genome sequencing at nucleotide resolution. So this is the kind of history I guess that spans several decades and really we picked up here about 10 years ago. So all this stuff is fairly new. We really haven't had the capacity to measure copy number changes at high resolution across the genome for very long at all. And so it's quite exciting because it opens up a whole new opportunity for research. So let's just look at how genotyping arrays work. And this is very schematic, okay? So this is not from AfroMetrics itself. This is just schematic to illustrate the point. So the first thing we do is you can design probes where these are typically designed so that there's unambiguous sequence in the genome. So you have specificity of hybridization. We've minimized cross-hybridization or off-target hybridization to that region. And with back arrays, they're typically 100 kb probes. So the specificity is usually pretty good. The AfroMetrics SNP6 arrays are 25-mers, but they did a lot of engineering work to make sure that those 25-mers were highly specific, although it's not perfect. But they managed to do this with 25-mers. So here's the array. And then you put this on a glass slide or any kind of slide. And we know in the array coordinates where on the genome those probes come from, okay? So the readout, then we can plot the readout according to the chromosomal position. And again, these dots are proportional to the hybridization intensity. And that's read off by digital photography and then image processing to get something like this. So we go from hybridization intensities through image processing to readout that looks something like this. And if you zoom in on this, you can see that just these probes here have a lower intensity. And so then we try to infer that this is probably a segmental deletion. So these are arrays. So this is just DNA extraction. And then actually, these are back arrays that I'm showing here. I'm actually not sure about that. But for affometrics, it's a standard kit. I mean, it's really, really easy. Yeah, so there's an SOP that you just follow and standard and almost, you know, with a box that almost any lab... I mean, these boxes are distributed everywhere. I mean, they're really straightforward to use. Okay, and then so each data point here then represents in loose terms the copy number of a particular clone or probe relative to the reference. And that reference can be a matched normal or a pooled reference, as I discussed earlier. So the difference then for high density genotyping arrays is that for SNP6, we have measurement of two alleles at approximately, actually it's not true at more than a million. So it's 900,000 of the probes, of the 1.8 million probes are SNP based. So we have actually two different 25-mers that differ at one nucleotide. And that's probing the major, minor alleles in the population. And that's actually a key distinction between the SNP genotyping arrays and the array situation. I've already shown you what the power of that is and why that's important because measuring alleles really gives you interpretive capacity of what those copy number changes are doing to the heterosagosity profile. Okay, good. So now we're going to get into some of the topics around statistical inference. And the topic has already come up about normal contamination. So even though a lot of these platforms, whole genome sequencing, SNP6 arrays, et cetera, are actually designed with normal genomes in mind. And so it took a long time for the computational community and statistical community to catch up with the fact that, in fact, when you study a cancer population, there are many different properties of the cancer genomes that are not taken into account when studying a normal genome. So the major concepts are the fact that we have normal contamination. We have an admixture of stromal lymphocytic cells with an epithelial component as well. And that also, we have intratumoral heterogeneity. So I've done a lot of work in this area, or we've been able to start to measure and model the clonal populations of cells with different genomes. So we'll talk more about that this afternoon. And so most experimental designs, so there are two aspects of this. One is that you have heterogeneous populations within a single sample where maybe only a minor population might harbor a particular deletion. And so the sensitivity of that is directly proportional to the number of cells that actually harbor that change in the first place. And the second is that if we look at spatially separated deposits, metastases for example, we might find actually quite different results. So the tumors are likely to be clonally related, but there may be in one, just due to microenvironmental selective pressures, there may be in one sample a set of abnormalities that don't exist in the other sample at all. So subject to spatial sampling bias as well. So that's something to take in mind. So then we have the issue of ploidy. So ploidy is, so endo-reduplication that induces tetraploidy, for example, or even octaploidy, is a common occurrence in tumors. So sometimes we have, in ovarian cancer, probably at least half cases, half the cases at the time of diagnosis are tetraploid. We've got four copies of their genomes, we've undergone some whole genome re-duplication. So that has some bearing on how we interpret alleles as well. Okay, and so, you know, up until a couple of years ago, the fact there were basically no tools available to take all these factors into account. It's really been an explosion of activity in the last three, four years. And so, you know, just to make the point that taking off-the-shelf tools that are designed for normal genomes is not going to work very well. And so specialized tools for cancer are really needed. And that's what this workshop is here to expose you to some of those tools. And you'll have some practice with them in the lab this afternoon. So here's a very nice overview of these concepts from Terry Speed's group. And he talks about how the statistical analysis of these SNP arrays in cancer studies and how to deal with a lot of these different phenomenon. So in general, we have a workflow that looks something like this, where from the data generation, that's just straight from the machine, this is for SNP-6 analysis. We get something called a cell file. And you'll be starting from a cell file in the lab, and I believe so, and then going from there. And so we do a couple of really important steps. There's some pre-processing and normalization. Any time you do an array, there's some aspects that often need to be cleaned up. And the same is true of whole genome sequencing as well. So we have to go through some level of pre-processing and normalization. And we do the total copy number extraction from there, and then the B-aleal extraction. So now we're dealing with these SNP-6 arrays. We're going to infer these two different quantities. And then we're going to learn how to take that raw data and then look for breakpoints and for the copy number changes, segment into loss-featured legosity and allele-specific copy number changes, and then ultimately we do some sort of gene and pathway analysis and or a clinical correlation. So this would be the example, for example, a workflow that we applied to the Matterbrick study. Okay. So just some detailed specifications about SNP-6. So I said there are 25 Mer-algamucleotide probes, with 900,000 SNP probes, 900,000 CNV probes. And so there are a number of tools that have been developed for pre-processing. The tool that we like to use is called Aroma. There are other very nice ones called PEN-CNV, for example. Andrew, what's going to be discussed in the lab? Which one? Very new PEN-CNV process for globalization. Okay. So we'll talk about PEN-CNV in the lab. And both of these methods will output allele-specific and total-copy real-value data. And so just by illustration for aphymetrics, the Aroma.Aphymetrics package, this is just a histogram of the intensities across the genome for I think 10 different samples here. And you can see that they're quite variable and so not comparable to each other. And these are just normal samples. But then after aromalization, we get a much more comparable set. So this is very analogous to gene expression microarray normalization, which I think you're going to cover as well. So the important concept here is that there needs to be an adjustment of the data so that they're comparable to each other and remove the possibility of batch effects or something that's specific to a run. Okay. And then we have these different features. Okay, so now I'm going to throw in some notation. So this is just something that may scare some people, but don't be scared of it. It's just these are just helped to unambiguously define what we're talking about. So we have intensity for a particular allele at a position, A and B, and that can just be denoted by Y. And then the total intensity is just the sum of these two alleles. And then you have, for example, the total copy number at a given position. So it's just Yj over the reference. And there can be some standard normalization constant here, this gamma here. And then the B allele fraction is, of course, just the intensity of the B over the total. So these are the types of things that you might, if you really want to dig into this, these are the types of notation that you might see. And then, so then we go from signal processing to actual copy number. So what's shown here, really, if now we think about notation, is this is the Y or the total copy number at each locus. And then we try to take, and of course they're not colored, and we try to take these black dots and put colors on them by segmenting the data into these contiguous blocks and into different states. So we project the continuous data on to discrete values. And here, this question came up earlier about normal C and V, so here's an advantage of having a match normal. So this is a signal that is very focal. It looks like it's low amplitude, and so it has all the hallmarks of what you think would be a tumor suppressor gene. And so, oh, there's a homozygous deletion here. So I'm going to look at what that gene is, and then I'm going to do all kinds of functional assays and create mouse models and then write a cell paper. But actually, this is in the germline, and so it's probably not affecting the malignancy at all. So this is where having the advantage of a match normal will be very, very good. We've been developing algorithms, and this is work from Gavin Haught in my lab, where if you didn't have the match normal, could one actually recognize this signal as being a germline? And in fact, this analysis has shown just that. So he's developed a method whereby just from a tumor sample, one can use the statistical properties that one would expect from normal germline CNVs to try to classify these events as being germline and not somatic. Okay, yeah? So is this a germline CNV, or is it a population type of minor allele? So this would be probably, there are two possible ways that something like this could happen. So the reference human genome actually turns out to be predominantly I think an African-American individual, and it just happened that I think there are five DNA samples in the original human genome project, and this person's genome just made for a really nice library, and so it ended up getting used disproportionately to everyone else. And so the reference human genome could very well have tandem duplications at this locus, but that could be almost specific to that individual. So most of us probably don't have that tandem duplication, so it manifests as a deletion when we look at it. So that's likely what's going on here. Yeah? If you'd like your slides with all of the math, so you have this when you're translating copy number, you have this reference. So when you talk about reference there, you're talking about like a match normal reference, but I think it just has to be a reference of say a pooled normal cell development sample. So it's interchangeable, right? So a match reference is best, but often you don't, so then you can replace that with just a pooled reference. So you think you'd get a library of this and then you can do it? Yeah, definitely. So just look at, this is, I'll put this example here. Yeah, because if you just to look at this tumor example, and we've done some algorithm development there, and the algorithm is generally good, but it's not perfect. And so the best way to rule out a germline polymorphism is by having a normal layer in the first place. Okay, so I think I've talked, I'm going to skip over this. Skip over this. So there are a number of approaches that have appeared in the literature, that range from smoothing type models where we try to actually fit a spline type curve through the data. That's probably not recommended because ultimately you still go from a continuous value measurement to then you end up with a continuous value measurement and you still have to then interpret that. And then we have segmentation approaches that try to infer to be a change in level. And so this has been a very, very popular approach is to segment the data or to just determine where the breaks are in the genome. That has some disadvantages in the sense that we still have these segments, and those segments are continuous value measurements as well. So you still have to then post-process that data to then classify those segments into neutral gain, loss, et cetera. There have been some suggestions of models that treat each data point independently and that's really not desirable because we know that these events actually span multiple adjacent data points. And so here, for example, if you were to just look at IID classification, this set of data points is likely the same event here. But this classification would only look at the... it essentially ranks the data points according to their intensity levels. And so that's going to be undesirable as well. So a nice way to approach this is with hidden Markov models and we've seen huge amounts of activity in the literature. This is an old slide that I really need to update. But there have probably been 25 different approaches now published in the last, I would say, eight years on hidden Markov models. And there's still, I think, two camps. Generally, people fall into two classes. So people that like parametric models where you can actually get interpretable output at the end. So you make some assumptions, but then the actual data that comes out is interpretable. So here you have a segmentation of the data that at the same time finds the break points and also takes those segments and classifies them as loss-neutral or gain. So in one algorithmic step, we get these two properties of knowing where the breaks are and actually classifying these segments as being a loss-neutral or gain. And that also allows us to take, for example, the advantage of the fact that adjacent data points are very likely to be in the same class. So for example, here you have a loss of deletion here and all of these data points are adjacent. So you're more likely to be similar to your neighbor than to something else. And so one can enforce some modeling assumptions on that. The problem there is that, of course, we do require parametric modeling. It forces us to make some assumptions. And for example, we've seen HMMs are often limited by the state space. So what a hidden Markov model does is essentially take these raw data and tries to classify each point into a fixed number of states. And so that can be, for example, you can have homozygous deletion, hemizygous deletion neutral, and maybe three or four levels of amplification. But we know from, for example, HER2 that in fact you actually have 100 copies. So if you want to actually infer the exact copy number, then maybe a hidden Markov model does a scale well because it doesn't have the state space to be able to accurately represent that. So Andrew's been working on some models that can actually overcome this and it's probably too new to talk about in an educational capacity, but you can ask him on the break about that maybe. And so these hidden Markov models do have a lot of advantages, but they have some limitations as well. So here are some examples of this and you can read these papers for which there's actually software available as well, DNA copy and from our group HMM dosage. The key ideas of the DNA copy algorithm, and you'll see this in the literature a lot, there are a lot of people that like this approach, is that we try to output change points in the data and these change points are inferred by minimizing the within segment variation. So here you have a segment. And we try to minimize the variation within that segment and then maximize the variation between segments or the distance between segments. And so this algorithm works by, you could put a change point at any point along this, along the axis and then essentially then through regression type modeling this algorithm optimizes the best place to put those segments to optimize these two factors, which is to minimize within segment variation and maximize between segment variation. So then the difference between that and hidden Markov models is that it's kind of an iterative process in hidden Markov models. So first we start with segmentation and then we alternate that with classification. We put the breakpoints on and then we determine what the class of the data points within each segment should be. Then we update some parameters and then re-iterate that process. And the big advantage is, and this is very simplistic, is that we can really assign semantic meaning to the states. So that rather than just saying, here we have a segment that has logarithm of 0.02, we say that that segment is a neutral segment or we have a segment that has logarithm of minus 0.32, we say that's a loss. So when they output, you get the probability that each probe or each locus is a loss neutral gain. Then we go from that into this two-dimensional analysis. So the DNA copy does that. You don't have to do that in the HMM because it's actually a Bayesian model and actually just uses the parameters of the model to then maximize the likelihood of that particular state space given the data. So it's a different approach. There's no multiple test correct. There's no p-value calculation. It's a parametric model whereby you can calculate a likelihood of the data given the model. That's for a deeper discussion. Two minutes? How much time? How much do I have left here? I have a lot left. Oh, okay. Okay, that's all right. So good. So that means we've had a good discussion. So just a couple of tools here. So the stuff that I just skipped through is just listing of tools and websites. So you can look at that. You can ask me at the break, whatever. All right, so one nice tool to visualize data is IGB and the integrated genomics viewer. I'm sure you've already been exposed to that already. Andrew's going to give you an example of that. And how to show that. This is just showing this is the RB2 locus here. So these are in red. The high-level amplifications of RB2 are shown here. And then there are literally a thousand cases here. You can see that it's about 15% of the population has these high-level amplifications of RB2. These are deletions and I'm going to go over those as well. I'm just going to skip over the stuff. This is the actual GC content and mapability normalization. So you can see that it really is an association between read counts and GC content. And so read counts shown on the y-axis here and then GC content shown on the x is a strong relationship that we can normalize out and we can do something similar for mapability as well. And there are several tools that take into account. And this is the effect of that. So at the very top is these 1KB bands that are unnormalized. We normalize by GC content. We get a better readout. We normalize by mapability and we get a much more interpretable readout here. And this is what it might look like. We get very nice data from NGS. Okay, so let's just finish up with some advanced topics. So somebody asked about copy number from Exomes. Here's an example from ExomeCNB. And so here's just a surcost plot of showing how one can infer a copy number from Exomes. Again, I would say that the reliability of this is still yet to be determined. But sometimes people generate Exomes and then want to just take advantage of the fact they have this data. Can I get more information out? So the primary goal of Exomes is maybe sequencing and looking for mutations, but then we have the data, so let's try to take better advantage of it. And so one can use a tool like ExomeCNB to pull out copy number changes. And it uses all the same concepts that I've talked about today. So at the recounts, B allele, ratios, et cetera. So here's another example of these that look familiar to you, these plots. So this is controlling for GC content, for example. And this is a package called Control Freak and by the Emmanuel Barrios group in France. And this is a nice package as well that we've used and it works. It's nice. So here's an example of how complex it can get. So this is an example of a tool of Chromothripsis. And this was a phenomenon that was reported a few years ago. And it really shows that in a single cell division event we can get a chromosome that shatters essentially into hundreds of pieces. And then through non-homologous end joining and through other repair mechanisms we can stitch back together, but in a scrambled way. So all these arcs here at the top represent rearrangements where you have a shuffling of that genomic material. And you have copy number losses alternating when copy number gains in this kind of sawtooth pattern. And so this can get extremely complex. And again, none of the methods that I've talked to you about today would be able to find this. But Andrew and others out there have been developing methods to really try to profile and identify these complex rearrangements. Only in the cancer cells, yeah. So here's an example of when that's useful. So neuroblastilence is, of course, a childhood brain tumor. They actually tend to be devoid of mutations. So a lot of people ended up trying to sequence these childhood brain cancers and found that they were barren mutational landscapes. So they went through this process of sequencing these genomes and didn't find any mutations, and it was just incredibly frustrating. And this group actually looked at chromothripsis and found that, in fact, there was a soft population of these tumors that exhibited this pattern of chromothripsis and actually were able to associate that with inferior outcome. And so was able to stratify this patient population into a group of 10 patients with chromothripsis. You can see that the outcome here for these kids is really, really bad. And those patients without the chromothripsis had a much better survival. So the properties of the genome, this is completely irrelevant of gene content. This is just, does this phenomenon happen? That can be illustrative of a phenotype. Yes? So what's the celerity in which... This is the complete genomics paper from a couple of years ago. So wouldn't there be, probably, like, a certain period of time when things have broken, wouldn't that sort of show up as it does because things weren't pairing up properly here? So, yeah, so often there are insertions of nucleotides right at the edges of these... Yeah. So that's needed by non-homologous enjoyment. Basically, did that show up as a mutation? I was saying, trying to figure out how come you didn't see any patients. Yeah, so... I have to look into the paper again. The main point of the paper is that this is an event, a characteristic of the genome that's irrelevant to gene content. It's the whole pattern across a genome that matters here. And that was prognostic in a major way. Okay, yes? Like, to try to decode the grades... Oh, in chromaturpuse? Yeah. I don't... That said, it'd be very, very difficult. I mean, it's like trying to interpret a hybrid mutator case so where you have mismatch repair protein that needs to be... A lot of orders are more mutations than what you'd expect in a typically without pattern. So you get, literally, 10,000 total mutations. And so, how does one interpret that? Oh, that's better. Passengers sort of always try to... That's exactly what you see. So, we'll talk about a passive driver this afternoon, but that's kind of irrelevant. So, chromaturpuse is like a hybrid mutator of the problem of the world. Okay, so Michelle wants me to stop. So, I'm going to stop. So, we're going to do a blue picture now before coffee break. Can I just say one thing here? Yes. Okay, and then we'll... So, this is a nice study by Nick Naaman a few years ago. He actually looked at single cells of breast cancers. So, isolated different populations of cells. And actually was able to profile a copy number of those different cells. And so, what he found is that there were three distinct populations that segregated according to their copy number profile. So, this is within one tumor. We have this population of cells that exhibit really quite dramatically different copy number profiles. So, just to make the point that copy number profiles can actually be used to study clonal evolution and intratumoral heterogeneity. And so, they become really important markers of clones and can then be used to characterize a mixture of populations that exist in a tumor. And we'll talk more in detail about that this afternoon. So, let's just wrap up then to say that the genome architecture is a fundamentally important aspect of studying the cancer genome. That somatic copy number alterations change the gene dosage and can drive expression of oncogenes in tumor suppressors. And there are multiple different platforms, genotyping arrays, whole genome sequencing. And the properties that the genome revealed through copy number profiles actually can indicate the important phenotypic characteristics of cancers. So, hopefully I've convinced you of these things, but you can also ask me throughout the day if there's something that I wasn't clear about. So, here's the list of tools, and now we can take a break.