 Right, so introduction to biomarkers. So what we're going to do in this section is look at it, just a brief history of clinical biomarkers and cancer, the use of biomarkers in clinical practice, and maybe just talk about the future of biomarker discovery. And then we're also going to have some, after the break, we'll look at measuring how you actually measure these features, and then we'll have a little electron, we'll have an electron alternative splicing. And then just before lunch, we'll go over this example study that we keep mentioning. So what is a biomarker? Does anybody have a good definition of a biomarker? They'll be shy. It's no right answer. Sure. The biomarker vary. It depends on if you feel a little bit more sensitive. I think that's an important part, and it's something you can measure. Any other comments? Okay, so I'm just going to flash up some examples now. You can tell me, does anybody know what this story is? This is two genes coming together. Yeah, that's right. So this is a biomarker for prostate cancer, which is a gene fusion. You've probably seen plots like this. Where's the person working on a GWAS study? Right, truly. You've probably seen pictures like this. So this is just a plot of an associated SNP with a particular disease. But as you might know what this is. Why don't you tell the group what this is? Great. Okay, so, and this one is, this is the classic, this is one of the most, I think, classic cases of biomarker use in clinical, in clinical use right now. This is just showing a copy number amplification of her two or B two on chromosome 17 and a breast cancer tumor. And what's shown there is just a SNP six, after metric SNP six profiling of this tumor. And the red spike there shows that this is the locus that contains this gene her two. And you can see that there are just many, many, many more copies of this gene in this particular tumor than that would be expected normally. And so this is a targetable biomarker when it's over its best. You can make a test for that and actually prescribe the therapy accordingly. So this then is actually a mutation in the gene called P 10 in an ovarian cancer. This is a real result from data that I work with, where we mutations in certain types of variant cancer and P 10 are well described and well known. And and so there it is. So this just gives you a flavor of the type of variety of biomarkers that are out there and different measurement technologies that that go into their discovery and their their analysis. So I just wanted to go over some definitions from some places around around the web. So here's just the Wikipedia definition. So biomarkers are substance use as an indicator of biological state. It is a characteristic that is objectively measured and evaluated as an indicator of normal biologic process, pathogenic process or pharmacologic response to therapeutic intervention. So here's something from the Huntington's outreach project Stanford biomarkers a specific biological trait, such as the level of a certain molecule in the body that can be measured to indicate the progression of a disease or condition. This is the human proteome organization. Their definition is biomarkers used to indicate or measure biological biologic process. For instance, levels of a specific protein and blood or spinal fluid genetic mutations or brain abnormalities observed in a PET scan or other imaging tests. So biomarkers don't have to be molecular. They can be from from images or other other sources of data detecting biomarkers specific to a disease can aid in identification diagnosis and treatment of affected individuals and people who may be at risk but they're not yet exhibit symptoms. Okay, so this is from Lily trials. A biomarker is a measurement of a variable related to a disease that may serve as an indicator or predictor of that disease. And biomarkers are parameters from which the presence or risk of a disease can be inferred rather than being a measure of the disease itself. So we don't. So what we sometimes use is biomarkers as a proxy. So it doesn't actually indicate the disease itself, but we might have some sort of association. The biomarkers consortium from the NIH defines biomarkers as characteristics that are objectively measured and evaluated as indicators of normal biologic process pathogenic process or pharmacological response to invention. That's that sounds what we've seen before. So so at the end of the day, we need a measurable biomarker needs to be an objectively measurable quantity. And we need to be able to infer something about disease from it. So from a clinical perspective, these are indicators for management of care. And and they're often we want to be able to use biomarkers in diagnostics and prognostics and in therapy targets. In a basic science, biomarkers help us to better understand mechanisms of disease. And so I think that's an important element that actually may may not have been mentioned before is that actually, it helps us to understand the disease better. So here's some just some types of biomarkers and cancer. We've already heard these terms before. But a diagnostic biomarker is used to detect and identify a given type of cancer in an individual. So we can use use a diagnostic biomarker to help us subtype of different different types of cancer. And often those subtypes. Well, sometimes the subtypes may respond preferentially to different therapies. And so these helps to to diagnose the the tumors and that helps indicate management of care. So a prognostic marker would help predict the probable course of disease. So how aggressive is the disease? And what is the likely outcome of the patient? So given elevated levels of some biomarker, how likely is a what is the prognosis for the particular patient? And then we have predictive markers, which actually help us to to determine response to therapy. So here's just a table that I extracted from this paper that's actually written by a group here in Toronto. Does anybody know these people? So I actually found this to be quite a nice review paper. So this is just a short list. But one of the things to note from this is actually that the latest entry on this list is comes from the 80s. So the number of new markers that are effectively used in clinical practice, it recently is actually remarkably small. So so what's listed here, and I'll just again highlight the the canonical examples that in breast cancer are her to just her B2 also knows her B2 and the estrogen receptor. And these are these are used to actually direct chemotherapy in these patients. So why are so few new biomarkers in clinical use? And actually put this out to the class. Can somebody can anybody give some insight into that or have some thoughts about why so few we're making we're generating lots of data. We're actually making a lot of associations with clinical outcome in these data, genome wide association studies, gene expression profiling, huge copy number, data sets. Now we're now into sequencing. So we're generating lots of data, actually finding markers that are associated with outcome. But we have very few markers in clinical practice. So please are at the lower expression levels of I mean, it's basically by the technology that we have the resolution to find out if being for atomic or expression profiling, we just great. Thanks for that. Any other comments? Please. Yeah, so that's an excellent point. We'll be talking a little bit about that. Just about. We have a very large number of features that we're measuring. We have very few patients that we're usually measuring. So the proportion of features that we're measuring to the number of patients is really quite large. And so we there may be markers that are simply just present in our one cohort. And but when you actually extrapolate to a large, larger cohort home, they just are it's not reproducible. So that's, that's a major issue. Any other thoughts on this? Sure, please. Yeah. Yeah. Yeah. Do any of the conditions have any perspective? I think it's time. It's probably when you find something that might be useful clinically, it takes time to get into the mainstream. Even if it's, first, we have to validate the grouping across several types of population. And then you have to be to show that it changes the probability when you use it or when you don't use it. So get to the clinical trials, read my control trials, you have to see if it's really valuable or not. And among all the markers that are generated, only a few will pass all those filters at the end and get into clinical management. The main thing is you are generating all those data right now. But from now, to get into the clinical, it usually takes 10 to 15 years. Yeah. That's my reason why I think markers are not getting there. Great, great. Any other perspective, please? So, so essentially, I think that just to summarize what was said is that, first of all, I mean, this process is very difficult. It's a really hard process to try to find a biomarker for clinical use. So second of all, is that the signal to noise ratio for most markets is actually really small. So to find a signal out of out of the noise is really quite difficult. And often it's not reproducible in larger cohorts. Another another item is that it really takes it does have to go through a lengthy process. And and so so all those things combined, make this actually quite a difficult, quite a difficult field to actually really gain success. So here's just an example of that of the of the actual process. So here's some barriers and challenges to the adoption. So so we start out in basic research. And and so we can be generating lots of measurements for various biological entities, either proteins or transcripts or DNA. And, and we might find some sort of association with with with outcome. And but then we need to go through through this process and this is really where the difficult part comes in. So so here's just a figure that that I pulled out from this paper here. And it's it's it's this part here. So a validation study on samples from patients with disease and healthy control. So so controlled clinical trials take a long time and and are difficult to study. And these are on large numbers of patients. This is actually a major major problem right here. So here's just a flow chart then of what might actually have to take place. So so we start with a patient sample. And and we may have some sort of biomarker identification. Well, the first problem is that that often, one of the reasons for lack of reproducibility is that there's a lack of standards for how you prepare samples and how a lot of these are actually retrospective studies that, you know, the the samples might have been fixed in various various different ways when they're originally resected from the patient. And so so there's a variable, a lot of variability there in terms of actually how the samples are prepared. So then then we might actually get to the point where we have biomarker application. And and so this might help us determine the tumor type stage of the grade. For example, it may have an association of prediction of survival or response treatment. And so we can we can maybe make those associations in our in our discovery cohort. But often that cohort is just too small. And we have problems with what we call overfitting. So this is a term that we should really become familiar with this field, all the all the people using high dimensional data sets. The problem of having many, many features and few patients often means that we could be spuriously finding things that are just present in our, in our, in our actual discovery data set. And they just don't generalize to larger cohorts. And that's actually one of I think that's probably the single biggest problem with with this field is that the original cohort discovery cohort sizes, we've all read a million papers on this. But the biggest problem is that it's just the sample sizes are just too small. And yeah, so so so I think what we're seeing in the literature up until now has been about reaching in terms of high dimensional data sets reaching in 100 or 100 to 200 range. Certainly the paper that we're going to be looking at. Looked at 145 DNA copy number profiles, 118 gene expression profiles. And I think there was an overlap of about 106. And so that that paper was published in cancer cell. And, but it's still it's very small, a small cohort size. So one of the, one of the goals of the breast cancer project that I talked to you about earlier is to really ratchet this up by orders of magnitude. So we're looking at profiling 2000 tumors. And, and hoping that actually that that will really help solve this problem. It's still probably not enough though. Yeah. No, I don't think to that. Yeah. So, so it actually I think it depends what you're looking for. And it depends on on the kind of design study. So it's 2000 enough, probably not. But what we were what we're surmising is that this may help identify rare markers that are better. Would show up as maybe singletons in a set of hundreds. And, and you can't really say much about singletons. So something occurs once. And what can you say, but it may, it may recur 10 to 10 to 12 to 15 times in a sample size of 2000. And then you may have some indication that that might be something relevant. Yeah, right. That's right. Correct. Yeah, absolutely. So you have multiple, then you have the, then you have the top problem, right? So, then you have variable in terms of preparation. So I have a very rare example, I think that we found a biomarker with four patients. And so and that was published just recently. So I'll talk a little bit about that. So I'm a bit hypocritical there. But at the same time, I think so if the signal is strong, and if the signal is very strong, then you need fewer patients. But when we're talking about discovery, this is this is a bit of an issue. Okay, so so we have problems with overfading and reproducibility. What will show in the lab as well is actually, so there are details on this in the slides, but essentially breast cancer since about the year 2000 has been spoken about in five gene expression subtypes. And, and so, so one thing is that if we were to take different cohorts, so so what happened what a group of individuals did is is to build classifiers based on these subtypes. So what we're going to do in the labs actually try to rediscover those subtypes from the data in the in the chin paper. And you'll see that, well, you get similar results, but it's not crystal clear. And so so once we've, if you use the the classifier is somewhat reproducible, but, but to actually rediscover the classifier from another cohort, it's not. So it's just something to bear in mind. Okay, so then finally, the need for perspective trials. These are, of course, you know, I don't personally have a lot of experience in this, but, but this is this is actually required for clinical use. And this is a lengthy and an expensive process, as has already been mentioned. Okay, so this is all sounding very negative. But there, you're here because there's some optimism here. So, so have all the important markers been found? I mean, is it just limited to that 12 that I showed on the table? Well, hopefully not. So, so what's happening now is that measurements are getting more and more precise. We're moving from hybridization techniques, which give really kind of noisy signals to sequencing. And sequencing is the highest resolution that we can possibly look at molecular standpoint. So there's a lot of activity involved in sequencing cancer genomes and Francis is involved and and and more and more of you, I'm sure, will start to be involved in this activity. And I think if I give this workshop in a year from now, the focus will be much more on this than on what we're going to talk about, because it's really kind of at a point where, well, so far, there's only been one tumor genome, tumor, full team tumor genome published. And, and then we have our own paper on for transcript films. But, but so that's it for now. But I think what we're going to see in the next year is will probably have tens of papers in describing tumor genomes by this time next year, I think. So so this is kind of where the field is moving. So right now we have about 100,000 mutations in cancer, described in a database called cosmic, which is just a repository of mutations that was curated from the literature. And these are the main point here is that these are mostly obtained through targeted studies. So investigators have a hunch that are have or have legitimate reasons for chasing a certain gene and sequencing through a certain gene. And, and lo and behold, they find mutations in that gene. So these are kind of very directed to our studies. But what next generation sequencing offers us is now the ability to do mutation discovery in what I would call an unbiased way. So we can look at, we can look at the whole whole genome now in a in a relatively cost effective way to do mutation discovery, where we don't really know ahead of time where the mutations might be or where they're hiding. And so I think what we're we're going to move towards is this is a 1500s era map of the world. And when when all the European explorers were trying to chart out chart out the globe, and you can see it's, you know, it's not quite accurate, it's kind of fuzzy. But but what we will move towards is a very, hopefully a very much more precise landscape of mutations in cancer. And, and so this is really what biomarker discovery is all about. So the other reason for optimism is that new biology is being discovered. So here's an example that was published in Nature just a couple of months ago, about large non coding RNAs and mammals. These are kind of new species of molecules that we didn't know really existed until just a few months ago. And so the question is, is almost, you know, almost every new molecule has some variability in humans. And the question is, is what is the clinical, is there variability that is actually associated with with clinical outcome or phenotypes? And we just don't know. So, so this is an area of investigation that we want to measure these new biological markers in in human tissues and in in our in our case control studies, that may may actually have some relevance to clinic. The classic example is a recent so micro RNAs have only been really on the scene for a decade or so. And all kinds of studies are coming out saying that, you know, there are some associations and mutations in micro RNAs, the differential expression of micro RNAs have some clinical impact. Another another type of entity of highly conserved non coding elements, these are, these are elements that are highly conserved across mammals, but we just don't know what they do yet. And, and so people are working furiously to try to understand what these do. But we might see, for example, if we in these cancer genomes, if we start to see mutations in these, and these types of elements, we might want to, might want to try to try to associate that with, with with cancer. So, so there are new molecules as well as higher dimensional and more precise measurements coming out. So, and here's just two examples of a very large scale, high throughput projects that are really kind of, I think also hold a lot of optimism. These are studies that are very ambitious, and, and are really kind of going past this small cohort problem that I talked about, and really expanding cohort size, and doing standardized data collection and data analysis. So, so here's a paper from the cancer genome atlas project that Anna was actually involved in with Joe Grace live a little bit part of this part of this organization. And, and so they published this paper just recently on a comprehensive genome characterizations of gluoblastoma genes. And then another organization that's come up, it's been, its inception has been in the last two years, two years now, Francis. Yeah, is the International Cancer Genome Consortium. And this is a group of investigators in labs from around the world and involving many, many countries. So, so these projects really hold a lot of optimism because they are trying to look at large cohorts. It's still 500. It's 500 enough. It's 500 too much. It probably depends on the cancer type that you're looking at. But, but nonetheless, these will be incredibly rich data sets that will need to be appropriately mined and analyzed in order to, to actually pull out biomarkers from it. So, so this is exactly where bio informatics comes in. And, and a lot of you may not be kind of directly, you know, involved in developing methods for doing this. But nonetheless, what we will have is large cohorts. And we'll have high dimensional data. And so what we will need to actually make sense from this is a robust algorithmic and statistical tools to bring knowledge from data. So, so data generation is one thing. But ultimately, what we want is knowledge from that data. And, and so I think there's a lot of reason to be to be optimistic and a lot of reason to learn about bioinformatics, given these types of projects that are, that are ongoing. So, here's my hypocritical case report. So this is a study I was involved in a BC Cancer Agency. And we actually profiled starting with 11 tumors for which were granulosa cell tumors in the ovary. And we discovered a mutation that was present only in granulosa cell tumors, and not in the other sub types of ovarian cancer. And this was published in the New England Journal of Medicine just a couple of weeks ago. This is actually what the mutation looks like. And this is one of the examples where the signal was so strong that it was kind of it was hard to ignore. And here's just what this is called a sequence logo. And what it represents is the purity of alignment, where whereby if you stack up sequence reads, and you look at this position, all the positions would all the reads would tell us that there's a G there. And all the position tells that there's a there. And you can just move along. Let me just illustrate with this particular case. So here you have a tiny bit of noise in the data. So we have a couple of teeth here. But then you get to this position here. And we saw heterozygous mutations encoding encoding C to G here. And and so then we looked actually what this the way we actually found this essentially by employing giant filters. So we look at all the coding positions in the transcriptome. Then we look for all all variants that actually cause an amino acid change. And then we also ensure that none of these positions are already present in DB SNP. So we're really after somatic mutations that were specific to tumor subtypes. And if something actually was present in a database of known polymorphisms, the chances are likely that this is just a germline polymorphism that maybe not related to cancer. So we had some very specific criteria for actually isolating these these events. And then the next thing we did is actually so I mentioned that we sequenced for these cases. And then we had tumors from other subtypes. And we require that a mutation be present only in one particular subtype and not in the others. So that is specific to this subtype to that would give us some clue that it was actually having something to do with the tumor biology that was a subtype. So what we found is that we found this mutation all four index granulosa cell cases. Then we actually took this further to a larger panel. And we collected 89 granulosa cell tumors from around the world. These are actually quite rare tumors. So the the the lead of the study Dr. David Huntsman, he called in lots of favors from around the world and people were sending us additional samples. And we found this mutation at exactly the same position. Again, in this in this gene in 86 of 89 additional granulosa cell tumor. So we actually were able because it's it was so specific and an extra same position, we're able to design an assay that just looked at that position in all the other cases that we got that we received. So then we actually looked, we use this assay and profile 800 other types of cancers, including breast cancer and other ovarian cancers and lung cancer. And it wasn't present there. So this is a disease and maybe, you know, some of the pathologists can can comment on this, maybe blaze probably knows it the best. But maybe you can just talk about what the histology and the diagnosis of this. So blaze, that was a collaborator on this study as well. So this helps great deal. So so this provides now kind of a very yes, please. Yes, so so that's an important. Yeah, thank you for that. So so one thing we needed to do up front is actually determine that this was a somatic mutation. So it's only in the tumor cells and not in the normal. So we had normal DNA for these cases as well. And we looked at the normal and we looked at the tumor to make sure that it's only present in the tumor. So so if it was a germline mutation, something that you're born with, then they will be present in both. That's what we think we can explain away those three actually. So so there are some ambiguity around two of them and one of them is actually really quite diffused. The sample was probably to signal just wasn't strong enough in the sample itself. So so we couldn't really report that. I mean, that's kind of cheating retrospectively. But but in truth, actually, we could explain away those three. And so what this finding does provide is now a diagnostic and and a target for novel for potentially a novel therapeutic just so happens that this is a transcription factor. And these are notoriously difficult to target from the therapy perspective. So but but nonetheless, we've actually identified what we think is the is the important driver mutation in this disease. Okay, yeah, please. Absolutely. So I think there are two, two important things that need to happen here is from from this point on is one one that's actually it's not functionally characterized yet. We don't really know from the electric standpoint what this mutation is actually doing. And so there's a number of follow research that's going on right now to actually try to characterize that we have potentially have created a model system using cell lines. But it's still still ongoing. But nonetheless, so that's that's one thing that needs to happen. And the second thing is you write is that I think it needs to be expanded, you know, much into into actually perspective trials and to see if it actually helps a lot larger core. The problem is that assembling a large cohort of these patients difficult, it's a rare tumor. So okay, so I just wanted to actually draw your attention to this part. This is this is the nice paper that talks about proteomics based biomarker discovery. But what it does is illustrates really quite nicely, I think, the kind of process of biomarker discovery here. And the important things to know is that when we start out in these in these studies, we look at many, many, many markers. So we look at thousands of genes, we look at 3 billion nucleotides, we look at, you know, tens of chemical markers, you name it. And we usually do this in a small number of samples, because usually these types of profiling experiments are quite expensive. Then as we find associations in this in the small cohort, we need to go through a process, a series steps whereby we grow the cohort, but hopefully also reduce the number of markers that we're looking at until at the very end, we're dealing with a small number of markers, a large number of patients. So from a from a sort of experimental design and discovery point of view, I think this is the paradigm that that needs to be followed. And in fact, it's actually what you know, in a kind of probably a coarser way than this is what we did in our study, whereby we really did profile the whole transcriptome and looked at 60 million coding positions in a small number of cases. And then well, we zeroed it down to one, and in a larger number of cases. But that's, I think, really, an important aspect of the process of biomarker discovery. The main issue is that one needs to validate and revalidate and revalidate using larger and larger cohorts in order to make sense of our discoveries. And this is a quote from from David Huntsman actually, that I'm stealing here. And he says that an ideal genomic study becomes genetic. So so we start by looking at the whole thing. And we narrow it down to very few. Okay, so then just in to recap, then, we've gone over, hopefully what a biomarker is, we've learned that few biomarkers are currently in use in cancer. We have some reason for optimism that new technologies are actually providing results. So, so you know, we've added one mutation to that 100,000 in cosmic. But one can imagine that as as these projects like the ICGC and the TCGA ramp up, and even it doesn't need to be huge studies like that. I mean, they're probably very directed studies in various labs that are going on here, that new technologies are showing promising results. And going over the process of biomarker discovery. So, so that sort of concludes this part of the lecture. But I do want to ask if there are any questions or comments, and maybe we can have a little discussion for the for the remaining 10 minutes before we go on a break. So then any, any comments or questions on what we've seen so far. So that's the big, that's the big million dollar question is what is all this data actually going to get us is personalized medicine, which is, you know, this buzzword that's been around for kind of half a decade now, maybe longer. We're going to realize that is that something that we're actually going to be able to get to. The days are still very early for that, I think. But, but maybe we'll take slow steps towards that. And I hope that, you know, Dr. Shearer will give us a perspective on that at the end of tomorrow as well. So, but if there are any other comments on that particular issue, any some of the clinicians can comment on that, please. So I'll talk a little bit about that in the next, in the next set, but you want to know the better than the older one. And that's, that's our goal, right? Well, yeah. And theoretically, yes, because it's more and more objective. Now, because we've proven that that's going to be the person. Yeah, I mean, I come from the somewhat narrow perspective of just ovarian cancer as well. And so, David, and others as colleagues have basically redefined what ovarian cancer is, it's not one disease, it's multiple different diseases, and it needs to be considered as such. And that's done through, through markers. Yeah. Any other comments? Anna, do you have anything to add? Okay, great. Well, I think in that case, then we'll break and we'll come back here at 1050. And we will hear from Anna about alternative splicing in clinical dynamics. So coffee's just outside. Okay, so we have quite a few things to cover before lunch. So I think we'll get started. So we've gone over now what a biomarker is, and in some definitions for biomarker. Now we're just going to talk about measurement technologies that are in current use in genomics and proteomics. So, and we're going to cover gene expression micro rays. And these are really cursory overviews, but it's just to make you aware that these are technologies that are being used. So gene expression, micro rays, genomic micro rays for SNP and copy number detection. We're going to talk about next generation sequencing, talk a little bit about amino has to chemistry as well. So gene expression, micro rays. Well, the biology of this is actually to to quantitate transcripts, so mRNA transcripts. And the the technology that's used is is hybridization and fluorescence intensity. So many people in this room understand that quite well. And so it's not necessary to go into detail about that. So one of the limitations is that of this technology is that essentially what we're probing for is things we already know about. So we have some idea of the annotations of the genes in the human genome. And we're going to probe those for for for transcript quantitation. So so they're basically the limitation is that we can't find things we can't discover new transcripts, for example, using micro rays. So here's some of the biological questions that one would ask of of a microwave experiment and feel free to jump in with other ones. So for example, so which genes are differentially expressed in my samples versus control or my one subtype versus the other subject? What subgroups can be identified in my population based on their gene expression profiling. So we've seen hundreds and hundreds of papers with clustering of gene expression. And then finally, you know, can a gene expression signature actually be used to classify a new sample. So if I have gene expression signature that's associated with certain class of a phenotypic class, and I have a new case that comes in, and I don't know what phenotypic class belongs to. I just used a gene expression to classify my sample. So so here's just an example of the data. So essentially what we get out after and this is really after many steps of normalization. And so that needs to be done correctly. And there's a field on to itself. But we get a data matrix. And we have n rows, okay, and P columns here. And and n is often much, much larger than P. And we've talked about that already. And so this matrix X, the the actual entries in the matrix represents relative quantity for transcript I, for sample J. Okay. And so the types of analysis that we do, we do normalization, this is a critical first step in gene expression analysis, and we're sure you're all aware. We want to look at differential expression, we can do what we call unsupervised clustering, and I'll talk about that this afternoon, what that is, we can do classification, we can do longitudinal studies. So often in model organisms, we want to take measurements at different time points, and see which which genes get turned on at what time points. And then a number of you participated in the workshop. Just just last week, and we do things like network reconstruction. So we try to find genes that are co expressed and that actually belong to the to the same type of biological network. So there's tons of software for gene expression. A place to go looking is is really an amazing resource is called bioconductor. And these are a set of modules and libraries that are written in our statistical language. This set of software contains 320 software packages that are more than half I think are devoted to microwave analysis gene expression analysis. There are 400 annotation packages and there are books and tutorials available on all this you just go to bioconductor.org and and that'll point you to all that the resources. Just another example piece of software is called gene pattern. This is from the Broad Institute, and this is much more of a sort of a point click type of type package that you can install it's really available. And it will have many of the types of analyses that I mentioned to you. So another type of way in which measure these biomarkers is high density genotyping arrays. And now we're talking about DNA. And so the ball biology is looking for one is for single nucleotide polymorphisms. And the other is DNA copy number changes. So there are rays that can do genotyping for for one million SNPs. Does anybody not know what a SNP is? Everyone knows everyone knows what a SNP is. Okay, all right, good. So if there anything that I say that you don't know what it is put your hand up because I'm assuming, you know, most people are in genetics or biology. So so the other thing that these these arrays now give us is we can actually measure allele specific copy number changes. So in cancer, sometimes one allele is preferentially amplified or deleted. And it might be of use to know which allele these are DNA copy number changes are a major source of human variation. And actually, Dr. Shear, who's going to give us our talk tomorrow, will play a major role in discovering that structural variance and DNA actually are a major source of human variation. And then there are congenital there. So most of the the copy number changes that we'll talk about in the context of this workshop are somatic alterations and tumor genomes. But there are congenital abnormalities, whereby people are born with these alterations in their genomes. And that has implications for mental retardation and autism. And again, mentioned somatic alteration cancer. These are so this is the type of biology that can be measured with high density genotype in a race. Here's just an example of the data set. And some example questions. So, so which regions in the genome are recurrently altered in my cohort? So here I've just plotted the genome. And these are the chromosomes 123 all the way to this should be x here. So labeled one, but this is one through x. And and then so one can actually measure the frequency of alteration. So this is a heat map representation of copy number changes. The red indicates amplification. So extra copies of DNA, the green indicates loss. And, and so you can summarize across this, this patient core each row represents a patient. And you can see that well, you know, this chromosome arm, one one q of is is is recurrently amplified in almost in almost 40 to 50% of the cases. This is this is a breast cancer data set, by the way. And we have eight q as well. These are these are sort of known patterns that occur in and actually what we're going to do in the lab is exactly do something just just like this. Okay, so we're going to we're going to process a race CGH data. We're going to find out where the copy number changes are. And then we're going to try to look at the frequency of alteration and different subgroups of cancer. We do this in the lab. So, so the logical question is then, you know, can the cohort not only do we want to look at frequency frequency across the whole population, but we want to look at if there are subgroups that can be discovered from the data. So that's an example. And again, we're going to do that as well. So, so here is a rate, an example of what the data looks like for a CGH. So here is just chromosome one. And this is just an example from a mantle cell lymphoma cell line. And, and on the x axis is the physical location along the chromosome. And on the y axis is the relative relative hybridization intensity of the DNA in a tumor sample versus a normal sample. And so basically what that means is that negative this is a log two ratio. So negative numbers mean that there's probably a deletion of that region of the chromosome. And positive numbers mean that there's probably an amplification in the chromosome. So it's pretty hard to see this slide. But over here, I've shown that with with SNP genotyping, actually what you get in addition to copy number change. So here there's a little copy number change here. But so we can actually see alleles the allele specific hybridization intensity because both both versions of the SNP are actually hybridized on these arrays. And you can see that there's a differential there's a differential copy number change in the allele specific sense. So the analysis for high density genotyping arrays. So there's normalization usually entails for the SNP arrays. Sometimes if you're probing to alleles, and most of the probe sequence is actually identical except for one one position. And so sometimes what you get is what's called a lila crosstalk. So you get the wrong allele hybridizing and so you can actually try to correct for that. The DNA fragment length and also the GC content of the of the the probes actually make quite a difference in the actual signal that one gets from these arrays. So it's important to do this step. And again, I'll point to some software that can do this for for genotyping arrays. The next step is actually to do what's called segmentation. So what that means is actually you want to find the breakpoint. So here one would one would assume that actually there's actually a biological event here that encompasses this deletion and one wants to actually determine exactly where those breakpoints are. And so you can look within those breakpoints for what genes might be in there, for example. So if there's a tumor suppressor gene in this deletion, then one has pretty high confidence that maybe that's a targeted deletion by this cancer and and and that should be followed up. So in addition to actually determining breakpoints, one wants also classify these segments as being, for example, unchanged, deleted or gained. And so there's models that, for example, called state space models that actually find breakpoints and classify the segments as well. So you end up with this kind of this as input, which is just kind of a group of noisy black dots. And one wants to actually classify biologically what these black dots actually mean. Okay, so that's what that's what I've done here. So here's just some a few tools for for high density genotyping arrays. Again, bioconductor many, many packages. There's my own set of software called CNA Hammer. And actually, unfortunately, I haven't quite finished this yet. So but I am compiling a whole list of references and resources that I'll put on the wiki for module one. So you can refer back to that. And I'll post basically all the links to software and all the references that I've talked about in this in this module so far. So for SNP arrays, normalization, for example, for apometrics, SNP six, there's a package in an hour called roma dot apometrics. And this does all those allele crosstalk, GC content normalization, etc. And then some other allele specific copy number type of algorithms, one called Qaunty SNP, the other pen, CNB. One wants to be able to genotype. And so there are a number of algorithms for that. C realm, B realm, and bird seed. And again, I'll point you to these references. For visualization, there's the integrated genomics viewer from the Broad Institute and a Sigma two package from from the BCC RC in Vancouver. So let's just talk a little bit about immunohistochemical staining. So the biology here that we're measuring is our actually protein levels, and also localization of proteins. So the technology is based on having a labeled antibody binding to an antigen. And this is this can be done in relatively high numbers of cases on what we call tissue microarrays. And then the limitation of this is that we actually have to have an antibody available for the protein that you want to quantify. So some example questions. Is my protein of interest expressed in my sample? Which part of the cell does my protein of interest actually localized to use it to the membrane? Is it to the nucleus? How abundantly expresses my protein and immunohistochemistry and again, probably others in the room can comment on this a bit more than I can is can be used in diagnosis so we can get a better feel for sub types of cancer prognosis and can actually be predictive as well. So here's what the data look like. So this is immunohistochemical stain of a gene called beta catenin. And when there's a mutation in beta catenin in ovarian cancer, the protein actually localizes to the nucleus. So hopefully what you can see there, and maybe it's difficult to see, but are there kind of concentrated brown blobs here in this in these slides? And what that indicates is that this protein has actually localized to the nucleus. And in a case that does not have the mutation, the staining is actually restricted to the to the membrane of the cell. And so this is used as a diagnostic for a particular subtype of ovarian cancer. Just a comment about this technology is that it's fairly low throughput, but it's highly specific. So so you know, you can't have your cake and eat it too. So it's it works very well. But it's relatively low throughput. And else, yeah, image analysis software. Yeah, so so that I think is not really well employed. Usually you have human beings looking at these slides that have expertise and can readily tell the outcomes of these states. So here's just an example of use of aminohesocomic markers to to subtype ovarian cancer. And this is a paper written by my colleague, Martin Colbell. And so basically these are aminohesochemical markers here. These are the subtypes of ovarian cancer. And you can see that they're really kind of wildly different expression levels to their Z axis here in a three dimensional plot is the level of expression of these these particular proteins. And and so it's really it's really quite striking that these have quite different profiles in terms of protein expression of these markers. And whereas, you know, in the past, actually, even in the present, ovarian cancer is really treated as one disease. It's, it's, it's pretty clear that they have very different histological features and should be considered as a different disease. So this is just an application of aminohesochemistry into actually clinical interpretation of ovarian cancer. So let's move now on to next generation sequencing. So the biology involved here is almost everything. So these assays can get you so many different features of a genome, so including signal nucleotide variants. So we're getting down to the resolution of the nucleotides where we can detect point mutations and insertions and deletions. Using paired end sequencing, we can detect genome rearrangements. We can detect copy number changes similar to a race EGH and SNP arrays, but much more exquisite resolution. We can detect sequence inversions in RNA seek libraries, which is just the sequencing the the mRNA, we can detect transcript expression. And again, I already mentioned insertions and deletions, which are of relatively small length, but can be readily detected this technology. So a colleague of mine and I are engaged in a project in sequencing Hodgkin's lymphoma. And just using short reads, we sequence the Hodgkin's lymphoma cell line, using 50 base pair reads, we're actually able to detect insertions and deletions of size 11 and 15, down to the actual exactly tie that's reported in the literature based on the cell line. So the resolution of these technologies is exquisite. And the potential challenge is that it produces literally millions and millions of data points per case. And recently involved in a project through which we sequenced the breast cancer genome. And we literally produced 120 billion data points. For one case. So it's a lot of data. So here's some example questions. So what does an individual tumor person animal actually look like at nucleotide resolution? This is a view that we're getting of these tumors and these organisms that we weren't able to do before. So I actually liken this to Van Luhenhoek looking down the microscope for the first time and seeing that there are bugs on the slide, you know, there are bacteria down there. And I think we're really in this kind of age of discovery where we're seeing things that we've never been able to see before. So it's really quite, quite exciting. What is the genome architecture of my sample? What single nucleotide variants exist in my sample? What transcripts are expressed and at what quantity? What are the recurrent aberrations in my set of samples? And what pathways are potentially dysregulated by mutations? And, and so these are this latter question again is something that, you know, you may have visited last week in Gary's workshop. So this technology gives you in one assay what you'd have to do, you know, multiple multiple using various different technologies. And, and so that's why it has great appeal. It's still quite expensive to sequence through a team of tumor genome sequencing the transcriptome or using RNA seek is much more efficient. You're just looking at the express sequence. But, but nonetheless, I think this is this is certainly the way of the future and probably possibly, you know, quite possibly the way of the present. So just a brief overview of what the process here is. And this is just a schematic cartoon. So when the data comes off the sequencer, we get these unaligned reads. And here I've shown paired and read. So these little blue bars represent a read and that this can be, for example, 50 to 75 nucleotides long. And they're paired in the sense that they're separated by, by what we call an insert sequence that we don't actually sequence, but we have a general idea of how far away this is because the fragments are size selected. And that so we sequence how big are the inserts, usually about 200 nucleotides. So so so we size select for about 250 to 300. Yeah. And then so we take these paradigm alignments and sorry, we take the paradigm reads and we actually align them to the genome. So the really important part about this is that you can't do this without the human genome reference in the first place. Because it the human genome project invested a huge amount of algorithmic time and computational time in assembling longer fragments. And that was a major part of the human genome project was actually to try to assemble it. You can't do assembly with short fragments. So you have to have a genome that's already there in order to align the fragments to. Okay, so then once we have aligned reads, then we can do some inference on the on the variants that are in these in this data. And so we can, for example, predict single nucleotide variants. And then once we do that, certainly in tumor biology, we need to, we need to do some validation. So we want to confirm whether it's somatic. And again, what's required here is the sample should be chosen very carefully. So that if you have, based on whether you have matched normal DNA to actually validate the mutations that you find. So if you find a variant in the tumor, you want to be able to see if it's also present in the germ line or blood DNA for the blood. So we can confirm some that are somatic, confirm some that are germline. Due to the noise and the short reads and the misalignment that happens, we get false positives. And so when you get predictions out of this data, one has to validate it using other techniques like Hillary base sequencing or other other base sequencing in order to actually confirm that it's true. So validation is critically important. And then and then like I said, what we did in a Fox cell two case is in granulosis cell tumors is is actually take take then a handful of the variants and characterize them for recurrence and functional significance in larger cohorts. So here's just what the data actually looks like. So here, I've just shown the actual sequence reads actually as they would be aligned to the genome, for example. Okay, so so and then just cutting it off at a certain position. So so here you have the reads like this, okay, and then where the dash lines just indicates that's the end of the sequence. And so there's these are just piled up on top of each other. And then what one can do is actually represent this alignment in a set of what I call a lily count. So here you have a vector that just shows how many times you have a read that matches the reference sequence thing. So again, this is a this is a reference sequence that's that's the human genome hg18 or NCBI 36 or that you can download from from from UCSC or or ensemble or your favorite genome browser. And then so then you can literally just do a county exercise. So how many how many reads actually match my reference. So here you have three reads that match the reference and you have zero that don't match your reference. And then so here's one that maybe looks like a variant. So here you have six reads that have highlighted with the seeds that don't match your reference. And you have only one that matches the reference. Okay, so so so here you potentially have a variant. And then here's another type of variant here. And so, you know, I've done a lot of work and some of my colleagues have done a lot of work and actually modeling these data. Yeah. So that's a very good question. So so theoretically, in order to see, let's say for a heterozygous position, in order to see two alleles, so two reads that represent each allele. So by what's called the binomial distribution theoretical calculation one we need to have a depth of 11 in order to see two alleles 99% of the time. Okay, now, then by what's called a Poisson distribution in terms of coverage, in order to get at least 11 at 99% of the positions in the genome, one would have to sequence to about 27 fold coverage of the genome. So that that's kind of the theoretical argument for a normal genome. Everything goes haywire when you're dealing with tumor genomes because you get copy number alterations and and you're dealing with sample heterogeneity as well. So sometimes you get different coronal populations of cells. You have normal cells mixed in as well. And so your signal gets diluted. Yeah. Yeah, so the function can be done. So what we did for this breast cancer case that I mentioned a little while ago is we went to 43x. Which so in our initial calculations, it's expensive. And actually, what we showed in our study is that you sequence the transcriptome as well. You can actually do moderate genome coverage. And and combined with the transcriptome, we actually cover it cover most of the exons that are expressed adequately enough. So the genome is incredibly expensive to sequence. But as I mentioned, the transcriptome is much more efficient. So so you can get away with hundreds of millions of reads or have been billions of reads with the transcriptome. Yeah, yeah. Correct. Yeah. So so it's still an expensive proposition. I think to see it's reasonable to say that given today's throughput to sequence a genome to 40x coverage is still over 100,000 dollars. Okay, so it's incredible amount of activity and software development and analytical tools has sprung up as a result of next gen sequencing. Here's just a list of tools for alignment for detecting single nucleotide variants for insertions and deletions, copy number changes, expression. And of course, there's a workshop devoted to this, that if some of you are already registered, then you're good to go. If not, you can register for next year led by Francis and colleagues. Okay, so just a point about validation. So high throughput measurement technologies are noisy. We know that. So predictions must be validated using lower throughput, but more accurate experimental assays. So that's just the statement that is true. And you need to do it. Okay, so here's an example. So in this breast cancer genome that I keep referring to, what we have, we found an amplification of the insulin receptor amplicon by by doing copy number analysis. And I hope you can see that maybe you can. But this is a fluorescence in situ hybridization. Backing know what this is. Validation of the sample con. You can see that there are many more red dots here than green dots. So the red dots are the insulin receptor amplicon and insulin receptor probe and the green dots are control. And then actually, this is in the metastatic tumor, we also found the sample con in the primary tumor. So this was selected for an evolution of this tumor. Then here's an example of a very high level amplification of the map 2k3 locus. Again, this is validated using fish. This is a case that was not present in the primary tumor. And so this was kind of what we call a progression of that. So this is just give you a kind of a pictorial example that you can make predictions. These two happened to be true. In the single nucleotide variants that we found in this tumor, about 30% of them turned out to be false. So they turned out to be not reproducible by other assets. So in high throughput measurement technology, one needs to validate. So let's just go over then. We talked about what gene expression is. We talked about DNA copy number. The legal variation is we talked about protein quantitation, and we talked about next generation sequencing. And so the lab this afternoon focuses on gene expression and copy number. And then tomorrow we're going to link that to clinical data. So just the last couple of points is that the measurement technologies are getting denser. This means more data. And again, validation is critical to make conclusions from from these assets. Okay. So any questions on this material? Yeah. So people are working on that. People call it exome sequencing or exon capture techniques, or by hybridized, you probe for axons, for example. And then you can extract the DNA from those hybridizations and just make a library out of that and sequence that. So my understanding is that that's still still under technology development. It's it's starting to work. But it's not. You don't have the quantities of DNA needed are very high, for one. So that's actually severe limitation. And it's still technology that's maturing. But Francis might have a perspective on that. But I think Yeah, so another another example of that is that so this tumor we had a matched primary tumor that I would look at. And we took all the variants that we discovered and validated in the metastatic tumor, design PCR primers across those those variants, amplify those up and put those back on the aluminum machine using the prime, maybe we made a library out of the primary tumor. So we discovered the variants in the metastatic. Then we sequence very deeply in the primary tumor to see what frequency or what portion of cells already contained those mutations. And so you can get up to 10,000 100,000 fold redundancy sequencing in that in that targeted way. Does that make sense? Yeah. So so so actually we were able to characterize the evolution of the tumor in terms of these somatic variants by looking at mutations that just weren't present at all, even in 100,000 copies, some that were present between one and 2% of cells and then the rest that were above that threshold as well. So yeah. Could you want to use a normal microwave recognition technique in which you just put your targets there? Yeah, you know, from answers and count them to see the number of sequences that are normal and normal and then to that. So I think what was converting the color sign into a number of sequencing? I think what people are doing for that purpose, what you're talking about is actually doing barcoding. So you can attach a barcode that identifies the sample. And so when you sequence it, then you can pull out the reads that have that barcode has as going to one sample and separate about that way. And so that that technology is actually just coming online now the barcoding and multiplexing technique. Yeah. Okay, so I think we're ready for the switch gears now. And I'll invite Anna to come up and talk about alternative splicing. Okay, so as I mentioned before, I've recently for the last few years, I've been focusing on the research of splicing aberrations in cancer. And when we started this program, in 2004, back at the National Laboratory, nobody really knew what the situation with my thing overall, and on the global scale in cancer, how to measure it, and how to interpret the results that come out from the microwave experiments measuring the levels of splice either form. Nor there was a certain understanding what is the value of this knowledge for cancer research in general, and clinical applications in particular. And so over back then, we started a collaboration with App Metrics, and we tried to use the research platform, the human axon array. And we've done quite a lot of research using that platform, we've made a great progress. Having gone through a lot of obstacles over those years. And so right now, we pretty much know and there's been a number of great number of different publications devoted to the mercury analysis of splicing in cancer, particularly with introduction of some use statistical methods and so on so forth. But so it is pretty much clear right now how to do this thing for cancer in particularly and how to measure the global profiles of splicing. But at the same time, the splicing is still very much underappreciated. And so there is it's it's it's pretty much neglect specific, especially within the clinical applications built. And I think that the first slide that Zora showed you with the examples of different biomarkers actually speaks to itself, right? There was no mentioning of biomarkers that came from splicing ice forms. So and that's that's actually the general trend. So for some reason, splicing is still perceived as such weird thing that is basically can be neglected. Whereas it is really very important layer of information and layer of opportunity for a discovery of novel biomarkers. And there's been a precedent of using splice isoforms for clinical applications. For example, in Alzheimer's disease, the ratio of a certain enzyme splice isoforms is used to predict the treatment outcome of this disease. There are other examples where the antibodies against the alternative region in the protein isoforms have been used for improving diagnosis and that happened in the human glioblastomas. And then so there's another example, for instance, how we can manipulate splicing and control behavior of cells, and that precedent to place in one study that was published for pancreatic cancer, where they tried to manipulate the level of expression of one of the regulators of splicing. And they were showing that the cellular phenotype have changed and basically reverse to more normal phenotype. So there is an endless number of opportunities here within this splicing. And I think I know I've become a preacher of this. And I'm trying to bring this layer of information into the clinical and cancer research. So and within the following some 20 minutes, I'm going to convince you that this is really something to go for. This slide is just to tell you that indeed, the majority, if not all of the human genes undergo alternative splicing in time and condition dependent matter. And some recent deep sequencing studies have shown that it's probably about 98% of all the human genes undergo splicing. And of course, this is very tightly regulated process with many factors involved. And so of course, this is a excellent target for disruption. And so alternative splicing indeed has been shown to be implicated in many human diseases, including cancer. So this slide just shows you how complex the process of splicing of pre mRNA. So it is undertaken by a huge protein complex comprised of some 200 proteins. And so you can imagine that there are different axons in the gene and some of them are constitutive axons, which are included into all of the transcript isoforms of that gene. And there are alternative axons that are included or excluded or licensed in some some other different way in a particular tissue or state of the development. And so those alternative axons, and this is very simple case of a cassette axon when you have just a one axon, which is either included or excluded. And so for this very simple situation, you see that there are flanking regions of splice sites around this alternative axon and some other signals regulatory signals, which is this polycarimidin track, and this branching point. And these are all the regulatory sites, which are called this regulators. And basically, the components of a splice axon that that's what it's called are trans regulators. And these are both RNAs and proteins, which regulate splicing and do it in a very precise manner. So these sites are actually spread throughout the genome. And there is a number of cryptic splice sites. And it is very important to recognize a true splice sites, but not cryptic ones. And so that's what the splice is home machinery does here. So and you can see certainly that all all aspects of this splice is home machinery can be targeted with mutations or some other types of aberrations and can be disrupted. And that's what happens in cancer, particularly. So you may have a mutations in cis regulators in those regulatory sequences, or you can alter the components of splicing machinery, such as knock out one of the splicing regulatory proteins. And these results in aberrations of splicing, it can have a massive effect. For example, if you change the level of one of the splice factors, which target many genes, you may have a really chain reaction, and change in splicing of many, many downstream genes. And so, of course, this leads to the situation when you de-regulate many cellular processes, which lead to these different outcomes, which are characteristic to cancer. So there's been a growing body of evidence in the literature that many cancer genes actually have a cancer specific isoforms. So you, you can find very simple cases such as a single cassette axon where you have a production of two isoforms excluding and including alternative axon. And these are just a few examples of such events taking place in cancer cells. For the FGFR1, for instance, there is one axon, which is excluded in cancer, and it's related with poor prognosis. So for the WISP1 gene, there is a cancer specific short isoform excluding the alternative axon. Which has a different biological properties. Then you can see that then maybe situation where in cancer, you have a inclusion of alternative axon. And that happens in the classical example of this in BCLX gene, where there are two isoforms, short isoform is normal and it's pro-abototic. And the long isoform including alternative axon is anti-abototic. And it correlates with resistance to chemotherapy. There's a whole number of different types of splicing events. This is probably the simplest and this is probably the most complex where there is a tandem of alternative axon someplace in the middle of the gene. And these axon can be spliced even in different combinations and in different numbers. And that's what happens in the case of penicin C gene, in which there is a great, a big chunk of alternative region included. And basically the antibody against this region has been used to improve the diagnosis of glioblastomas. Another example is the CD44 gene, which is notorious for its complex splicing. It has 10 internal alternative axons. And as I said, they are spliced in different numbers in different combinations. And there was a precedent of using an antibody against a specific splice variant including variant axon six to treat the head and neck carcinoma. Unfortunately, it didn't pass the phase one trial because of the high toxicity. But still there is a precedent of using splice isoform specific treatments. So and as I mentioned, before back at LVNL, we've been studying the splicing patterns in breast cancer, and we had an excellent model system comprised of a panel of breast cancer cell lines, which really recapitulate the aberrations both on genomic and expression level that take place in tumors. And so so the breast cancer cell lines, expression patterns show distinct clusters. And those clusters actually correspond to basal B, basal A and luminal cells, which are a cell of slightly different phenotype with basal B cells being most aggressive cells, stem cell like cells. And so when we try to measure the the to detect the splicing repertoire of the CD44 gene, which is actually a stem cell marker within those cells, we could clearly see the agreement between the RDPCR and our predictions from microarrays, and we saw a completely different pattern of splicing within basal B's and basal A's and luminal's luminal cell lines. What was really attractive about this gene basically is that the alternative region, this is a transmembrane protein in the alternative region happens to be located on the cell surface and outside of the cell. So it is very much possible to use it both as diagnostic and therapeutic targets using a splice isoform specific antibodies, for example. So how is it possible to explore splicing in a global scale on the whole genome, whole transcriptome scale? And the answer is using microarray platforms and more so using whole transcriptome, shotgun sequencing or RNA seek technologies. So but with with, I will just briefly mention the micro platforms that are around for this type of studies. There are two types of platforms that can be used for interrogation of splicing. Both of them are sub gene level. And one of them are exon level based and others are rather junction based. So there are advantages and disadvantages of both platforms. So for example, this platform by aphymetrics, the human as in chip 1.0 SD that has been introduced a few years ago, it covers a great chunk of human transcriptome, both known and predicted. And mostly the proves are targeting exonic regions. And so with regard to the coverage of a human transcriptome, this is really very valuable tool for a discovery of noble splice events that have not been detected anywhere else before. And that is the advantage of this platform. Disadvantage of this platform is a great deal of noise because of really dense and really huge content that is present on the array. And more over, it is it is really a challenge to reconstruct the splice isoform structures, because you are basically measuring the exon level intensities or exon level expression for every gene in the human genome. But at the same time, you don't know which axon is connected to which axon. And this is a little bit of challenge here. But still, I still believe that this is a very good discovery tool, because it just gives you an opportunity to explore any combination of any axons that might exist out there. The other platform is the junction, excellent junction based platforms. And those actually probe both exonic regions and junction regions. And so they give you a much clearer signal, much less noise, and they are giving you more information as to what particular structure of a transcript you are dealing with and particularly sample. But at the same time, the design is limited to a splice events that have been already known or observed in some, in some sources. So I, I think this is a little bit of a downside of this platform. Do you have any questions for this part? No. So you're saying that the junction arrays don't take into account a junction between exon 1 and exon 4? If it even, if it hasn't been known before, then no. So I would say that this is more of a discovery tool. And this is more of a validation tool. Are there data sets that are available? Well, I think there are for the mouse. And it's going to be the data set for the breast cancer cell lines is going to be available pretty soon. So as I mentioned, a number of statistical approaches have been published to explore and to infer splicing pattern from micro data. And this is an excellent review that gives you a state of affairs and this build. I'm not going to focus on different methodologies right now. I'm just going to mention briefly of the method that we've developed back at LDNL, where we participated in the TCGA project, profiling multiple cancer types with subject level platforms. And basically, this approach allows you to see the differential exon expression rather than differential gene expression. So for example, so just to give you an overview of the whole pipeline of this computational approach is that you measure the exon level expression and then you derive from it the gene level expression. And then you apply some sort of filtering, which was really essential. As our experience showed, then you apply to two separate detection algorithms, the splicing index and firma in combination. And that gives you a list of highly likely candidates of differential splicing in this case. So what it shows you here is the profile of expression of a single gene across different samples. And along the x axis, you see the different probe sets. And this is the expression level. And the same layout is for this heat map. But this one is completely different measurement that comes out from firma. And I can give you the reference if you're interested in this. So basically, what it shows you here is that the expression of the entire gene is basically pretty tight. When you correct for the overall gene expression differences, you see that all the probes actually follow the same trend within the majority within the main part of the gene. But within some narrow region of the gene, you see a great variance of expression of a particular probe set that probe particular junctions or axons. And that is also reflected in this method too. And so these are the candidates of the differential expression. So in this case, we measured the expression of two isoforms, including an excluding a single cassette axon. So this is just one of the methods. And if you just look at the literature, it's been a little bit of challenge to infer splicing using micro data because of the high level of noise and the validation rate of different methods range from some 30% up to 80%. Bless you. And with this method, we were able to achieve the validation success rate of over 80%. So we've been really happy with it. So splicing is, as I said, is very tightly regulated process. And the good thing about it is that we see not only tumor specific splice signatures, but also tumor subtype specific splicing signatures. And this is a picture from our study of breast cancer cell lines. And you can see that three major sub classes of breast cancer show a specific splicing pattern. And as I said, these are excellent candidates for a novel biomarkers. This comes from a study of breast cancer using breast cancer cell lines panel. This is junction array. This is junction array. Yeah. And so just to give you a little bit of biology here with regard to the splicing and transcription regulation. It's been noted several times in the literature that basically transcription regulation and splicing regulation act in parallel, meaning that in the same cells, transcription regulation and splicing regulation may target say the same pathways, but through totally different genes. So these are two parallel mechanisms of regulation of cell processes. And that's what we were able to see in the breast cancer study. You see the enrichment of a certain pathways with either alternatives of tonic to these place genes with red color and differential expressed genes. And you see that the pathways in bridge with these two groups of genes are totally different. And so this is not and there was virtually there was zero overlap between the those groups of genes. So that was very comforting to see in this case. And this is not only a fundamental a question of fundamental interest, but it's also can guide you with your efforts with regard to the search of novel biomarkers. So for example, if you are interested in these pathways, you would rather use platforms that allow you to measure the overall gene expression differences. And on the contrary, if you are interested in these pathways, you would rather focus on research of splicing than differential gene expression. And I didn't mention it before, but if you are actually using just these platforms that measure overall gene expression differences, you basically risk to overlook the alternative splice events, and you miss a great chunk of information that can be used for development of novel biomarkers. So what is it? Well, so so the usual strategy that the strategy that I would take is that I would look at both because very often it happens that you observe both a splicing change and change in overall expression. But but you can use a single platform for both purposes. So as I mentioned before, in that pipeline, you can use the subject level platform. And you would get an axon or junction level signal. And you can basically derive the overall gene or gene expression levels from that platform by averaging across all the processes of particular gene. And so you just work with the data from the same platform. But I would I would concentrate on these two different analysis separately. So I would I would look for the differential expressed gene. This is one part. And I would concentrate on the splicing differences. This is another part because you understand that and it's pretty straightforward, you look for the differential expressed gene and there you go, right? But at the same time, if you don't see the differential expression, you tend to throw those genes away. But it's not necessarily the right thing to do. Because if you look into the splicing patterns of the genes that are more or less constant and expression over your sample, you may discover something really new and important. Yeah. So just to, to tell you a little bit more, what is the importance and value of alternative splicing in clinical applications? So imagine that you have a transcript isoforms coming out of this gene with cassette axon right there. And there are two isoforms excluding and including the alternative axon. And then two protein isoforms are translated from these products. And you can imagine having an antibody against the common part of those proteins. And then you are able to pull out all isoforms of that gene, all protein isoforms. But then you can imagine raising an antibody against an alternative region. And in that case, you would be targeting a specific splice isoform. And as I mentioned, there are isoforms that are cancer promoting. And there are isoforms that are characteristic to the normal state of cells. And you would rather target the ones that are a culprit here. And so with all this arsenal can be used in clinical applications for both diagnostic, diagnostic and therapeutic purposes. So hopefully with these few slides, I have convinced you that this is really important. Now, I can take questions. Oh, yeah, you can, you can. Yeah. So as sort of mentioned, with with this new technology, it's basically possible to, to do all sorts of studies with regard to the whole genome and whole transcriptome, including splicing as well. Yeah. There are going to be different approaches there with regard to the detection, of course, huh? Yeah. Yeah, exactly. Yeah, so basically, basically, when you measure when you when you deal with with the reads counts information, right, you derive a read counts there, right? And then this is post mapping on to some reference. So your reference can be genomic or mentioned, right? And then your reference can be a collection of different junctions. And so for example, you can create in silico a database of possible old possible junction or the junction that you're interested in. And you can map your sequencing data onto that collection. And so that's how you profile the expression level of those junctions. And at this point, it reduces to the same set of analysis that was here with my query. So the important thing is, for what initially people do, they take the learning and speaking data into mind, you know, and the problem there is that, you know, you're reading all the prospects on them. I think this is actually another another reason for optimism is that, you know, this is another type of feature that hasn't traditionally been pro. And so this is this is kind of, it's not new biology, but it's, it's probably new new ways of looking at cases that we already have interest in. And so this is potentially a very, very lucrative and avenue for biomarker discovery. So thank you for taking us through that. Okay, so not lunch yet. Here we go. Okay, so how many people did their homework? Did you read this? How many people read the paper? Okay, so a few that didn't. Okay, that's okay. If you can get a chance to look at it during lunch, then that would be useful, I think I think useful for you. And certainly during the lab, you know, you could take a little bit of time to look at the paper itself. So this is a paper published by Joe Gray's group and cancer cell in 2006. And this basically what I want to take you through in this little remaining half hour is give you an overview of the study. What were the goals and the biological questions? The introduced the breast cancer expression subtypes. I introduced the data types and data sets that were generated in the study. And then talk about how those are related to clinical outcome. Okay, so just some background on what we were designing this course. I had this idea that it would be a good idea to take a model paper paper of some kind that that would really did integrate clinical data and genomic data. And one that would allow us to illustrate the concepts that we wanted to illustrate. So this paper was published at the end of 2006. So it's already has 206 citations. So it has reasonably high impact. We're reading it citing it. The really nice feature about this paper is that it contains the concepts that are nearly always encountered in large scale clinical genomic studies. A really nice bonus feature is that this has integrated analysis of copy number and expression. So tumor hallmark of tumors is our copy number. It's not always associated with copy numbers actually more often than not associated with the expression of the genes that are contained. And this paper kind of describes how they navigate through that landscape. A very important part of this is that the data and the clinical phenotypes were clinically available or freely available. This is rare. This is hard to find. And so so this is a beautiful data set for that. So one of the few of the limitations there are some limitations to this paper. We can't of course cut the quintessential paper probably doesn't exist. Is that the data is generated pretty much on older and I would say obsolete platforms. But that doesn't mean that the concepts aren't similar. The concepts are the same. It's just that the data you may be generating today will be probably denser and higher dimensional than the data sets in this lab. And nonetheless, I think it's worthwhile pursuing this. So the goals of the study were to identify genomic events in breast cancer that can be asked to better stratify patients according to clinical behavior. So the second goal is to develop insights into how molecular aberrations contribute to breast cancer pathogenesis. And finally, it was to discover genes that may be therapeutic targets in patients that do not respond well to current therapies. So any questions before we jump into this? Okay, so here is a figure showing the expression subtypes in breast cancer. So here what you have is you have samples of patients clustered along the top here and you have genes ordered along the side here. This is from the supplementary data. I don't know if anyone was really keen and went to the supplementary data, but this is where this is from. Okay, and so what I wanted to point out here is that there are four breast cancer subtypes shown here, the fifth being normal like, so that's excluded here. But here we have the luminal A subtype. Here's the luminal B subtype. Here's the basal subtype. And then here's the B2 subtype. And the important thing to note is that these genes here in the luminal A's, these are basically for these particular cases over here, these genes are highly expressed. And over here, sorry, these are lowly expressed and they're highly expressed over here. Okay. Here you have a pattern of low expressures of these genes in these patients here. And then finally, have a big block over here, where these genes are highly expressed in the basal subtype and not in the breast. And finally, for the Herb B2, it's a much more related. So here you have many, many features that contribute to the differences in cases. For this, it's actually a relatively small number of features, a small number of genes that contribute to the difference. And actually, I mean, in reality, it's probably just one, which is Herb B2 itself or Herb 2 itself. So one of the things that we're going to do in the lab is use this package called gene filter. By the way, sort of these annotations are not on the original slide, so you can scribble these in if you want. So we're going to figure out how to take a list of 22,000 features in our original data set. And and we're going to learn how to collapse those down to about 100 or less, using a package called gene filter. And then we're going to try to reproduce this plot using hierarchical clustering after feature selection. And this is with the cluster package. And these are all these are packages in R and bioconductor. So the other thing I wanted to mention here is that the subtyping also, what's annotated here are the ER status. So this is estrogen receptor, and either ER positive or negative. So one of the things that is really nice about the study is that it has this matched data set. For copy number. Okay, so what the what the investigators did is first subclass all the patients, according to these gene expression subtypes, so basal or B2, liminal A, liminal B. And then based on that, they were able to look within those subtypes and see if there were copy number patterns that were potentially contributing to those to those subtypes. So here, I've shown a this is a figure from the paper. This is looking at all cases. And it's looking at the frequency of gain that goes above this line. And this or amplification and the frequency of loss, which or deletion, which goes below the line. So this gives you kind of overall portrait of the cohort. So it gives you a kind of a summary view of what this this population looks like from a copy number perspective. And this is very similar to the plot I showed earlier in the day, based on our our larger cohort. So so this is a reproducible, certainly a reproducible frequency plot that that is kind of accepted now in breast cancer research, the breast cancer community. So the other thing to note here is that what they've done is plot here the high level amplification. So there are some genes like RB2 that are that are targeted with many, many copy number amplifications. So to get huge reproduction. And so here is a is a frequency plot of high level amplifications. And essentially, they're well known while characterized how this is RB2 here. Okay. All right, so then this is again, the whole population. We take the sub types, we look at the basal sub types. We can then plot frequency diagram for the basal sub types. And it has it does have some slightly different characteristics. And in particular, some of the the high level copy copy number alterations are sort of more specific to basal than others. Here you have the RB2 and what striking is that of course, well, here's your RB2 and it's present in almost 80% of those cases. Okay, and then and then the rest of the alterations are relatively low frequency, the high level. And then you have the luminal A and B. And so what we will do is actually, we're going to take this or ACVH data, we're going to load it in, we're going to analyze it for copy number changes. And we're going to use the phenotype database that we have to subgroup the cases, and we're going to plot these frequency diagrams. So we're going to try to reproduce this figure in the lab. So the other important thing that they did in this paper was to perform unsupervised clustering based on the copy number. And essentially what they found is that the samples fall into three groups, which they define as one q 16 q application. Then you have this kind of amplifying group, which contains a lot of samples with high level amplifications. So what this figure shows is in in green, you have so there's a lot of color swapping happens, unfortunately, but so sometimes in gene expression, the sometimes in sorry, sometimes in copy number, green is actually shown. Green means loss. In this case, it means gain. So so here you have gains of one q and you have and then this group here you have these yellow dots and what the yellow dots mean are high level amplifications. So these are kind of extra annotations put onto the plots to show you where the high level amplifications are. And so we're going to use a package that that that can that can actually reproduce this this data as well. And so you learn how to copy that basically superimpose the the the high level amplifications for visual analysis. And then also we're going to we're going to cluster these samples as well, using using the package a CTH. And and so again, they fall into these three groups. And one thing just to just to watch out for in the lab is you're going to get a result. And I wanted to try to think about how closely it matches this result. It's the same data. It's it's it's all similar analysis. But I just think it'd be aware of that. What you're going to get might look different. It might look the same. But just be be cognizant of the fact that we want to try to compare the results. Okay. Okay. So that's that's that section of the paper. Any questions so far? So does everyone understand this plot? So here we have, we have basically the chromosomal position here. And this comes on one, two, three, four, five, cetera, to x. Okay. And the the the tools that you'll use actually will plot the data like this. So so you're going to have to turn your either turn your paper or turn your computer or turn your head. One of the three. Okay, so just be a kind of application. Yeah, that's right. Yeah, a lot of game. So one, you have several lines. That's right. So so here you have the patients are on the top here. Okay. And so you have all these patients have a gain of one cube. Okay. Yeah, okay. So everyone understand that? And then here, you have this group has his chromosome 17 again, and here's high level amplifications of her B two. So these are the patients, all these patients have her B two, and they get grouped together. Okay. Okay, so then yeah, that's right. So this right. So what's here is the expression subtype. Okay, so this is then superimposed, you do unsupervised clustering of the copy number, then they superimposed the expression subtypes on top of it. And well, some of it's clean, some of it's not right. So here you have the basals that are pretty much grouped together. And, you know, you can make an argument that that luminolase are grouped together. But then you get kind of a, you know, confused mix in the middle. And I think actually the authors overstate this in the paper, that's my, my, my feeling is that that they recover the subtypes, but doesn't seem to be sure, I mean, you can do correlation analysis of the subgroups of the copy number to the subgroups of the pardon me. No, I don't think they did that. They used to define it sort of visual interpretation. Okay, so here we have survival analysis. Okay, so this is going to actually going to be work that you're going to do tomorrow. And it's going to lead you through this. She was actually involved in this study, which is actually another real big bonus of using a study she can provide insight that most would be able to do. So, so here you have the expression subtypes that show differential survival. So that's quite, that's quite striking. I think I believe these are the basals here. So they have much inferior survival rates. These are Captain Meyer plots. And so you will learn how to do these and actually ahead of the lab, you learn all the, the concepts that go into producing these plots in the first place, and the statistics that go into determining whether one survival curve is different from another survival curve. So that's the there is a clinical component of our clinical genomics workshop. It's that part and that'll be that'll be covered tomorrow. So, so here you have those three copy number subtypes. And so they show that there's there's differential survival in in the ones that I can't, I believe that these are the complex and amplifying and then these are the others. So then, then we have cases that showed recurrent high level amplifications. And, and those that didn't. So here you have an inferior survival of those that showed recurrent amplification versus those that didn't. And then here, I think this is one of the really neat findings of this paper is that you have these expression subtypes that are kind of accepted almost as dogma in the breast cancer in the breast cancer world. And I want this data shows and albeit on a small number of cases that these can actually be these subjects can be further split in terms of outcome data. And so here you have those that are luminal aids but also have high level amplifications. And those that are luminal aids that don't have high level amplifications. And so it shows that you can split these subtypes. And so again, this is looking into into the perspective on biomarker discovery. So this this gives us some indication that there's an association even within this tightly defined subgroup by one method, use a different orthogonal assay that that measures something different, one can potentially split a subtype. So this is a really nice finding, I think in this paper. And then you have the two, two remaining survival curse related to amplification of AP. Do you have anything to add so far, Anna? Okay, okay. You can split it. Yeah. In fact, you wouldn't get a correlation. Very good. Okay. So then finally, then then there's this plot here, which shows that if you take the copy number out. So you take the the non induced copy number, actually, there's still, there's still that the subtypes are somewhat preserved. So so basically, adding copy number helps to to split but even without it, we still recover the subtypes. Okay. So any comments on the paper? So what were people's impression of the paper? I'd be interested to hear those who those who had a chance to read it. Was it something that you thought was interesting? Was it something that you thought was well done? Had over the caveats limitations? Any any comments? People are reading it now. Yeah, please. Yeah, I mean, it seems that way. Again, it's a small cohort. So that needs to be reproduced. And that's something that we're in our own research going to try to reproduce. Certainly, we look up to that. Yeah, I mean, it certainly points to that. Yeah, right. So so I think they listed a large number of genes that they think are potential targets based on association with amplification expression. And so there's presumably there's a lot of follow up going on. For that. The biomarker problem, the problem in the biomarker is that you always believe that some of the evidence can lead to predictive quality value, negative value, but those are the clinical ones, the interesting ones. Pretty deep value. Those change so much with that with the background of the population. So it's always never going to be the perfect thing because that there will be never anything with 100% predictive value of 100% negative predictive value. Absolutely. And I think that finding the target for therapy on those places that they find that have those opportunities are the most important of time. Yeah, so I think I actually agree with there. The the issue is that again, you're dealing with maybe 10, you got 10 cases, you know, some 10, 10, 15 cases that have the high level of applications. So your case numbers are so small. Of course, then you need to just build up case number. So you filter down the biomarker set. But now you need to do the other side, which is to expand the patient cohort set. Please. Sure. So let's talk about limitations. So certainly, that's, I think that's an issue. And also, so this speaks to why I think these clustering and biomarker discovery studies are often not reproducible is because tumor samples are difficult to study. And it's because it's for the thanks for bringing up those points. I mean, we have normal cells mixed in, you have heterogeneous clonal populations that are mixed in. And, and so there's a lot of biological noise in the system. And so that's often a case that's often contributes to, to variation and and lack of reproducibility in these studies. And so there are techniques like laser capture microdissection and laser microdissection. And those are, those are, I think the better the sample preparation, the better your study is going to be. That's that's the general rule, I think is that if you invest time up front, the data that comes out will be better. And often some knows it's saying that we say in computer science that garbage in garbage out. And so so you want to try to to really purify your samples as much as possible. I think that's clinical, you lose clinical validity. Yeah. Yeah, that's right. That's right. Sure. But if you get closer to the biology, though, so so right. So if you if you have pure samples and cleaner data, you get closer to the biology, then then maybe you can has a better chance for clinical utility. Yeah. So there's one thing that's missing, and nobody's picked up on yes, it may identify a signal, right? So that I think the the point is, is that laser capture may or a cleaner samples in general, may identify the signal. They can still use the same, you can still try to find the signal in noisy samples. Yeah. Right. Yeah, exactly. Yeah, that's right. More specific techniques. Yeah, correct. So you do much more targeted, lower throughput assays once you found the signal. And okay, so so there's one one other point that before we take lunch that I really want to discuss about this paper. I wonder if anybody has any other comments will hit on it. So what about validation? So these, they use these high throughput techniques to determine high level copy number alterations. But didn't validate using tried and true techniques like fish, for example. So does that mean that it's less believable? Well, I think what that the power of having two data sets is really quite nice. Because that that's in some way, that's actually a validation in itself. Because you're seeing concomitant up levels of in DNA and expression, that gives you pretty good indication and can cut down noise considerably. But at the end of the day, I think one would still need to to look at at least a few of these cases, using, for example, fluorescence to do hybridization of the copy number applications to make sure that fact that they're there. So so that's just a limitation that that I noticed. But right, so that's so you can. So orthogonal data sets that are also certainly a very reasonable way, because it's it's an external cohort of patients and and if you can validate it in external cohorts that are produced in different labs using maybe different microwave platform, then that's that's also a form of validation. Okay, so just a preview of the lab before we go for lunch. So so we're going to look at these expression subtypes, and we're going to do feature selection and clustering methods. We're going to to determine copy number profiles based on the race you go to try to find copy number subtypes. And then tomorrow we're going to do association with outcome by using by plotting survival curves and looking at how we can do that type of analysis and look at parameters that were associated with outcome. So so basically, you know, this is just a review of the paper. And again, so if you haven't looked at it already, I encourage you to do so.