 Welcome to the second day of our clinical biomarkers workshop. As a matter of a short introduction into the material that we'll be taught today, I'm just going to say that this is the second course that I'm teaching and the material keeps evolving. So in the end of the course you will be prompted to give your feedback and I would much appreciate your feedback on my day in terms of the content depth and the breadth of spectrum of topics in lectures and the structure of a lab which is going to be a bit different from yesterday. So the module three would be called clinical omics and it will be composed of two parts. In the first part I will talk about the, so this is basically a bit of extension of the topics that were covered yesterday by SOARAPP and I will talk a bit about the transcriptome profiling using both microarrays and next-gen sequencing technologies quite briefly. There is a next-gen sequencing workshop here that you can take so I will just give you a very short, very small glimpse at the situation at this front and then I will illustrate the application of these technologies to the profiling of alternative splicing in cancer. So this will be part one and after the short break between the two parts we will follow with the part two where I will introduce you to the clinical omics basics and then we will talk about the procedures of building classifiers and the classification discrimination problems in the biomarker discovery and I will illustrate the application of these approaches to the drug response studies. So now let's start with the first part. Now the transcriptome profiling using high resolution technologies. So as you've seen yesterday and as you've probably knew before pretty well, there is a vast number of different statistical methods that have been developed out there by the research community for the analysis of microarrays. For instance, this paper is devoted to the evaluation of different statistical methods using CDNA microarray technology and it was published in 2000. So now for the next-gen sequencing technology what people have been mostly doing for the last few years is they've been trying to work out the profiling of any given sample and to come up with an appropriate methods of measuring copy number or expression using next-gen sequencing technology and now that they have succeeded with this, now we are posed with the question how are we going to analyze a cohort of samples profiled with next-gen sequencing technology and now it's becoming clear that we can basically apply the same statistical methods as in microarrays to the next-gen sequencing profiles and this is a paper which is actually by the same authors but now in 2010 this is a very recent paper where they evaluated pretty much the same statistical methods for analysis of differential expression from next-gen sequencing profiling. So just to remind you, the conventional steps that are usually taken during the microarray profiling of expression say, so there are a number of steps. So you have your fabricated array with physical DNA sequences printed on it and then you have your sample material which you hybridize onto that array and then you get a series of images, scans of those arrays and so you go usually through three major steps of background correction which allows you to tell real signal from background noise then normalization is the next step. The goal of which is to make the expression profiles comparable across chips or across samples and in the end you do is summarization such as absent level expression summary or gene level expression summary. It just depends on on the on the array design in your study design and in the end, sorry it's not really clearly seen there, in the end you get a log 2 expression measurement for your microarray platform. Now how do we get from the next-gen sequencing to the expression profiles? So the idea is very simple. First, say we have a gene model like that which includes a number of axons. This may be an alternative axon which may be excluded or included and then you may have a transcript like that with simplicity. I just put just one transcript structure here as a foreman. So these two things are known a priori from the gene annotation databases. Then you take a collection of RNA secretes and you map them onto your transcript sequences and then you cover both exonic regions and junction regions and so by quantifying the number of reads that are mapped to this particular sequence of a transcript in isoform. So this quantification of reads gives you a measure of expression. So basically the idea more reads in a sample library, the higher expression. But now is it really true that more reads are higher expression? And the answer is no. There is a number of considerations in the RNA-seq data that are always have to be kept in mind. So for instance longer genes, more reads naturally, right? The longer sequence, the more reads you can map to those longer sequences. Does it mean high expression? The answer is no. Not necessarily. Then the greater sequencing depth, more million reads you get out of your sample library, the more reads you get mapped to the same sequence. Is it an indication of a high expression? Not necessarily. And then certainly in comparison with other libraries, with other samples. Now there can be a situation where you have more map reads to a given region for some reason that are not explained by these two other points. These can be possibly non-unique regions. Is it necessarily high expression? And this is considering that you have masked the repeats in the genome and yet you get a higher density of your reads across certain regions. So is it a high expression? No, not necessarily. This can be a certain region that has some other regions in the genome with high similarity. And so you have to correct for that, too. And then maybe some other, yes? How would you correct for each one of these? Okay. Yes. So basically there are certain metrics that have been introduced a couple of years ago, specifically by Barbara Wald group. And that metric is called the RPKM. I'll briefly mention that. So basically you're normalized by the length of the sequence that you're mapping to. So this is a correction that is usually done these days. And yet with the Illumina platform, for instance, there is a propensity for longer and higher expressed genes to be preferentially sequenced. And this is recognized to be a big problem. And people are trying to develop certain algorithms to correct, to fix this bias. Now, greater sequencing depth, the same metrics such as RPKM, which is the simplest one and probably one of the first metrics that was available out there, does take into account the sequencing depth. And so the RPKM means the read counts per kilobase of axon per million reads. So if you have libraries, a sequence of different and different depth, say 10 million in one and 20 million in another. So you just normalize by the total number of reads that are read from that library. And you get this kind of normalization. Now, more map reads to given region. So there is a notion which is called a mapability of each and every nucleotide in the human genome. And that means if you take any given nucleotide, how possible is it that, say, fictimer from RNA-SIC that spans this particular nucleotide would be mapped to more than one location in the genome. So not only to this one, but to some other locations within the genome. And this assessment is done for each and every nucleotide in the human genome. And so each nucleotide has a mapability score that you can use to correct for these things. Presumably that mapability score depends on the size of the oligo that you're... Yes. And it's very much, pretty much, you know, library-specific, platform-specific, right? So in the previous generation, we had 36 MRS, right? So the construction of a mapability, the production of mapability score across the genome would include taking 36 MRS. Now we have 50 MRS, so you still need to reproduce the whole procedure and get new mapability scores. And next step is 75 MRS, and that will need to be repeated. But, basically, if you... There's a number of approaches that are taken by different groups to compute mapability scores for human genome. And you can go to the UCC browser and you can browse the mapability track and you'll see a number of different mapability scores. And do those scores yet take into account what's known about copy number variation or not yet? That's a good question. I think they should, because... Now, let's see, eventually they will, but... I think they should, because the entire procedure, basically, is that you take all 50 MRS and there will be 50 of them that cover a particular nucleotide with a shift by one nucleotide, right? So you have 50 MRS that still span a particular nucleotide. And then you map them all back to the genome. And then you count how many of those 50 were uniquely mapped back to the same position? Yeah, but I think, I think reference genome does contain these, right? So, yeah, yeah. So there are other considerations such as material quality, degradation. Very often people are faced with the problem of profiling of archived material such as paraffin embedded and formal and fixed specimens. And that kind of confounding factor also has to be taken into account through a certain normalization and correction procedures. So there are yet other biases to the gene sequencing profiling. And that is not uniform distribution of reads along the transcript. So it seems that, specifically for Illumina platform, the nucleotide frequency of the individual reads is not uniform. So the very first few, the first dozen nucleotides of any given read has a certain bias for nucleotide frequency. And that results in the uneven distribution of reads along the transcript. Now, if you think of it, it will give you not a straight line, right, for the expression level of any given transcript. But it would be like peaks and valleys, peaks and valleys, right? And so it is intrinsic to the microarray platform itself, which is caused by the probe affinity issue. So each and every probe and microarray does not necessarily give you absolutely equal signal in terms of magnitude. And so this is not a problem when you compare multiple samples. You just see the curve that is followed by pretty much all of the samples. And then you're fine. But when it comes to the analysis of, say, differential splicing within a given sample or within a given gene in a given sample, then it may be a problem because you may have a lower signal for one axon. And you may infer that this axon is underrepresented in the transcript pool of this particular sample. And this is not necessarily the case. So people are actually recognizing this problem. And again, they're trying to come up with a certain weighing scheme to correct these biases. But so what I was trying to say is that both microarray and next-gen sequencing technology, they both have their own biases that still should be taking into consideration for proper analysis. So here's a specific example. This is from my current study of prostate cancer samples that have in profile with next-gen sequencing, and these are FPE samples. And I'm assessing the splicing profiling, splicing profiles of these samples. And this is a distribution of a certain splicing metric for each and every sample. So I have 11 samples here, including a couple of cell lines. And what I suddenly see is that for one of the samples, I have very unusual distribution. It looks weird. And I knew that basically I did not have a problem of a variable sequencing depth. I did not have a variation in my samples nature. They were all FPE, the same amount of material, the same quality of RNA. It was corrected for gene lengths and et cetera. So now when I go back to the expression distributions, I see clearly that this particular sample, so this is a typical distribution for the RNA-seq data. For expression measures, this is the expression measure, and this is distribution. So for this particular sample, I see that the fraction of modestly expressed species is much less compared to other samples. And this can mean the greater degradation of this particular sample. And to me, it did make sense because this sample was from the lymph node. And from the lymph node FFP, it's a bit challenging to get a pretty good quality RNA insulation. So now this tells me that indeed I need a normalization step, even for my pretty much equally distributed data with regard to other factors. Just a question about the FFP samples. Do you have any quality control metrics for the sample before you do the sequencing for these FFP samples to identify highly-graded samples? Well, the first check is the check for the RNA isolation itself. And then what we prefer to do, because we are from a microwave facility, we usually do a microwave profiling. And we see how the sample performs and how interesting it is to us. Because sequencing is still a way more expensive than microarrays. And because we have all that technology in-house, we can afford it. So we do the initial screening with microarrays and see the quality. The whole transcriptome. The whole transcriptome. And then we see the quality and the expression profile. So if it's really flat, then we just don't bother sequencing yet. So, and now, yes. So can you use examples that are like years and years old? Yes. And you're trying to get RNA. How do you know the integrity of your specimens? Well, you can't really tell until you start doing something with it, right? In our experience, we still can do more or less fine with samples up to 10 years old. If it's over 10 years old, the degradation is very significant. And it's pretty hard to profile it. Can you do anything? You're asking me, do you know anything? So if you want to know the quality of the RNA before you do the microarray, there's things that will be done with the RNA to look for the quality of the samples. Do you have some PCR assays? Well, yeah, so, yeah, you take a look at the, basically, the smear that you have. Yeah, but, well, you can think of some, you know, specific PCR, you know, some of your interesting genes, but we're not interested in any specific genes, but we're rather interested in the whole plasma gel profiling. And then you just see, you know, the spectrum of the fragment sizes that you have in your isolate, you know, it's very indirect. So you can really tell upfront how well the profiling will go. If you see, you know, a reasonable smear in your gel, then you just go ahead and analyze it. And then, certainly, microarrays is a good check. It's just an expensive check. Yeah, it's an expensive check, but, I mean, $300 versus pretty much, yeah, about 10 grand. It's a bit of a difference. It's a big issue. It's a big issue. And, for instance, in our institution, we're also faced with the problem of comparing different formats of samples, fresh frozen versus FFP, and this is not a simple batch effect, for sure. It's a problem. And people are trying to come up with a certain algorithm that would correct for these differences. So it's not simple. What is this FFP? This is a formalin-fixed, purpose-embedded tissue. Yes, it's a pathology sample, yeah. So the advantage of having FFB is that you can do a histology on that and pathology on that on the entire block, and then you can micro-descept the tumor itself and leave the benign tissue's trauma, say, outside, right? And then you can take, say, this trauma and then do a comparative analysis to see the effect of the microenvironment and, etc., and feel the fact that, etc. With fresh frozen, the quality of the RNA and DNA isolates is way higher, but at the same time, you never know the tumor content. You just, it's pretty much like a black box. And fresh frozen hasn't been the most traditional way of preserving samples. Yes. So maybe FFP samples there. Yeah, most banks are full with FFP. They have seen estimates of millions of FFP for these samples. Yeah, yeah. And very often for certain cancer types, there is also another challenge of a availability of only a small amount of material. For prostate cancer, for instance, prostate cancer volume is, it's very early detected and the tumor volume is really tiny. And so we are opposed with two challenges here. FFP in prostate cancer and small amount. Okay. Oh, you know, this is a hard question. Let's say, in terms of cells, I can't give you the number, but in terms of the FFP homogeneity, it would be something like a couple of hundreds of micro layers. It should be on the order of a microgram. Is that right? Is that right about a microgram? Yeah, it depends. It depends. But on average, for any given assay that you're doing, you have pretty much about one microgram of total RNA. Uh-huh. Yeah, yeah. Okay. So now it is clear that for the next gen sequencing technology, when it comes to the analysis of multiple samples, we still need to go through the same pipeline, which includes this set of steps. So these are three familiar steps for microanalysis, which are preceded by the mapping of the RNA-seq data onto the feature sequence database. So here, in this case, we have a feature sequence database and feature being genes or transcripts or axons and junctions. As you wish, it can be any database of your choice, depending on what study you're doing. And then you have your library and reads coming out from the Illumina sequencer, and so you map it onto this feature sequence database and you get your read counts. So this is done using different algorithms, MAC and VOTI, popular ones. Then you still need to do the mapability and background correction. So how we do the background correction, for instance. We take a look at the distribution of read counts in intranic regions and intergenic regions, and we take that distribution as our background level for a given library. And then we do a summarization, so it is basically the same what kind of expression measurements you want to get. Axon-wise or transcript-wise or gene-wise. So it is just a matter of summarization, and it's dependent on your feature sequence database. And then you do a normalization, as I mentioned to you before. So the RPKM is one simple global normalization method, which is just simply dividing by a single value per library. And in the paper that I showed you in the beginning from 2010, it seems that this kind of global normalization is not quite adequate because it is very much affected and driven by a minority of genes that represent the majority of the reads population. So some 5% of human genes give you 50% of reads. And so that's why this kind of global normalization introduces a bias. And so what rather people do now is they do something similar to the microe-normalization, and that would be something like an upper quartile normalization that seems to be quite appropriate in that group show that it's quite adequate. And in the end, you get the same log 2 expression measurements which are directly comparable with your microease experiments. So now what I've been telling you so far with regard to the next gene sequencing regards the prerequisite of a feature database, feature sequence database. So the a priori coding content of the human genome that you need to have. But now the next gene sequencing technology provides much more than that. So basically you can use next gene sequencing to reconstruct transcriptome and genome de novo using paired end sequences. And this is a highly novel and actively developing field. And these are just a couple of recent papers that I can refer you to that have developed algorithms for reconstruction of transcriptomes de novo. I'm not going to touch that. It's a bit advanced topic and it's not really relevant to the course here. But you're welcome to take a look at that. Okay. So this was the first part of the first part. Now I'm moving on to the illustration of application of these technologies to the profiling of expression and in particular splice variants, alternative splicing with the illustration of splicing repaired to our profiling in breast and prostate cancers. So before I do that, I will give you a very short introduction just as a refresher for you what the alternative splicing is. So it's ubiquitous in the genome. So probably almost 100% of all of the human genes undergo alternative splicing in tissue and condition specific matter. And of course it is very mobile system and of course it's implicated in human disease including cancer. Did you raise your hand? Yes. So the question about that first statement was that did you consider all tissues at all times? Yes. So for any given tissue at any given time, you may only have... That's an excellent question. So let me ask you then the following question here. It's relevant. See this is an expression distribution for all of the transcripts in a human genome for prostate cancers. One tissue type. Well, no. No. You see that these samples pretty much follow the same bimodal distribution. Why is it bimodal? Yes. Why? What are the two different distributions? Incompletely spliced... Close. H and R and A. Close and it's very close related to your question. So in any given cell, in any given tissue, in any given condition, not all of the genes and not all of the transcripts are expressed. So in my experience, only about two-third of all of the transcripts are expressed in any given cell. So and this fraction is not expressed features. Axons and junctions and ultimately transcript variants. So this fraction of transcripts are expressed at certain level. Log 2 of 5 is very reasonable expression. And then this is basically a background. So yes, it is, this number comes from the comparison of multiple tissues and multiple conditions, but mostly multiple tissues. Yeah. Can you tell what the same proportion of genes would be expressed and not expressed? I guess for more in-line sequence data you can tell that it's the same subset of genes that are being used. That would be a very interesting question to ask. I didn't ask that question. But I would assume that in this particular set, because this is a pretty much homogeneous set, it's a prostate cancer, it's a prostate epithelial cells. So I would expect to have a great overlap between samples in terms of which genes expressed and which ones are not. Okay. So alternative splicing is tightly regulated process and it involves a number of cis regulators and trans regulators. And trans regulators being a spliceosome, which is a huge ribonucleoprotein complex composed of some 200 different proteins and RNAs. These are different factors, parts of spliceosome. And cis regulatory sequences are the RNA sequences, signals, the splicing signals. So one of them say the splice sites, 5 prime, 3 prime splice sites, branch side, and then intronix splicing enhancers and silencers and exonic splicing enhancers and silencers. So this is a very tightly regulated process and very finely regulated process. And you can imagine that any aberration that takes place either within cis regulators or trans regulators can lead to the alteration of splicing patterns. And that can result in the regulation of cell processes such as death, fertility, invasion, differentiation, and proliferation. So are there any examples of cancer-specific splice site forms? And the answer is yes. There is a great number of different examples. So just to show you a few, this is a simplest scenario where you have only one cassette axon. This is called a cassette axon and you have just one internal alternative axon. It's called cassette axon which can produce the short isoform and long isoform. And for these forms, we have an example of FGFR1 that has cancer-specific isoform that lacks the alternative axon and it's correlated with corporate noses and breast cancer and malignancy of astrocytomas. Now the inclusion of cancer-specific isoform... So the short isoform, again, in WISP1 and VEGF where you see that cancer-specific short isoform for WISP1 has different biological properties compared to the normal inclusion isoform and causes cells to invade. Another example is VEGF where you have the cancer-specific short isoform lacking the functional domain of the protein. Now one of the notorious examples is the BCLX gene which has long and short isoforms and the long isoform is anti-abaptotic cancer isoform and short isoform is the apoptosis-promoting form. So there are other examples of splicing. Splicing can be very complex and this is probably one of the ultimate examples of a number of tandem alternative axons, some maybe ten axons, that can be spliced in different numbers and different combinations to give a really big spectrum of isoforms. And there are examples of such genes, for instance, Tensin C which has a 8 kilobases alternative region in the middle of the gene and that isoform facilitates cell migration by inducing loss of adhesion. And basically it's present at much higher levels in invasive breast cancers compared to the non-invasive breast cancers and antibody has been raised against this alternative region to detect glioblastomas in brain tissue. CD44 is another notorious gene for very complex alternative splicing. It has ten alternative axons, they are spliced in different combinations and there was a precedent of using antibody against a specific alternative axon within that gene against the splice variant v6, for head and neck carcinoma unfortunately didn't go really far because of the high toxicity of that antibody. But still, so why do we care about splicing itself? In particular with regard to the clinical applications. So if you have a gene model like that with an alternative region in the middle, you can have two transcript isoforms excluding and including the alternative axon. And so you can imagine that two protein isoforms are translated from these transcript isoforms. And if you raise an antibody against the common part between these protein isoforms and then if it turns out that one of the isoforms plays a cancer beneficial role compared to the normal isoform, then you would be shutting down, say if this is a therapeutic agent, your antibody, right? Then it would block the function of both isoforms and you would have a toxicity for normal cells that express normal isoform. So this is a scenario where the protein is, say, on the cell surface. Now, if you imagine raising an antibody against only the alternative region, then you can specifically target the cancer-specific protein isoform and spare normal isoform which can be very beneficial for normal functioning of the cell. And thus you reduce the toxicity of a new therapy. And so all this knowledge can be used effectively in clinical applications. Now, how do we interrogate alternative splicing on the whole transcriptome level? There are two ways. Using microase and next-gen sequencing technology. So using microase, it is a matter of choosing a sub-gene level micro-ray platform that has multiple probes per gene, say, for each and every axon. You have a probe. There are specific micro-platforms that are more effective in terms of the design. They would include junction sequences between corresponding axons. So in these junction sequences would come from the transcript variants that have been observed somewhere and are annotated in the database. So the example of such platform, which would be called splice sensitive or junction array, is the Anthymetrics Research Human Junction Array. And so we see in the summary below that it has some half a million features and it interrogates 250,000 axons in the human genome which represents 33,000 transcripts. And you have this number of exonic probes and this number of junction probes. So does that mean that you can sell them on the platform that has multiple probes that combine different splice variants of the same sub-gene? Yes. So you have multiple probes per gene and you have probes for axons, BSRs, they called. So, and then you have probes for junctions between them. So junctions would be highly specific. So you can imagine that if you have only anthonic probes and you have a pool of different isoforms in any given sample, then it would be really hard for you to infer the transcript structure just from the expression level of each and every axon. You do not know which axon is connected to which. So that's why I personally prefer platforms that have junction probes. They provide additional information to the inference of the transcript structures. It's quite the same level as the RNA sequence. Is it the same level? It is at the same level. I mean, it's telling you the same information that we can get from RNA sequence. Exactly. It doesn't draw on everything. Yes. So the right panel shows a similar idea behind profiling splicing using next-gen sequencing. So you have your feature database, sequence database that is comprised of axonic regions and junctions. And then you map RNA secretes onto that feature database. You do all of the normalizations that grant correction steps, as I described before. And you get a profile similar to these microAs. And so as I mentioned to you, this all is based on the prior annotation of the human genome. You have to know which axons you are interrogating. You have to have sequence for them. You have to have sequences for junctions. And this is one of the examples of such a database that has been created in Vancouver Ingenome Sciences Center by Mark Amara Group, where they have much more features represented compared to, say, microAs. It's much more comprehensive database. They tried. They made an attempt to represent not all known junctions, but also all possible junctions in the human genome, which resulted in this huge increase from 300,000 to 1.2 million different junctions. Yes. For each sample, there are possible multiple junctions. It's only one type, right? You invented multiple probes for different types. And for each sample, they only possible have one type, right? What do you mean by one type? You've got alternative spicing to the possible resulting different types. For each different types, you designed a probe for that. Therefore, for each sample, they only could have one specific... Of course. So it can happen that within any given sample, only one isoform is expressed. It's possible in multiple of them. In one sample. Oh, yeah, for sure. Yeah, it's rare. You know, as I mentioned to you, the rate of alternative spicing in human genome is high. Now, if you're dealing with a specific tissue... But you use PCR to replicate one specific type, a lot of the copies, right? So when you probe for an expression for that one, it's only specific for one type, right? I mean, for one... So maybe this one will explain the question? Answer it. Maybe this one will answer your question. Yeah, for each sample, they only possible have one variant. One transcript variant. Yes, it is possible. But more often... It's possible to have two variants. And it's more often. It's more often that for any given gene, especially for the gene that is of, you know, cancer interest, it happens much more often that there is a number of distinct supply size of forms that are expressed at a certain ratio in any given sample. Now, when you say all possible junctions, like, could you review that again? It's all possible junctions in trans, right? It's all possible junctions in cis within an existing transcript. Yes, we're going to give in gene. Yes, certainly. So... Yes. Yeah. Yeah. Yeah. Yeah. Yeah. A million two seems to be a small number for all possible junctions. Yeah, so if you... BCR-Able. I'm sure BCR-Able's in there, but that's a special case. So there's roughly 200,000 axons in a human genome, right? And so if you imagine all possible combinations, that would be two to the power of that. It's a daunting number. And so, but in principle, you know, if at some point, you know, the computer capacity will match those tasks in the future, it will be possible to search for fusions, for instance, in this matter. So... And could you look at the difference between a PSR and an Exxon? So consider it's about the same. Okay. So the Exxon region is PSR. It's a probe selection region, what it's called. Yeah. In the analysis... But it's bored, I'm sorry, but it's bored from microarrays. So it's used mostly in microarrays field. In the next gen field, it would be called a feature. Well, in the analysis, you have to treat it as a group, right? We cannot treat it as a probe set as usual. You have to take account of all the variants. A bunch of different probes for the different variants. But when we analyze it, we try to treat it as one group. For inferring alternative splicing? For splicing, yeah. For differential splicing, yes. So the way the algorithm goes, at least the one that I've developed, it takes into consideration all of the probes within a given gene. Within a given gene. So all of the transcripts. And it looks at the profile of expression for each and every feature across samples. Then it does a certain normalization procedure that takes out the differential gene expression. And then I'm left with a differential axon expression, which is an indication of splicing. So I don't have that particular slide. I didn't really think that it would be of interest to you. But I can show you during the break, if you wish, how it can be done. Yeah, and so this slide summarizes the strategy for inferring differential splicing from next-gen sequencing. So you have your gene model in the top with the alternative epsilon 2, which is included in splice variant 1, but it's excluded in splice variant 2. And then we have reads that are mapped to unique junctions. In this case, these junctions are unique and distinct from these junctions. That's how you can tell these two splice isoforms apart. And then if you catalog all of the observed axons and all possible junctions, maybe if you wish, do 2 to the power of 200,000 as you wish. And then you map your RNA-thick reads onto this feature database. And that's how you get the expression level and ratio of splice variants for all human genes in a given sample. So now, yes. Well, in this kind of design, I don't usually use the paired-end information. So it's kind of unpaired. Yeah, so the paired-end... Yeah, you can use it basically for fusion detections and breakpoints detections, for sure. But this would be like a different algorithm. Yes, for sure. And that's what they do. They use this information in the reconstruction de novo. Yeah, so that is crucial for those kind of algorithms. But this is kind of simple. So it does not really require the paired-end information. Okay, so this slide shows a molecular profiling of subtypes of breast cancer using microarrays. And that was a splice-sensitive microarray platform. This was a study that I was doing back in California in LBNL. And it was recently published in molecular cancer research. So in breast cancer we have a number of subtypes which are cold-herby tubease and luminal A and luminal B which show different survival experience. And we know that these subtypes are very well defined through the expression profiles. So they very much differ in terms of different profiles from each other. And now this is a splicing pattern. This is not a gene expression. This is a splicing pattern, splicing profiling of the same breast cancer cell lines. And what we saw there is that each and every subtype of breast cancer had a specific splice signature. It was an unsuitified clustering and any new splicing pattern subgroups. The splicing repair to our pretty much followed the expression derived subtypes. But it was good to see that these subtypes also differed in splicing patterns. But also what was interesting to see is that alternatively or differentially rather spliced genes across the breast cancer subtypes had zero overlap with the differentially expressed genes in the same cell lines. So there was zero overlap between differentially expressed genes and alternatively spliced genes in the same cells. And this is a result of a pathway enrichment analysis where you see the enrichment with alternatively spliced genes in red and enrichment with differentially expressed genes in blue. So you can see that different pathways are enriched with alternatively spliced genes compared to the differentially expressed genes. So for the splicing you would see axonal guidance signaling, affrin receptor signaling, integrin signaling, actin cytoskeleton signaling, which was pretty much in line with the underlying morphological and differences, infinitypic differences of our breast cancer subtypes because some of the subtypes are much more aggressive and invasive. Potential is much higher compared to the other subtypes. So that was yes. Say it again, please. No, I don't need to because what I'm looking at it's just it's so when you compare say two subgroups and you want to get a certain insight into the biology that drives these differences so what you do is you look for a differentially expressed genes or splice variants in between these two groups and then you do the the geoenrichment analysis or pathway enrichment analysis and then you suddenly see that a certain pathway is enriched where the differentially expressed genes between these two groups and that's how you can infer that this pathway may be important in shaping up the differences between these two subgroups of your samples and that's what is done here. Is that, that would be different within each subgroup? Yeah, I see what you're asking. So, in that study actually every phase that sent us to the top there that differentially expressed genes do not have so if you take the entire set of my samples breast cancer cell lines and you look for differentially expressed genes across all of them you get a list of and you take top with the greatest variants across samples this is your list of differentially expressed genes now you do the same with differential splicing you look for splicing differences different rate of inclusion or exclusion of any particular axon across the same samples so that there would be two splice variants for any given gene in this one so, yes in a similar scenario and so say for this particular axon these are all cell lines so that's and so you do the same thing and you get the same top list of mostly variable axons and then you compare the genes are they the same? No and this indicates this suggests that transcriptional regulation acts in parallel with the alternative splicing regulation of the gene expression yeah yes and so what these arrows actually tells you this is actually very informative for others in terms of how to go about the study design for instance right so if you are say come to the microwave facility and say okay you know what I'm working with breast cancer and I'm very much interested in the efferent receptor signaling pathway genes so what platform would you recommend me to use for profiling my samples and the answer would be splice sensitive because it seems that this pathway is very much regulated by splicing rather than by differential expression and vice versa do you need samples genetic? mm-hmm just copy them yeah that's a different story yeah and so it actually does suggest that it might be a good idea to develop new metrics as well not that phyloinformatics needs more metrics as well it doesn't but to you know the relative proportion of pathology contributed to a process by each of these levels of regulation so the proportion of metabolic badness that's happening because of alternative splicing compared to the proportion of metabolic badness due to differential expression which is not what would arise from epigenetic changes certainly being relatively upregulated or downregulated versus the proportion of differences from normal homeostasis that's happening because of quite mutations or other types of variants yeah it's actually a very good point the control here well the control here was normal well h-max yeah the unfortunately there was a flaw in the design because of the lack of normal heart tissues and so yeah so the goal in that study was to look at the differences between subtypes and certainly it can reflect the differences in the origin of these subtypes of cancers for sure so you can't really say that these are particularly cancer specific this can be specific to the cells of origin and now this is actually a very good point that you brought up so my belief is that really for different diseases for different cancer types it may be a specific mechanism of regulation that plays a major role so for instance in breast cancer what I was seeing is that there is a great deal of both transcriptional regulation and splicing regulation between subtypes so when I'm looking at prostate cancer right now it seems that splicing plays a much bigger role than transcriptional regulation and in the recent paper from the MSKCC by Charles Soares on integration of you know high resolution data for prostate cancer it was clear that for instance mutations in oncogenes do not play big role in prostate cancer so they are very rare compared to other cancer types so yeah it seems that for any given subject of your interest it may be a specific mechanism that plays a major role well that's at least you know my opinion so is it time for a fellowship in bargain with the cosmetic surgeons in Hollywood where you could get lots of breast tissue from otherwise healthy women and then you'd know the actual splicing patterns of your controlled tissue where it was not being done for some other pathology how do people like to spare a healthy pursuit? exactly especially from that organ use 50 of them are challenged every day for the production of the breast so you don't have to worry about it we had this common discussion about collecting novel cancers so the solution was to drive around with a cylinder of liquid nitrogen and look for bicycle accidents so that tissue would be great like the second one dies you have to jump in anyway yeah yeah it's the same problem for prostate cancer where normal samples would normally come from african-americans yeah it would lead a certain lifestyle but I thought that there was an issue with all of this work if you have a solid tissue you can contaminate it with more yeah but is it a separate issue these are the ones that are using the normal tissue no you can't really the adjacent normal tissue which is more or less benign by the pathology you can't really call it normal normal because of the tumor microenvironment interaction nobody knows these days how far tumor cells actually travel through the microenvironment it can be actually very far and this is a separate you know field of research in prostate in particular because in prostate it's a multifocal disease so it can happen in multiple foci in the prostate itself and so it's very important issue getting control and never mind transcript which is very complicated and also DNA and just tissue it's not easy to get the normal DNA but you know yeah exactly the same CTCs and why not and we'll be there they use other tissue for which other tissue doesn't have blood and so it's really it's a tough problem in general in genomic and even more so in prostate oh yeah and for splicing it's a huge issue so for instance there are a great number of differences in splicing between cells grown in 2D and 3D even so it's very very mobile system okay so um this was very interesting discussion thank you for that so this slide just shows another example of um benefits of profiling splicing um in cancer this is a paper published in bioinformatics in 2006 where they profiled a number of FFP samples from prostate cancer and um they had normals and uh cancers and um the goal was to build a classifier um stratify um normals and cancer samples and based on the 128 isoform specific classifiers they were able to reach accuracy of 92% whereas using the gene expression classifiers um they achieved a accuracy which was 5% less it's not it doesn't seem to be much but still it is an improvement so um it does seem to bring in an additional knowledge and additional power into the tumor classification problem now this is an example um from my current study that I am uh leading at the Vancouver prostate center on um profiling prostate cancer patients of high risk using next-gen sequencing technology this is a brand new data uh you see there's only few samples just an illustration that using next-gen sequencing technology we can do the same thing and profile in high resolution matter alternative splicing profiles of tumor samples so uh there were just a few one two three four five six samples there including two prostate cancer sand lines and on the left you see the um the unsupervised hierarchical clustering of gene expression profiling and on the right you see the clustering of differentially spliced genes in the same sample so from the first glance what you see is that the clustering is you know significantly different there was a much tighter clustering of the splicing profiles compared to the gene expression profiles and then again I saw very little overlap between differentially spliced genes and uh differentially expressed genes and so this again suggested to me very similar to the breast cancer case um that transcription regulation and splicing regulation mechanisms act in parallel to modulate cell processes in prostate cancer progression do you one of those two approaches correlate more closely with a histological or clinical classification of disease so that's very interesting question here so what you see here let's just take a look I don't know yet this is a ongoing study right now and this is what I'm interested in answering so from the first glance what I see here is I have two cell lines that are pretty much derivatives of each other they differ in um in certain aspects but one of them is parental another one is a subline now what else I have is one patient with primary disease and leaf nodes in expression they cluster together and these cell lines also cluster together I also had a second answer no no no these are established cell lines that have been around for I don't know in a decade now now so this is one patient so here's another patient which is very interesting case he had a metastatic disease and he had two sites of metastasis he had a urethral metastasis which are quite often in prostate cancer but also he had a pinot metastasis and these, histologically looked very different from the usual urethral metastasis and yet the expression profiles showed that these two samples cluster together now if I look at this sliding profiling what I see is that now my urethral metastasis sample clusters together with the cell lines and the pinot metastasis clusters together with the primary and leaf node of a different patient now I cannot explain right now whether it is sensible because I have very few samples and my goal is to answer this particular question what is to shed light into how much effect does this splicing regulation have in the metastasis process in prostate cancer into urethra and into the pinot but I must tell you that the genes that were differentially spliced between different types of metastasis and they corresponding primary tumors were completely different and they were very much related to the organometastasis which is good so urethra is very close to the prostate itself so what happens very often during surgery they do the trans urethral resection of prostate which is called TURP and when they do this they may be a residual disease left in that particular spot and it is thought by clinicians that those remnants of that tissue may give rise to a urethral metastasis so from one standpoint it can be considered a local disease because urethra is very close to prostate but from the other hand I'm sorry the urethra meth can be classified a slightly different tumor sometimes they call it a bladder tumor so they just characterize it as a bladder tumor it is still a question whether there is indeed still a prostate tumor that keeps growing there or whether it is you know a different origin for that tumor say because of the wound healing processes that take place there certain pathways get turned on and then you get a malignant transformation in that particular site so it is still an open question but at this point the urethral metastasis are considered to be a distant metastasis so it's not the same organ anymore it's complicated it's complicated especially in prostate cancer there is a number of surgical procedures that they do on patients and there is a lot of definitions that differ from clinician to clinician from center to center it's a bit complicated so this represents differences between the samples themselves that are within the dendrogram yes it can be so you can infer a cancer specific splicing here you can infer the tumor sample specific either metastasis or primary tumor specific splicing but you are absolutely right and that's what we are doing right now we were able to get hold of the normal tissue which is a cell line of a finite lifespan that was established from a completely normal and healthy young male which is as normal as we can get in prostate field my other question is is that a measure of the proportion of transcript that has a alternative spliced event so the red would be I guess red is high so red would be a greater proportion because this is no you can't really tell this out of this unfortunately the kind of inference that you are trying to make right now requires an additional step of reconstruction of transcript structures what this says rather is just a feature based signal so yes red is up and green is down and so what it can say is that say this junction is up in this group but this may be a junction which is specific to the skip of the alternative axon so this is what I am working on right now in collaboration with the SFU group from computer sciences department they are trying to reconstruct the transcript structure from this data those are actually genes these are features yes okay so I think we are going to take a short break and then we will continue with part 2 module 3