 Okay, if we could queue up the slides. So my name is Marco Marra, and I'm from the British Columbia Cancer Agency Genome Sciences Center. I'd like to thank Elaine and Linda for the opportunity to address you today on one of my favorite topics, sequence-based RNA profiling. And so in an apparently unusual move, what I'd like to do is start the presentation with the acknowledgments, and I'm not advancing the slides here. So maybe I can have some, oh, assistance from the guys in the back. The advance is not advancing. Yeah, I'm pressing the button. There we go. Oh, whoa, okay. That's number one. If we could go forward. Number two. All right. So this slide is a very high-level overview of the organization. This slide is a different slide. This is hands-free, guys. We're in the matrix. So this slide is a high-level organization of TCGA, if you like. And I haven't seen a depiction of this slide. Thank you. Okay. So I thought I'd throw it up there basically for two reasons. One, to emphasize the area that I'm going to talk to that's in the upper left of the slide. But also to point out that TCGA, I think, is remarkable for a bunch of reasons. One of the reasons I think it's remarkable is the focus, the emphasis that it's brought to large-scale analyses of cancer. And I think all of you sitting in this room are evidence that there's a very powerful and compelling attractive force to the availability of such data. So here's my acknowledgment slide. It's one of the more important slides I'm going to show you today. And my reason for putting it right up front is to acknowledge and thank all the individuals that have provided slides and concepts for this particular presentation. I want to apologize in advance if I've butchered any of your slides or their interpretation for those of you that have provided them. I am speaking on behalf today not only of the group in BC, but a larger group that includes Chuck Pru and Neil Hayes at UNC and a large cohort of individuals at Oaxu with whom we've been working on AML fairly intensively. So thanks to you all. And most importantly, thanks to the patients because without them, we wouldn't have a project and we wouldn't have a motivation. So on to the business of the discussion. So RNA-seq is being done primarily in the United States. With some smaller contributions from ourselves and we in BC are primarily generating and to a more limited extent analyzing microRNA sequencing data. And I've listed here just a few applications for those data modalities. I think some particularly excellent examples of RNA-seq. Just a few applications for those data modalities. I think some particularly exciting opportunities exist around novel analyses using these very sensitive measures. And I think in particular the opportunity for the first time exists to produce a map, the wires together, the exons that are expressed in various malignancies and use this information in novel ways. So we're excited to see those kinds of analyses emerge. So there is a large and growing cohort of RNA-seq data produced by the group at UNC. And thanks to Chuck Peru for the slide where he's showing the availability of more than 1,500 RNA-seq lanes and samples at the DCC. This is growing very rapidly. The other point to take away from this slide is the extent to which these RNA-seq data can be used to define cancer types and subtypes. It's a very beautiful and striking picture. Along with the evolving data sets that you'll find at the DCC and indeed elsewhere are an evolving cohort of tools. And this slide is too small to read and that's probably the point that there are numerous tools that are available now and coming on board for analyses of RNA-seq data both at the level of the entire gene but also at the level of individual exons and exon-exon junctions. Today I'm going to highlight some examples taken from two of the tools that we've had some focus on. That would be Transibis, which is a de novo assembly tool for RNA-seq data, and Alexa-seq, which is used primarily for exon-level expression analysis. But before we go there, I want to take the opportunity to talk about how technology changes are going to mean that the amount and hopefully the quality of the RNA-seq data will increase and improve over time. So this is a very old slide that shows on the bottom axis read numbers and on the y-axis the coverage of exons. This was done in a colorectal cancer cell line. And the purpose of the slide is to simply show the trajectory of gene discovery. So there's an initial rapid rise in the rate of gene discovery with increasing numbers of reads, and then the tail of the distribution indicates a relatively slow accumulation of new exons, but rather speaks to an increasing coverage of those exons. And so when we started with the AML data, if we were to plot the number of reads on this trajectory, that's where we were. And to get to that point, about 125 million reads required something like two lanes of information per case. With the evolution of at least the Illumina sequencing platform, this is about where we are today for TCGA RNA-seq data, where you're no longer at two lanes per sample, but two samples per lane. And even at multiplexing these samples, we have an increasing amount of data. So we expect the coverage of exons to be incrementally increased as well. Okay, so I have enormous enthusiasm for the concept of these exon wiring maps, as I'm calling them. This is, I think, a primary strength of RNA-seq data, the ability to look at exon level expression and also to look at novel structures that emerge from assembly of the RNA-seq data. And so another very dense slide whose only real purpose is to make the case that there are a large number of ways in which transcripts can be altered during splicing, and these alterations can in turn lead to effects on the protein product. And so in theory, one could believe that positive selection might be acting on many of these mechanisms of splicing to produce effects giving cancer cells growth advantages. So it behooves us in the interest of comprehensive data generation to start looking at this in great detail. So if one considers exon level expression across a number of cell lines, one can see on the top metal the gene model of actin, excuse me. And below it, one sees a graph with a series of lines that represent levels at exon and intron, if you like, locations throughout that gene model. These are RNA-seq data where we've plotted the expression of these features across that gene model. And so here you can see some data from embryonic stem cell lines, some breast cancer lines, and for this gene there's no action. All the expression appears on this plot to be pretty equivalent. If you look at the expression in drug-sensitive versus drug-resistant colon cancer cell lines, you can find many examples where exons are dropped or retained. This is one example in which there is an exon skipping event, and it shows up quite nicely in this Alexa-seq plot. So what you can see is an exon missing in one of the samples relative to the other, and that's indicated on the gene model diagram with the lines indicating the splicing events that this signifies. That's a relatively straightforward example. More complex examples are easily found. Here's a case where there's a whole lot of action going on around exon 9 in this particular gene CA-12, where there's exon skipping events that actually skip two exons, some that skip one exon, so on and so forth, and these are predicted, for the most part, to have an influence on the nature of the protein product. And then finally, one other example which shows that in this case, the expression of particular exons is not necessarily reflected in the overall expression of the entire gene. So here's a gene-level expression analysis focused on TPM2 across two different subtypes of diffuse large B-cell lymphoma, and what you can see in these two subtypes indicated in the different colors are approximately equivalent levels of overall gene expression. But if you look in more detail at the exon level, what you can see is a subtype associated differential expression in which an exon is either skipped or retained. So this makes possible some analysis of the details of the gene expression data at the level of these exons as opposed to the entire gene. Now these kinds of concepts in TCGA are being expanded upon and refined, and what I'm showing here is an elegant slide provided by Chad Creighton from Baylor, in which he's showing expression levels across 629 differentially expressed exons between two different stages of colorectal cancer, and showing that for these exons that there is a linkage between splicing patterns and overall levels of gene expression. So this is where I think we are going. So one of the things that we focused on, at least in our group, pretty heavily is trying to find evidence of expressed structures that are not easily found using alignment-based approaches. And so some of these structures can be quite complicated and defy identification solely by alignment. And so we use de novo assembly pretty heavily to try and find such things. We use transibis for this. There are other assembly tools available, of course, and shown here is simply a cartoon that emphasizes that for this particular assembly approach we like to have paired-end reads. We like to assemble those reads using an approach that you can read about in that reference. And we like to find reads that map to that novel structure that actually span the fusion gene breakpoint. We like then to align these contigs we call them back to the genome as a verification of the fusion event. So this has been done pretty exhaustively. We think for the AML RNA-seq data that we've been analyzing in collaboration with the folks at WashU. And shown here is simply a distribution of the different fusion events that have been detected using this assembly approach. Tim Lee showed this yesterday. And so we find both the things that we expect to find in AML data. Those are the known things in blue and at relatively low frequency a distribution of events that tail off. That includes some genes that we think might be interesting to consider in the context of AML. I should make the point now that I'm thinking about it that the transibis pipeline that we offer is not automated from front to rear. There's a fair amount of manual interrogation that goes into this. But as a consequence, the verification rates tend to be very high. So if people are interested, there are individuals in this room, Gordon Robertson, notably who can give you clues and pointers as to the use of the tool. The other thing that assembly offers is the opportunity to detect more complex things than fusions between two genes. These would fall into two broad categories. For example, partial tandem duplications and internal tandem duplications. And I won't take you through all the topography. The nucleating theme here is de novo assembly of the reads followed by alignment to aid in interpretation. And this too seems capable of finding known events as well as novel events in AML. We have yet to verify the novel events for PTDs and ITDs. All right, so as we transit along the shopping list of things that you can do with the data, I'd like to spend some time discussing expressed mutations. And so what I mean by that are transcripts that seem to be encoded from loci that are somatically mutated. So there have been a number of groups that have looked for mutations in RNA-seq data and were guilty of this transgression. As you might imagine, there are a fairly significant proportion of false negatives in relying solely on RNA-seq for mutation identification. The flip side though is that if you find such things, genes that contain a mutation that are expressed, you have some knowledge of the context in which these mutations are expressed. And that can be useful helping to think about the function that the mutations might impart. I think a good example is that of EZH2 where in the expressed data for diffuse large B-cell lymphoma we see over and over and over again transcripts that seem to be affected at a particular tyrosine residue within the set domain of this important methylator of histone H3K27. So they're there and you can find them, albeit with some false negatives perhaps approaching 50% depending on the sample that you're looking at. This is being, I think, aggressively pursued at the moment and is the topic of some hot conversation. I remember a discussion with Gatti earlier in the week where it was being argued that you can use RNA-seq for sequence verification of candidate somatic mutations from tumor normal genome pairs. And this nice slide from Matt Wilkerson I think makes the point quite well. And so if you look at the text what you'll see are some statistics based on medians. So if you, in lung, you can find that about 66% or so of the candidate somatic mutations have RNA-seq data that map to that locus. Of those about three quarters are detected for an overall yield of about 50% in the RNA-seq data. So one can imagine as these large projects go forward one would use the RNA-seq data in combination with the genome data to verify the existence of mutations in a perhaps rather rapid fashion. Going even further, Angela in the Harvard group has been using RNA-seq data to confirm, if you like, fusions that are detected in low-pass-covered sequencing of colorectal tumors. And these just show some examples that Angela has provided in which it's possible to use this in a mode that's confirmatory not only at the level of the individual bases but much larger structural effects. And so we can imagine that this validation produces some independent evidence of the existence of the fusion but also provides some information as to how the fusion is wired together in the context of unexpressed transcript. Okay, so covering a lot of ground. I apologize for that. Shifting gears yet again. I want to have the opportunity just very briefly to make a few comments on microRNA sequencing which is our principal contribution at this moment to TCGA. So there are now on the order of 3,000 or so microRNA-seq profiles at the DCC representing something like 18 different cancer types and the diagram, the clustering diagram below simply shows the tumor sites for which these are available and the ability to use these data to correlate with disease pathology. So one of the main rationales for us when we entered into this was the opportunity to think about the interplay between messenger RNAs and microRNAs. That's this little diagram down here and so we imagine that microRNAs can play regulatory roles that act at various cell biological levels and so we were anxious to generate these data to enable analyses that would look at the correlation between microRNA and messenger RNA sequences. I'm not going to talk about any of that. You can speak to Gordon Robertson or Andy Chu who are both here that can tell you the status of that. Instead I think what I want to point out in my observation is that in addition to having the potential to look at these regulatory relationships we think that we have some evidence that suggests that these poor old star microRNA guys sitting up there that have been relegated to the degradation been upstream of the gene regulatory process may not actually be degraded in a way that prohibits their function in a regulatory context. There are also, as we're learning, all sorts of interesting features about microRNAs revealed by the sequence including the addition of templated bases which can expand the target range of these things so clearly this is a fairly rich resource that I hope people out there are tapping into and analyzing aggressively. For AML, just as an example, there were on the order of 190 libraries sequenced with an average yield of about a million mapped reads detecting between 270 to 400 odd known microRNAs. Not that many novel microRNAs are found using this depth of sequencing and the approaches that we are using to process, but even so what we have noticed is a very interesting distribution of the so-called star strand versus mature strand expression. And so there's a transcript up at the top, you see it there, with the indications of what will become processed star and mature strand microRNAs. If the star strand is indeed relegated to the bin as it were and chewed up in an upstream degradation, you wouldn't expect many of them to be around. And what we show here is the ratio from the microRNA sequencing of star strand to mature strand expression and the boxes around the case where the star strand is expressed about seven times that of the mature. And so clearly these are abundant and there are probably regulatory roles that can be inferred from these as well as different targets for their action. So again, this is an invitation if you like to get interested in this area and take it forward. You can of course use the microRNA sequencing for clustering purposes. And so shown here on the right is a microRNA clustering experiment using the AML microRNAs. And on the left, RNA-seq clustering using the AML data. So these are the same cases. And I'm just pointing out areas in which the RNA-seq data agree with the microRNA data. That's the top line. I mean for all of you that aren't challenged with color. So we see here that for the M3 subtype cases the microRNA and messenger RNA clustering data yield about the same result. So they're concordant in this respect. Less concordance is seen in cases that have an NPM1 insertion and that's shown below where the microRNAs seem to be quite sensitive for such cases the messenger RNA is less so. All right. So a quick tour through some of the things that you can use the data for. Now I want to go into an area that I think represents a forward direction for TCGA. And this area really is sense and anti-sense gene expression and the consequences or the potential regulatory consequences of overlapping expression in particular on the plus and minus strands. And so what do I mean by that? Well, I'm showing here a cartoon of sense and anti-sense genes and just for convention purposes we'll call the top gene the sense gene. So these transcripts overlap and what this diagram says is that when the gene on the lower strand is expressed that there's a difference in splicing pattern, frequency of spliced transcripts that is then found as depicted in the lower panel. So an expression of the lower guy is on you tend to see an increase in expression in alpha one. So there's actually a few examples of this but not loads and loads and loads in the literature. So what might this do while splicing perhaps is an influence of this kind of expression? The other thing that has been linked to this kind of expression is silencing through methylation either of chromatin or directly onto DNA. So there's a linkage here that we can exploit in the context of a project like TCGA where cases have both expression data and methylation data as you heard from Peter Laird earlier in this meeting. A more comprehensive example of, if you like, using afymetrics exon arrays as shown here, what I'm showing is the correlated and anti-correlated expression of two exons for the gene Part 9. Those are in blue or red compared to the expression of an antisense gene DTX3L that's in black. And what you can see is when DTX3L goes up the expression of one of the exons in Part 9 goes down, that's the red line. And so there's this anti-correlation effect. And this is seen in a rather pleasing pattern across many different tissues. And so one speculates that this kind of effect might be general for this particular gene. And if you look in TCGA afymetrics exon data you can see on the bottom panel that there are quite a significant number of genes with sense antisense correlated apparent splicing going on. And so a deeper analysis then is before us to address the question as to what might the function of this expression be. So I think in order for TCGA to go there and go there hard we should be considering the use of strand-specific RNA-seq. This is not something that we've done. So strand-specific RNA-seq is a modification of the RNA-seq that allows knowledge of the strands to which the individual reads map. The method that we're quite enamored with at the moment is the one published originally in NAR in 2009. And the bottom reference there is a rather detailed evaluation of the performance of a number of different protocols. So I encourage you if you're interested to have a look at these. So what I'm showing in the black track is RNA-seq data analyzed in a strand insensitive way. So this is maybe what you would see if you didn't know about strandedness. You can see all the peaks where the reads are lining up across the gene models on the lower part of the diagram. With strand-specific information which is shown in the center in the orange and the red you can now start to attribute the expression of these individual exons to the annotations below. Which gives you, I think, a rather more detailed view of the pattern of expression of these closely linked and even overlapping low side. So there's information here. You might be able to figure out how the reads go if you have good annotation but there are cases where some of the annotation isn't yet available and that's one of the things I think this project is going to do. And so shown here on the top again is a strand insensitive analysis of the orange peaks but the only gene model available is the one at the bottom. And so if you resolve this into strand-sensitive analysis what you can see is the gene model at the bottom is represented by the orange peaks and the big thing in the middle is something else entirely. So this has two meanings, I think. One is that you could misinterpret the expression of this gene by lumping all the reads in together. You've discovered a novel and potentially differentially expressed transcript in your analysis of cancer samples. This happens to be a non-coding RNA. Okay. So just wrapping up I'd like to emphasize what I said right at the beginning of the talk which is we are very pleased to be part of this project. I think it's an amazing union of intellectual energies and thank you all for being interested enough to come to this meeting. It's fantastic. I'd like to acknowledge the foresight and vision of the funders particularly NCI and NHGRI and of course we are supported by the BC Cancer Foundation. So with that I'm going to stop and we have 25 seconds for a question. I have a question. For the RNA-seq mutation confirmation do you have to use a specific nucleotide as the verification or are you looking at the whole gene of the So when we're verifying the existence of mutations I suppose different folks do it rather differently in the examples that I showed what we typically do is we match the RNA-seq data against the genome data and if the two are in accordance at a particular locus then we would argue that that's a verification result. So it follows on the nature of the actual mutation in the DNA. Because for using the RNA-seq for screening purposes the false positive rate seems to be much higher if you're looking at samples that don't have the no mutation but you're suspecting it in either an exon or somewhere along the gene. If you don't have a matched normal sample and you find something in the RNA-seq data then you're going to have to concoct a rationale that doesn't involve somatic mutation that's for sure. So you need some source of normal if you want to claim it's a somatic event. We have I think used the RNA-seq data fairly successfully as looking for evidence of recurrent expressed mutations and in the absence of normal this doesn't prove that that recurrent event is a somatic mutation because it could be a polymorphism for example but that does focus our attention on the things that we would take forward through to validation and tumor normal. There's a lot of mutational noise in the RNA-seq data alone so you have to be careful how you use it. From our experience it seems that the false positive is much higher than what seems to be presented. I think that's right. All of these false positive issues are related to the exact algorithm that you employ and the deductive process that you employ but there are many more false positives of using RNA-seq solely for mutation detection is on the agenda. You have to be careful. I think with that I want to talk to you later but I want to keep the session moving forward here.