 So this first slide is just the title slide introducing me and the topic with this picture that I like to use as an introduction to RNA sequencing that was created by a colleague of mine, Rodrigo Goya in Vancouver, showing pre-MRNA molecule being spliced into a mature RNA molecule and then short reads being generated from that RNA by some kind of next generation approach and then those reads being mapped back to the RNA to show what a split read alignment looks like when it's spanning across an exon-exon boundary. We're going to talk a lot more about the details of that process throughout this presentation and please do feel free to ask questions as we go. So just briefly to outline the goals of this module, basically we'll try to introduce you to the theory and practice of RNA sequencing for some of you, presumably this will be sort of a review, but basically we'll try to go over the rationale, some of the challenges and common questions that come up in RNA sequence experiments. The idea of this whole tutorial and lecture is to try to act as a practical resource for those that are new to the topic of RNA sequence analysis. And with that goal in mind, I provide a working example of an analysis pipeline and this is supposed to be generally self-contained so that you should be able to take it away and run it again on your own time. And there are many different types of analyses we can do with RNA-seq data, but as our example workflow is going to be based on gene expression and differential expression. So just as a quick review of the biology, for most of you I'm sure this will be well understood but just so that we're all on the same page, I usually show this cartoon of gene expression starting with double-stranded genomic DNA template which has various features that we'll talk about in terms of gene annotation. So on this example you have three exons and two introns and those exons have features as a promoter region and so on. This region of the genome gets transcribed into a single-stranded pre-MRNA molecule which still has the introns intact. There are additional features on that pre-MRNA molecule that allow the splicing machinery to recognize where the exons start and end and where the introns need to be removed. Once this RNA processing occurs you have a mature MRNA that's been capped and polyethanylated and the exons have been assembled and this thing will get exported from the nucleus into the cytoplasm to be converted into protein, translated into protein which will then be folded. So really the main point of showing this is to note that we're really focusing on this type of molecule. For the most part RNA-seq libraries, at least most of the ones I've seen are generated from polyethanylated MRNA so someone has isolated total RNA and they've enriched for polyRNA and they've made CDNA from those RNAs and fragmented them. But of course variations of that strategy are possible. So this slide shows sort of an overview of the whole process of RNA sequencing where analysis is just kind of a black box but starting with some samples of interest. So in our tutorial we're going to use two samples. They're both from colon, one is a colon tumor and the other is a matched normal piece of colon tissue from which RNA was isolated and that RNA was used to generate CDNA. The CDNAs were then fragmented into smaller pieces and those CDNA fragments had linkers attached and they were selected to be approximately a particular size and usually that size is in the range of about 250 base pairs. Those fragments would then flowed across an alumina flow cell which is shown here with eight lanes and this is an image from the flow cell being analyzed and the alumina generated hundreds of millions of paired-end reads which comes out to tens of billions of bases of sequence and now what's being depicted here is sort of a cartoon version of a paired-end read where you've got a blue and a red box connected by a dot so that's meant to represent read one and read two over read pair with some space in between them which is variably called the insert size or the mate inner distance or a variety of other terms but basically you have these two reads that are correspond to the ends of the CDNA fragments that we started with and this is going to be the raw data that we're going to be starting with so we're going to start with files that contain millions and millions of lines that each one of which represents either read one or read two of a paired-end read. We're going to take those reads and we're going to map them to a combination of the genome, the transcriptome, and exon junctions and then we're going to do downstream analysis on those mapped reads. Any questions on that? Sorry? Initially it was because the read lengths generated by next-gen sequencing instruments are not long enough to accommodate so if the read can only be 40 base pairs long you don't really want to be sequencing 80 base CDNA fragments because they're kind of too small and you get more information by having short reads that correspond to the ends of a fragment that's larger. We're starting to get back to the point where the read lengths are approaching the lengths where we can actually sequence from one end and the other end and where the sequences might actually meet in the middle and then in a sense they are still paired-end reads but in a sense they aren't. If effectively you can have one read being generated and we're sort of at the transition where people are starting to think about whether we should make our fragments larger so that we still get paired-end reads and what are the advantages and disadvantages of that but yeah it's basically a technological limitation. The reads initiate from linkers in words. They each have yes a common linker, a read one linker and a read two linker. So read one is from one end and read two is from that end. So they're facing each other. It depends on the sequencing technology so there are different there are sort of paired reads that are in a different orientation than the orientation that's being depicted here. So for example in Packed Bio has an instrument that does a concept of strobe reading where they read and then stop reading and then continue reading and those reads wind up being facing the same direction and if you have two of them you might kind of think of them as paired reads and you can do multiple strobes. You could have a read one, two, three, four and they would all be the same direction. If you make your, instead of making CDNA fragments and putting reads on either end, if you make a mate pair library instead which involves circularizing DNA and then cutting a piece out you can wind up with the effective orientation of the reads being different again so it does depend. And it's something to think about when you first get some new data is to understand how that data was generated and whether the orientation of the two reads is what you think it is. So you'll have two sets of reads that each one, one file will correspond to all of the read ones and one file will correspond to all of the read twos and in each case those will match up so you'll have one read for the first part of this fragment and one read for the second part of that fragment for every fragment. I mean this is just an abstraction I mean they're not, yeah there's, you don't know what, you don't know anything about the orientation of the, so maybe you're getting at sort of strandedness, there's no stra, strandedness to a lot of these libraries that's just you randomly sequence from one end and the other end. Okay so I put this slide up because often I'm in a crowd where there's actually not that many people interested in RNA-seq analysis and a lot of people doing mutation calling or other things really copy number analysis, structural variant detection in DNA and usually there's like three or four out of like 20 people that are interested in RNA-seq. So I have sort of some rationales, slides for why you would sequence RNA but sounds like I don't really need to convince this audience but I mean I'll go through it quickly anyways just because it's interesting to think about. But basically from a biological perspective why you would sequence RNAs for functional reasons so there's experiments that you could easily design where the genome may be constant but some experimental condition is having some effect at the transcriptome level so you could imagine a cell line where you have drug treated versus untreated state and effectively the genome of that cell line is static but the transcriptome may be changing radically in response to the drug, similar with most models and many other model systems that you could think of. There are definitely some molecular features that can only be observed at the RNA level so for example alternative isoforms, fusion transcripts and RNA editing are really only observable at the level of the RNA so you can sequence DNA all day long and not really learn that much about them. In particular predicting the transcript sequences from the genome is very difficult and there was sort of a field of bioinformatics that attempted to do this for a long time sequencing genomes, analyzing the feature of the genome and trying to predict what transcripts would look like for a variety of reasons. This was deemed to be a sort of an efficient approach or something to do when you didn't have an alternative but it's very difficult. It's much easier if you can to just sequence the transcriptome and align it back to a reference genome or assemble it de novo, not to say that those things are easy but easier. And also there's sort of, there's circumstances where it's very useful to combine RNA-seq data and whole genome or exome data so for example interpreting mutations that might not have an obvious effect on the protein sequence so for example regulatory mutations you may identify in a tumor say you have some mutation, doesn't appear to occur in an exon and therefore you're not quite sure what its function is. If you sequence the transcriptome you may find that it affects the expression level of a nearby transcript and then you've been able to assign a putative function to that mutation. And there's a variety of sort of categories of regulatory mutations that would fit into that category. Another scenario is prioritizing protein coding somatic mutations. This is probably one of the most straightforward things that people are doing but combining transcriptome and whole genome data is to identify mutations in the genome and then determine which of those mutations are actually expressed in the transcriptome and to what level. To help you prioritize a list of mutations so you can in theory sort of bend your mutations into those that are in genes that are perhaps not expressed at all in your tissue and therefore they might not be as interesting. And then you have this notion of allele specific expression where the gene may be expressed but perhaps only the wild type allele is expressed and that perhaps might suggest that the mutation is the loss of function or helpful in sufficiency. If the mutant allele is expressed preferentially on the other hand that might be a good candidate for a drug target or it might be a higher priority mutation for further study. There are definitely papers that have been written about it. There are like alignment strategies that are sort of tailored towards it. Not aware of any tools that really specifically take a combination of whole genome and transcriptome data and sort of have a push button. Here's your allele specific expression of variation results. I would be interested to know if someone doesn't have such a tool though. It's probably only a matter of time though. I guess I'm supposed to be repeating these questions, right? Okay, so let's move on to the challenges. So some of the challenges that relate to RNAs relate to the sample itself. So purity, quantity and quality seem to always be an issue when you're dealing with RNA. Another sort of feature of genomes is that RNAs consist of small exons that are in species like human or mammals tend to be separated by very large introns. So we're going to do all of our tutorials based on human and you'll see that mapping the reads can be quite challenging and the reason is that you have these short reads and short exons and you're trying to match them up but then you have these huge introns in between so it can be hard to identify correctly when a read spans across an exon-exon boundary if you've got a couple small exons and 50,000 bases of intronic sequence in between. Another issue that doesn't really come up when you're sequencing DNA is the relative abundance of the species that you're sequencing. So in DNA there is some variation because of copy number variation but basically you expect approximately a uniform coverage of reads across the genome so this was mentioned in Michael Prudino's talk yesterday. In RNA you don't expect that to be the case at all so a lowly-expressed RNA and a highly-expressed RNA are in a very different category so the range of relative abundance of RNAs varies wildly and you'll hear estimates that are anywhere from 10 to the 5 or 10 to the 7. I think in this this paper here Trapnel actually quotes 10 to the 10 so from 1 to 10 to the 10 so you can have a biologically functional transcript that's present in a few copies per cell and another the next gene over in order for it to be biologically functional you need tens of thousands of copies in that cell and this creates a challenge for sequencing because RNA sequencing works by random sampling so we're going to basically be grabbing reads randomly from these fragments that were generated from a pool of RNA but the highly-expressed RNAs are really going to consume a lot of our reads so when we randomly pull these fragments out we tend to pull the same fragments over and over and over again and we may wind up having the sequence much deeper than we would hope to accurately profile lowly-expressed transcripts is there a reason that we only attacked mature mRNAs and not unprocessed RNAs I guess the first answer to that is that of course we don't just attack to mature mRNAs it we enrich for them to simplify the search space but they're still there it's in some amount so we'll definitely be sequencing some amount of unprocessed RNA if for some reason you had a particular interest in the processing of RNA then you might make your library in a different way that actually didn't enrich for polyadenylated mature mRNAs and that's just a choice of the experimental design but there can be challenges associated with that because one of the ways to improve that the challenge of this situation is to remove some of the most highly-expressed transcripts and a lot of those really highly-expressed transcripts are not polyadenylated so by doing the polyadenylation you remove a lot of redundancy in your sample and it helps with this problem that you don't have to sequence quite as deeply but there's no reason why you can't make your library in a different way that would be suitable to that kind of experiment similar to the relative abundance issue RNAs also come in a wide range of sizes so if you're sequencing human chromosomes generally they're all just very large and you don't think about much about their relative sizes but for when you're profiling RNAs you do tend to think about what size they are for a variety of reasons one is that there are some functional RNAs that are very small and because of the way we make the libraries they may not be captured effectively so as far as I know not really a great library preparation method that will cover all types of RNAs generally what people are doing now is creating one library for mRNA in a second separate library or series of libraries in some cases for small RNAs particular micro RNAs tend to be treated sort of as a completely separate problem and also related to the size for very large RNAs because RNA is fairly fragile if you do a polyase selection you may lose the five prime ends of larger RNAs so if you think about a bunch of RNAs floating around in solution if some of those RNAs have been broken in one or more places and you grab onto the three prime end of them and then wash everything else away you will lose some five prime ends of some species and the overall result in your libraries that you'll have bias towards the three prime end of your transcripts and this is very common in RNA libraries that you'll see increased coverage towards the three prime end of transcripts in particular if you've done a polyase selection or if your library construction somehow involves priming off of a polyate tail. So given all these challenges if there's a sequencing center that wants to try RNA-seq they want to try and control to see how good they are at it. Are there data sets publicized that you could judge your results against and say yes we are? Or how well would you expect to do in terms of reproduction? There are a lot of RNA-seq data sets that have been published they're very heterogeneous in the way that they were made the library construction techniques that were used the sequencing technology that was applied the read lengths the fragments out of so many variables that one of the challenges will be finding something that's similar to what you intend to do with the current state of the art. So what about like a sort of control set to say that's very well described? Yeah so what I think what a lot of people do there is they they get RNA from a cell line that's been well characterized that a lot of other centers perhaps are also sequencing. So in our lab we use Lincap which is a prostate cancer cell line and we just sequenced it to death and every time we make we try a different polyase selection strategy or we get a different kit for making CDNA or we want to experiment with some significantly different analytical approach we kind of go back to that that RNA or that data and I think a lot of places will will take that strategy and the nice thing about doing working with the cell line is that you have a basically unlimited supply of high-quality RNA or you or you can purchase RNA in large quantities as well especially if you don't sequence it the same way if you do a polyase selection you may have washed away five prime ends in a degraded library if the library is not degraded then it should not be an issue but the larger so say you have a 20,000 nucleotide long transcript and you have some amount of even a low-level amount of degradation your RNA the probability that a break occurred is higher in a large long sequence than it is in a short sequence of very very small RNAs you don't worry as much about RNA degradation because they're smaller but you'll notice it because the breaks will be random so you'll have five prime hands will be different copies of different rates will have different yes they will have different random ends but you're always grabbing on to the three prime end so you will tend to have lower gradually lower cap coverage on average as you get out to the five prime end and it's common it's a common sort of QC measure in RNA-seq labs to produce an end bias profile or something that called something similar like that where you basically plot the coverage that you observe in your reads across transcripts of different sizes from the five prime end to the three prime end and you'll see a sort of horseback shaped curve where you don't get very good coverage at the ends and in a really degraded library you'll see a big peak or spike at the three prime end and a kind of tail going out to the five prime end where you're getting progressively lower coverage on average so the question is if you get n bias does it affect the mapping or the stats of your expression it will definitely have the potential I mean you used it won't affect your ability to map the reads the reads you get will still be mappable and you will probably still be able to get good gene expression estimates you can get great gene expression estimates just by specifically profiling three prime end of transcripts and that's the basis of a lot of the aphometrics microwave designs or sage libraries are basically focusing on the three prime end of transcripts so you can get gene expression estimates but you will not be able to effectively profile the status of mutations that occur across the length of the transcript you might not be able to assemble full-length transcripts together if you're trying to determine what your transcript structures actually look like you may miss alternative splicing events and in terms of comparing two libraries if the degree of end bias is variable and this will happen this tends to happen because each say you have 20 patient samples and they're all all of those samples are from a clinical setting and they're all degraded to different amounts they each have this effect to a different level so that introduces bias in terms of your differential expression analysis so it's definitely a concern you if you do have it you want it to at least be consistent across the libraries you intend to compare why don't I go to the next slide where I'm showing an example of an Agilent QC so the the question is whether there's a cutoff in terms of purity or quality in your RNA where you wouldn't want to even go forward with RNA sequencing it's unfortunately not necessarily a simple answer question to answer but I'll start by showing this example of an Agilent trace so this is the output from an assay called Agilent 2100 nano or RNA 5000 or something it's got a big long name but basically what it is is you're running RNA samples through a device which is an Agilent lab on a chip and this is a very common way that people QC RNA libraries how many people here are familiar with this assay okay so quite a few of you but basically the idea is you're you're running a gel effectively so in the good old days people would get an RNA sample and they would run a gel and they would look at how smeared the RNA was to get a sense of how degraded it was if it's not degraded you'll get two really bright bands near where you would expect the size for whatever species ribosomal species are so for human you expect them to be at a particular size so shown here is the 18S and 28S peaks for human so this is an example of an RNA that was analyzed that had very very high quality so this is from a cell line and if there this was a gel you'd have two very bright bands and everything else would be black so basically there's no smearing but instead of running a gel we're running this sample through a capillary and then every time you get a band on your gel you get a peak effectively instead of a band so it's like a way of running a gel in a more quantitative way and on the x-axis you basically have time on the y-axis you have fluorescence units so how bright the band the effective band was and because you've run a ladder with known sizes of fragments you can convert this time into nucleotide sizes and then you can analyze this image and come up with a sort of empirical notion of quality and you can assign a score and the software that comes with this machine will assign an RNA integrity number or RIN number and these are commonly reported in RNA sequence experiments a perfect score is 10 and a very bad score would be 0 and the sort of numbers in between give you sort of a crude sense of how degraded your sample is but I mean there's various caveats associated with that and what's being shown here is a sample that's degraded where you have instead of just seeing your two ribosomal peaks you're seeing those peaks but also a lot of peaks that are smaller than those two peaks and basically this is degradation so you've got RNA of these two sizes being broken into smaller smaller pieces and creating a smear and that's interpreted as a series of you know sort of non-distinct spiky peaks and the more of this is going on relative to the the two peaks that would correspond to intact RNA the lower your RIN score gets and some labs will pick arbitrarily a RIN score that they feel is a minimum so they'll say if it's less than 8 which is sort of a common number that we shouldn't try to make an RNA C library out of the sample and then the next week they'll break that rule because they have some collaborator that really wants the data and some analyst will be stuck with the problem not deliberately but accidentally all the time and but yeah and I can't think of an example where someone has has sort of I mean that would be interesting to take a sample and degrade it systematically and produce like try to identify samples that had RINs 4 5 6 7 8 9 10 produce RNA C libraries from each of those and see what effect it has on a variety of downstream applications that would be really interesting it's exactly the kind of experiment that should be done but no one will pay for again if anyone knows of an experiment like that please yeah I mean they're bad and generally for FFP that you can easily have and these scores don't really mean much beyond below five anyways but I've seen RNA from FFP that's anywhere from you know two point something to maybe six or seven would be like a really great RNA sample from an FFP and generally the result is not that great you tend to have a lot of issues with those libraries the question is basically should you worry about the RIN score as much if you're doing a micro RNA library I haven't done that but I have heard sort of anecdotal accounts that libraries that came from a degraded RNA even though like you say you shouldn't expect it to be as big of a problem still or problematic but it's pretty anecdotal I mean ideally even though you're dealing with these small species you still would prefer to have intact RNA but it definitely should be easier than if you were trying to profile full-length you know mRNAs that are 2 to 3 kb it depends on the nature of the RNA degradation something else okay so some design considerations I mentioned this standards and guidelines and best practices document it's basically a list of practices relating to how you would make the RNA but a lot of it has to do with analysis so what kind of data should you supply metadata should you supply with data that you deposit to a public repository how many replicates should you do what kind of sequencing depth you need what kind of control experiments you should consider spikens all of these kind of it's basically reporting standards a big list of things that you should do that nobody does but it's still worth considering I mean which of them might be practical to implement in your lab I mean none of them are bad ideas some of them are just costly time consuming increase the complexity of your pipeline etc and you may decide to try to do without them and that's whatever you know everyone is doing that to some degree but it's definitely worth considering what the best practices are and how close you can get to them in practice with while still meeting you know your other financial and other considerations so the first one one of the things I mentioned that was replicates which of course are very useful things we're going to include that data we're going to analyze today involves replicates and many experimental designs you may not have them but you should really consider if possible generating replicates and if you do that what kind of replicates should you make so there's a variety of different kinds of replicates I would sort of have think about three categories technical experimental and biological technical replicates it kind of depends on how you think about it but what a lot of people mean when they say technical replicates and RNA seek is taking the same library and running it twice or having two lanes of data generated for the same physical library and I guess the thing that I would say about that is to probably not bother doing that I think it's pretty much been established at this point that if you compare one lane of RNA seek data to the lane beside it and even to a large extent the next flow cell that runs on that machine or even the next machine over at least this is for Illumina the results seem very very consistent you generally get very very very consistent results by comparing a couple lanes it's not it's not that much different than just taking a lane and dividing into two bins and treating those as technical replicates unless something has really gone I mean that you can still have you still have to watch out for problems there may be some kind of defect in the flow cell the machine you know may have a power outage of course things can go wrong but generally in a production environment when things are moving running smoothly those kind of replicates ideally in an ideal world you would always do technical replicates but if you have limited resources probably not the the most important thing of course biological replicates are always important if you if you're working in a biological system which has variability which of course it does you would want biological replicates experimental replicates I are sort of a third category where I would say you take some RNA or even you start at the RNA isolation step and you repeat the whole like the whole lab process you take a sample you isolate RNA you make a library from it and then you sequence that library that whole process does have variability in it so now you may have you know a different operator making the library different technicians involve different buffers I mean you can try to keep these things consistent but it's a good idea to get a sense especially if you're just starting out on a series of RNA-seq experiments to get some idea how wild the results are coming out if you process the same cell lines you take your lint cap cell line and you go through the whole process from tip to tail and then you start back at the beginning and you do it over again on a different day and see how comparable the results are and do that a couple times and if and it may identify problems in your in your process for biological replicates there's not much I can say other than do them and it really depends on the nature of the biological question and you should generally for both technical and experimental replicates you should be shooting for a Pearson correlation above 0.9 in my experience if that isn't the case then something has probably gone wrong so what are some of the common analysis goals of RNA-seq analysis the one we're going to talk about mostly today in the tutorial is gene expression and differential expression but one of the reasons people are excited about RNA sequencing is all of the other types of questions that you can ask of the data so you can look for alternative expression you can use RNA sequencing as a means of transcript discovery and annotating a genome that previously has not been annotated it's actually probably the cheapest most efficient way to annotate the genome now you can look at alleles specific expressions this could be relating to common polymorphisms or to mutations that you identified in the genome you could use your RNA sequence data directly for mutation discovery in some groups have been doing this it's challenging but it can be done fusion detection is sort of a hot topic involving RNA sequencing it'll be interesting to have a tutorial on that but we just don't have time RNA editing is also kind of in a hot topic lately in certain circles and fairly controversial so there's a lot of people interested in that and there probably others here that I'm forgetting but each of these things has a general theme when you convert them to an analysis workflow so I thought I would just go over the that basic theme so they generally have some distinct requirements and challenges but for the most part what you're going to be doing is getting some raw data off an instrument at this step there will also often be a requirement to convert from one file format into another file format then they'll be in a line or assemble step where you're taking your reads and you're aligning them to a reference genome if you have it and if you do not have it you're doing some kind of de novo assembly of your reads into transcripts then you're going to process these alignments with some tool that's specific to one of the goals that I mentioned on the previous slide so generally there's a variety of that's sort of the way these tools get divided out is they take a BAM file that has alignments already and they examine that BAM in a particular context looking for looking to generate expression estimates or identify alternative isoforms or call variants or whatever you and usually there's sort of separate tools for each of those things so for example we're going to use cuff links for expression analysis you could use defuse or chimera scan or something else for fusion detection and so forth after that these tools so all of these tools generally they do not produce a paper for you so there will be some kind of post-processing step actually in actual actuality they generally produce a very esoteric complicated series of output files that only the author of the tool can understand so usually there is some kind of further massaging of the data that's required and if not that at the very least there's some kind of visualization or statistics so usually what you'll the situation you'll be facing is trying to parse the output of these tools into something that's a bit more digestible and then feeding that output into another tool that will allow you to visualize the data and do some statistics get some basically make figures for your paper and there are some tools sort of that are starting to be developed now that are specifically tailored to summarizing and visualizing the output from these other tools but I'd say that still at this stage it's kind of very there's not a lot of great tools for that purpose so there's a lot of custom stuff going on yes it can be so actually the way that we run the tutorial which you have already done right that actually the tutorials actually created to try to produce an output that says microarray like as possible and that can be a conscious decision to allow you to feed into a pipeline that have already been developed so a lot of other so a lot of things may be very common once you get the data into a format that kind of looks like the format you would get out of a microarray so you could then feed it into pathway analysis or various other types of analyses yeah so the microarrays were never really used for fusion detection or for they were very difficult to use for alternative splicing analysis so the tools related to that they were just starting to be developed really when microarray when the next generation sequencing really took off and they tend to be quite specific to the design of the microarray so it can be difficult to get your data into a format that it can be very forced to get your data into a format that would allow you to use those tools so in a lot of circumstances it's more work than it's probably worth and also by by kind of thinking of an RNA-seq experiment as a micro experiment in a lot of ways you're kind of limiting yourself sort of undoing the advantages of RNA-seq in the first place which is that you have this ability to profile the transcriptome in a very comprehensive unbiased fashion so from a bioinformatics perspective one of the most annoying things about microarray analysis was being limited by the hypotheses that were phrased by the designer of the microarray so you're limited by the genes that they put on there the exons that they profiled the exon exon exon exon connections that are represented by the probes on the array and once that's done it's done the data can't if you want to change the design you have to go back make a new chip hybridize scan do everything over again whereas RNA sequence doesn't have that you're just randomly sampling the transcriptome and if you want to think about whether a different gene is there you learn about some new gene in some other experiment you can just go re-examine the data you don't have to go through a design step again that's really one of the like most powerful things about RNA sequencing this slide I just kind of threw in there as a as a resource listing a few tools for each of the different categories in case someone is interested in one of those particular questions that you want to ask of RNA-seq data other than expression and differential expression that we're gonna talk about we're really gonna focus on this cuff links cuff diff pipeline but there's some others here in case you're interested and then I was also sort of plug these two forums secancers and bio star which are sort of complementary secancers is really a forum that has a lot of it has a lot about bio informatics but also a lot of the lab details so things about library construction or inequality all these kinds of questions and pretty much any question that you will ask today probably someone is asked there and there may be some useful answers and bio star is kind of a similar concept but really focused on the bioinformatics end of things so there's endless people talking about which tools they like and problems with those tools and I'd highly recommend that you check those out some of them include sort of downstream steps tool so they have like a step that will be assemble your raw reads into contigs and then they'll have some subsequent step that maps the reads about back to those contigs or attempts to extract read counts from the assembly process and give you some notion of expression level which ones do I list here I would say definitely Trinity and transivist have some concept of that how well it works I think is that these are really like this is it's an area very heavy active development at this stage the whole transcriptome assembly and characterization for a lot of minutes kind of dodging the issue but for a lot of people that asked me about doing analysis without having a reference genome where if so if you have a lab and you intend to embark on five years of experimentation on a particular model species you don't have a reference genome usually my first question is why aren't you sequencing the genome first given the cost of genome sequencing at this stage maybe you should make the reference genome and then proceed to the transcriptome analysis and there may be good reasons why you can't do that I mean depends on the size of the genome it's complexity there's lots of but they're you know it's worth thinking about okay so the next section is common questions I don't think any of these common questions have actually been asked so far but if they have all just gloss over them but these are things that the questions that tend to crop up over and over again so I just thought we'd go through them one question that people commonly asked it's sort of related to a early stages of analysis is should I remove duplicates for RNA seek so in DNA next-gen analysis is very common to remove duplicates basically every SNP or mutation calling pipeline has a preliminary step that removes duplicates by identifying paired reads that appear to map to the same coordinates and then basically collapsing those down to a single representation and the idea there is that you don't expect these duplicates to occur by chance and if they do occur perhaps they represent PCR amplification artifacts and we don't want those in the data set so we remove them and a good library a good whole genome sequence library will not have very many duplicates you'll look at a band before and after duplicate removal and you'll get a drop of maybe 5% of your reads or 10% of your reads will be removed and of course that depends on how deep you sequence the genome but that's assuming like a 30 to 40x coverage of a genome a good library will not result in that much removal of duplicates in RNA this is not the case though in RNA this question is more complicated and the reason is that while duplicates may correspond to biased PCR amplification of particular fragments for highly expressed genes for example a highly expressed short gene you expect duplicates to occur just by chance even if there is no amplification bias so you can imagine a gene that's only 400 bases long and it's expressed at 10,000 copies per cell when you start sequencing deeply a library that contains that those transcripts just by chance you'll get read duplicates because the species you're sequencing is so abundant you're just hammering you'll for highly expressed genes you'll get 10,000 x coverage of highly expressed genes from a single lane of data so you expect these duplicates and if you remove them you're effectively collapsing the dynamic range of expression estimates that you can measure from your library and if you're if the goal of your experiment is to determine how highly abundant each species is that may seem like an unsatisfying thing to do so for that reason a lot of people don't remove duplicates from RNA-seq libraries unless they believe that there's some good reason to do so so if you perform some analysis and determine that your library seems to have a lot of redundancy that it's a very low complexity library then you may change your mind and decide to remove them but it's a bit more sophisticated analysis than just removing duplicates on mass so my recommendation usually to people is to look at your libraries and try to decide if you think you have a duplication problem and definitely if you do remove them you should be assessing your duplicates at the level of paired end reads and not single end reads yes the question is why why do you do a PCR amplification and then sequence a subset of it it seems sort of counterintuitive or wasteful that's a really good question in the case of whole genome libraries so the level of amplification first of all it's pretty low level so it would be common to do maybe 10 cycles of PCR 8 or 10 cycles and in a lot of cases we actually do need all that material so in cancer genomes that we sequence at WashU for example will commonly make a library from DNA using the standard protocol and we'll do what we believe is a reasonable amount of amplification that you don't want to do too much because you'll introduce bias and then we'll start sequencing and it will actually run out we'll physically run out of material and then we make another library we have to make another library from the original DNA to continue seek to get up to 30 or 40 x coverage of the whole genome so in that case it is just it literally is to get enough material to run the machine and my sense is that it that has basically has to do with the way the machine works you're flowing these molecules across a surface and you're making clusters of them that can be detected by the imaging system but that process is not perfect not every fragment results in a sequence of all cluster you just you need like a certain amount of raw input to the device to get signal out but there's probably someone like on the technical side of the instruments that would be able to answer that question better than me if I'm being honest yeah so the question is if you have transcripts that are expressed at different levels or the same level and you introduce a PC or amplification step do you introduce bias such that when you are analyzing the data later you may have lost the actual difference in expression or genes that were expressed at the same level now appear to be expressed at different levels because you did an amplification and that amplification was biased the answer is of course that's a concern if we could do all of this without any amplification everyone would love to do that that for sure you try to limit the amount of amplification as much as possible so far it does not seem to be avoidable so none of the technology I mean Illumina is not doing single molecule sequencing we can't literally take an RNA and have one RNA in a literally one molecule RNA in a solution and sequence it that would be great but that technology does not exist but you try to minimize it so you don't do any more amplification than you need and you try to do things consistently so that at least when you're doing this process that you're describing that introduces bias you hope that that bias is consistent so that when you do it across 50 libraries you can still compare those libraries to each other and because the same biases are introduced to each of them it sort of comes out in the wash that it's okay but we don't really know that that's the case so all we can do is make predictions and then validate the effectiveness of those predictions so take some other technology that is very as orthogonal as possible and see does it give us the same answer as what RNAseq does and I'll have some slides on that coming up in a few slides yeah so spiking controls are featured prominently in the on-code best practices document as something everyone should do which to my knowledge no one has ever done I mean there was like one paper or two where people describe doing spikens and RNAseq I don't understand why it's so rare but I have encountered very few data sets that mention anything about spikens there are of course I mean there's no reason why you can't and there are manufacturers that will provide a series of spikens you know with that have been generated in some kind of high quality control process that are gonna you'll be able to get them consistently over and over again and you could use them in every library every time you propose doing this everyone's like yeah that's a great idea we should do that it just it adds to the complexity of the library construction procedure and that there's a cost associated with that it's not just the cost of buying the controls it's the cost of integrating that into the standard operating procedure of your lab but I think it's definitely a good idea and we should and there's no good reason not to not do it but I can think of as far as I know I mean I mean Illumina is really the king of RNA sequencing right now so the others are very kind of fringe by comparison in terms of the numbers of people using them not yeah I'm not really aware of anything I mean in the way that we're doing we actually have two kinds of not just amplification but two different kinds of amplification going on so we use a kit to make our CD nice synthesis during that step there's a linear amplification and that is before the PCR amplification so we're actually doing two steps of amplification to allow for low-input libraries where we can't get enough RNA to make a library but yeah I think for now we're kind of stuck with it maybe nano-pore we'll get us away from this somehow yeah so the question is what is the lowest amount of RNA that you can start with it's kind of hard to get a sense from that if you're dealing with so if you're submitting your sample to someone else to another lab to a commercial organization that's going to produce your library and sequence it for you they're going to ask for probably 10 to 20 times as much as they really need in my experience and they're going to tell you no no that that's not the case they're gonna say we need all that or we yeah we keep back half of it in case there's a failure in the lab we are typically using at least a hundred nanograms generally of total RNA but if you're starting with less than a hundred nanograms of total RNA you start to get into a pretty high failure rate my in my experience with the kits that we're using but so we're using this new genovation cdna synthesis that involves a linear amplification and it seems to work quite well on small amounts of RNA and sometimes it does work just fine below a hundred nanograms I don't know the details of how low they've gone and been successful and it's it would be too anecdotal at this stage anyways because we may have done it a few times and it worked out but it's hard to get a sense of how reliable it would be until you've tried it a bunch of times until you've initiated a project where you have a hundred samples and for each of them you only have five nanograms of total RNA and you try to like work with it anyways I suspect that right now that would be a big challenge though usually it doesn't seem to come up that much though but it depends what you're working with yeah I would say quality is more often an issue than quantity yeah yeah so like I said it really depends so which center this is genome compact okay so it really depends on how they make the library so I think we're the this concept of the new gen ovation cdna synthesis kit is that it has this linear amplification and you can take total RNA and go straight into cdna synthesis without doing a poly a selection and that's what allows you to get away with what such small inputs and the way they do that is not really clear so they the methods are obfuscated they don't really know what molecular biology exactly is going on but it involves some amount of random hexamer priming and oligo dt priming and adding a linker that allows linear amplification with a polymerase and those things all together seem to facilitate somewhat robust library creation from small inputs if you are not doing something like that if you don't want to have oligo dt priming anywhere in your process which is desirable to avoid introducing in bias so it's all based on random hexamers you definitely need you definitely need more input material the other so the other one I can compare to is in Vancouver we would do just random hexamer cdna synthesis no linear amplification they would ask for 10 micrograms of total RNA and they would use 5 micrograms as input for poly a selection and then make the library from whatever they got out of that and they tinkered with that poly a selections step quite a lot to optimize it for yield and it yeah so that was definitely they were definitely using more than they needed to and it hadn't been optimized specifically for purposes of low input it had been optimized for producing libraries robustly and flexibility and downstream experiments and so forth but there are definitely a lot of people working on protocols you know so that making libraries 96 well plate format you are using micro fluid experience things that will allow lower input and I think that definitely that's going to be an area that will improve in that in the coming years more questions yeah so seems like a lot of so yeah I think all of my questions are actually related to bioinformatics analysis all of these questions how much library depth is needed for RNA sequencing this is another very common question how many reads do I need to generate of course the answers again not simple but depends on a large number of factors probably the easiest or most obvious is what are you asking of the data has a big influence on how much data you need if you just want to get gene expression estimates out you don't need that much depth if you want to completely profile and assemble your transcriptome and be able to establish the status of any mutation any gene determine the structure of every isoform and so forth you will need a lot more depth it also depends on sort of more lab based things tissue type the way you're preparing the RNA the quality of the RNA the way you're constructing the library so things that we're just talking about the sequencing type to a certain degree so how long your reads are whether the reads are paired or not and to a certain degree the computational approach and resources that you have at your disposal and the answer to that the actual answer to the question if people make me pick a number I usually say shoot for a hundred million paired 100 mirrors just if you have to have a number but that's not a very good answer in my opinion the better strategies to think about what you want to get out of the data and ideally identify someone else who's done something similar and don't do less than what they did so basically if you if you want to call mutations with RNA seek data find a paper where they called mutations with RNA seek data and you are satisfied with the outcome that they achieved and don't do less than that do more than that and even better than that is perform a pilot experiment and sequence very deeply on a small number of samples and try to figure out in your hands in your lab with the analysis you're going to use where is the sweet spot basically do a cost benefit analysis in your own lab the good news of all this is that you can get a lot of data pretty easily so one to two lanes of Illumina high-seq data will generally be sufficient for most of the applications that I described there what mapping strategy should I use another common question the answer depends largely on read length thankfully we don't really have to deal with these short reads as much anymore but you do still see some places generating them for cost reasons so if you have short reads you may need to use an aligner like BWA against a genome plus junction database there are a number of publications that took this strategy and the reason is that when you have short reads it's hard to map them across X on X on boundaries where you have a large intron in between so if you have say a 40 base pair read and it spans across an X on X on boundary even if it hit the boundary perfectly centered you only have 20 bases on either side and that's a small amount of sequence to map over a large space so for that reason you have to map to the genome and some notion of junctions that have already been assembled but probably if you're getting data now starting now your reads will be longer than 50 bases in particular if there's 75 bases are longer you can go straight to one of the spliced aligners and that gives you a much simpler output to interpret and feed into downstream analysis and it's not biased towards the junctions that are in your junction database it's more comprehensive and it allows you to identify isoforms that were previously never identified before so here's just a visualization of this in IGV so you should all be familiar with genome browsers did they use IGV yesterday or just so far okay so this is a screenshot from IGV there are three tracks of data being represented here normal whole genome shotgun data so this is a normal sample from it from a cancer patient this is a tumor sample from a cancer patient these are both DNA and then RNA sequence data from the same tumor as the whole genome data so each of these gray boxes is reads and you can see when you align genome reads against DNA reads against the genome they all just line up continuously without being capped but when you align your transcriptome reads some of them span across exon exon junctions so for example this top read here the first part of the read matches to the end of this exon and then the rest of the read continues on in this exon so basically an aligner determined hey it looks like this read matches here but it gets to this point now it doesn't match this intron where does the rest of the read go it starts looking downstream and it finds okay it goes here now when we're looking at the data we can see oh it actually lines up with a known transcript so that alignment is probably correct but it didn't know about this transcript when it was doing the alignment it just figured out where that read should go the sequences in the region of the intron can be several things so they could be genomic DNA contamination in your RNA sample so you have some amount you always have some amount of genomic DNA in your RNA sample from which reads could be generated that map anywhere including in the exons or in the introns it could be unprocessed or partially processed RNA so we enriched for polyadenylated and hopefully processed mRNA but there's no guarantee that our RNAs have actually been processed so it could be that the intron hadn't been spliced out of some fraction of the transcripts in our sample and that would give us reads inside the introns it could be a misalignment or it could be a transcript that we don't know about so it could be that there's a transcript a real transcript that has either this whole intron retained or maybe it uses a different donor site can't tell which just accept our donor site depending on which direction this transcript is being transcribed from and then those could be real correct transcript don't read they just correspond to a transcript that isn't being depicted here how many so sort of what is the minimum size of this thing it really depends on the details of the aligner I would say probably below 10 is pretty hit and miss and you can tweak the parameters of top hat or whatever gap the liner to play around with how much it will allow if it's too short generally it won't attempt so if this if you have a read that hits this excellent just a few bases spillover it's probably not going to attempt to like place those bases it's just gonna say I'm not quite sure where those last two bases go yeah so these are just reads that are randomly starting at different positions in this transcript and ideally you have reads that just pepper across the transcript starting and ending all over the place in this overall gives you a nice clean representation of what the transcript is or the transcripts and in this case you can see so these three reads that are marked by stars so you can see that most reads that span across an X on X on junction which is what's indicated by this light blue line just hard to see with the resolution of this projector but so for example the first three reads here they span across an X on X on junction when we look at the known transcript for this region of the genome there is an X on X on junction like that that seems to match these three reads do not match that junction but they do seem to correspond to perfect skipping of this X on so you've got a read that's starting at the end of this X on and then it continues on it passes this X on and it starts at the beginning of this X on so what that tells us is that there's likely at least two isoforms being expressed from this locus one that includes this X on and one that skips it I don't know if there's a particular number but definitely more than one is good and the software that we're going to use cufflinks tries to take that decision out of your hands it tries to say to build a model with a lot of Greek symbols and mathematical wizardry behind it that will attempt to assign confidence to its predictions about what transcripts are present in your sample and assign a p-value and a sort of error bars to help you make that decision and if you become comfortable with that pipeline you're using it a lot you will over time or maybe at the outset of an experiment you will perform a series of validations on its predictions to get your own sense in the lab what can I validate how based on the p-values it's giving me or the coverage value you can sort of get a sense of how reliable the estimates are based on your data and the analysis that you're getting yeah so we're gonna I'm gonna show you some slides on validations next slide actually so this is a common question how reliable our expression predictions from RNA seek so for example like in the last slide are these novel exon exon junctions that we observe so in that case we had three reads that suggests that an exon was being skipped and in that case we had actually I didn't mention him we had an a priori expectation to see that exon being skipped because I had identified a somatic mutation at a splice site in this gene which would be expected to disrupt splicing and cause that exon to be skipped and then lo and behold we see reads that look like the read that it looked like the exon is indeed being skipped but just in general how reliable are these these predictions that an exon is skipped and are the differential expression changes that we observe between two tissues or estimate between two tissues accurate and how well do they compare to some orthologous techniques such as QPCR so I of course had this question when I started working with RNA seek data and I wrote some software to do alternative expression analysis and differential expression analysis of RNA seek data and wanted to know how reliable the estimates were that were coming out of that software so I did about 400 validations which involved designing Q PCR and or RT PCR and saying or sequencing experiments to validate predictions you can read all about that in this publication but I'll just show you a couple slides from that publication first showing a qualitative validation of the first part so how reliable are these sort of structure predictions so for example exon skipping so if we imagine a scenario where we have a gene with three exons and the RNA seek data tells us that we have some amount of exon two being skipped so that we're getting two isoforms one that contains exon two and one that skips exon two we design PCR primers and exon one and three to span across the site where the exon is being skipped and then we do an RP RT PCR we run the product on a gel and we expect to see two bands one for the large isoform that includes exon two one for the small isoform that skips exon two and in this case that's exactly what we saw do those sizes match what we would expect and then if we cut those bands out of a gel and sequence them by a different sequencing technology in this case Sanger sequencing and then align those reads back to the genome do we reciprocate the original finding that the exon was skipped so in this case we've got some Sanger reads these are about 500 bases now instead of short 50 base pair reads spanning across exon one to exon three with exon two being skipped so basically went through this process about 200 times here's a little snapshot of some of those gels and basically found that in this assay there was an 85% validation rate so 85% of the time that the RNA seek data told me there was an exon being skipped here I was able to validate that by RT PCR and Sanger sequencing and that is a lower estimate because the validation strategy will fail some amount of the time it's not perfect it also has its own limitations and the technology has improved since then the second part which is more quantitative validation sort of a similar idea we identified a bunch of exons that between two conditions were considered differentially expressed so in this case we had a drug sensitive and a drug resistant cell line and there were genes being differentially expressed and isoforms being differentially expressed we identified a bunch of exons that were predicted to be higher in resistant than insensitive or vice versa and we randomly selected a set of about 200 of those to validate designed exon specific QPCR primers and then performed QPCR in a series of biological and technical replicates and then produced differential expression values from the QPCR output and compared them to the RNA seek output and plotted them together so on the y-axis you've got the difference log 2 difference between drug sensitive and drug resistant and on the x-axis you've got the same log 2 difference between drug sensitive and drug resistant and you can see that you get a very good correlation between the two technologies and your few apply a statistical test to both data sets and ask which of these things are statistically significant differentially expressed and then see how often the two technologies agree you get an 88% validation rate so again it's basically this tells us that if we could do QPCR for every exon in the entire transcript dome you would get a readout that is similar to what the RNA seek data is giving us as a one-shot deal which is very nice okay so now we're going to do just a brief introduction to the tutorial and then take a break and start with the tutorial when you come back so I just have this one slide and I'm going to try to like keep referring back to this so that we know sort of where we're at in the process but basically we're going to start with three general types of input data we're going to have raw sequence data in the form of fast queue files just like the ones you were working with yesterday except today they're going to be RNA reads instead of DNA reads we're going to have a reference genome again just like we had yesterday but in the tutorial today it's just going to be chromosome 22 so that the tutorial runs quickly but in a normal situation it would be a faster database or file representing your whole reference genome for whatever species you're analyzing and then we're going to have some gene annotations as a GTF file and I'll explain a bit more what the features of a GTF file are each of these things is going to be fed into a pipeline that involves taking our raw reads doing a read alignment we're going to use bowtie and top hat to do this alignment to the genome and it's actually going to be a combined genome transcriptome alignment and then a transcript compilation step which will involve cufflinks examining all of the read alignments and trying to figure out what transcripts are present and how highly expressed each of those transcripts are present and then cufflinks will also compare the transcripts that it finds to the known transcripts that we supplied in our gene annotation file and then we're going to take the output from the cufflinks process and do a comparison between two tissues so in our case it's going to be colon tumor versus colon normal and then we're not going to get to this part but there the authors of this pipeline have also added a visualization step where you can take the output of this final step and feed it into this R tool that will help you visualize the results.