 Okay, so this is module one, the goal of this module is to be an introduction to RNA sequencing, sort of the background of the biology, and then some key points for the analysis. I'm going to keep showing this slide, or Obi and I are going to keep showing this slide as we go, just to review sort of where we're at. So there are five modules that are going to be covered over the next two days. Each of those modules has some practical component. So as I said, we're going to start with an introduction to RNA sequencing. So this is really going to be some fundamentals of what RNA sequencing is all about. And the tutorials are going to be much more hands-on. You're going to be executing commands on the cloud and doing actual analysis with real data. And the general goal of all of these tutorials is to provide a working example of an RNA-seq analysis pipeline. So if successful, you should be able to go back to your own lab and create a pipeline to analyze your own data that's based on a lot of the principles, commands, tools, installation instructions, et cetera, that are provided in these tutorials. Sort of another general goal of these is that they run in a reasonable amount of time with modest compute resources. In particular, in a teaching setting, we're always trying to optimize the data that we're using so that we can actually get through these things in a timely fashion and keep on Michelle's schedule so that she doesn't get mad at us. And as I said, we try to make these tutorials as self-contained as possible. So we provide explanations. Everything in them is meant to be portable. If there's a tool that's needed to do the analysis, instructions are provided on how that tool is installed and configured. So there shouldn't be very much, if any, of a sort of black box feeling to it. You should walk away with everything that you need. So that's our goal, and if we fail in that goal, we're always trying to improve and we're interested to hear where that doesn't work, if you find that's the case when you get back to your own labs. So just to go over some of the specific learning objectives of this module. So as I said, we're going to talk about some theory and practice of RNA sequencing analysis. We'll briefly go over the rationale for RNA sequencing. Probably most of you are already convinced since you're here at this course taking an interest, but just in case you need to rationalize it to someone else, we'll give you some pointers. And then we're going to go through some of the challenges that are specific to RNA-Seq. So there are some analysis challenges that are general to next generation sequencing of DNA, whole genome sequence, exome sequencing, and so on. We're going to talk about some of those general challenges, but we're really going to focus on challenges that are specific to RNA-Seq analysis. We'll talk about some of the general goals and themes of analysis workflows. So what do these pipelines look like? There's an infinite number of ways to plug various tools together, and we're going to talk about some of the themes that come into play. We're going to spend a lot of the time in this lecture in particular going over some of the common technical questions that are related to RNA-Seq analysis. So these are things that I get asked over and over. So rather than waiting to be asked, we've just created content that tries to answer these questions. But I would encourage you, especially in this first lecture, to ask questions as we go, as long as we don't start to burn up too much time. I think it's more interactive if we have a bit of back and forth. I'm going to go over how we get, how you can get help outside of this course when it's done, and then I'll give a brief introduction to the hands-on tutorial so that when you come back from the coffee break, things should be, you're sort of primed with some of the ideas that you're going to be putting in place. So you probably know this almost as well as I do. Maybe some of you know it better than I do. But just so that we're on the same page, I thought I would start with sort of the central dogma as just a brief review. So what's being shown here is a DNA template. This is genomic DNA in cartoon form with exons depicted as colored boxes interspersed with introns. This example is based on human or eukaryotic gene expression, and these introns are not to scale. So, of course, introns are very large compared to exons. They're just shown this way for display purposes. But what happens is that an RNA polymerase comes along. It finds upstream of a transcription start sites and initiates transcription and produces a single-stranded pre-mRNA molecule, which is depicted here, where the introns are still in place. And then in addition to the transcriptional machinery, a splicing machinery comes in and it recognizes various features of this molecule that allow it to remove or splice out the introns and assemble the exons together without the introns in place. And this is called a mature mRNA. And these are really the subject of RNA sequencing. So this is why I really show this. A lot of RNA sequencing is specifically focused on polyadenylated mature mRNA sequences. But it's really important to remember that RNA sequencing is not actually sequencing these things in several ways. So one thing is that we're not actually sequencing RNA directly. We're making cDNA from RNA and then we're sequencing that. And importantly for the analysis, we're not sequencing full-length transcripts. Most of the time you're fragmenting your RNA into small pieces, usually about 200 to 300 bases long, and then you're sequencing the ends of those fragments. So in human, for example, it's very common for a transcript to be 1,000 or 2,000 or 5,000 or 10,000 bases long. We are not sequencing anything even close to that long. We're sequencing little pieces of it that we've broken up. And we're going to be trying to assemble all that information back and infer what the full-length transcript looks like. But it's important to remember that there's a lot of inference going on here. So you should always be a skeptical of predictions of what a full-length transcript looks like when they're based on RNA-seq, because fundamentally the data is made up of pieces that you're trying to assemble back together like a jigsaw puzzle. And then what a lot of people are interested in, of course, is protein sequences, which will be translated from these mature mRNAs after they're exported from the nucleus to the cytoplasm. Probably a lot of us, if we could, would prefer to just sequence these things directly, but unfortunately there is no high-throughput way of sequencing proteins directly. At least nothing that's nearly as high-throughput as RNA sequencing. So for now we're stuck with yet another layer of inference where we're sequencing RNAs and trying to predict what that might mean at the protein level. So this is just an overview of sort of the wet lab, a very, very high-level overview of the wet lab side of RNA sequencing. So we're starting with some samples of interest. In our example data we're going to be looking at a tumor normal comparison. So I heard that at least one or two of you are interested in doing differential expression analysis in cancer, so this tutorial is going to be very relevant to you. But you have some number of conditions of interest. From each of those you isolate RNAs and there are variable lengths, generally much longer than 250 bases. But we're going to generate CDNA and then we're going to fragment those CDNA fragment pieces into much smaller pieces. Often there's a size selection step, so you're either throwing away the really small stuff or you're actually selecting a very tight range of fragment sizes. And you're going to add aluminum linkers onto the end of those, which is what's depicted here with the blue and red rectangles. And then we're going to sequence those on an aluminum flow cell, which is what's depicted here. It's basically the size of a slide that you would look at under a microscope. It has eight lanes on it. And each of those lanes consist of a very long, thin flow cell over which you flow all of these small fragments. And then an amplification occurs and sequencing occurs directly on the flow cell. From this small piece of substrate, we're going to produce hundreds of millions of paired reads in a single run. On the high-seat 2000, this takes about 10 days. On the high-seat 2500, you can do it in a couple of days. And what you get at the end are these paired reads. You have a small fragment with reads of some length coming in from either side. And these lengths can vary quite a bit. So in the integrated assignment, we're going to be using very small reads that were generated quite a while ago. I believe they're 36 bases long. But probably the most common read size right now is a paired 100-mers. You have a 300-base fragment, and you've got 100 bases read in from one side and 100 bases read in from the other side, giving you 200 bases total with some amount of space in between. All of the analysis in RNA-Seq generally follows this theme of mapping to some combination of the genome, the transcriptome, and predicted exon junctions. And that's exactly what we're going to do in our example analysis. And then after mapping to the genome, and there are many different ways to do that, but assuming that you've successfully mapped all of your reads to the genome downstream and it also starts, so this can involve many different tools to give you different types of predictions about what genes are expressed, genes that are differentially expressed, RNA fusions, variance mutations, and so forth. So why would we sequence up RNA in the first place as opposed to DNA? This is a question that comes up in a genome-centered context where people are really focused on DNA a lot, which maybe doesn't apply to you at all. But of course there are lots of reasons why we would be interested in RNA. One is that we may be interested in functional studies. So the genome in our study may be constant, but some experimental condition can still have a pronounced effect on gene expression levels. So for example, you might have a cell line that's being treated with a drug and you're comparing it to untreated to try to understand the effects of that drug. You could have a mouse model and have a wild type versus a knockout version of that mouse. There are certain types of molecular features that can only be observed at the RNA level, so alternative splicing is a good example of this. Fusion transcripts, RNA editing are base changes that happen at the RNA level that are not represented in the genome, so you cannot observe them by sequencing the genome. You might be able to predict them, but you can't observe them directly. Predicting transcript sequences from the genome sequence is quite difficult, so you can have a fully sequenced genome like the human genome, which is very, very well characterized and sequenced, but no matter how good your representation of the genome sequence is, it's difficult to predict what the transcripts will look like when they're transcribed from that genome. In particular, splicing patterns are very difficult to predict based on the genome sequence alone. We can predict with pretty good accuracy what an exon will look like based on conservation and other features, but it's difficult to know for sure how it's going to be spliced together into a mature mRNA. Another reason to sequence RNA is in combination with DNA. It can be very useful for interpreting mutations. Someone mentioned allele-specific expression. For example, you may observe a mutation in the DNA, and you want to know if that mutation is actually expressed and whether the mutant alleles are expressed as well or more or less than the wild-type allele. You may also be interested in regulatory mutations that affect what isoforms are expressed and how much expression is occurring. And then as I mentioned, if you're really interested in somatic mutations, you can use RNA-Seq to prioritize coding mutations. So if you have a tumor sample, say, and you identify 100 mutations in the exons, and then you overlay your RNA-Seq data on top of that, you can quickly determine that only 40% of those mutations are actually in genes that are expressed, and that might allow you to prioritize them in a useful way. So for example, if a gene is not expressed, a mutation in that gene might be less interesting if you're interested in, say, targeted therapies. If a gene is expressed but only the wild-type allele, this could suggest that a loss of function is happening. So you may be triggering nonsense media decay, for example. And if the mutant allele itself is expressed, that might be interesting because it may indicate that this is a drug-able mutation or it's an activating mutation or it just makes the effect of the mutation more pronounced because you're having basically only the mutant form being expressed. Okay, so some of the challenges of RNA sequencing. To start with, they're really focused on sort of the lab side of things. Some of you are probably quite familiar with these. One of the biggest challenges is the sample itself. So you may have issues with purity, quantity, and quality of your RNA. People that are used to working with DNA will usually be disappointed with how much harder it is to work with RNA in terms of quality. RNA is a lot more sensitive to degradation than DNA is. So it's very often that you'll have a sample that gives you decent genomic DNA but gives you very degraded RNA, and that can have big consequences for library construction and ultimately the analysis on the results that you get. So it's a good idea to start thinking about your sample quality right away when you're doing RNA analysis. Another problem that our challenge that I alluded to is that RNAs consist of small exons that are separated by very large introns. This creates a mapping challenge in RNA sequencing that does not exist in mapping DNA reads to the genome. So if we have 100 base read and we want to map it to the genome, we're looking for 100 bases that align continuously. If we have 100 bases of DNA or sequence from RNA and we're trying to map it to the genome, it may correspond to a piece of RNA that's spanning across an intron. So we may have to actually divide our alignment so that part of it is hitting one exon and then there's an intron and then the rest of it continues on in the next exon. And computationally, that's much more challenging than just aligning it as one single piece. So our read alignments are much more challenging. Another challenge that's specific to RNA that you don't have in the genomic DNA nearly as much is that the relative abundance that you expect from reads across the transcript varies wildly. So when you're sequencing DNA, if you have, say, a diploid genome in human, you expect approximately uniform coverage across the genome when you map your reads back to the genome. So all of the DNA is there in approximately the same amounts. RNA, of course, this is not the case. The expression level of RNAs vary over a huge range. So 10 to the 5 to 10 to the 7 orders of magnitude is sort of the estimate that you'll hear. And this is important because RNA sequencing works by random sampling. So we're going to take all of our CDNA fragments and we're going to randomly sample them. And because there are some RNAs that are there in a very high level and some RNAs that are there at a really low level, it can be difficult to capture information from the lowly expressed RNAs because every time you randomly pull some reads out of your bin, the chances are you're just going to get the reads that are coming from transcripts that are most highly expressed. And in particular, ribosomal and mitochondrial genes and certain other housekeeping genes, you'll notice this news you start looking at RNA-seq data that you have ridiculously high coverage over certain genes. And when you're really interested in a particular gene that's not very highly expressed, you may find that even with a lot of data you get very little or no coverage of that gene. So a good example of this is telomerase. So the human telomerase gene is expressed at just a few copies per cell. And at that level, it's functional and it has a very important function. And people want to study it. So you hope to see some representation of this gene in your RNA-seq data, but because it's there relatively at such a low level compared to the highly expressed genes, you can sequence your library very deeply before you get decent coverage over genes such as that. Another issue is that RNAs come in a wide range of sizes. Some RNAs are functional at a very, very small size. For example, microRNAs are just tens of bases. Well, other RNAs can be 100 kb. And this creates an analysis challenge as well as lab challenges in terms of how you make your libraries. And really one of the consequences of this is that you may have to actually divide and conquer by having different analysis and library preparation strategies for very small RNAs and then a different strategy for larger RNAs. Another challenge is the issue of poly-A selection. So those people that are interested in RNAs that are poly-denylated often perform an enrichment for poly-denylated RNAs. And one of the challenges that this can introduce is three-prime end bias. So you can imagine you have transcripts of varying length in a solution, and you're going to capture poly-denylated RNAs by grabbing onto their poly-A tail, usually with oligo-DT sequences. So you're holding onto these long RNAs with this poly-A tail. And if there's any break in that RNA, since you're holding onto it at one end, you're going to lose the five-prime end of it. You're holding onto the three-prime end. And if the RNA is not completely intact, if it's been broken anywhere, you're going to lose the five-prime end. And what happens is that this is always happening to some level, and as soon as you do a poly-A selection, you introduce some bias towards the three-prime ends of transcripts. So when you start looking at coverage across the transcriptome and look at specific transcripts, you'll see that the three-prime end tends to have a lot of more reads piling up and the five-prime end will have much less reads. So it'll generally tail off at the five-prime end. And sometimes if you're interested in the five-prime end of transcripts, you may have very poor coverage and you may have to up the total amount of data that you're using to get the information that you need. And this speaks again to this issue of RNA being much more fragile than DNA. So I usually shield this slide just because this is a very common method for assessing the quality of RNAs. Oh, sorry. A couple questions. I wish I had a whiteboard. It can have better coverage than the five-prime end if you're doing a poly-A selection. So first, do you know what I mean by poly-A selection? You're holding on to this three-prime end of all of your transcripts basically when you do a poly-A selection. And then you're washing away everything that isn't poly-adenylated. That's how the enrichment works. In the process of washing all of the other non-poly-adenylated RNA away, you're also washing away five-prime ends of transcripts that have been broken off. So imagine you have a transcript that's 2,000 bases long and it's poly-adenylated at its three-prime end. But because RNA is fragile, it's actually been broken in the middle. Now when I grab onto that three-prime end and wash everything else away, I've lost the part that was broken off. And this sort of happens when you average all of that out and then look at coverage across each transcript, you'll see a sort of pattern where there's a high peak and then a tail that goes off. And the bigger the transcript is, the more likely it is to have a break in it. So smaller transcripts won't have this problem as much, but really, really big transcripts just have that much more opportunity to be broken in the middle. What's the alternative to poly-A selection? Why do you do it? They're... So you do it for some of the other reasons I described. In particular, this issue of always sequencing highly-expressed transcripts over and over again. So there are certain very highly-expressed transcripts that are not poly-adenylated that you may not be interested in. So by doing a poly-A selection, you enrich for transcripts that are mRNAs and you can wash away a lot of the really highly-expressed housekeeping genes that you're not interested in. So this is a sequence and not burn as much of your reads on these really highly-expressed non-polyadenylated RNAs. It's basically a way for enriching for mRNA sequences. So people that are really interested in protein coding genes, they want to focus on those types of transcripts so they can enrich their pool towards those kinds of transcripts by doing a poly-A selection and then spend less of their time sequencing ribosomal RNAs that are not poly-adenylated. And there are alternatives. So the alternative would be to take the reverse strategy. Instead of positively trying to enrich for things that have a poly-A tail, you try to pull away things that are ribosomal. So there's ribo-reduction kits that try to reduce the ribosomal sequences so that you can sequence things regardless of whether they're poly-adenylated or not, but then still reduce this problem of not wanting to sequence ribosomal RNAs to death. You can sequence generally for the sort of standard RNA-seq protocols. It's not that you get a specific size, but that usually you get everything above a certain size. So everything that's a few hundred bases or larger will usually be captured in an RNA-seq library with some exceptions. So I've seen sequences being assembled for transcripts that are 40, 50KB or even larger. But again, keeping in mind, you're inferring that you're sequencing those full-length transcripts. You don't actually know what you sequenced on that scale because you're just sequencing multiple pieces and assembling them together. But it appears that you're able to select very, very large RNA-seq ones, though. I would say the smaller RNAs are more of a challenge. Sorry, he's been waiting for a while. That's the conversion actually. Lumina is trying to get away from poly-adenylated selection and generally speaking so in our lab also I've developed my own ribosomal RNA distribution for some commercial people I've developed my own RNA distribution from Lumina and it works really much well, I think it has also as we mentioned some disadvantages but generally speaking works much better than poly-adenylated selection but if you try to get it as much as the double RNA that you can and then sequenced the fragment and do the larger spread on the sequence. So I think I actually have a slide of the ways that you can make an RNA-seq library and we could probably talk all afternoon about the pros and cons of each of these different arms but there was one more question, though. Yeah. So three days you talked about library selection prior to subject to RNA on CDNF sequencing. So here you say that small RNA must be captured separately. So if I need to stay you say microRNA or non-coding RNA do I have to choose a smaller fragment of the library and then if I want to study quality feed? Basically, yeah. It's a very common strategy to have a library construction protocol for small RNAs that involves selecting a small size selection and then for larger RNAs it's also common to just remove the small fraction and then fragment after that and make your libraries from that and then you may or may not do an additional size selection after that. But a lot of places are basically they have sort of two arms. They have a protocol they use for microRNA libraries and then everything else. You can sort of cover most of the transcriptome by dividing into those two strategies. But if you say if I select a library that you could go that way to the small RNA and also just coding RNA what bias would that be? You may use a lot of your sequence if there's some really highly expressed short RNAs. I haven't actually looked at any data from anyone that did that. So I can't really say. It seems like everyone does either a size selection with a very tight range or they remove small molecules from the before making the library. I haven't seen an attempt to just let's do the entire transcriptome completely unbiased and I think that there's a reason for that that it probably causes problems. I think that this comes more like with that technical challenge if you think that microRNA that 20 to 30 years later has adapted and in the normal way transcriptome you try to get rid of final parameters and so the reverse space that your patient they are not comfortable so you have to find a range you want to study. Some people also sequence really small like microRNAs by actually concatenating them into bigger pieces and then because otherwise you're wasting the sequencer produces reads that are 100 bases long if your fragment only has 20 bases of sequence in it that you're interested in that's kind of wasteful. I think, yeah, it's going to be hard to do. Along those lines are you going to address what type of metrics you need to look at for coverage for deciding whether the library selection that you have is appropriate for working for? You're saying that you're meant to provide a concrete methodology for looking at specific subsets in the transcriptome that's fine. Who do you address? Have you picked the subset? Have we zeroed in on the appropriate library and is there a way to check for that? We're going to review some tools that do QC of RNA-seq data and they provide a huge number of metrics some of which can be used to infer some of these issues. I usually show this example of an agilent nanoassay of some RNAs and I provide this link here which you can check out later that shows a wide range of examples of runs of this instrument with RNAs of varying qualities from different sources of material. But I include this because you'll really commonly see this RIN number being discussed in RNA-seq library preparation and then analysis. It's really common for a sequencing core or lab to run their RNA on an agilent produce one of these traces and then use these traces to interpret the quality of the RNA and then a lot of sequencing centers will have some kind of minimum score that they allow where they basically won't recommend that you do make an RNA-seq library if your RNA is too degraded and that's basically what this assay tells you. This is a normal RNA peaks which are shown here. This is a human sample and the more secondary peaks you see the more degraded the RNA is. So this is basically a modern day example of running your RNA on a gel and seeing how smeared it is. So if there's a smear that's sort of really uniform and you don't really see really distinct ribosomal bands you would say your RNA is very degraded. But if you see really distinct RNA bands and not much of a smear you would say that your RNA is very intact. This is just sort of a digital version digital analog version of that. And the software basically interprets this pattern and assigns a score based on how clean it looks. So it's detecting a lot of peaks here basically a lot of smearing and it's giving you a score of 6 which is not that great. A perfect score is 10. So here's an example of RNA that I isolated from a cell line and it's just absolutely perfect. There's no secondary peaks at all except for the marker. And it gets a score of 10. And you can see everything in between and as I said if you go to this PDF here you can see probably 50 examples everything from perfect to really degraded. A common cutoff that you'll hear people using is 8. The software comes with the Agilent Lab on a Chip Bioanalyzer instrument and there's a different sort of program for each of the chips. So there are RNA chips, there's low input RNA there's DNA chips for different size ranges of DNA and it's a really commonly used instrument and aluminum structure. Consider do you consider that Bioanalyzer on the RNA chip number is enough to assess original DNA combination? It's not really I mean they will say no if you ask them because the size is so different from this they generally adjust the gel concentration for different size just as if you're running a deep gel sort of in an old school way and that makes it hard to assess very large pieces of DNA so if your genomic DNA is very intact it may not be able to enter the column very well or not at all that being said I have seen examples of these traces where you could see basically some kind of smearing or peaking at higher molecular weight and then I've done a before and after DNA treatment and actually seen a difference so I think it is possible to get at least a qualitative sense of that If you get an alternative between 9 to 10 not from the score I'm not aware of standardized protocol most people do a DNA treatment and they just sort of say my genomic DNA must be gone which is of course not true so I would say it's probably easier to do at the analysis level which is maybe kind of too late if you have a lot of genomic DNA contamination but you should be able to tell how much intergenic reads you're getting from your RNA-seq library and if there's a lot of them you could infer that that may be coming from genomic DNA contamination but you don't actually know that it's very difficult in the end analysis to know for sure whether signal came from a true RNA whether it came from biologically relevant transcription or just noisy transcription basically every region of the genome is transcribed to some level in some kind of stochastic process and then of course some of your signal will always be from genomic DNA so just because you see a read of an exon of your gene you don't know for sure that that came from an RNA that was expressed from that gene but if you see very high peaks over your exons and very low peaks in your introns and perhaps even lower coverage outside of genes entirely you can feel comfortable that your genomic DNA contamination level is low but generally every RNA-seq library consists of some amount of true RNA unprocessed RNA that still has introns in place that haven't been spliced out and genomic DNA which will give you a low level of coverage everywhere and there are methods for trying to measure the amount of each of those things to assess the quality of your library Yeah but it won't it won't prevent you from doing analysis it just it reduces your signal to noise level so if you're interested in transcript that is very very lowly expressed so lowly expressed that you might not be able to tell the difference between true expression of that gene and genomic DNA contamination but in practice it seems like you can if you have a good library you don't have that problem okay so design considerations this is just something to do on your own time but the on-code consortium published this standard guidelines and best practices for RNA-seq you can download it from the course wiki or from this URL or you can Google for it and basically it's a very in-depth discussion of all of the experimental controls the way you should set up your replicates the types of libraries you should make it's basically a list of all of the things that people studying RNA-seq should do and don't do because almost no one includes reporting standards for example but of course it's an excellent idea just for those that have done RNA-seq projects has anyone thought about using spikens if you're at the beginning the reason a lot of people don't do it is just because they haven't done it before or because they've already decided on something they want to be consistent but if you're at the point where you're starting an RNA-seq project and you want and you can make these decisions one way or the other go read these standards and think about which ones are practical and cost-effective to do and you'll probably thank yourself later okay so I alluded to this earlier there are many many many RNA-seq library construction strategies and this is another challenge for analysis which is propagates from the uncertainty and changing landscape on the lab side of things into the analysis side of things it doesn't seem that people have really stabilized on a one good way to make RNA-seq libraries there are still new kits coming out all the time oh sorry this was a last-minute edition this slide you can always update anything that's slightly different you'll be able to get the updated version from looking yeah RNA-seq library okay so an RNA-seq library is basically how this was done so you're starting with RNA and then you're going to go through some series of molecular steps to make cDNA out of it how do you make cDNA it involves enzymes different enzymes you could use different kits that you can get to synthesize cDNA from RNA how do you fragment it there are different devices that can fragment cDNA into little pieces you might use an enzymatic way of fragmenting them you might run them on a gel and cut a slice out you might use a caliper there's various instruments to do size selection and then linkers you may add different types of linkers and all of this process together the details that I'm probably not mentioning make up the library and the library is the thing whatever molecules that you've made that you're going to put on the flow cell and sequence so alternate libraries are different ways of doing this so you might have a library that's made up of little fragments you might have a library that's made of big fragments you might have a library whereas the cDNA was synthesized with some kit and then it was amplified or maybe it wasn't amplified that's what this slide is talking about all of those tweaks to the molecular biology that happens when you're going from RNA to sequence data does that make sense this one we've talked about already and it's already been mentioned by others you can start with total RNA this is generally where you're isolating from your cells and you can do a polyaselection or you can leave it as total RNA and you can do a ribo reduction so you can try to enrich for polyase species or you can try to reduce the ribosomal RNA content or you could do both or neither size selection you can do this before or after cDNA synthesis so you might do a size selection on the RNA itself or you might synthesize cDNA from your RNA fragment it and then do a size selection your library may be focused on small RNAs versus large RNAs and that will of course influence your size selection strategy if you do select a size distribution so you basically run your cDNA on a gel and you cut out a section of it to represent a certain size you can do that different ways you could select a very tight band or you could have a very broad band or you might just cut away the stuff that's too small and you might not do that on a gel you might use some kind of instrument there's many different strategies there are some kits that do cDNA synthesis and then linear amplification with some kind of polymerase for example a T7 polymerase and this could be done if you have very small amounts of RNA where you want to start with say a tumor fine needle biopsy from a tumor sample you might just have a very tiny amount of material and you need to amplify it to get enough material to make a library you can have stranded versus unstranded libraries so we're going to talk a bit more about this up until now most RNA-seq libraries have been made in such a way that you don't actually know which strand was being transcribed when you're analyzing the data so you're making double-stranded cDNAs and then sequencing those so when you align them back to the genome you get reads aligning as if they came from either strand but of course RNAs only transcribed into three prime directions so they probably all came from the same strand but that information has been lost so it's one more thing that you're inferring when you're doing the analysis you're looking at the way you read the line to the genome and you're inferring that transcription was occurring in a certain direction but you don't actually know that you don't know which strand was being transcribed and it could be a combination of the two and you wouldn't be able to tell the difference necessarily but there are some new libraries to help you make libraries where this information is not lost where the reads will only come from the strand that was transcribed or they always come from the strand the opposite strand depending on how the library was created you may manipulate your library in ways after you make it so you can take your library and hybridize it to an exomer to enrich four transcripts that actually have exons in them this is another way of sort of cleaning up your library if you have a really problematic library there are library normalization strategies so what I mean by normalization here is so I talked about this idea that some RNAs are very, very highly expressed and some RNAs are very, very lowly expressed normalization is an attempt to try to collapse that a bit so that you're getting rid of some of the really highly expressed things so they're not so overrepresented and effectively you're allowing yourself to sequence more easily the things that are lowly expressed all of these details can affect the analysis strategy and certainly the interpretation of the analysis and probably the most the simplest way that affects that is that if you're not doing things one way you're going to have problems comparing between your different conditions so you generally want to pick a strategy and stick with it at least within a project so if you're comparing tumor and normal you don't want your normal to be made in a different way from the tumor and that should be obvious but invariably this happens because different projects get merged or someone, you know, some PI says well why don't you compare it to this data we produced last year it's RNA-seq data it's all the same thing right it might not be so if the library was made in a significantly different way you may see differences that are down to the way the library was made and don't have anything to do with biology it is not commonly used that I'm aware of there have been a few papers describing the idea of it we use it for libraries that are very, from very challenging samples so for example if you have RNA from a formal and fixed paraphernal embedded tumor sample that's a block that came from some pathology lab in a hospital and was sitting on a shelf for five years the RNA is really really badly degraded there's not very much of it it's sort of a library cleanup strategy that can help recover some really problematic libraries but it's an unusual thing you would only do it in certain circumstances for certain applications so when you're mentioning that what you may have this is we compare different methodologies for RNA-seq data sets would something like the RNA-seq allow you to do that more easily to normalize up the potential variations and purification methods because you'd be able to see I mean the assumption is that the RNA-seq data sets can be amplified linear if we get the time at the beginning so if you know what those readout counts are for those spikens would that help you normalize that? oh sorry the spikens I didn't catch that yeah so this is one reason if you're varying the way you're making libraries but you always include spikens it does give you a sort of a ground truth to compare to and that is one of the many uses of the of those kinds of strategies and even if you're making libraries if your intent is to make libraries the same way something can go wrong and if you have the spikens there it will really help you tell when something went wrong and when you really have to start worrying about your interpretation of the results so the interesting question is that for spikens that I implemented in a nice spiking library is there a standard for analyzing spikens? I would not know because I still haven't encountered a data set that had a spikens spiked into it um there may be some standards proposed in the document that I referenced but they will probably be very high level I mean it's not a it's not a practical hands-on recommendation but this is a common strategy from the microwave days spikens are a very old idea so I'm sure you could learn a lot about the way people use spikens by going back to microwave experiments that's the other thing is the literature you can definitely find groups that are doing this analyzing the data I expect that there very little in the RNA-seq analysis or lab side is really standardized so I expect that not really unfortunately ok quick discussion of replicates there are many different types of replicates that you might consider technical experimental biological technical replicates could be something like having multiple instances of sequence generation from the same physical library so you make a library and you just sequence it one day and then you sequence it again the next day or you put it on different flow cells or different instruments or different lanes of the same instrument to get a sense of the variability on the sequencing side of things so how much does it matter which machine the sequencing happened on or was it in lane 1 or lane 2 or maybe in lane 3 or last week just to quickly address that the illuminate technology now is very very stabilized no one does these kinds of replicates anymore maybe periodically or when a new instrument is installed but for the most part you run the Illumina high-seq 2,000 platform this week you run it again next week if you run the same library very, very similar. You'll commonly see a correlation of expression values that look something like this with exceptionally high r squared value. Basically, you don't need to worry about producing more data, about worrying that it's on a different instrument or a different flow cell. If it's the same library, you can safely go back to that library and get another lane of data and just add it into the data you already have. You can keep it separate as a replicate if you want, but you're really going to see that they're very, very similar. Of course, biological replicates are not the same story. So as wonderful and great as modern next-generation sequencing technology is, it didn't eliminate the variability of biology. If someone comes up with an instrument that does that, I would be very interested to know. But basically, if you have multiple isolations of cells showing the same phenotype stage, experimental condition, these are always a good idea. Biological replicates are always a good idea. It doesn't matter if you're using microarrays or RNA sequencing or some other strategy. You want to control for things like environmental factors, growth conditions, time, etc. The correlation coefficient that's shown here is for the biological replicates that will really depend on how variable your system is. It depends what your experiment is, but it's definitely something to consider. It's very difficult to predict how many replicates you will need without knowing a lot more about what the exact experiment is that you're doing. So just in case someone asks me that, sorry that you'll be disappointed with my answer probably. Okay, so that's sort of the lab side of things, pretty much covered. Is there any more questions about that? It seems like we had quite a few so far. Yeah. I have also heard this very commonly and I've heard of people doing it twice trying to improve it. They get say, I mean even if you get an 80% reduction, that still leaves a lot of ribosomal RNA because, so for example in a species like human, 95 to 98% of all RNAs are ribosomal RNAs and the tiny fraction, the 2 to 5% that's left is the mRNAs and other RNAs that you're actually interested in. So even if you remove half of those RNAs or two third ribosomal RNAs or two thirds of the ribosomal RNAs, you still have a ton of ribosomal RNA. It may still completely dominate over the other transcripts. So you may still be really disappointed when you get your RNA seek data back and you start aligning reads. You randomly pick 10 reads and eight out of those 10 reads align to the same two ribosomal transcripts. This can seem very disappointing and frustrating. And I think this has been a very common experience with people doing ribo reduction and that's why they often will do a polyase selection. But I think that the ribo reduction kits, they continue to come out, they continue to have different versions of them, they are improving. So I think it's the kind of thing that eventually may become standard because there are there are a lot of advantages to just reducing the ribosomal RNA and then having this more holistic view of the transcriptome without selecting for polyadenylated transcripts. Sorry. I cannot recommend any kits because all of the ones that we tried we were not satisfied with and we don't use. I know that some of our technology development guys continue to evaluate and I'm sure they're looking at one right now but I don't I'm not aware of the name of it. I should yeah I should check on that again because it's something that periodically we need to think about whether we should switch. We periodically do comparisons of a variety of, so we'll try, we have numerous times we've done this poly A versus total RNA, a kit that has a normalization strategy versus one that doesn't, one that has linear amplification versus one that doesn't and we're always fighting this battle between wanting to compare all of the new technologies on the one hand but on the other hand we need to produce data and we want to be able to compare our data from this week to the data that we can generate a month from now and if we keep changing that variable we can't do that. So we do have sort of a standard protocol that we use and we'll try to keep that for at least a year or something and then on the side we'll think about what our next default mode is going to be. So there's sort of a big lag and we're a huge production sequencing facility so that reduces our agility to keep changing variables because we need things to be working in a standard operating procedure kind of way in a production environment. But we have been using a kit called the Nugent Ovation version 2 which has one of these normalization strategies somewhat built into it which does a halfway decent job of reducing the ribosomal RNA content whether you put total RNA into it or poly A. So it kind of works okay for either of those. We did one experiment maybe six months ago with duplex specific normalization and we're not very satisfied with the results but I know that there are various companies working on kits that are trying to optimize that. I mean it seems promising the idea of it anyways. Yeah we weren't very satisfied with it either. We are not we are not using spike in controls but we should. Yes absolutely recommend them. I have recommended them to the guys that make our libraries. I mean there's always this trade off between complexity of making the library and cost and the number of steps and actually getting things done. So it's one of those things where everyone can agree it's a good idea and then somehow it doesn't happen and that's just the nature of bureaucracy. Other than cost you can adjust the the total molarity that you're adding to not be that costly in terms of the number of reads. I mean I don't I think a low level of spike in will still be useful and it should not burn that many reads but it probably there's someone that thinks oh I don't want to waste my reads sequencing those same control sequences over and over again. You should try it. Come back next year and let us know how it goes. Okay so I think we should get into some analysis content now since we're we've used a lot of time already. Okay so here are some of the analysis goals of RNA sequencing. So we've talked about some of these already gene expression and differential expression. We're going to do some practical examples of that. Alternative expression analysis, transcript discovery and annotation, a little specific expression. We'll do a very brief example of looking at that. Mutation discovery unfortunately don't have time to cover this and it's actually quite challenging to do from RNA sequencing but people have done it. Fusion detection we're going to do some basic fusion detection strategy and RNA editing. This we will not have a chance to cover either but we're going to cover most of these things. Some of the general themes of RNA-seq workflows. So each type of RNA-seq analysis has distinct requirements and challenges but they're also common themes and they generally follow this pattern. So we're going to have some raw data in the form of a FASTQ file usually. We're going to align and or assemble the reads and we'll explain what we mean by that. We're going to process the alignments with some tool that's specific to the goals. So usually there's a tool specific to each of the goals that I mentioned on the last slide. So for example there's a tool called cufflinks for expression analysis. There are tools like diffuse or top hat fusion for fusion detection and so forth. And then there will usually be some kind of post processing where we're going to import the output from one of these tools into some kind of downstream software that helps us summarize or visualize or perform statistics or so forth. And that's what's described here. And then usually at the end we'll have gene lists or candidate genes for validation similar to what you would get from a microwave expression experiment or other platforms. This is just for reference some tool recommendations. These are some tools that I've used that I found useful. We're going to provide a whole bunch of links, figures, etc. that review a galaxy of tools that you might use for different purposes. The details are not really important. This is just something for reference in your binder. So the seek answers exercise. So I had a couple exercises in here. We're doing for time. We're still okay. So there are two forms that are really useful for asking questions about next-gen sequencing analysis, bound formatics, etc. One of those is seek answers. So I thought we could just take a quick break from the lecture, spend a few minutes to go and just run through these four steps at seek answers. You're basically going to go to this website and go to the wiki link. And basically we're going to look at a list of software tools for our NAC analysis that these guys are aggregating. And this is a way to keep on top of the latest development. So we try to update these slides to wherever the field is at as of today. But a month from now a new tool will come out and you may want to hear about it. And this is one of the ways that you can learn about the latest developments in the field. So we'll just give you a few minutes to do that and then we'll continue on. Yeah, so we're going to look at another forum website as well in a few more slides. And it seems to have heftier servers. So maybe we can just move on since they're not able to handle all of us at once. Okay, so let's just continue with a few more common questions about RNA seek analysis. This one, something I get asked all the time, should I remove duplicates for RNA seek data? And a duplicate is a read that aligns to the same region as the genome as another read. So we have a read that aligns to the first 100 bases of the gene EGFR and then another read we observe aligns in exactly the same way. So it looks like it's the exact same piece of information. And the reason that this question gets raised a lot is that in DNA analysis, it is extremely common to remove duplicates. So people don't even think about it. You align all of your data against the genome and any reads that align in the same place, you automatically mark them and you do not consider them. You basically say, these are redundant pieces of information, I'm just going to count that one occurrence of this alignment. And then when I do variant calling or other things with my genomic DNA, I'm not going to double count reads that align to the same place. And the reason that they do that is that during library construction, there's a PCR amplification step. And the worry is that if a read aligns to exactly the same region as another read that those reads were probably actually amplification artifacts. And that when we're sampling reads randomly, we don't expect to see reads that are aligning exactly the same. Unfortunately for RNA, the question of whether to remove duplicates is much more complicated than for DNA. So the concern is that they could be biased PCR amplification of particular fragments. But the problem is that for really highly expressed and short genes, we expect duplicates, even if there's no amplification bias, because we just have a very different sequencing situation. In whole genome sequencing, we might be producing approximately 50X coverage of the genome. So at any particular site, you have maybe 50 reads piling up on that site. And they all start and end at slightly different places, they're not expected to be duplicates. But in RNA seek data, we might have a transcript that's expressed at hundreds of thousands of copies per cell. And when we sequence, we're going to get a ton and ton of reads for that transcript, because it's so highly expressed. And just because of the number of reads that are coming from that transcript, it's much more likely now that we will really have true duplicates. And that will represent the high expression level of that gene. So if we remove the duplicates, we're actually removing information that says this gene is really highly expressed and this other gene is really low expressed. So generally in RNA seek, people do not remove duplicates. But then of course, we're left with a problem that you could have this issue of having PCR amplification bias. So there isn't really an easy answer. I would say the answers don't remove them unless you think that you have a good reason to remove them. Or maybe for a particular application, say if you're doing variant discovery in your RNA seek data, you might remove them then. But I wouldn't generally remove them. We don't. So does it imply that short transcripts, you could be just double counting them all the time in your expression analysis? You could be. Yeah. But if you would do differential gene expression, hopefully the amount of duplication would be the same in your control. Exactly. So these effect would cancel out. Yes. So if you're doing differential across, but if you're doing a rank order list within the set, that would be a problem. Potentially, yes. And it's one of the many reasons why there are many caveats associated with creating a rank order list within a data set. So if you're doing something like a class discovery of this thing within a cancer of different subtypes, you would need to do a rank order rather than differential. And the problem is that the number of duplicates differ per gene. I mean, they could differ depending on the features of the nucleotide content in particular. So you might have more amplification bias in highly GC-rich regions, and then you will get a bias towards having more artifactual duplicates in those kinds of regions. And some genes may have those regions more than other genes, so it will affect your rank order list of genes at basically making it not truly correct to what the order of expression levels of the genes is in the cells. What would you consider a metric of too high or too low, and what's your standard for that? I don't think that there is a way. I think the best you can do is assess it in each library and look for libraries that are outliers rather than having a number. And it will depend on how you make your library. It will depend on so many factors that there's no way for me to really say what a number would be. You can do it in DNA because there's so much less variables, but for RNA it's really hard. It's just something to consider. How do you assess duplicates? You can observe how many duplicates are there when you're doing the analysis, and we'll talk about that. So do you think people are skipping the PCR when the starting material is not in the material? Just don't do PCR. Yes. Generally you can do much less PCR. You cannot do no PCR because you need to enrich four fragments that have linkers to allow sequencing to occur on the instrument. But we definitely adjust the amount of PCR amplification cycles according to how much input was there, and we lower it as much as possible. How much library depth is needed? Another question that's really commonly asked, that there isn't unfortunately a good answer for, it depends what you intend to do with the data. And a lot of other things. So for example, gene expression tends to put a lot less requirements on depth, so you can get away with pooling multiple samples into a single lane, maybe three, four, five samples in a single lane of high-seq data, and you'll probably get really good gene expression estimates. But for something like alternative expression or mutation calling, you need much more data to do that well, so you'll have to adjust depending on what your goals are. It can also depend on things to do with the way the library was created, how long your reads are, whether you have paired or unpaired reads. So really the advice I give here when people ask this is to actually try to find someone who did an experiment that had similar goals to you, and see how well whatever level of depth they chose worked out for them, and use that as a starting point. And then just perform a pilot experiment, see how it goes in your conditions with the way you made your libraries with your cells, your mouse, your whatever, and try to assess the outcome and see whether you feel like you need more depth. But the good news, one to two lanes of the most recent Illumina high-seq data should be good enough for most of those purposes. I guess the worst case. Yeah, exactly. So as I mentioned, if you have a library already made and you sequence it one week and you find out you don't have enough data, you can go back to that same library, produce more data, and just pull it together. This is totally fine. Mapping strategies. This we're going to talk a lot about, so I'll just briefly introduce the topic here. What mapping strategy should I use? The simplest factor to consider is read length. So if you have short reads, you might want to choose a different alignment strategy than longer reads. So for example, in the early days of Illumina, when you had reads that were shorter than 50 bases, it was common to use an aligner like BWA and align against a genome plus junction database. And I'll explain what I mean by exon junctions in a bit. The junction database needs to be tailored to your read length. So basically, you're taking pieces of exons and splicing them together in silico, and then you're aligning reads against those. And they need to be long enough to accommodate the length of your reads. But generally, no one is producing these reads or not that many people are producing reads that are this short anymore. Really, we're getting a lot of reads that are 100 bases or even longer now. And for reads of that length, you should really do a spliced alignment using a top hat or another spliced aligner where you're not aligning against just the transcriptome, you're aligning against the genome, and you're letting the reads tell you what the exon intro structure really looks like. So here's just an example of the output from a spliced read alignment. So this is two samples here. This is a genome viewer called IGV, and we're going to be using this a lot. There's a normal whole genome sample. This is just whole genome data, and this is RNA-seq data. This is a tumor normal comparison that was done at the DNA level. And then we have an RNA-seq track at the bottom here. And what I mean by spliced alignment is that we have reads, each of these gray bars is an individual read, and they're aligning against the genome. And you can see that these reads have aligned, for example, this read is aligned with several bases here. And then the rest of the alignment continues down here. And if we compare down to the gene track at the bottom, we can see that that alignment corresponds to an intron. So basically, we've got an exon here and an exon there, exon here and exon there, and an intron in between. And we've got reads that are spanning across that intron. So we've got an alignment of, say, 60 bases here and 40 bases there spanning across this intron boundary. That's what I mean by spliced alignment. How reliable are expression predictions from RNA-seq? So it's another common question. So things like, are novel exon junctions that are predicted from RNA-seq really real? What proportion would validate if you did, say, an RT-PCR in Sanger sequencing? Is this all nonsense? Or is it really reliable or somewhere in between? Are differential or alternative expression changes that we observe between different tissues actually accurate? How well would they correlate to something that some of you may be more familiar with, say, QPCR? So to answer this question, I actually did this in a publication in response to some reviewers who basically asked these questions. And we did about 400 validations using things like QPCR, RT-PCR, and Sanger sequencing. And that's what's depicted here, at least part of it, where we had predictions of various splicing events. So for example, we have a gene being depicted here with three exons. And we predicted that this exon was being skipped. And if we design PCR primers to amplify across that skipping event, do we actually see these two versions of the transcript, one where exon one is connected to exon three, and one where it's exon one, two, and three? These would be expected to give you different sizes of bands on a gel if we run them on a gel. And if we cut those things out and sequence them by, say, Sanger sequencing, do we actually see evidence for skipping of the exon, which is what's being depicted here? So we did this many times, about 200 times. And then we asked, how often did we get what we expected? And the validation rate was 85%. And that was with much shorter reads than what we'd be using today. That was with 42-base reads. And the accuracy will only go up from there. And of course, the RT-PCR assay itself has a false negative rate. So this would be a lower bound of what the validation rate is. It's probably, you could easily repeat this experiment today and get a 95% validation rate, I would expect. Similar idea, but at a quantitative level, using QPCR. So if we have predictions of, again, exons being skipped or isoforms that utilize different exons, and we want to do exon-specific expression estimates and compare them to the exon expression estimates we're getting from the RNA-seq data, how well does RNA-seq correlate with QPCR? So again, we did about 200 of these. And looked at the correlation and the validation rate where we're looking for RNA-seq told us this exon's differentially expressed. And QPCR confirmed that that exon was differentially expressed between two conditions. In this case, the conditions were drug-sensitive and drug resistant cell lines. And again, it's very convincing. So you get an 88% validation rate and a very nice correlation between the technologies. Would you still advise on results or do we just accept that it is good? A lot of people are not doing a sort of systematic validation like this. This is like a validation of the technology. I would say it's more common now to see if someone is doing a genome-wide or a transcription-wide screen, they identify something that they find very interesting. They're making some biological investigation based on that. They'll validate that. They'll validate that specific event as part of the further functional studies that they do to follow up on that observation. But I don't think you need to go back and say, this technology works. I would say you just focus it on the things that you're. I mean, if you're making a panel to figure for your paper and you're talking about an isoform you discovered in RNA-seq 4, EGFR, you would validate that one example, perhaps. We'll find that one genius overexpression example. I mean, it might depend on how convincing it was. This is the last slide of the lecture proper, which was to do a similar exercise as what we did before with the secancers. And this one, I would like to encourage you to at least go and sign up to Biosdarr if you have not already signed up. It's really simple. If you have a Google account or a Yahoo account, you don't even actually need to create a new account. There's an open ID sign-in. Most people have, I think there might even be like Facebook or something. I can do it live. I can't show what it looks like for someone who isn't registered because I am registered. But if you go to Biosdarrs.org and OK, yeah. So if you go to Biosdarrs.org and you are not signed in, you should be able to hit the sign-in button and you will see this. So if you have one of these four things already, you basically can just click that button. It will automatically log you in and create an account for you. If you don't have one, you can still create a login. You can create an open ID login if you don't have any of those accounts. So OK, if I log in. So for me, I use my Google to sign in. So I just press the Google button and knows what to do. So now I'm logged in. And basically, it's a question and answer site where people ask questions about Bound for Mattox analysis and they get answers. And it's sort of divided. The layout is quite complicated, so it's a bit overwhelming at first. But basically, the landing page is the most active questions or discussions that are being considered right now. So for example, here, someone is asking, what does SAM Tools Flagstatt results mean, which is not written very well, which is also very common with forums. And then each of these tabs will give you basically a different breakdown of questions. So you can look at popular questions only if you want to eliminate forum postings or discussions. If you're interested in answering other people's questions, you can go to the unanswered section. As you participate, you will be encouraged by earning little badges to try to get you to participate and help other people as well as just asking questions. And the search is not bad. So for example, if we search for top hat, you will get various things like questions that are relevant to top hat. So for example, here's the top hit. Is top hat the only mapper to consider for RNA-seq data? What a wonderful question. And you'll see often very detailed descriptions related to that question. And they're ordered by the number of votes. So it's similar to the BIOS, the Stack Exchange platform if you're familiar with that. I strongly encourage you that when you have a random bioinformatic question related to RNA-seq, there's a pretty good chance that someone has asked it already and that there may be a good answer already on this site. Yeah, and the how-to section contains additional tutorials, some of which are similar to what we're going to be talking about today. Many, many tutorials on different topics. Yeah, the planet is kind of like a way to keep on top of just up-to-date discussions of next generation sequencing analysis and so forth. There are also job postings, including jobs from our center. So after you take this course and you're all experts in RNA-seq analysis, if you're looking for a job, please let us know. We would love to hear from you. OK, I think we better keep moving. OK, so hopefully none of you have seizures. OK, so I'm just going to do a quick introduction to the tutorial. I'm going to show this flow chart over and over again. So you should become pretty familiar with it. Just showing the overall workflow of much of what we're going to go through. So we're going to start with some data that was produced by a next-gen sequencing machine. As I mentioned before, we're using two-by-100 merges, because this is sort of the most common format right now. We're going to go through a lot of details about how do we take those reads and align them to the genome and transcriptome? How are we going to compile those alignments into transcript predictions? We're going to use a tool called cufflinks for that. We're going to merge all of our identifications of transcripts together and compare them back to known genes. We're going to do some differential expression analysis, and then we're going to try to visualize these results. This is basically everything that we're going to do over today. And we're going to talk a lot about what the inputs to this process are. So we're going to be starting with raw sequence data. We're going to look at the format of files that RNA-seq data comes in. We're going to look at the reference genome. So all of this is based on having a reference genome. And we're only briefly going to talk about the situation where you don't have a reference genome, because we don't have time to cover it. And then another critical feature of a lot of these analyses is gene annotation. So we're going to look at the gene annotation file formats that you need to know about. And the first thing that we're going to do, which is what we're going to move to next, is module one, which is really focused on the file type. So the input data, the stuff that we're going to need to get going with our alignment and expression analysis.