 All right so this is the first of four lectures and we're gonna follow this general theme of having sort of a brief lecture. This is probably one of the longer ones and then there will be some slides introducing the tutorial and then we'll have to go through the tutorial and we'll kind of back and forth like that so that there's a mix of lecturing sort of live tutorial activities kind of interspersed. So these again the preamble slides with the creative commons license. Module one so this is really an introduction to RNA sequencing itself with not too much emphasis on analysis and then we're gonna get more and more into sort of the analysis aspects as we go. Learning objectives of this course overall so we're gonna have as I said four modules the first one will be an introduction the second one we're gonna really dig into the details of RNA sequence alignment and visualization of RNA-seq data. The third one will be expression differential expression and the fourth one will be isoform discovery and alternative expression. We're gonna do the first two today and three and four tomorrow and then each of these modules is gonna have a series of tutorials and the goal of these tutorials is to provide a working example of an RNA-seq analysis pipeline. The idea is you can take this home and use it to set up the pipeline that you're going to use to process your own data in your own lab. For the purposes of the educational instruction that you receive here we have the goal of that this should run in a reasonable amount of time with modest computer resources. You will probably need greater computer resources to process larger data sets when you get back to your own lab but the the data sets we're using here really sort of test-sized so that things kind of roll along quickly and the whole tutorial modules are meant to be self-contained and self-explanatory and portable so they don't there's nothing about them that means that they will only work here you should be able to take these home and run them on any Linux or Mac machine or Windows where you log into a Linux machine. We're gonna log into the into the Amazon cluster but you could just log into some other server that you have available at your university. Specific learning objectives for module ones we're going to do an introduction to the theory and practice of RNA sequencing analysis starting with sort of the background rationale for sequencing RNA in the first place. Some of the challenges that are really specific to RNA-seq analysis that might not be encountered when you're doing DNA sequence analysis. Some of the general goals and themes of RNA-seq analysis workflows there's a infinite number of ways you can do RNA-seq analysis but there are some general themes that are shared across these different approaches. We're gonna review some common technical questions that come up in RNA-seq experiments and you're welcome you're welcome to ask additional questions but we kind of just added in questions that get asked every year just to address them at the outset. We'll go over how to get help outside of this course and then we'll do a brief introduction to the first tutorial. So before we go any further two screens this is sort of the really high-level biological introduction to RNA sequencing that's kind of a good idea to start with. We could probably spend the next hour talking about sort of the details and implications of this slide but just briefly it's this is a summary of the central dogma basically describing how information is transmitted from the genomic DNA level to the protein level. So starting at the top here we have a depiction of a double-stranded genomic DNA template in very cartoonish and not to scale format showing a promoter region and a hypothetical gene with three exons and two introns and then there's various features annotated on this gene such as a translation initiation code on and a stop code on a polyadenylation site. This thing gets transcribed by a polymerase into an RNA molecule and it gets polyadenylated so this happens in the human anyways happens in the nucleus and you wind up with a single stranded pre-mRNA molecule which still has the introns in place in between the exons and then what's shown here are various features that allow splicing of this molecule to occur so you have donor sites and acceptor sites we're going to talk a bit more about those in module four. You got your introns and your exons and then there are various other regulatory features such as exon splicing enhancers and silencers and intron splicing enhancers and silencers and there's a very complicated splicing machinery that recognizes all of these features and very specifically removes the introns and splices the exons together to give you a mature mRNA molecule which is what's depicted here. This thing once it's polyadenylated gets exported to the cytoplasm where the translation into protein will occur if it's a protein coding transcript and then that will give you a protein sequence which will be folded and other post-translational modifications may occur. This often is what we're interested in in terms of biology if we could sequence these things directly and understand their structure in some kind of high throughput fashion we would probably just do that but the technology is really not available to do that in high throughput fashion most proteomics technology is still really low to medium throughput but we can study this thing in a very high throughput fashion by RNA seek and that's so that's really what we're going to be talking about over the next few days but it's important to remember that RNA seek isn't really sequencing these things directly so there's a few sort of nuances to what's really going on one is that we're not actually sequencing RNA all of the sequencing technologies that are available today actually sequence DNA so in order to sequence these things we need to convert them to cDNA and it might be double-stranded or single-stranded cDNA but it will be DNA and then we're not sequencing full length transcripts we're usually converting this into cDNA and fragmenting it into smaller pieces or fragmenting the RNA and then converting those pieces into cDNA but either way you're starting with something that is potentially quite large and then you're sequencing little pieces of it and then we're going to try to assemble those pieces back together and infer something about the structure of the RNA that was actually there but it's really important to remember that there's a potentially a lot of inference going on there and there is a whole field of people developing algorithms to make those inferences but it's a very difficult problem and you really have to keep in mind that even though you may be given a prediction of a 2000 base pair RNA with a particular structure that was really kind of assembled from pieces of information and it isn't guaranteed to be correct so the sort of the amount that this sort of caveat is important will depend on your species and how big the RNAs tend to be so in human we have the situation that's depicted here so for example so let's consider just a really simple experimental design and this is what we're going to the data that we're going to be analyzing the tutorials will be based on this scenario where we have just two samples of interest in this case a colon tumor sample and a match normal sample from adjacent colon tissue and from those two tissues we isolated RNA which is what's depicted here the scale here is shown as 250 basis you can see that these RNAs are on the order of 500 to two thousand three thousand five thousand bases in size which is representative of what they would be for human these from these RNAs we're going to generate CDNA fragments size select those often add linkers to them and then we have these small fragments of CDNA with sequencing adapters on either side there tend to be in the range of 250 to 400-500 base pairs in size these fragments all of these molecules are going to be flowed across an aluminum flow cell in this case the data came from the aluminum platform which is depicted here so this is a about the size of a microscope slide it has eight channels and each of those is sort of a freeform space that you flow molecules across and then the sequencing happens in cycles of ATCG where you image this flow cell and you observe the incorporation of each of the four bases and you keep cycling through and you build up RNA sequences from that the output of this machine is going to be hundreds of millions of paired end reads so really what we're going to get is a short sequence at the end of each fragment so we started with these fragments and we're going to read in from each end and the data we're going to look at today is actually a hundred bases in from the left and a hundred bases in from the right and then depending on how big the fragment is there's some unknown sequence in the middle and that's just one of the many inferences that we're going to have to make is sort of inferring that when we map these two reads and we identify them we're kind of guessing that the sequence in between is what we think it is based on the reference genome so this is really the currency of all RNA sequence analysis is these paired reads and what we're going to do is take those reads map them to a combination of the genome transcriptome and predicted exon-exon junctions from known transcript sequences and then all of this is going to be fed into a variety of downstream analyses and that little box is really what we're going to be this this process is really what we're going to spend the next two days really taking into yeah so that's a good question so the question was or the comment was that this is really so far focused on coding RNAs what about non-coding RNAs so it is a bit of an oversimplification the way I've depicted this these mature mRNA molecules in this case it's assumed that they're protein coding but depending on exactly how you isolate your RNA they don't necessarily need to be protein coding if you don't do a polyase selection they may or may not polyadenolate it really what you're going to get out of an RNA seek experiment the way most people do it is sequences for all of the coding and non-coding RNAs that are above a certain size for micro RNAs and other small RNA species usually you will have to do a separate library construction protocol so most labs to kind of take a divide in conquer strategy where they say do what is considered sort of classic RNA sequencing is everything that's say 300 bases or larger and then there's a variety of other sort of custom techniques for creating libraries to sequence the small RNAs and most people will do at least two they'll have a small RNA or a micro RNA pipeline and they'll have an everything else pipeline and we're gonna most of what's considered today will be the sort of classic RNA seek which is everything that's a few hundred bases or larger a lot of the analysis techniques will be the same but they're certainly for very small RNAs you will need probably a dedicated RNA micro RNA pipeline that does alignment slightly differently and does sort of summarization of the data in a different way yes yes although there are some sort of nuances to that that go through it in particular if they're not non-coding and they're not polyadenylated then you would want to make sure that you didn't do a polyase selection and there's a whole galaxy of RNA seek kits for making the library some of them include polyase selection some of them include priming off of oligo DT primers and some of them don't you want to really think about the details of the RNA seek kit that you choose to make sure that it's going to cover the types of RNAs that you're interested in but there are some sort of very holistic approaches that don't do polyase selection that use only random primers and then really the main limitation will just be the size of the RNA everything else should come through for really small micro RNAs you're probably gonna want to you might use the same a liner or the same underlying a liner you you might you might do okay with top hat probably you better off just to use bow tie which is what top hat uses for its alignment and in some ways the aligning alignment is sort of conceptually some simpler for micro RNAs because a lot of what a lot of the fanciness of the top hat alignment algorithm is trying to figure out where splice junctions are so where the exon start and end of what where the intron is interspersed and that doesn't really apply to micro RNA so it's kind of a wasted search when you're trying to do that with micro RNA so I would say you you better off to go to a different alignment strategy for micro RNAs any more questions yes we will not unfortunately have time for de novo assembly it's one of several topics that we would love to cover if we had more time maybe if this was like a three or four day workshop or we may have a separate workshop in the future we will make some recommendations of tools for assembly so there's a couple slides are sort of like what if my experimental situation doesn't match what we're doing here but it gets yet the assembly is transcriptome assembly is really a difficult problem and it's difficult to do justice in a you know in just a few minutes so we would need like a whole module dedicated to that yeah I mean good is a highly you know subjective term but the assumption for these tools is that you have some reference to work with yeah I mean the better the references the better your results will be but you definitely need some kind of reference to use the tool chain that we're going to use today I think we're gonna talk about that on an upcoming slide but there are basically two strategies one is to positively positively select for things that are polyadenylated and since ribosomal RNA is not polyadenylated it gets sort of washed away the other is to positively select for the ribosomal RNA and then to keep everything else and that both of those approaches are quite widely used and they both have their pluses and minuses and there are sort of other strategies that are a little bit more indirect so there are some library construction strategies that just attempt to remove very abundant species which ribosomal RNA's is are by far the most abundant in your RNA sample so I would say those are kind of the three the three strategies and there are kits that use each of those different strategies capture yes there's also you can do like an exome capture of your RNA library to enrich for actual genes and to try to wash away the ribosomal species and you can do combinations of these things as well so your question is how to eliminate genomic DNA contamination so your sample will be contaminated with genomic DNA there is no will it will it not the question is how much and it will also be contaminated with unprocessed RNA's if you're dealing with the eukaryotic species again there's not a question of will or will not it's just a matter of how much and one of the ways that people assess the level of that is to use one of the tools that we're gonna talk about so the card has RNA QC component that summarizes how many breeds align to exons across exon exon boundaries within introns within UTRs in non-genic space and in ribosomal or mitochondrial genes and you can use to sort of break down of those categories to assess how much your reads align to real transcripts versus to your places where you don't expect a gene or two introns or and you can get a sense of how much genomic DNA contamination you have by looking at the number of the proportion of reads that align to regions of the genome that don't have annotated transcripts all right so a couple quick sort of background rationale slides why would we sequence RNA in the first place versus genomic DNA and this is really for sort of the genomic crowd so probably none of you need to be convinced but just to quickly run through some of the rationals any anyways one of them is you study RNA because you're interested in doing functional studies so there are experimental scenarios where the genome may actually be constant but some experimental condition has some effect on gene expression so for example you might have cell lines where you have a drug treated and untreated condition you may have a model organism where you created a genetic variant of that model organism so you for example some gene knockout mouse for example some molecular features can really only be observed at the RNA level so you can try to predict things like alternative isoforms fusion transcripts and RNA editing from the genome but it's really really hard not very accurate you sequence the RNA directly you have a much better method for interrogating those things predicting transcript sequence that's actually expressed from the genome sequence is extremely difficult a lot of people used to develop methods to do this one genome sequencing was prevalent but before it was practical to do a lot of RNA sequencing so people would try to predict what will the alternative splicing or RNA editing pattern look like by studying the genome sequence now that we can sequence RNA we can bypass that very problematic and inaccurate method by just interrogating the RNA directly when you combine RNA sequencing with DNA sequencing sometimes this allows you to do interesting things like you can interpret mutations that you detected at the DNA level but you don't know what their effect might be so for example you might study regulatory mutations that affect what RNA isoforms are expressed and how much so these could be splice site mutations promoter mutations mutations in exon splicing signals and then you can also use it to prioritize protein coding somatic mutations this is something that's done in cancer a lot where you have mutations in genes that are expressed versus mutations in genes that are silent silent or actually silenced by the mutation so you might have a simple breakdown where you say if the gene is not expressed in a mutation in that gene might be less interesting if the gene is expressed but only the wild type allele this could suggest a loss of function mutation and conversely if the mutant allele is preferentially expressed this might be a good candidate drug target because you know you have a somatic mutation in the genome the gene is being expressed and for some reason it's being preferentially expressed and in cancer that those are good candidates for driver mutations so there's a lot of challenges that are particular to RNA sequencing some of which we've already come up in the questions but we'll just go through all these and if you have further questions to stop me so one is the sample so this is something that's always a question with any kind of sample you're interrogating is it pure do you have enough of that sample what is the quality so is it degraded and this is particularly a concern with RNA that can easily be degraded and a lot of our time in RNA-seeking analysis has spent assessing RNA quality and trying to deal with the problems of the instability of RNA compared to DNA another problem that already alluded to is that RNAs consist of small exons that are separated by large introns in many species especially in the mammals so you've got these huge introns and relatively small exons the reads are coming from the exons so this creates a mapping challenge we're trying to map our reads back to the genome and you have some sequence that hits an exon and then it spans across a huge intron into another exon so that creates sort of a search space problem where we have some of our sequence hitting an exon and then the remainder of the sequence is mapping potentially a hundred thousand bases downstream where the intron ends and the next exon begins another problem that you have an RNA that you don't have a DNA is that the relative abundance of RNAs very wildly so in contrast to the genome where you expect approximately even coverage across the genome of a diploid organism in RNA you do not have this expectation at all the relative abundance of one RNA compared to another RNA might very wildly say ten to the five or ten to the seven orders of magnitude and this is important because RNA sequencing works by random sampling so we're not directing where where we get these reads in any way we have this pool of RNA and we're basically shotgun sequencing it we have almost basically no control over what sequences get generated and the amount of sequence that we get from any particular RNA is really dependent on how highly expressed that RNA is so when we start sequencing we tend to see a lot of reads for highly expressed genes and these consume the majority of our reads for lowly expressed genes we might have to produce a ton of sequence before we start to see any sequences from those really lowly expressed genes so if we're interested in those genes that creates a problem and this is sort of related to the high expression of ribosomal and mitochondrial genes where you basically just want to sequencing these things to death and the gene that you're really interested in it takes a lot of data before you start to get reads hitting that gene sort of another related issues that RNAs come in a wide range of sizes so we've also alluded to this already we've got small RNAs that generally have to be treated separately and then we've got much much larger RNAs and this has an implication for estimating the expression level of the RNAs because a large RNA is going to be detected more simply by virtue of the fact that it's larger so there are more reads that are going to come from that RNA because it's a big thing a small gene will be harder to cover because it's a small there's less reads that can come out of it if you have large genes you can often encounter the problem where if you're doing a polyase selection you can introduce three prime end bias so if every gene has that you're interested in has a poly A tail and you grab on to that poly A tail during the preparation of your library and then you wash everything else away while you're holding on to the poly A tail which is at the three prime end if any of your RNAs have been broken or degraded you're going to wash away the five prime ends of those things while you're holding on to the three prime end and then you're going to sequence that stuff you're holding on to and you're going to lose information from the five prime end so if you look at the distribution of coverage across all of your transcripts from the five prime end to the three prime end you're going to see this bias towards the three prime end where you're really able to sequence the three prime end of your gene but the five prime end basically is sort of an information hole yeah and I mean you can compensate in the sense of dealing with maybe bias that introduces to gene expression levels but if you're losing that information it's just lost there's not a lot you can do to recover yeah I mean again you can do that kind of thing for gene expression purposes but if you're interested in what is the sequence content of the five prime end of the gene or how are exons connected to each other at the five prime end of the gene if you don't have that information there's just nothing you can do about that yes so the answer here is yes there is definitely so all of these sequencing technologies that are currently used on every one in the past as well are we're definitely defined and optimized for a particular GC content and this is just a choice that you have to make and you try to design the technology in such a way that it isn't biased towards high GC or low GC sequences but invariably there is some amount of bias and generally things that are very GC rich or very AT rich will not be sequenced as effectively and there are other features of the sequence that will influence that will introduce bias such as repeat content the propensity for those sequences to form secondary structures versus those that do not and so forth so it's definitely another caveat to keep in mind that you know there are a lot of sort of potential sources of bias in this whole process is there a way to know the quality there are many many ways to assess the possible quality of your RNA seek data and we're going to talk about some of those it's complicated to come up with sort of one simple answer yes this is a high quality library or no it is not to some degree it depends on what you what questions you hope to ask of the data but yes there are many ways many metrics that you can gather that will especially allow you to compare in a relative sense that this library was a failure for some reason compared to this other library which looks much nicer and so some examples of that would be the proportion of your reads that actually map to real known transcripts versus those that don't the proportion of reads that map to introns versus those that map to exons the proportion of reads that map across exon exon junctions these tend to be good estimates of a good RNA seek library that so we mentioned the end bias already so a lot of people will generate these end buys plots and a look for nice even coverage from the five prime in the three prime end of the transcript and there are many other things like this too yes we're running out of time we're going to take a break at 1030 okay something I also alluded to already is that RNA is very fragile compared to DNA it's easily degraded something that you'll see a lot of are these adjuvant traces or maybe you won't see the trace but you'll hear about RNA integrity numbers this is a very very common way to assess the quality of RNA before performing RNA sequencing on it was to run it on an adjuvant lab on a chip assay and basically what you're doing is effectively you're running a gel through a capillary and then you're getting a readout of the nucleic acid sizes and the proportion of things that are each size so this is a human RNA sample that's incredibly intact and when you run it through this asset you get two large peaks for the ribosomal RNA species and based on the size and cleanness of these peaks a score is assigned to this RNA sample so this is total RNA and that's why you see so much of these two ribosomal RNA peaks this sample is assigned a score of 10 which is perfect according to the algorithm as your RNA sample gets degraded these two ribosomal RNA species start to become broken into smaller pieces and then instead of getting two large peaks you get a series of smaller peaks and the more degraded the sample is the more of these smaller peaks you get and gradually your sample starts to become more and more shifted towards smaller and smaller RNAs and the RNA integrity number gets lower and lower to reflect that level of degradation and if your sample is heavily degraded that will definitely cause problems for your RNA analysis some design considerations just sort of overall for RNA sequencing the encode consortium a number of years ago came up with a basically formed a committee and came up with a bunch of recommendations for how to do RNA sequencing we provided this the report that they wrote with on the course wiki and basically it covers things like what kind of metadata should you supply how many replicates should you use what kind of sequencing depth should you do what control experiments should you include so that would be things like spike-ins what standards to use for reporting the data and so forth it's really a good read to sort of get you and give you an idea of what the ideal experiment that you should aim for would look like so I've sort of alluded to this as well we could probably you know talk a lot about the details of these points on this slide but just to quickly review some of them there are many RNA seek library construction strategies and there really isn't a lot of agreement in the field about how you should make an RNA seek library and to some degree it really depends on what your experimental goals are and that's why it's difficult to standardize but it's also not standardized because it's still being improved the early kits were really not that great and we've gradually they've been improved over time so some some of the major choices that are made by these kits are should you do sequencing of total RNA or should you do a polyase selection and then sequence that poly a enriched material or an alternate to that might be to do a rival reduction so you start with total RNA and you try to pull away or remove all the ribosomal RNA species size selection so you might I've already mentioned divide your RNA to really small RNA RNAs and then everything larger and then once you make your seeding CDNA you might decide decide to choose a very tight size fraction so that all of your fragments are very close to the same size some people cannot do that some people will just remove the small stuff and then they have sequence a very broad range of sizes so you have these paired 100 mirrors coming from your fragments and the fragments are anything from 100 bases in size to 500 600 700 or bigger and that both of those things can have both those strategies can have implications for your analysis some kits use a linear amplification to try to deal with small RNA amounts they'll add a linker to the RNA sequence and then they'll use some kind of polymerase to try to increase copy number so that you can start with a really small absolute amount of RNA as your input initially almost every RNA C-clarbrae was unstranded in that you couldn't tell where the sequence to which strand the sequence you're sequencing came from so RNAs are only expressed in a 5 prime to 3 prime direction on one strand at one position so it would generally it's nice to know which strand your RNA was expressed from but many of the kits basically create double stranded CDNA and that's what you sequence so you don't actually know for sure which strand was being expressed you just have to infer it based on what you know about the annotation of that genome but now there are libraries where this information is maintained where you do know what strand the sequence came from and if you can do that I would definitely recommend that obi mentioned already that some people do exome capture of their RNA seek libraries to enrich for sequences that come from real transcripts real known transcripts their library normalization strategies that attempt to deal with this problem that you have very very highly expressed genes that try to that tend to burn up all of your data and then you have really lowly expressed genes that are hard to sequence so library normalization tries to sort of flatten that out so that everything is a bit more even and all of these things can really influence both the analysis strategy you take and the interpretation of the results that you get out of your analysis this and there it's especially important to remember that if you're doing comparisons between libraries you don't want to vary any of these things it's just not a good idea so you don't want to like make your libraries for your you know condition a with one approach and then condition be decided I want to try this kid or this time I'm going to do poly a selection and then you're hoping to make some biological insight about comparing condition a to condition b and you're just going to wind up observing the differences between the way you treated your libraries and the biology will potentially be lost replicates a lot of questions come up about replicates how many replicates should I do what kind of replicates the good news is that the first category of replicates technical replicates this problem is largely been solved so people used to do multiple lanes from the same library or they would run the same library on two different flow cells or two different instruments if you're using the Illumina platform these platform is so consistent that you really do not need to bother doing that anymore you can basically treat them as identical so for example this is a plot of gene expression estimates that came from data generated on one flow cell one lane of one instrument and then the same library was just loaded on a different instrument on a different lane and you basically get the same answer out of those two things so it's very consistent this platform the Illumina platform of course biological replicates this is not the case you should of course do biological replicates if you can if they're available and these could be lots of different things depending on what the nature of your experiment is so some of the common analysis goals of RNA sequence analysis so there are many different things that you can ask of the data some of the more common things are listed here probably the most common is what we're going to start with which is gene expression and differential expression so this is conceptually very similar to micro analysis that's been done for years alternative expression analysis is sort of an extension of that so not just how much is a what is the gene expression output from a locus but what are the transcripts that are actually being expressed from that locus sort of related is transcript discovery and annotation so since we're sequencing in a relatively unbiased way we have the opportunity to discover entirely novel transcripts from potentially entirely novel genes you can look at the in diploid species at allele specific expression so you might see bias towards expression of paternal or maternal allele for example some people attempt to do mutation discovery and RNA sequencing RNA seek data it's very difficult but it is possible in cancer or fusion detection is a very hot topic and then RNA editing would be the attempt to find sequence differences at the RNA level that are not actually at the DNA level that occur after transcription of the RNA from the DNA some of the general themes of RNA seek workflows so each type of RNA seek analysis has really distinct requirements and challenges but there's generally also a common theme and they kind of follow this pattern so we're going to have some raw data a fast cue file a BAM file that comes off the instrument we're always going to align or assemble that data and then we're going to process these alignments with some with a series of tools and usually there's a tool specific to various goals so for example we're going to use cuff links for expression analysis and we might use a tool like diffuse or primary scan for fusion detection so each of the sort of goals that I listed in the last slide has usually many many tools for that goal and then usually they'll be each of these tools will dump out some potentially very complicated and obtuse file formats which you will then import into some downstream software to help you visualize summarize perform statistics really try to ask the biological question that you're initially interested in and then you will from that summary create figures gene list prioritize candidates for experimental validation and so forth the tool recommendation so the details of this aren't really important this is just a reference for you guys and I cover a few of the topic areas but if you have if you want particular recommendations for other areas then we can talk about that during one of the coffee breaks I would also recommend visiting these forums to keep on top of the latest developments in RNA-seq analysis or to look for a tool for your specific problem we're running low on time but usually we would do a seek answers exercise here but you can really just check this out on your own time it's a really useful form in particular seek answers is good for sort of more of the wet lab experimental questions related to RNA-seq there's incredible amount of information from people who have been really dealing with the problems of RNA-seq analysis over the years yeah so it's up to you we can you guys we can continue the last several slides of this after a break or we can just finish them off now and then we can go right into the tutorial after a break okay okay I will try not to take too long to finish it off so a few common questions that these really relate specifically to the analysis the first question that comes up for a lot of people that have done DNA-seq analysis that are moving into RNA-seq is should I remove duplicates and some people just assume yes I should because I would always do that if I was doing generating genome sequence data but this is really a much more complicated question for RNA than it is for DNA so the reason people remove duplicates and DNA is that the assumption is that they might be from PCR amplification artifacts so you're getting the same fragment being sequenced over and over again and that was introduced during some PCR amplification step when you were making the library and that really is a real concern and you want to remove those because the information is not really it's not telling you about unique fragments you basically just have one unique fragment and then you've got many copies of it and what you really care about are the unique fragments in RNA we can't really do this because because of the problems that I mentioned before that RNAs can be different sizes and they can be expressed at different levels so you can imagine a small gene that's very very highly expressed say you have a gene that's only 400 bases in size and it's very highly expressed there just isn't that many unique fragments that you can generate from that small RNA species so you expect duplicates even without PCR amplification artifacts you expect a lot of duplicates from that the output of that locus and there's nothing you can do about that so if you remove if you remove your duplicates you're basically going to remove your ability to tell that that small highly expressed RNA was really highly expressed so generally we don't remove or mark duplicates in RNA-seq analysis how much library depth is needed every year various people ask this question and unfortunately there isn't really a simple one-number answer it really depends on the number of factors so the first obvious one is the size of the genome of the species you're studying and the complexity of its transcriptome but also to a large degree what question you're asking of the data so if you just want gene expression estimates that are very much comparable to micro a output that you would get that places the least demand on sequencing depth so you can probably take four or five six samples and multiplex them on a single lane of Illumina data and get gene expression estimates that are just fine the problem comes when you start to do ask more elaborate and detailed questions of the data so not just how much how abundant is each gene but what is the structure of the isoforms of that gene and what mutations are expressed or not expressed or mutation calling all these things mean that you need to have a much more comprehensive view of the transcriptome to answer these questions well you can also really very depending on tissue type RNA preparation method the quality of your RNA the library construction method to some degree it can depend on what kind of sequencing you're doing so how long are your reads are you doing paired versus unpaired reads and then to a lesser degree what computational approach and resources you have at your disposal probably the best thing you can do if you you want to design your experiment decide how much depth is defined an experiment that's somehow similar to what you want to do and see what they did as a kind of starting point then conduct a pilot experiment and kind of overkill the data or what you think is an overkill and see how it turns out and try to get a sense from that where you should set the bar and of course there's things like budget that will put sort of hard constraints on the amount of data that you can generate yeah this is enograph sample human xenographed yeah I guess that's what it is so that you're sequencing the human transcriptome at that point I don't see I mean you can I have seen some xenographic experiments where you tend to get good RNA quality because it's it's a fairly controlled experiment you have a lot of control over when and how RNA is isolated it's not like a tumor coming from but an individual in the operating room where it sits you know our room temperature so your RNA quality might be higher than it would be and that is definitely in your favor but I would say in that case it's still probably the bigger factors are gonna be what do you hope to do with the data as opposed to where it came from in that case so you know we can maybe talk about what your experimental goals are in more detail if you want yeah that's an interesting point and that yeah so if you have a lot of mouse contamination then those are reads that you're in effect wasting because you know you're not you're interested in the xenograph not the post so you need 10% more data than you need so the good news in all of this is that the output of the high-seq instrument is so vast now that one to two lanes of data is a lot and you can do most of what you would want to do with that so that's sort of the simple answer if you really want one some common strategies or another common question is what mapping strategy should you use and this this has also really been simplified so it used to be that we had this problem where it really depended on read length so there was a lot of people generating data where you had really really short reads that were shorter than 50 bases 36 42 that kind of thing and those short reads are really difficult to align and they're difficult to align in a manner that identify splice sites so you would generally take a different approach with those small reads where you'd align to a genome and junction database or directly to the transcript dome or you might try an assembly strategy and then try to align contigs but now most people are generating a much larger reads than that so instead of 50 bases or small either usually a hundred or paired hundred mirrors and then you should really use a splice aware liner like bow tie top hat which is what we're going to use visualization of splice it's pretty comparable to BWA in terms of speed we mostly use it just because that's what top the top hat authors chose to use as their short read aligner so top hat is really like a heuristic or a series of heuristics that are built on top of bow tie aligns alignments and but you could in theory switch bow tie for BWA and it would probably work fairly well it's just you know the authors of top hat they wrote bow tie and so that that's what they used but that you can to some degrees you know swap them yeah there are a lot of different strats that's just one example of literally dozens of difference alignment isoform assembly and quantification pipelines we chose we choose this one because we have a lot of experience with it and it's the most popular and it's one of the better engineered in terms of the software but it's not necessarily the best or the only option yes the idea here is really to try to give you a good example of what one of these pipelines look like and get you comfortable with it and hopefully you can use that do it exactly the way we do it here but or you can you know use a different approach and but there will be very a lot of common themes so the concept will be almost exactly the same and you'll just be using different tools and different commands we can't cover every possibility but the hope is that this example is demonstrative enough that you can go around and play with an alternate strategy using the skills that you've learned here I would have to check yeah like I said there's a lot there are literally like 20 different I think we have some slides that show like all these different things but we'd have to dig into the details of it to really understand exactly how it's different so we're going to look at a lot of IGV screenshots and we're going to look at IGV sessions ourselves this is just an example of a visualization of some RNA-seq data actually this is whole genome sequence data at the top here so what's being shown here is as you can see nice even coverage each of these gray bars is a single read and the little colored lines are discrepancies between the alignment of that read and the reference genome which is shown at the bottom here and this is for a normal sample and this is from a tumor sample and you can see the coverage bar at the top here you see nice even coverage across in both cases you can see here that there's a polymorphism where you've got a heterozygous sniff and then you can see here an example of a mutation that's in the tumor but not the normal so this is a somatic mutation and it turns out that this somatic mutation is at the edge of the exon of a known gene and it's actually an acceptor site mutation and then this track is the RNA-seq read or the RNA-seq track so now notice how the coverage is much more blocky where you've got high coverage on the exons and then much lower coverage in the introns and then when you look at the individual reads the reads are spanning across exon intron boundaries so this for example this read covers the end of this exon and then it continues on at the beginning of this exon with the intron sequence being skipped and then these three reads that are marked with stars are actually showing the skipping of this middle exon and a join of this exon directly to that one which is exactly what we would predict to be the effect of a splice site mutation at the acceptor site of this gene so basically the splice and machinery is no longer able to recognize this site so it skips this exon and you get a transcript that's connecting exon 1 to exon 3 here. Makes sense? And then this is sort of the last common question which is probably asked less now than it used to be but basically how reliable are on a seek expression predictions things like are the novel exon exon junctions that I'm seeing my data real or are they just some kind of artifact of alignment or the sequence data production if we went to try to validate these things how many of them would validate and sort of similar question for the differential expression level so if we find it to be a gene or an exon or an exon junction to be differentially expressed between condition a and condition b is that real is it comparable to a micro a output or to QPCR something that you might be more familiar with so to try to address this question the number of years ago I did a series of validation experiments where we compared output from RNA seek to micro rays and also to QPCR RTPCR center sequencing and so forth and you can read all about the details of that in the paper that was cited there but basically what we did is to validation experiments in one case we looked at examples kind of like the one I just showed in the IDV screenshot where we had a prediction from RNA seek that said there's a novel exon skipping event going on here so I'm expecting to see exon 1 2 3 but my RNA seek data is telling me that exon 2 is actually skipped or alternatively spliced so we designed PCR primers to span across exon 1 and 3 and then based on whether the exon is included or excluded we expect different bands so we get a large band if exon 2 is there and a smaller band if exon 3 2 is skipped and we expect the size difference to be the size of exon 2 and that both of these will be at the expected sizes if we then cut these bands out of the out of the gel and purify the DNA sequence we could validate by Sanger sequencing that indeed the exon was skipped so that's what's being shown here we've got exon 1 2 3 and the second exon is being skipped so we did this about 200 times and we basically assessed how long how often did RNA seek tell us the right thing and we got a validation rate of about 85 percent which is probably a lower limit because sometimes it's the PCR experiment that fails and actually the RNA seek was right it's probably closer than 90 95% and you see a very similar story for the expression level or differential expression analysis where again we found cases where the RNA seek told us that a particular exon was being differentially expressed or differentially spliced or had a different boundary and we got on the Y axis here differential expression estimates between two conditions so drug treated and untreated for example and then we compared the output of that to a comparable output from QPCR experiment where we designed exon specific primers and did a very classic QPCR experiment and then we did this with replicates and we assessed whether QPCR also found that exon to be differentially expressed and we got a validation rate of 88% and again that is probably a lower estimate that's probably closer than 90 95% and then again there was another exercise that we can do here maybe after the break or later you can do this on your own where basically you do a quick tour of the BioStar websites this is a really great forum for asking and answering questions about bioinformatics and there's a lot of people in the RNA seek community that participate in this forum including myself and Obi how many people here use BioStar or BioStar members it's so about a third or a quarter of you so I highly recommend that the rest of you during this course sign up for accounts there and just do some quick searches for the kinds of analysis you're doing or experimental conditions you're interested in just to get a quick sense of how much how many of your questions might have already been answered there and when you have problems in the future you can direct questions there so that's the end of this lecture and we'll continue on after the break so the first thing I want to do is just show this really high level flow chart of everything we're going to be covering in the tutorials over the next two days so everything is going to start with some sort of fundamental inputs which are listed here on the bottom at the left so we've got raw sequence data and these are going to be fast Q files just sort of a very common file format for Illuminous sequence data and other types of sequence data and in our case these are paired 2 by 100 base pair RNA secrets and then we're also going to talk about some of the other inputs that are really important to this kind of analysis so one of which we've talked about already which is the reference genome and we've learned that having a reference genome versus not having it can have a pretty pronounced effect on the way you do your analysis and then a sort of related input is gene annotations so depending on your species you will hopefully have some notion of gene annotation for that species and it'll be more or less complete depending on how well and how long that species has been studied for things like human the gene annotation is exceptionally good so an incredible amount of resources have been poured into understanding where genes are on the human genome and what they look like and what their possible functions might be how they're similar or different from other genes what classes they belong to and so forth and each of these things have sort of common file formats I already mentioned fast Q reference genomes are often represented in fast up file format and then there are many formats for gene annotations but one of the more common is yeah we're going to review all these file formats so that you know what you're looking at when you look at the raw data the way our analysis is going to flow is basically we'll start with our raw data our raw data and we're going to do alignments of those reads to a reference genome using top hat which as I've already mentioned is kind of a pile of heuristics stacked on top of bowtie which is really the fundamental liner that's used once we have our alignments we're going to have BAM files which is another common file format we're going to dig into what that how that format works as well and this file is going to be fed into cuff links which is going to try to compile or assemble transcripts from our individual read alignments and then we're going to compare those transcripts to the known transcripts for human that are represented in our QTF file using a tool called cuff merge and then we're going to compare our two conditions remember we have a tumor and a normal sample from a colon tissue using tough death and then we're going to use a cummerbund to do some visualization and statistics and then there also be some sort of custom r scripts that we'll introduce you to as well so you can see by the naming of these things that they're all named after articles of clothing which is kind of the cutesy theme that the authors of this suite of tools chose and they call it the tuxedo suite so what we're going to start with is really talking about these input files that's really going to be the meat of module one so before we jump into actually running commands I have a few slides that actually introduce some of the terminology of the tutorial itself so each tutorial has this corresponding series of slides that try to provide some definitions and you can kind of follow the along with these as we're doing the actual tutorial running actual commands in your terminal but I'm just going to go through them quickly so that it's you get kind of a heads up and you're familiar so learning objectives of this tutorial we're going to go through installation installation is painful and tedious but it's really a part of bioinformatics and it's pretty hard to get around doing it so we're going to try to be familiar with how to install these tools we're going to talk about how to obtain a reference genome how to obtain gene transfer canotations for that genome we'll look at the GTF file format and how that works we're going to index our reference genome for use with the aligners this is a common theme of basically all the next gen aligners that you build an index which is sort of like it's basically like an index in a library where you have a card catalog that tells you where all the books go in the library this index is basically a way of organizing the information in the reference genome to allow us to search for things in it more rapidly and then we're going to look at some of the raw sequence data that we're going to be doing our alignments with we'll look at the fast a and fast Q files these are some of the most common problems that are encountered while working on the tutorials there's sort of a challenge that we have that some of these commands are really really long and elaborate and from a learning perspective it would all it would be better if everyone was able to type these out really think about everything that's going on but for my sort of time constraint perspective it's just not possible to get through all the analysis if we have to type these really really long elaborate commands so what we usually recommend is that for some of the shorter commands you can try typing them carefully if you like but make sure you're always doing things in order and that you're not skipping steps otherwise you can you know quickly get to a spot where it's you know it's like you're taking directions and you take a wrong turn early on and by the time you get to the end of the directions you're in a completely wrong place to where you're meant to be so watch out for that we're going to be doing for elite especially for the long or medium length commands we're going to do copying and pasting so get familiar with how you copy from the tutorial file into your terminal which will differ slightly depending on whether you're on a Mac or a PC or Linux but make sure that you got the hang of that it's a lot faster if you learn the shortcuts for that for a copy and pasting if you don't know them already make sure you copy the entire command so sometimes the commands are so long that they wouldn't fit on one line they wrap over to the next line you got to get all of all of that command pasted in together another problem that's common is being in the wrong directory at the wrong time so we're going to be doing things like create files, rename files, move files, concatenate files and a lot of the commands assume that you're in a particular in the directory you're supposed to be in so if you start moving around and then you try to continue on the commands might not work because you're not in the place that you're supposed to be so watch out for that if you do change directories just make sure you go back to where you were before continuing on with the rest of the tutorial we have this RNA home environment variable set and we'll go over that with you one on one but we always want to make sure that this thing is defined I think it might have been added to the bash RC files so that you don't have to worry about it is that correct nobie you know if this RNA home environment variable is added to the bash RC already so there's a number of tutorial files and they're all available on the wiki we're going to start with this first one called tutorial module one Linux txt and it's called Linux because we're doing this in a Linux environment so we're going to be on the cloud in the buntu instance which is a type of Linux we're going to do a really simple introduction to Linux and basic commands then we're going to install the tools obtain the reference genome create our index obtain and explore some gene annotations and then obtain and explore some raw RNA seek data each of these files contains much more complete instructions than what sort of briefly summarized here lines in the tutorial file that begin with the the hash sign are comments so you can paste these into the terminal nothing happens they're just there to help explain what's going on all of the other lines are things that should be pasted and executed in the Linux terminal or in the case of some of these tutorials are going to be done in R then it'll happen in an R tutorial instead of a Linux tutorial each command we try to annotate them with comments to explain what's going on but we do assume a basic familiarity with Linux so we assume that you know how to make a directory and change directories we're going to walk you through that process but all of this will go a lot more smoothly for those of you that are already familiar with Linux or that did the pre workshop exercise these are the tools we're going to install we don't really need to read those it's just for reference so when we obtain our reference genome there's various places you can obtain them from we're going to get them from the alumni genomes project and this is just a project of attempt to kind of like simplify obtaining these files to get you going on your analysis faster they've sort of done some organizing for us we are going to use only a single chromosome of the human genome and this the reason for this is just to reduce the runtime for the tutorial so in reality you would not do that if you were studying human RNA-seq you would want to align to the whole genome but it's just too takes too much time because it's such a huge search space so even with us a small toy sort of amount of reads it would take too long so we use a single chromosome but we provide instructions for how you would do it if you were using the whole genome similar idea with the transcript annotations so we're getting them again from the alumni genomes project the link is there there are lots of other places to get gene annotations there are various competing and collaborating entities that are annotating the human genome and it's a similar situation for many other genomes or species we're going to download GTF files in this case describing human transcripts so these contain things like exon coordinates gene identifiers gene symbols like BRCA1 and EGFR and we're going to talk about the GTF file but it's described in gory detail at this link here if you want to go and read more about that on your own time before so I mentioned this already before sequences can be mapped to the genome that gene reference genome has to be indexed and this has to be done in a way that's compatible with the aligner being used so each aligner will come with a little tool that helps you index the genome for use with that aligner and this is sort of sort of another common bioinformatics mistake that people commonly run into is that they index the reference genome and then they use a different aligner or they update their aligner and the index is no longer compatible and that can either cause errors which may or may not be clear or it can cause the alignments to not come out correctly we're going to use top hat but we're also going to use an alternate aligner called star and each of these require their own index of the reference genome so we have some test data that we're going to use for the purposes of the tutorial it's kind of been pre-filtered to make things come out faster and nicer so basically what we did is identified reads that match the transcripts on a single chromosome so we've sort of stacked the deck so that we're not wasting our time aligning reads to a part of the genome that we're not going to look at we're going to focus on chromosome 22 or whatever it was we chose and we've already picked reads that we know aligned to chromosome 22 again this is just for the purposes of making the tutorial run more quickly and it still takes time we're still going to be waiting for analysis to happen the test data consists of two RNAs so we've got human colon tumor and matching normal each RNA was used to create four different libraries so there's a cdna that was prepared one way that's called cdna1 and cdna that was prepared another way called cdna2 and then replicate libraries were made from each of these cdna so we've got library 1 and library 2 so again this is just to kind of simulate a very simple sort of biological experiment we've got condition a condition b replicates on the replicates 2 you might have more conditions you might have more replicates but it's just sort of to have the sort of familiar theme the input data is provided in fast q format which is an example as shown here so in the fast q format every record consists of four lines and they always start have the same pattern which is they start with the at symbol followed by the read name which is this very long complicated sounding thing and there's actually a fair amount of information in this this will often contain some kind of identifier of the instrument that did the sequencer sequencing some kind of unique number and then these numbers describe the position of the read so in this case the read came from lane 3 sector 61 x y coordinates of these two numbers so it's actually telling you where physically on the flow cell that read was generated from and then this is sort of optional but it'll be a slash one to tell you that this is read one of a read pair and then slash two would be read two of the same read pair so those will have exactly the same name but we'll just be slash one slash two this is the actual read sequence and then the third line is a plus which is just kind of a delimiter and then this fourth line are the quality values associated with each sequence in the read and they're represented in a kind of code where each letter corresponds to a number and that number tells you how confident you should be that that base was called correctly and you can convert these letters back to the numbers and the number it can be converted to a probability that the read was correct at that position and we'll talk more about the details of that as well so as I said each library is marked as cdna one or cdna two and either library replica one or library replica two cdna one in case you're interested was total rna library and cdna two was a library that was poly a selected but the starting with the same rna and then the two replicate libraries are kind of slightly different strategies one was sort of standard rna sequencing and the other one was where we did a cdna capture of the rna seek library using a an exome reagent to try to reach for reads that align to real genes i think i've covered all the rest of this so we have four samples in total so you're going to see a lot of places in the tutorial where we do a command on the first piece of data and then we do it three more times you have these series of four commands where we're going to do something to each of the four pieces of data that we're working with we're going to do a quick pre-alignment qc with this tool called fast qc and this is a tool that basically you load a fast a or fast q file and it produces a report describing features of that rna data and it can be used to identify problems with the data give you an idea of how good the quality of your rna seek library is before you get too deep into the analysis