 We're going to do the first sort of intro lecture, we'll notice as we go through that the lectures get shorter and shorter and you spend more and more of your time doing the hands-on stuff. So this is probably the lecture that tends to take the most time, I would say, of all of them. And usually that's because there are kind of questions or discussion that goes along the way, which is great. So if you guys have questions or comments, then please feel free to shout them out as we go. And if not, then we'll get done the lecture faster. It's really up to you guys. So this is really just a kind of a general intro, so it's not a lot of sort of really hardcore bioinformatics in this lecture. It's really the sort of basic introduction to RNA sequencing. We're going to work our way through each of these modules and then each one has a corresponding hands-on tutorial. The sort of high-level goal of these tutorials is to provide one reasonable working example of an RNA-seq analysis pipeline. So some of you have already been working with some RNA-seq tools or perhaps have kind of a draft version of a pipeline you've been working with. This one will undoubtedly be different in some ways, so you can compare to it. If you haven't started yet at all, then this pipeline hopefully could be the basis for the start of the analysis that you actually do on your data and you could modify it to your purposes using this as a starting point. For the purposes of the actual tutorial we're going to do, everything needs to run in a reasonable amount of time with fairly modest compute resources so that things can happen in a sort of educational setting like this, so for that reason we have sort of down-sampled data sets that are smaller than the actual size of data that you would be dealing with for human. But all of the commands are basically the same, they just don't take as long to run. And whenever there are sort of discrepancies from what you would do with sort of a real large data set, we sort of try to draw attention to those differences. And then finally we want it to be really kind of self-contained, self-explanatory and portable, so all of this is going to blast past you guys pretty quickly and you'll do your best to follow along, but inevitably some of the details will start to sort of fade away pretty quickly after the course. So the goal is that you can refer back to the materials and hopefully everything that you need to understand what was going on in the classroom here will be fairly obvious, and another kind of goal was that nothing will be kind of hidden from you. So this is a common failing of some workshops, is that some complex environment is set up and everything kind of works and then you go back to your own lab and you try to do it and sort of like, well I don't know how to create that complex environment, the tools were all installed and I don't know how to install them, I don't know what versions were used and so forth. So we try our best to avoid that by basically explaining all of the tools you use, show you how to install the tools and in theory you should be able to recreate the exact workflow that's here and everything is published on at this open wiki so that you don't, you should be able to run through the entire workshop without actually being here and if you have problems with that you can submit a github issue to the git repo for the wiki for this course and we do get quite a few people who just come across it on the internet and they work their way through the workshop on their own and they're able to get it to work so we think that we're succeeding in this goal but please do let us know if you encounter any problems. So the learning objectives for this this first lecture or first module is really just a basic introduction to the theory and practice of RNA sequencing. We're going to go over the the rationale for RNA sequencing which I'm sure for many of you is pretty obvious. We're going to talk about some challenges specific to RNA sequencing that you wouldn't encounter say in some of the DNA sequencing approaches. Some general goals and themes of the RNA-seq analysis workflows so there's a lot of different RNA-seq tools and workflows that combine different combinations of those tools but they do have some sort of fundamental themes to them that are sort of useful to think about and learn. Some of the common technical questions that relate to RNA-seq analysis, some tips on getting help outside of the course, Anne's already done a great job of pointing you at resources to to work through on your own if you get stuck on something and then we'll do a brief introduction to the hands-on tutorial itself just to kind of prime you for what's going to happen at the command line. So this you're probably all quite familiar with but just to make sure we're all on the same page this is a sort of a simple cartoon diagram I created some years ago providing an overview of the central dogma so the basic idea is that you start with a double-stranded genomic DNA template depicted at the top here which is depicting a sort of very simple gene example where we have three exons and two introns. For human this is of course not to scale so in human the the exons are much smaller relative to very large introns this almost looks like more like a yeast gene we have a promoter region here the first five prime UTR a transcript initiation site then the first intron exon 2 and then a transcript termination codon polyadenylation signal so this thing gets transcribed from the DNA into single-strand pre-mRNA molecule where the introns are still in place and now we have a sort of a different set of regulatory features that govern how splicing will remove these introns and stitch the exons together so we have donor sites acceptor sites and exonics splicing enhancers and silencers and intronic splicing enhancers and silencers all these things sort of control how the splicing machinery comes in identifies the introns and connects the exons together and this gives us a mature mRNA molecule which is capped and polyadenylated and exported from the nucleus into the cytoplasm where it gets translated into protein and the protein is folded and various post translational modifications take place and we wind up with this 3d structure at least for protein coding genes which is the functional unit of the gene in most cases so these are the things that we're often ultimately concerned about these protein sequences if we could sequence those and care quantify them in a massively parallel fashion with high accuracy a lot of people would probably just do that so you would just directly interrogate the protein since that's often what we care about but that's generally not possible with current technology so proteomics has come a long way but it's still quite expensive to do anything that's even sort of medium throughput is still quite expensive so RNA-seq is often sort of a proxy for doing that where we're hoping that by identifying differences in RNA expression patterns that those reveal something about what's happening at the protein level and then of course there's a lot of stuff that specifically happens at the RNA levels there's many transcripts that are not actually protein coding that are nevertheless very important and functional so the reason the main reason I show this is just to kind of remind all of ourselves what is the thing that we're primarily interrogating in an RNA-seq experiment and the goal of RNA-seq is really to characterize these mature mRNA molecules but there's a quite a few sort of nuances and gotchas related to that so one of them is that in many species most of the species that you all mentioned the transcripts tend to be you know fairly long so a thousand to ten thousand bases might be sort of a typical range for a typical transcript sequence when you put all the axons together and RNA-seq does is not sequencing those things in full length that they're not intact while we're sequencing is usually cDNA not RNA first of all and usually it's relatively small fragments so these things are broken into small pieces that are in the range of 200 to 300 bases long and those are the things that we're actually able to sequence and we're often not even sequencing those entire fragments we're just sequencing the ends of those those fragments and sometimes the reads two reads might meet in the middle or you may only have single end reads in which case you're just really sequencing a little piece of the fragment and so a lot of the analysis is inferring from those little pieces of information what the transcripts actually looked like and how abundant they were but there's a lot of inference built into this system so we should always kind of keep that in mind and maintain a sort of healthy level of skepticism about predictions of the full length structure of RNAs that come out of this and also the degree to which this sort of relatively somewhat biased way of looking at the transcript don't may result in bias in the output that we get at the end of an RNA CK experiment okay any questions on that yes this cDNA is just a reverse transcribed copy of the RNA so it has the full RNA sequence including the UTRs yes so people talk about sequencing mature mRNA or polyadenylated RNA where you capture or otherwise enrich for polyadenylated transcripts but really it's the it's the whole transcript it's usually not focused on the just the the coding portion of it although there are some techniques where people try to focus their data even more onto onto those regions by say capturing the exons and we'll talk a little bit about that in a few slides anything else okay so here's just a really again kind of cartoon overview what about an RNA sequencing experiment actually looks like so this is going to sort of mimic what we do to some degree in the hands-on exercise so we're going to start with some some samples of interest so say let's let's say the sort of simplest scenario is we have condition one and condition two and it's a tumor normal comparison in this example we're going to isolate RNA from both of those samples so we're showing here sort of long RNA sequences that are polyadenylated and then we're going to generate cdna off of those RNAs fragment that and the fragmentation can can occur at different stages but somehow we usually we do some kind of fragmentation then there'll be a size selection and then we'll add linkers and then these things will get sequenced so these are these small fragments that are made up of pieces of the RNA these things are going to get put on to an alumina flow cell usually is there anyone working with a platform other than alumina data that came from i don't know ian torrent or trying to deny it okay yeah so oxford nanopore is a is a company that produces single or nanopore sequencing instruments and some people are starting to play around with sequencing RNA on them and that's kind of a really interesting point to raise here relates to what i was just talking about that sort of the the appeal of pac bio or oxford nanopore sequencing is that you could sequence longer potentially full length RNA sequences and if you could do that in a high throughput enough fashion then a bunch of that inference of putting the jigsaw puzzle back together with these little pieces would go away you could just feed each RNA through the pore and sequence it from one end to the other end and you would see the complete structure of all the exons without having to kind of stitch them together and then you could just count how many times did i see this full length molecule how many times did i see that one and you could get both your abundance and your structure information sort of in one shot and it would really make the analysis and you know sort of much simpler more accurate more powerful in terms of understanding alternative isoforms and so forth but the technology is still sort of in a fairly early stage of development so people are doing it it's not as high throughput as alumina the error rate is still high they can be sort of they're still biases so to produce a sort of comparable amount of information for an RNA sample on the oxford nanopore I think still costs quite a lot because of the amount of time you have to run one of these sequencing instruments but they are working on it so there's they have new instruments that increase the throughput of the system so they're gradually reducing the cost but it seems to be a very gradual improvement so I mean I first read about nanopore sequencing in the early 90s in a scientific american article in like 92 and they were talking about how it was like just around the corner that we were going to be sequencing full length chromosomes and you know feeding all kinds of things for these pores and here we are 2018 and they're like well we're still working out the technical details and you can you can do it but there's a lot of challenges to getting great data and in a cost effective way yes that's another appeal right so yeah so you think about these steps here like so this is a kind of you know high-level simplification where we have to generate cdna and then fragment it and size selected and add linkers each of those steps has real bias built into it right so how do you generate the cdna you know you have an enzyme like a reverse transcriptase and you have something that primes you have random hexamers or poly A or oligo DT primer that introduces bias because you know for various reasons there isn't a good place for the oligo to bind or those random hexamers aren't really random or they don't sit down in a perfectly random way the reverse transcriptase has sort of differential processivity for different types of sequences so it introduces bias into the output by doing some things better than doing other things the fragmentation affects different molecules differently because they have different three-dimensional structures or different strengths or weaknesses in them the size selection you know introduces a bias towards things that kind of fit into the size you may lose really small things or have different biases for really long transcripts anytime you're adding linkers it's an enzymatic process so there's bias in terms of efficiency of of where the linkers get added and where they now get added so the appeal of the the nanopore instrument is that if you could just take raw sort of unadultered RNA and just feed it through this instrument get this readout it would be this more pure and sort of untouched yes exactly this is what the more like what the transcriptome really looks like so we're you know you should actively you know watch those technologies and think about when you should try them for your your particular experimental system yeah so you're right that everything has it's yeah everything's has its pros and cons and so i'm sure that nanopore sequencing will will be no exception and it'll have its own sort of caveats and biases associated with it for sure yeah i haven't worked with enough of the data yet to really have a great sense of what those are but you're you're absolutely right it's just we're kind of gradually you know moving closer towards this more idealized scenario but achieving the true ideal is is elusive some things don't fit through the pore as easily some things form secondary structures that are difficult to disrupt some things are fragile they break when they're going through the pore maybe there's yeah there's lots of potential for yeah so that's that's a sort of approach or thought that will probably come up multiple times over the next three days is you know we're always so i'm trying to express the sort of like skepticism about each of these analytical approaches and each of the data generation approaches and sort of one common strategy to that is to being skeptical to try orthogonal approaches and try to see you know where the agreement and disagreement is and what what answers seem to be robust across orthogonal approaches is sort of it's one sort of tool that you use to to zero in on things that are platform-independent and therefore perhaps less likely to be a bias or artifact of the platform yeah so that that's something that i picked up in the readings where they would say if you do it this way then you might get this bias and do it that way like especially when it comes to identifying novel transcripts and so like it kept pinging in my head that you know i know cost is always an issue in time fit as well but i'm expecting that as we go there will be instances where it's feasible to actually do both so see when you're when you have the data like when you're just designing your experiment you probably have to make a call about one thing or another but in terms of the analysis i imagine there are points where you can go in one direction and then try the other and then compare and then say absolutely points where it's not driving and this is where it is and yeah it would just be interesting to get your point of view on that as we go through where you think really good places are to do that kind of redundant analysis yeah so we do this a lot and you know all of our projects were always kind of trying different things and comparing different approaches and there's a bunch of advantages to it one is sort of just a pragmatic um it's sort of a sanity check so if you don't get it if you get really bad agreement sometimes that reveals errors or mistakes that you actually made in the in one of the analyses and that helps you sort of track things down and one of the exercises we'll do in the next couple days is to try the the sort of primary expression analysis like the abundance estimation from raw RNA-seq data we're going to do it kind of three completely sort of somewhat orthogonal ways we're going to do the sort of raw counts based approach that there's sort of a whole camp of people that are you know really into the statistics of quantifying RNA-seq and differential expression they really like this idea of raw counts as the input and then there's the the camp of tools that try to do some kind of baked in normalization for transcript size and library depth the sort of fpkm or tpm kind of camp and then there's the alignment based approaches versus the alignment free approaches like calisto and salmon and sailfish and so we're going to pick sort of three examples of paths through those three different expression abundance approaches and then we're going to compare the three to each other okay so we're not actually going to do the RNA sequencing we're going to start with raw data but it's important to sort of keep in mind the sort of the upstream details of how your data was generated of course because sometimes that that can influence interpretation of your results so why would we sequence RNA probably none of you need to be talked into the the value of sequencing RNA but just to review some of the the sort of main arguments for it functional studies of course so this is sort of the functional biology so instead of sequencing the genome which is constant but where an experimental condition has a pronounced effect on gene expression so we can see that effect only by sequencing the transcriptome it won't be apparent in the genome of the these two conditions predicting transcript sequences from the genome sequence is difficult so there's a whole field of bioinformatics that's spent you know a decade or more trying to predict what genes and transcripts would look like based on sequencing the genome so it used to be a lot of groups that sequence the genome of a model organism or some critter some species and they would just look at the genome sequence and try to identify features that looked like exons and features that look like introns and think about how they would get stitched together into a transcript and this is really difficult to do and a huge amount of of really smart people that came up with some fairly decent ways of doing that but now we can just sequence the RNA and then we can align those RNA sequences back against the genome it's extremely powerful approach to resolving the structures of transcripts and then actually get the added bonuses that you get to see how abundant they are so gene annotation has really been revolutionized by the advent of RNA seek of course some molecular features can only be observed at the RNA level so things like alternative isoforms really hard to tell from the genome what those things are going to be fusion transcripts again when there's a rearrangement in the genome it's sometimes very difficult to know for sure what the RNA fusion will look like RNA editing events of course are only apparent in the RNA itself some cancer specific stuff so interpreting mutations that don't have an obvious effect on the protein sequence regulatory mutation analysis of course and then something we use RNA seek for quite a lot is prioritizing protein coding mutations according to the the expression of alleles that contain the the mutations this is a common application in cancer and then obi mentioned that we're doing quite a lot of work with personalized cancer vaccine design RNA seek is very important to that as well because we really only care about epitopes that are actually expressed and could be made into a protein one we're thinking about designing a vaccine for a patient some of the challenges that are particular to RNA seek so i'm sure you're you're aware of many of these issues purity just like any sample this would be relevant for DNA or epigenetic studies as well but the purity of the sample so if you have a mixture of cells and you're interested in a subset of those cells so in tumor biology the classic example is you have stroma mixed in with your tumor cells and that's diluting the signal of your tumor cells when you do an RNA seek experiment if you have you know too few tumor cells it can be hard to understand the expression path because you're basically not sequencing the thing that you care about entirely or or mostly quantity of course is a problem if you can't get enough RNA then you'll have problems analyzing the transcriptome of those cells quality is a problem with RNA much more than DNA so RNA sort of infamously fragile it degrades easily and that can be a real source of bias across an experiment when you have differential RNA quality across your experiment of course RNA consists of small exons that are separated by large introns as shown in the previous cartoons this creates a challenge for the mapping or alignment of reads back to the genome so if you're sequencing DNA you get to for the most part just align your reads directly against the reference genome and you don't have to worry about these big gaps caused by introns in RNA seek the the alignment is considerably more challenging because you have these gaps and that places a greater emphasis on read length so we're talking about pretty short reads here in this sort of grand scheme of sequencing technologies the NGS sequencing reads are really really short so they're shorter than even not that great EST sequencing that we did for you know decades before NGS sequencing came along and if the reads are too short it can be really hard to resolve those exon exon instructions it can be hard to tell where the the intron starts and where the next exon begins when your aligning reads back against the reference genome so you want to think about that when you're designing your experiment and submitting your samples to the to the core does anyone what kind of read lengths are you guys dealing with who has reads that are say 50 or smaller does anyone have short read short short reads like that what about 75 anyone okay what about like a hundred at least what about like 150 okay so there's like yeah quite a mix and this is one of the choices you have to make when you're designing an RNA-seq experiment is do I have single-end reads or do I have paired-end reads and how long should the reads be and one of the the real important factors in determining the length of the reads is how much do you care about accurately being able to resolve the exon intron structure of transcripts and how much do you want to be able to assemble full-length transcript sequences versus just getting kind of really basic abundance output from sort of the transcriptional output from each gene locus depending where you fall on that scale you might shift towards shorter reads that are cheaper allowing you to do more samples or longer reads that cost more and you can do maybe less samples or less experimental conditions okay yeah so if i'm understanding your question correctly you're interested in knowing what transcript species are there but you do have prior knowledge about what the transcript structures look like so you have annotation has been done possibly with some other longer read information being used in the past to give you a kind of a database of reference transcripts so what's the the read length that you care about there so it definitely if you have a good reference transcript dome so not the reference genome but the reference transcript dome that definitely helps a lot and you can get away with shorter reads I still think that probably 75 is about as low as you would want to go but you probably will be you know able to do a pretty good job with 75 so if you're more interested in the de novo transcript discovery then I would say definitely go longer um I mean longer is better 150 is kind of a common sweet spot right now for alumina sequencing there's a lot of production centers that are just making huge amounts of paired 150 base data and so that's often kind of like it'll be a good price point and to go longer than that will start to get more expensive also the quality does still start to tail off the base quality of reads starts to tail off as you get out out past 150 or 200 bases still so there's it starts to be sort of diminishing returns when you get longer than that um but obviously longer is the longer the better if you can yeah um yeah another challenge that's particular to RNA is the the relative abundance problem so RNA's relative abundance very widely there are genes that are transcribed at a very low level so just a few copies per cell and that's the functionally important level of that gene and there are other genes that are transcribed at a very very high level say tens of thousands of copies per cell and that's how many they're needed for that RNA and protein to be functional in that cell so you can imagine the difference between say a structural protein that's used to sort of build the fundamental structure of the cell you need a lot of output of that thing perhaps especially when cells are dividing actively but a signaling molecule that's involved in a signaling cascade that could be a very sort of nuanced kind of signal or just a few copies can trigger this this sort of signaling cascade and so you can have genes that are just really rare but still really biologically important relative to other genes so this creates a problem for sequencing because RNA sequencing works by shotgun sequencing so we're not able to decide which transcripts we sequence we're just effectively pulling sequences out of a hat randomly without much control over it so if you just randomly pull reads you tend to get the things that are very abundant and the things that are rare you need a lot of data before you're able to sample those things effectively so this is really the thing that fundamentally influences or limits your choices with regards to how much data you produce so how much do you care about having sensitive detection of those transcripts that are present in the cell at relatively low copy number so if you need to be able to profile those things then you're just going to need whatever amount of data is needed to get down to those low level of expression and this is quite different from DNA sequencing where you know you have a bunch of chromosomes but they're all essentially present in whatever the normal ploidy status is for your system so in human they're all there in a diploid state so you can expect approximately even random sampling across the whole reference genome and you don't have to worry about differential abundance of chromosome 21 versus chromosome 3 for example and then similarly RNAs come in a wide range of sizes again there are really small RNAs that are functional and important and very critical to whatever biology you're interested in and then there are other transcripts that could be 100 kb or even longer and that creates a bit of a problem in that we are trying to design one experimental library construction strategy that captures the information from all of those things and that's really difficult to do so it's it's generally accepted that your your view of the transcriptome in RNA seek experiment is somewhat biased towards a certain size of transcripts usually the way people sort of divide and conquer as they go after everything that say bigger than 100 bases end up that sort of considered classical RNA seek and then if you're interested in small RNAs you basically need a different experimental procedures you design a separate library and sequencing approach for the really small RNAs and there are some people that divided into small intermediate and large and there's sort of various schemes but generally most RNA seek data sets are quite biased towards transcripts above a certain size and they're generally not good for small RNA species so you'll get sort of more biased output from those things and just in terms of designing the length of your sequencing reads if you're going after micro RNAs for example it doesn't really make sense to do a two by 150 base sequencing strategy one of the things that you're interested in sequencing are very very small compared to those read lengths so you'll often see you know depending on your interests some variation in the the way the libraries are constructed there and this again is particular to RNA you don't have this problem in DNA the chromosomes are all massive compared to what we're sequencing in RNA seek so for all intents and purposes the size is kind of irrelevant in a genome sequencing experiment all of the chromosomes are massive compared to the fragments you're sequencing and they're all there an approximately equal copy number so you just kind of break all the pieces up and you get the sort of even representation of the the whole genome with sort of a different set of caveats and then i'm as i mentioned RNA is quite fragile compared to DNA so that's something you really have to watch out for who's familiar with this the Agilent QC assays if you guys so this will be really common before you send your samples to wherever your sequencing is done you'll often run one of these lab on a chip assays with this really commonly used Agilent instrument that effectively you're running your RNA on a gel except instead of running on a gel you're feeding it through a capillary electrophoresis uh whereas RNA passes past a detector in the this capillary you get a kind of readout of abundance the small stuff moves the most quickly through the gel and the larger stuff takes more time to move and so the small stuff comes out the fastest and then over time you get this this readout that's often called an electrophurogram that sort of gives you the spikes that correspond to the the sizes of RNAs and how abundant they were and i'm just showing two examples here one this is some RNA that isolated from a cell line that was sort of happily growing one minute in the next minute the cells were being broken open and the RNA isolated and so there was very little degradation and what you see in human is sort of two big spikes for the the main ribosomal RNA species and then a little hump down here that you can barely see that's sort of where all the mRNA is so the sample is you know 98 or 99 percent ribosomal RNA which is typical for human cells and then if the RNA is really degraded basically what we see is that all of these ribosomal RNAs are being broken into pieces and so we start to see these spikes corresponding to smaller and smaller RNA species on a gel this would look like a smear and i'll show a few examples of that um a lot of sequencing cores will have some kind of cutoff that's based on this score so the RIN score RNA integrity number is a sort of quantitative estimate of how much degradation is present by analyzing the sort of intactness of these two peaks and so you get a perfect 10 score when it looks like this and then as things get more degraded the score goes down all the way as low as you know zero in theory and a lot of cores will have like a RIN score of eight or 7.5 is sort of a typical cutoff where they say if the RNA is more degraded than that then they'll probably still sequence it but they kind of will wave their whatever their warranty or you know they'll basically say if the data is no good then that's on you. But in the end of the day you're interested in mRNA right? Yeah but the mRNA is also presumably being degraded in a similar way so the ribosomal RNA is like a you're using it as a kind of canary for just degradation that's happened in the sample yeah. And so this link that I have here is just a PDF of a whole bunch of examples of these electrophirograms from different types of RNA isolation from FFP from actively growing cells so you can kind of see the full spectrum from like really beautiful intact RNA to really heavily degraded. Yes question at the back. Yeah so I include a bunch of examples in that this PDF that I linked there where I think these were colon cancer specimens that were from FFP blocks that ranged in age from quite recent to many many years old been sitting in a drawer somewhere for five years and so you can see that the level of degradation goes all the way down to you know RIN scores that are in the kind of two range and basically what it looks like is these these spikes keep moving this way and this way this way until eventually you just see a hump that's like kind of right here where everything is sort of fragmented down to at most 150 to 200 bases or shorter and that that's typical for an FFP sample that's really like the worst case scenario like it's it's been sitting in a shelf it wasn't a frozen block it wasn't done recently there are people that do kind of FFP and then they cut scrolls off like right away and do an RNA seek from that and then you'll see much better you might see four or five six even seven RIN score for that scenario and then the older and more fixed it is it goes all the way down to the 1.5 to 2. We have done sequencing all the way down to completely degraded and what we typically do in those scenarios is skip the fragmentation step so basically degradation is fragmented for you fragmentation no longer required and what we'll usually do is a CDNA capture of those samples and we don't usually do it for differential expression analysis i think that is quite challenging you might need a lot of replicates and low biological variability to still get a useful experiment out of RNA that's that heavily degraded but for other applications it can still be useful so you can still detect whether mutations are present and expressed so we'll do this in the in the vaccine design scenario for example if we if we have a patient where we just simply don't have all we have as a block we'll do it and we we still do this the RNA sequencing and we do a CDNA capture and it that is pretty effective at sort of recovering the quality of heavily degraded FFP samples yes that's a good question and it's ideally yes ideally you don't want a wide range of levels of degradation across the samples that you intend to compare to each other particularly you don't want some kind of bias between condition a and condition b that you hope to compare so it's something that you would definitely want to think about all the way through to the end of your experiment and interpretation so you at the end of the day you wind up with say a heat map with condition a versus condition b plot the the written values on onto that and try to and watch out for batch effects that may be relevant to the level of degradation and not to the biology you're interested in and if you see or suspect that this is a problem you're probably going to want to spend more time thinking about careful normalization of the samples before you do your comparison and then just always thinking is what i'm seeing here potentially caused by the differential RNA quality between my two comparison groups versus the biology why is it why would it be differential yeah why does the degradation affect your sequencing result um i guess it influences some of the biases that are already in the system so you're applying the same size selection strategy to these things the same fragmentation approach but you're starting with kind of different starting points so one thing is already effectively partially degraded and so you're basically increasing the chance that something's gonna like in one sample things will get degraded to such a small level that they'll actually be lost completely and then the abundance of those things will appear to be less but it wasn't because they're actually less abundant it's because they were kind of pre fragmented and then you applied the same fragmentation protocol to both of your conditions so that's sort of like one simple way that you could imagine it but there are probably others as well yeah so i don't i think that people tend to think about degradation as a somewhat random process but only out of convenience but i i i imagine it's quite non-random so a lot of degradation is enzymatically driven so you have uh RNAs that are actively degrading RNA and of course they have you know they operate with certain sequence context so they degrade certain spots more than they degrade others um and then even more sort of mechanical degradation processes of course are also non-random because they'll be influenced influenced by gc content by secondary structures the rna forms and so forth so it definitely not random but it's something that people just kind of gloss over so that's another reason why you need to worry about the difference in degradation level between your comparison groups okay so related to this discussion we've been having about sort of experimental design considerations there's been a few standards guidelines and best practices for rna seek that have been published over the years and i'm linking to a few of them here these are a little bit out of date but really rna seek fundamentally hasn't changed that much in the last five or six years so and these are pretty fundamental experimental design considerations so they're still very relevant there so these documents talk about things like how many replicates should you include how should you do your size selection how much input material should you use how consistent should that be should you use spike in control sequences in your library construction procedure and so forth and they're you know they're quite idealized you probably if you have an experiment already and you go and look at these guidelines you probably didn't don't have an experiment that followed all these guidelines but they're definitely things to aspire to so we've talked a little bit about this these are sort of other design considerations that you should keep in mind so there are a bunch of sort of forks on the road in terms of how your rna seek data was created a lot of people doing analysis of rna seek data start with someone basically handing it to them so the experiment has already happened design considerations have already been taken into account and sort of out of your control right so you're you weren't there from square one or you're being asked to combine data sets from you know these three papers are from a collaborating lab with our with our lab or whatever and so all of these things are things you should really take into account to to try to get a sense of how comparable are these data sets to each other and how much do i need to worry about batch effects and possibly correcting for batch effects or doing some kind of additional normalization so common things that differ are some people do total rna versus poly a selection so poly a selections where you actively try to enrich for poly denulated species to enrich for mRNAs if you do a total rna based method then you're probably kind of taking the other approach where you try to remove the ribosomal rna species otherwise you would just sequence those ribosomal RNAs over and over again without ever getting to the rest of the genes in the genome size selection so there may be different size ranges that were selected some labs like to go after a narrow fragment size distribution other labs go for a fairly broad fragment size distribution linear amplification used to be a common strategy for small amounts of rna where you would basically try to increase the the amount of rna by amplifying it with a linear approach standard versus unstranded libraries so we'll talk a little bit about that most libraries now are being created in a strand way where you know this the strand of transcription that was used but there's a lot of rna seek data out there still where basically you're sequencing double stranded rna so you can try to you can infer fairly accurately what strand likely was expressed or transcribed but you're sequencing both so you don't actually know for sure. Exome captured versus uncaptured so I mentioned this idea that you can take your rna library and hybridize it to some set of probes that correspond for example to the known exons of your species to enrich for RNAs that are actually corresponding to known transcripts and genes in a sort of a way to clean up data from problematic samples or to make your sequencing more efficient library normalization so this is where there's some kind of step upstream where you try to deal with this problem that some genes are really really highly expressed and some gene genes are really lowly expressed there are various approaches to kind of compress that range a little bit so that you can sequence more lowly expressed species and spend less time sequencing the really highly expressed species and all of these details could in theory affect your analysis strategy and especially the comparisons between libraries so just to make a few of these things a bit more visual what I'm showing here is is an example of a few different sort of hypothetical RNA samples that in this case instead of sort of doing the Agilent electrophurogram analysis they've been run on a hypothetical gel and this is totally contrived where we have you know really intact total RNA here partially degraded RNA that's starting to become a bit of a smear heavily degraded RNA and then after we've isolated the mRNA this is kind of what your gel will look like where we've gotten rid of a lot of the ribosome RNA so usually what happens in an RNA-seq experiments you start but with some tissue or cells you isolate RNA and you do this kind of RNA quality assessment and then often there'll be a DNA's treatment where you try to get rid of any DNA molecules that are in your RNA and then you can do an RNA fragmentation or sometimes that happens after cDNA synthesis but then you do a cDNA synthesis with either an oligo-dt or a random hexamer approach usually and then the fragmentation and size selection and at each of these steps you could run this Agilent assay to get a sense of in this case the quality of the raw RNA prior to any manipulation and then in any of the subsequent steps sort of the status of your library and ultimately you wind up with this sort of selected RNA molecules of a certain size most of the small RNAs would be lost in this particular example and then you add your sequencing adapters and those are the things that you're going to prime the actual sequence generation off of when you run them on the sequencing instrument. So we just had a quick question so if you're doing the total RNA are you still capturing some of the poly A to RNA? Yeah so you're not yeah you're not excluding them. It's just the opposite, you're only getting the poly A. Yes and that is covered on the next slide. Are there any other questions about this before I move to that? Okay. Yeah so there's these different ways of enriching and all of them are kind of still in use so we haven't really because it depends a little bit on what your goals are but if you think about just sort of the the basic RNA where it has no manipulation you have total RNA where basically and if you think about how the the reads will align against the reference genome most of them are going to align to ribosomal RNA regions because that's what most of the RNA in the cell is made up of. So and it's just really impractical to sequence RNA in this totally unadultered fashion so you have to basically have to do something to get rid of some amount of the ribosomal RNA hopefully most of it. And there are sort of three general approaches the first is our RNA reduction so this is basically you have in a tube a bunch of oligonucleotides that correspond in sequence to the the RNA ribosomal RNA transcripts for your species so you'll have a reagent like this for each designed for each species and you basically those probes are attached to beads you mix your RNA or your cDNA with those beads and they grab on to all the ribosomal RNA sequences and then you basically elute everything that doesn't stick and it's heavily enriched for everything that isn't RNA ribosomal RNA including polydentalated but I think one of the sort of selling points of this approach is it gives you a more unbiased view of the transcriptome so you're basically grabbing onto the ribosomal RNAs but you're letting everything else come through which includes both polydentalated and not polydentalated species so you have potential for you know coding and non-coding genes there's more sort of potential to discover you know link RNAs and other kind of not not polydentalated RNAs that may still be functionally important but another common approach is to do at the bottom here polyase selection so that's sort of the reverse you have a similar situation where you have oligos attached to beads but in this case the oligos are just oligo DT and you grab on to all of the poly A tails of all of your species and then you wash everything else away including the ribosomal RNAs but also all of the other non-polydentalated RNAs that could be important and one of the sort of gotchas of that approach is that while you're holding on to the three prime tail of all of your coding transcripts if there's any degradation that has happened you're basically going to bias yourself towards the three prime end of your transcripts because you're holding on to the three prime end if that RNA is broken the five prime end can be washed away and it won't make it into your final sample so this is a sort of a common QC step when you're dealing with poly A RNA seek data is to try to assess to what degree am I losing the five prime end of transcripts and the longer the transcripts are the more chance there is that you've broken off the five prime end and that has been lost and you'll tend to see that play out pretty clearly in RNA that's got anything more than just a little bit of degradation and then the final approach is to just directly hybridize again with these oligos on beads to some notion of the known transcriptome so you know what the exons are and you design oligos for all of the say human exons across the entire genome and you hybridize your cdna library against that and anything that doesn't correspond to a known exon gets washed away and you don't select the red you don't try to target the ribosomal RNA so those get washed away and so the the caveat of that is that it relies on prior knowledge of the transcriptome so you need to know what those exon sequences are and you're sort of biasing yourself towards that it also you're kind of introducing a little bit of a bias in terms of the expression abundance estimation and that you're kind of selecting for all of these things so it does have a little bit of a normalization effect where it compresses the the range of expression values that come out so lowly express things tend to get pulled up a bit and highly express things tend to get pushed down a bit so that can be kind of an advantage or disadvantage depending on how you think about it yeah many thousands hundreds of thousands so there but there are a number of companies that have come up with very efficient ways of synthesizing oligonucleotides and then they basically just do this once every few years they come up with a design and then they just they make it available off the shelf as a product you can buy and they do it in such an efficient way that it's only a few hundred dollars to buy that reagent for each sample and yes so yeah there's sort of two big habits one is if you're studying you know whatever random not human mouse and a few other species then they probably don't have one for you so you might have to design it yourself and some people do that and that is very expensive so to be the first person to do it is quite expensive if your species has enough sort of of a market interest sometimes you can partner with those companies and in exchange for your expertise and knowledge about what it should look like what the design should be they will you can basically come to an arrangement with them where they will put it in their catalog and they'll give you some free reagents in exchange for you kind of doing the design for them so we've done that a few times on in different scenarios and there are you know we do it sometimes for custom regions of the genome in human that aren't sort of that aren't in exon regions where an exon wouldn't make sense and it you know it's expensive but it's feasible and there are a bunch of different companies that specialize this and they have kind of each have their pros and cons so there are companies like IDT and Agilent and Nimblegen that are commonly used for this and they've come up with different ways of efficiently synthesizing these massive numbers of unique probe sequences to whatever specification you use but you you provide them yes um yes the question is when would you do cdna capture instead of poly a um so one way to think about it is with cdna capture you could decide to enrich for known transcripts that whether they're polydentalated or not so if you have a mixture of genes you're interested in that aren't all polydentalated then that that would be one scenario it also has the effect of it really does enrich your data for the regions you know about and care about even more than poly a selection with the sort of caveat that you're you have this bias that it's based on on prior knowledge for particularly for the heavily degraded samples that I mentioned the poly a selection isn't doesn't work well at all because everything is broken so you're just going to really introduce a huge three prime end bias whereas the advantage of the the cdna capture is you're just directly capturing each of the regions so it doesn't matter if that RNA was broken and you can no longer grab it by its three prime tail you're just grabbing each piece independently so those are some of the common like ways people think about it any other questions all right and then the last cartoon here is this idea of stranded versus unstranded so just to make this a little bit more visual if you have unstranded data when you go to align it against a reference genome it'll look something like this so we'll have reads that align and they line up with with exons and then in the genome viewers are encoded in the the alignment file will be information about what strand the actual sequence you produced aligned against and then you can put them into two bins or you can color them individualizer according to their their their strand and so for example in igv you can color them red and blue for the positive and for the negative strand and sort of a classic RNA seek unstranded you'll see this sort of even mixture of each of the reads kind of came from either the positive strand or the negative strand and it doesn't matter whether the gene was actually transcribed from the positive strand or the negative strand so in this contrived example I'm showing two genes one that's being transcribed on the top strand and one that's being transcribed from the bottom strand and you don't really see that in the RNA seek data but you can still infer from where this reads align and from especially the ones that span across splice sites you can still infer with pretty good accuracy what read that's that the the fragment actually was transcribed from even though you don't know it but now there's a bunch of methods where they're able to basically encode this information ratings the data so you know what transcription strand was used and then when you do this same visualization in igv now you see that they all sort of line up in the way that you expect and here's an example of some real data I believe some of the data that we're going to look at where we have two genes here that are arranged in a I guess tail to tail fashion so they're both kind of ending in the middle here and we can see that the reads for the most part the coloring lines up with the the positive strand and the negative strand but if you look closely you can see there is the sort of the odd exception where the strand information either somehow was wrong or there actually was a very low level of transcription from the opposite strand and of course this information is really useful for people that are studying regions of the genome where genes legitimately do overlap on opposite strands or where you're looking at sense anti-sense regulation you really need to have this information to get a sense of you know which strand in my two reads come from and you could imagine that there are some genes in the genome where you have an RNA being transcribed from within an intron or that even spans across the exon of a gene on on one transcript and then there's another transcript with several exons on the on the other strand if you're not able to sort of separate those two then the one can influence the expression abundance estimate of the other and that can sometimes create weird patterns or make it hard for you to interpret what's going on it so it could result in yes it could result in inflated abundance estimates for those genes for which there is anti-sense transcription that overlaps the exons of the gene that you're trying to to estimate expression right yeah so if you don't have that situation then you should get a fairly consistent readout yeah what is the quick and dirty explanation which strand oh yeah this is like some really in the weeds molecular biology there are some like pretty nice diagrams of it some of the companies that do this they have different approaches and they're a little bit dodgy about explaining exactly how they do it because that's like their secret sauce um but most of them involve um DUTP incorporation and then some kind of enzymatic degradation that sort of selectively degrades one strand i'm doing a horrible job of explaining this actually have a but there's some great figures yeah it's been a really well posted classroom yeah so yeah let's share that um it's been a while it's been a while since i looked it up i've i've had this question many times and i go look it up i'm like okay that makes sense and then an hour later i forget the details yeah is it possible to share things to our google classroom yeah is that what this okay yeah so we can go to that this classroom how does this this code actually work did you explain that fantastic okay now the last section here is just to go through sort of common questions um replicates is something that comes up a lot during the early days of aluminum sequencing people would do technical replicates no one does that anymore you so just to cut to the chase if you have data from two lanes on the same instrument or from you know one flow cell and then the next day they ran you know additional data for that sample on another flow cell or some of your conditions were spread across two flow cells the platform has become so consistent at the technical level that as long as it passes the qc specs of your core it's probably fine to compare across uh that those sort of differences in data generation um without really worrying about batch effects from one flow cell to the next um for the most part so what's being shown here is just an example of sort of sequential flow cells with the same sample and the correlation in expression estimates that come out and it's incredibly good uh biological replicates of course are a completely different story so you still need biological and experimental replicates to the degree that the biology you're studying requires and our RNA seek is not magic RNA expression has a lot of variability and so you you may need a substantial amount of biological replicates to identify patterns it's really not much we can do about that common analysis goals so what can we ask of the data so we're going to go through some of these between the next two days and then Brian's going to cover additional ones and the third day we're going to focus the most on gene expression and differential expression because that's probably just the most common use case and a lot of the analysis principles will apply to other more specific types of analysis like alternative expression analysis which actually Brian's going to cover that a fair amount so we have kind of two versions of of the alternative expression analysis built into the workshop transcript discovery and annotation I mentioned already a little specific expression so if you want to look for say for example regulatory variants you might be able to see a signature of those using ASE mutation discovery some people directly call mutations from RNA seek data fusion detection is a common cancer specific application RNA editing and of course there are there are more as well all of the these analysis goals have very similar sort of themes to their RNA seek workflows each of them has a sort of pattern that's like start with your raw data align or assemble your reads process the alignments with a tool that's specific to each of those goals so you'll tend to see that there's sort of a fusion caller that takes alignments as input there's an expression estimator that takes alignments as input there's a mutation caller that takes alignments as input so you have this sort of fork in the road where you start with a BAM file and then you do analysis xyz with different tools and then there's usually some kind of post processing so almost every tool that that's out there produces some kind of usually weird hard to understand crazy output file that's not very standardized so there the upstream steps the data generation your data will probably always come in a fast queue format the alignments will generally always be stored in a BAM format once you get past that it's a complete wild west there's very little file format standardization so every tool does something different and crazy and mostly undocumented so there's almost always some kind of post process step where you create some kind of custom analysis to visualize interpret clean filter etc your data and so we're going to spend a fair amount of time on those kinds of concepts and then of course summarizing and visualizing and creating the figures that you use to communicate information to others and publications and presentations and so forth so pretty much always doing some version of that those steps and mentioned bios stars if you haven't signed up for an account i would encourage you to do so and just kind of poke around there's like an rna seek tab that sort of does a sort of pre-filter of questions and answers related to rna seek and there's an incredible wealth of of interesting discussions in there about rna sequencing some common questions so should i remove duplicates for rna seek sometimes people ask this if they're familiar with dna sequencing the short answer is no you shouldn't remove duplicates for for rna seek or you can mark them generally and downstream tools will just ignore the marking and the reason you don't want to remove duplicates in rna seek so is everyone familiar with this this concept of marking duplicates the basic ideas you have generated data and in dna sequencing if you have two reads that that look like they correspond to exactly the same fragment so it starts and ends at the same place in dna sequencing we generally assume that that's likely a pcr amplification artifact so it wasn't actually two unique observations of the same fragment it was a pcr amplification where you just sequenced two copies of the same unique fragment and so we generally mark all of those situations and just pick one as a reference so that we're not double counting the a fragment that isn't a unique dna fragment but rather was just a copy of the same dna fragment and you could imagine applying the same reasoning to rna but it's problematic in rna because of some of the features of rna that we've already discussed which is to say rna's can be quite short and their abundance is very quite widely so you could have a situation where a relatively short transcript is really really highly expressed and now you actually expect to see the same fragments exactly multiple times just because you don't have that many places to sample on that that for that transcript you could generate the same fragment over and over again just because there's not that many ways to make fragments from that short sequence so if you mark duplicates and just chose one of those it would really mess up your abundance estimates you'd basically be underestimating the abundance of short transcripts so it's generally a best practice to not mark duplicates in rna seek data except in certain use cases for example some people still do it for mutation calling yeah yeah i would be fairly worried sorry um so the you know the one of the main interpretations of a really high duplication rate is that your library has low complexity and usually that's caused by low inputs so you didn't have enough material and so there just wasn't enough unique molecules in there and the way you know these library construction workflows work they kind of um basically the the less material you have you kind of just wind up with more amplification and then you wind up with a certain amount of material that's needed to be loaded onto the flow cell so they'll kind of work it out so that they can always put the same amount of molecules onto the flow cell but if there was a small amount of input molecules then the relative proportion of them that are just PCR duplicates of each other starts to go up and up until you know it's quite bad um so that's sort of one reason why you would see a high duplication rate um another potential reason is that um you may have really low complexity in a different sense that so for example if your ribosomal rna reduction step didn't work well you could have a really high proportion of ribosomal RNA still in your sample and then the reason you're seeing a lot of duplicates is because you're basically just sequencing the same few highly abundant transcripts over and over and over again but both of those are pretty problematic for downstream analysis like you're not going to get a robust view of the transcriptome and if you have a big difference between some of your samples it's going to be hard to compare them and see differential gene expression values for example so it's definitely something you would want to keep in mind throughout the analysis interpretation and to see like are these samples really be like outliers in a way that they need to be excluded or that you need to redo that those samples or whatever you can do you know what's saying in that situation the the thing you can do is try if it's at least as consistent as possible from sample to sample that helps and this is a situation where you may want to increase perhaps substantially increase the number of replicates to compensate yeah so in the low complexity scenario where you have small numbers of cells and sometimes your biology is just it is what it is right but so the way to do that is effectively to sequence more cells but because you can't get more cells from a single mouse or from whatever procedure then you just you get more cells by increasing the number of experiments the number of experimental replicates and then you can aggregate the information can still be hopefully you know will resolve out of that if you have enough replicates yeah that's a good so the yeah the question is do more replicates or do pooling people have really strong opinions about this statisticians i'm not a statistician i think that there are some good papers and i think we referenced to some of those in the supplementary materials uh yeah i can't remember like the the details of the arguments for one versus the other you see a lot of people doing different things so and some people do a combination so they'll do each sample will be a pool of nine you know runs of isolating these cells and then they'll do that whole thing like three times or five times and yeah yeah there's a lot of variability but you definitely see that strategy a lot the pool strategy sometimes you're pooling just sometimes you have to pool just to get enough material to do the molecular biology like a lot of these molecular biology steps just simply don't work when the molarity is below some threshold like you need enough material to pellet something at some point and there's just nothing you can do other than yeah pooling samples to the point that you get enough that it becomes feasible and that's kind of that's an important consideration because of that step is is not robust to failure then you have a like a lot of introducing you're introducing a lot of technical variability because sometimes the molecular biology is just like failing miserably so i think you want to pool to the point where you can make a library robustly and consistently and then do that whole thing enough times to get your biological replicates to see the you know to get statistical power to see the the patterns that you're going after other question yeah that's interesting yeah yeah so they usually use both the beginning and the end to decide whether something's a duplicate but yeah you yeah you probably can use like i mean again in RNA you wouldn't remove them so they'll be there but you could go and look for them and there i'm sure there are i can't think of an example i'm sure there are tools that specialize in transcription start site identification that take into account that kind of information although you know transcription start sites do tend to the polymerase picks up like in a window but yeah yeah hmm hard to answer in a idealized way um they're a great idea and a really useful tool um and i think you're kind of hitting on the point that a lot of people come to which is it's you know it's an extra step and it's an extra cost so a lot of people do it at the beginning of working up the technology and then they drop them and that's pretty much what we do even though we teach this course and we tell oh use best standards uh and the data you're going to use has the spike ins and we're going to show you how to analyze them and interpret them and use them to gain confidence that your data is good but then in our own practice being truthful we don't use them routinely um and that's kind of a decision you know i would probably argue that we should but our you know core is sort of like well it increases cost and they want to charge us a lot for that they like things to be simple and so you wind up with this as you commonly do less ideal approach being taken um but definitely when you're yeah when you're starting with a new isolation protocol or some kind of major like at the upfront biology is uncertain it's a great way of you know telling you know how much variability is in my process and where is it coming from is it from the way i'm making my libraries is it from the the RNA isolation is that of the analysis and so forth yeah so is there formal protocol for working in the spike ins is there yeah i mean it comes with the you know pretty detailed guidelines come with the spike in reagents themselves um and there's just there's a prescribed way to do it you just yeah you put a certain amount into each sample and the as early as possible in the process like basically as soon as you isolate RNA i think you put it like pretty near the beginning so that those things go through the whole it's not like you're spiking in a library at the end you're spiking in material from which library will be created just like it's being created from your sample that you care about so you can get a sense of you know did something fail you know if you don't get a good readout on your spike ins then something has gone wrong so i mean at my university we often send stuff to the NRC facility probably i guess it would work to liaise with them and figure out if they are used to handling that sort of stuff they'll probably you know a lot of cores will have it as an option on their kind of their menu you know with for an added cost will include the spike in and if they don't it's probably possible that you could just do it on your end basically in some ways that's desirable right because it's a control on them you know it's basically you're putting in positive controls into your experiment before you even send the sample to the core now you have like an experiment baked in that you can do for fundamental like if you don't get back a reasonable range of abundances that fit the expected you know linear relationship for those spike ins and if you don't get the differential value between your two conditions then you know something's gone wrong okay and then sorry in the resources is there recommended suppliers spikens or is it just a discussion i don't know like how many options there are i know that the one that we use and that we provide the documentation for is quite popular it's widely used so that one will be just the ERCC one will be just fine for most purposes but there i'm sure there are competing options that are probably also just fine but there's nothing wrong with the one that we recommend that's the one that we use when we do use them any more questions yes yeah that's a really big question so the basically the question is if i understand correctly are you by doing the ribosomal RNA reduction are you actually kind of throwing away potentially important biology relating to ribosomal RNAs which of course yeah probably sometimes is true and so so some people spend their whole lives studying ribosomal RNAs but there's a really important genes and so it's one of the like really common and fundamental biases of the whole concept of most RNA seek experimentation is that it starts with this sort of an ideal manipulation of the transcriptome of your cells that you're basically introducing this step that says like throw away a huge chunk of all of the RNA that's there and if you could if you could not do that and still have the data production and analysis be feasible i think it would be desirable for sure but i think we're just not really at the stage where it's feasible because of the it's already going to cost you several hundred dollars to produce each RNA seek library just sequencing the two to three percent of species that aren't ribosomal RNA so times by you know 50 to get that unbiased picture and then the file sizes are already you know big so that part of it goes up too i so i think that if you again and this has become i think fairly consistent so if you sent all of your samples to the same core and they process them in a consistent reliable way they they did this ribosomal reduction but it's not absolute it's it's a partial enrichment for things that aren't ribosomal RNAs and so i think that as long as it's done fairly consistent you can still look at the ribosomal RNA readout because you'll still get lots of lots and lots of little RNA coming through and so you can probably still get decent differential expression estimates as long as you you know you did it consistently use the same RNA reduction procedure for each sample i feel like i don't know enough about ribosomal RNA biology to comment with with any like strong opinions on that but i mean there are there are actually a lot of ribosomal RNA species and we're yeah like you say we're mostly just depleting those two that are really really highly abundant but even that i feel like is somewhat of an oversimplification of the situation that they have a lot of homology to each other and so what is actually happening during that depletion is probably affecting quite a few transcripts and it varies from species to species and so i suspect that a ribosomal RNA biologist would be somewhat horrified to just not to throw away that information or just kind of like brush it off but i'm not one of those so i guess i'll just brush it off good all right um how much depth so i'm sure like five of you at least you're going to ask about your this question about your experiments in the next couple days since this always happens so we can we can totally do that it really depends on your experimental conditions but some of the the factors that you might think about are what kind of analysis you're going to do this is probably the biggest one in some ways so if you're really just looking for a gene expression readout that's sort of comparable to say like an expression microarray would be that places the least demands on the amount of data you need to produce and there's some you know pretty good papers out there that basically say you know something in the range of 25 million reads is sufficient to get a pretty robust gene expression signature out of your samples and in the era of you know in a situation where you have finite resources which is everyone has finite resources and you have to make a choice between how many biological replicates to include and how much money to spend sequencing each of those replicates there's this trade-off where if you just want a gene expression signature you might be better off statistically to just do you know a smaller number of reads on a larger number of samples could be way a way better sort of value but if you want to really resolve all of the trans you know structures of all of the transcripts if you want to do mutation calling if you want to verify the expression status of SNPs or point mutations in your RNA that places a much greater demand on the coverage achieved for each sample and so in that scenario you're probably looking at more like 50 to 75 to 100 million reads per sample to to get kind of a really robust picture of the transcriptome and of course all of these these sort of those like very off-the-cuff recommendations depend on a lot of factors relating to the things we've been talking about like how good did your ribosomal RNA reduction step work and how much input material did you have and how complex is the transcriptome in your sample and so forth so probably the the more pragmatic way to approach this question is to identify publications that already did something in a similar system that you can use as kind of a starting point so other publications that involve your species or the kinds of comparisons or cell types that you're using and even better than that to a pilot experiment where you spend a little bit more money on a small number of samples and you spend some time analyzing the data maybe you do some down sampling experiments you try to dial in you know what amount of data seems to be giving me a robust readout for my actual experimental conditions in my hands with my samples and so forth the good news is that the the amount of data that comes off an aluminum instrument just keeps going up and up and up so it's getting cheaper and cheaper to produce a large amount of data so that hundred million reads if kind of want to go for the sort of gold level of of RNA seek that amount is getting cheaper and cheaper and it's you know within the reach of many labs as long as they're not needing to sequence a large number of samples so yeah yes so the species makes a huge difference right so a lot less I think yeah so I'm not actually I don't know if I've ever actually dug into this question before like are there it seems like there would be a kind of at least back of the envelope approach to say comparing the size of the genome or number of transcripts that's sort of my sort of initial instinct is if I say a hundred million for a human genome with three billion bases and 30 000 genes or the number of transcript transcribed bases as such and such you could kind of extrapolate from there as a starting point and that would probably get you into the right ballpark ish um and then again you know do some do a bit of a pilot experiment to kind of dial it in a bit more yeah complexity and size yeah and so I don't yeah I haven't done any analysis like this but the capacity of the lumen instrument just seems like so massive compared to what you would need like you're almost may wind up being limited by your ability to index broadly like and because the the amount of data you need is so small relative to what the instrument could produce even within one lane you need to be able to jam like hundreds of samples into that one lane uh in order to get the sort of cost effectiveness of the platform and a lot of cores are limited in the amount of indexing you can do because they simply just don't support more than some of them support 96 index some is some support 192 and some are getting up into the 300 ranges so you might have to also come up with a your own indexing strategy potentially any other questions on that yeah so usually most sequencing cores will make more library than you material than you actually need to sequence and mostly they will either store that for some reasonable amount of time or give it back to you and you know it's a cdna library if you store it properly it's good for years and years and years and the now the sequencing platform is so consistent that yeah you can come back and we've done this many times where we we came back and added more data months or even years later yeah it's definitely an option um mapping strategy so we talked a bit about this um long story short if your reads are really short you'll want to use a different alignment than the ones we're going to use but pretty much i mean i think no one raised their hand for the less than 50 base pairs so we kind of don't need to worry about that you should make your reads longer so that you can use a splice aware liner such as the ones that we're going to use high satin star probably the two that are really popular right now and then what if we don't have a reference species so we've talked about this a little bit number one consider sequencing the genome library species because the data production is so cheap now that's actually becoming more feasible for small even small labs to just basically make their own reference genome if that's not practical a lot of people can get their hands on a transcript transcriptome reference that can be really useful and then the first two days we're not going to talk about de novo transcriptome but analysis but brian's going to cover transcript de novo transcript analysis on the third day so you can even if you don't have a reference transcriptome you could assemble one from the data and then use that as a basis for downstream analysis that was just a sort of sampling of some of the common questions we've encountered over the years on the wiki which we'll show you in a second there's more a lot more of these example questions and answers