 Right, that's just the, I don't know if you mentioned so that the license thing, all this contents is available under a creative commons license and is shared as mentioned on that, get hub site and also an RNA bio.org. And you're free to use it going forward and check back for updates and, you know, we've had actually surprising number of people have decided to take the materials and teach their own course on this. So one day you may be so expert at this that you're teaching others and you're welcome to reuse this content if you find it useful for that purpose. The purpose of this lecture is really just to, to get us on the same page. Get some work. I guess. Yeah, okay, I'm sorry. Okay, sweet. So we're just going to quickly, I'll try to be to go through it relatively quickly, although I do encourage you to ask questions along the way. This is kind of the only lecture where we're going to talk about sort of the background of RNA biology and how RNA data gets generated. You guys are all coming from like really diverse experimental systems there's probably a lot of stuff that's common between us in terms of the way data gets generated but this is a good time to to raise anything that you've always wondered about sort of nuances of the way RNA seek data gets generated and how that can influence downstream analysis and interpretation so please do feel free to interrupt with questions. So we're going to talk through the just the overall structure of the course. We're, we're kind of breaking into these four components and but there's sort of a lot of modularity within those we'll start with an introduction. We'll spend quite a lot of time going through sort of the reference genomes reference transcript dome data file formats, the format of the raw sequence data data qc so the kinds of things you do at the beginning of an RNA seek experiment. When you first get your raw data say, and then a fair amount of time on alignment and the visualization of alignment which also kind of doubles as sort of a course or sort of mini course on the integrated genomics viewer, which is really useful for transcriptome analysis to just help you kind of interpret and visualize what's going on with your data, probably the biggest section in terms of just the amount of content and the complexity and the time it takes is expression and differential expression. Generate expression estimates in a few different ways, and then we're going to feed those into differential expression analysis also with several different approaches. And we're going to talk about things like batch correction and pathway analysis of the gene list that result. And then, as I mentioned at the end will do an alignment free expression estimation section. These, all of the, the bulk of the vast majority of the time is going to be spent doing stuff at the command line. So there's relatively few lectures like this one this will probably be the longest lecture and then there's a few like mini lectures sprinkled in here where we kind of introduced the background concepts, but this course is really meant to be pretty hands on so it's sort of very applied. There's a lot of like theoretical math statistics stuff behind a lot of the tools that we're using we're not going to dive super deep into that we can refer you to like materials and lectures. If you kind of want to go sort of deeper on to that side of things but the, the principle aim of this course is is kind of in as short a time as possible. Try to get you comfortable running these tools installing tools, working through a relatively complicated pipeline of bioinformatics tools from raw data to a result that could be, you know, interpretable in terms of a biological experiment. And so, quite a lot of work over the years has gone into sort of refining the exercises to work like this in a, in a real like in an education setting so this is required us to, to create data sets that are a little bit contrived but which is which is a downside because they're kind of not quite what you would get with a sort of a full size diverse data set. But the trade off is that allows us to run through all of the commands exactly as you would run them wanting on a large data set in a, in a short amount of time so there should be like very few times where you're like launching a command and then waiting for the computer to like run for hours before the result comes back everything happens kind of in seconds or minutes at most. But to achieve that we did have to like kind of create things just so. But basically everything that you see, you should be able to run exactly the same way on a full data set and then just takes longer. So it's really set up with that kind of in mind. And then there's this RNA bio.org is sort of an online course that accompanies this course that you're taking in person, we're going to be walking through that step by step. And the idea of that course is to try to be like self contained self explanatory portable something that you know you're going to like drink from the fire hose for the next three days that probably times it'll be like going seems like it going really fast and it's overwhelming. But that content is always there you can always go back and re review it. And there's a lot of commentary and written explanation that mirrors a lot of what we're saying out loud here so the hope is that you'll be able to go back to that and be like okay now I remember what was going on here. So if you want to find any like sections of that online content that are really sort of unclear vague or feel like a black box and please do let us know because we're always trying to improve it. Okay, so this is module one. We're going to really talk very briefly about the background of molecular, the molecular biology of RNA sequencing. We're going to talk about some challenges specific to RNA seek some general goals and themes of RNA seek analysis workflows we're going to actually kind of run through several sort of parallel analysis workflows and hopefully you'll start to see the kind of the theme of them. Some common technical questions related to RNA seek analysis and then I'll introduce the hands on tutorial and we'll kind of get into the practical stuff from that point. So I think a lot of your biologists already have a strong biology background there sound like there were a few people coming a little bit more from the computer science side and sort of moving into interesting biology questions but I think it's really useful to to start with a brief review of the the central dogma, which is what's depicted here so this is a classic so this is a cartoon that I created during my PhD. These are the genes that depict this showing the flow of information from double stranded genomic DNA template, which gets transcribed in a five prime to three prime direction and I'm depicting a kind of example gene here that in some species of that look like this where you have three exons and two introns and the introns are really small, but in most eukaryotes the introns would be much bigger. They're shown kind of small here just for display purposes. And then there's regulatory elements that control the transcription of this thing so you have a promoter region. You have a place where transcript is initiated. It goes to a certain point and then it terminates and there's poly adenylation site, and this results in a single stranded pre MRNA molecules and how we've gone from DNA to RNA, where the introns are still in place. We have a five prime cap and our poly a tail on the three prime end. And then there's a sort of a second set of regulatory elements that govern how this thing gets spliced into a mature MRNA molecules so we have donor splice sites and acceptor splice sites, and a branch site. And then we have other more subtle elements that may influence the displacing of this pre MRNA in, in different tissue context say so exonic and intronic splicing enhancers and silencers. All of these things work together to control the behavior of the splicing machine machinery which is a complex of proteins that come along and remove the introns and splice together the exons to give a mature MRNA molecule, and then we have the cap and poly adenylated and now our exons are right together. And this thing contains our open reading frame. So a start code on and a stop code on that governs how it gets translated into protein. So the protein is depicted here in a in a linear fashion with an N terminus and a C terminus. But of course that's not what proteins really look like they, they ultimately wind up being folded, and then often have sometimes numerous post translational top. And all this to say that so for many of us were here to take an RNA C course or single cell RNA C course to study what genes are doing and that sort of very basic fundamental level. For many of us were actually interested in protein coding genes, the proteins that make up the cell that have some impact on the phenotype or function that we're interested in. There was a high throughput way to somehow just take pictures of these proteins and see what they were and quantify their abundance in a high throughput, cost effective, easy to interpret way, probably many of us would do that. There's still, you know, some people are interested in non coding RNAs and so in that case you would have to directly interrogate the RNA sequences but a lot of a lot of research is at the end of the day focused on protein coding sequences. But the technology really hasn't emerged there. There are produce a whole field of proteomics, and it is gradually improving. But to this day there isn't really a cost effective way to just get like a snapshot of like this is every protein that's being expressed. These are the relative abundances of them to really accurately determine their identities and quantities. So in many cases, that's why we're doing RNA sequencing as a proxy to quantifying proteins. And we've probably many of us have heard that that is not like a perfect way to do it there sometimes discrepancies between what's happening at the RNA level, and what's happening at the protein level. But it's incredibly useful thing to be able to do. And it has other advantages for people that are studying, as I mentioned, genomes that are from species that are less well studied. Our a seek has really revolutionize the ability to take a new species for which we don't have good gene predictions we don't know where the genes and exons and introns are at, and we used to spend or used to be a whole field of bound for Maddox focused on just looking at the DNA sequence a sequence, and trying to anticipate what the genes were, what where the exons and introns were just by looking at the DNA sequence. But now because we can shotgun sequence RNA seek RNA molecules and then just align them back to reference genome and get like a view of what is actually happening. It is really like revolutionized that field. So what is actually the subject of an RNA seek experiment here though which which thing, what what thing that we're showing on this slide is this the closest to what we're actually sequencing in an RNA seek experiment, sort of there's there's like five, five things being shown here which which one do you think is. Yeah, the mature mRNA is probably like the closest. There's, you know, there's a bunch of nuances in the way libraries are being made but really what we're doing is taking these molecules converting them to CDNA and then sequencing those. But how else are is what we're doing in an RNA seek different from what's being depicted here. Okay, so say we're sequencing this but is this really what we're sequencing. It's CDNA and set instead yeah what about the size of it. It's smaller. Yeah, smaller right so we fragment or break into pieces CDNA so in your species the median gene length or transcript length could be two or 3000 bases. And if we could just sequence those things directly again efficiently and cheaply. That would be nice because breaking them into fragments creates sort of another complexity to interpretation that now we're not sequencing full length we're sequencing pretty small pieces of them really. And then we have to kind of think about how those relate back to the full length transcript sequences. In recent years there has been quite a lot of advancement in long read RNA sequencing. So hopefully in a few years this course will become like redundant in this form and we'll have to actually read I would love to be remaking this course to use purely long read RNA data from like either pack bio or Oxford nanopore or maybe another one. But it just seems that the long read sequencing technology is amazing as it is and as much as it as it continues to advance it's advancing very iteratively so every year we hear like long read sequencing like that's high throughput and cheap is sort of just around the corner and it's been just around the corner for a while I first read about nanopore sequencing in a scientific American article in high school in the 80s, 80s 90s kind of thing. And we're here we are, like 20 30 years later still like tweaking these porters trying to like feed molecules through them and get an accurate read out. They're getting pretty close so it could actually be that in a couple years they'll be like a big shift where we'll transition a lot of this type of analysis to long read sequencing platforms. But right now it's still pretty hard to compete with the cost effectiveness and throughput of Illumina short read sequencing you can generate an RNA seek library and very very comprehensively sequence the molecules in that CD need library for a pretty affordable to achieve the same level of sensitivity in depth on a nanopore sequencing platform is still very very expensive you have to sort of run a Promethean like multiple times to get the equivalent of a single bulk RNA seek experiment done on Illumina. Yes. Yeah, so I guess I was kind of oversimplifying a little bit so this molecule here is it's kind of the closest to what we're sequencing but then there's a bunch of nuances because there are different library construction approaches that either enrich for the mRNA molecules by priming off of the polyate tail or decide not to do that and because they are interested in RNAs that are both polyadenylated and non polyadenylated and there's there's other reasons to do it too. And I'm going to have a slide that kind of talks through some of those like differences in library construction approach. So this is a really like simplified overview of what an RNA sequencing experiment looks like and it kind of mirrors a little bit what we're going to do in our, our hands on exercises that will walk you through and then also stuff that you'll work on on your own. You start with some samples of interest tissue cells that you culture that you isolate from a growing organism that apparently that you can obtain from. So someone said they were doing like ancient RNA so I assume that's not from a living thing. That's amazing. So you get a sample from somewhere. And you isolate RNA from it, and you try to isolate RNA and in such a way as you get an RNA that assess high quality and intact as possible and your mileage may vary depending on the source of your RNA. And then you generate cdna from it as was mentioned fragment. Select the sometimes size select some some people do size selection some people don't but you wind up with like a range of sizes that are to say 200 to 400. And then you create a library by adding Illumina linkers to either end of these fragments, and then you flow those fragments across an Illumina flow cell and use that for this massively parallel sequencing approach. And then you get back are these fragments that are often depicted like this you'll see many different tools and figures and whatnot that depict an RNA seek or other ngs sequencing read like this as kind of two little boxes connected by a line. Where, in this case the dark blue and the dark red pieces are the, the adapters that were used to on the ends of this, the cdna sequence that initiate the sequencing reaction. And it's often from both ends towards the center. And if the fragment is small enough relative to how long your reads are the reads might join each other in the middle. And if the fragment is a little bit bigger and your read lengths are a little bit shorter, there might be a part in the middle of the fragment that you didn't quite sequence so sometime that's called the insert, or the, I guess the unsequenced portion. And then you get a range of these situations so some of your reads are, you know, basically the two reads overlap completely some they don't overlap at all somebody overlap just a little bit. And the analysis tries to take all that into account. And then the analysis really involves taking all these sequenced fragments aligning both ends to a reference genome, and then feeding those alignments into quite a variety of downstream analysis for different purposes. There are some challenges to the general challenges to this process so you know there's in any biological experiment you have challenges associated with the samples and how those are obtained. In some context you may have purity considerations so that's important here because we're talking about bulk sequencing and maybe the cells that you're interested in don't represent all of the cells in your sample and so that complicates your analysis and interpretation. So you're working with a system where it's hard to obtain a large quantity of RNA. This is not super common but does happen, and then quality RNA is kind of famously like fragile. So it can be degraded relatively easily. And so there's a lot of like concern about RNA being degraded, and whether that can influence your analysis and interpretation. So this is one of the areas where the short read fragment sequencing actually works out in our favor, because we're expecting to short sequence these short reads anyways. It's actually not a problem really if our RNA is degraded a little bit, because we're going to break it into pieces anyways, but it can become a problem when the RNA gets so degraded that the pieces are smaller and smaller and smaller and once they get really small that can, that can create a problem. And of course the degradation may not be random it could be biased. So that can be a problem. And when you're doing experiments comparing conditions or perturbations, if you have some kind of systematic bias like all of your tumor samples are really degraded and all of your healthy normal comparison samples are really like intact. That can create like sort of batch effects of things like that can can introduce complexities. And so a lot of the molecules were sequencing so as I showed on the central dogma figure that the RNAs were sequencing here. That we're actually trying to profile consists of small exons that before being spliced were separated by large introns. But now we're sequencing the part the mature mRNA where the exons are spliced together. So we're usually going to align those sequences back against a reference genome, where you have exons and introns, and that creates a mapping challenge so that that's a much harder alignment algorithm problem then DNA sequencing and alignment back to a reference genome. And it can create some like sort of complexities in the analysis. Another thing that's, you know, quite different from say DNA analysis is the relative abundance of RNAs varies widely and this is one of the reasons why bulk RNA sequencing on the say the Illumina platform remains so popular because of the amount of data you can get for a relatively low cost. It can overcome the problem that there's this huge range in the expected abundance of different RNA molecules just based on their function. There are some RNAs that are functional. Telomerase is an example I like to use a telomerase is a very, very important protein. It helps to maintain the ends of chromosomes. It does that with only a few copies per cell it's it's very, very sparse you don't need a ton of this protein around. When you measure its expression level, you need a fairly sensitive technology to measure it in many systems and bulk RNA seek works great for that other RNAs that are in the same sample. They're normal expression they're part of the machinery of the engine of the cell or they are part of the structural components of the cell they might be present in tens of thousands or even hundreds of thousands of copies or more per cell, and we're just sequencing randomly so we just get what we get. So if you have this problem in RNA seek analysis that you're essentially like randomly sampling reads from a pool, and you just tend to get a lot of copies of the things that are abundant. And when you're interested in many genes that are not the most highly abundant genes in the cell or transcripts in the cell, you have to sequence more deeply to kind of get past all of this really abundant stuff. And this is, you know, not a problem that you have in other types of sequencing experiments. This is due to things like ribosomal and mitochondrial genes that are just very, very highly expressed RNAs also come in a wide range of sizes. So that that can influence interpretation as well so to some degree when you're estimating the abundance of something you have to account for the fact that if it's a really, really large thing. You'll sample it more easily then you'll sample a really small thing, all other things being equal. And then for really really small RNAs the just the way the libraries are constructed. You tend to lose really small molecules with your sort of typical bulk RNA seek library preparation strategy. So I think someone mentioned small RNA research. That would require like the analysis we're going to do would require probably some tweaks to work for small RNA, although many of the concepts would be the same. And then I already mentioned that RNA is really fragile. So many of you are familiar with the Agilent assays is something that's still widely used or something similar some kind of electrophure gram you run your RNA, essentially on a gel but through a capillary. And then you get a readout that looks something like this you get spikes over time, the smallest molecules come through the capillary first larger molecules take longer, just like on a gel, and then you read out abundance from a floor as the RNA molecules you get a detector and you get peaks or spikes of detection. And in a really intact RNA molecule. This is from human in this example, when you have total RNA that's not degraded at all what you see generally is two peaks that correspond to the size of the total RNA is in that species, and each species has kind of a characteristic pattern, and based on the sort of height and proportion and cleanliness of these peaks. These tools like the Agilent assay will give you a score of intactness or integrity, a score of 10 on this system is perfect intactness of the RNA is thought to be not degraded at all. And it starts to be broken into smaller pieces you see something more like on the left here where you still see those two rabbisomal peaks here, but you also see a lot of other smaller peaks where the RNA has been broken into smaller pieces. And I provide a link here as a reference to a series of these runs of this instrument with RNA is isolated from sort of different circumstances everything from cell line sample where the cells were growing and happy one second and the next second they were in trizol and RNA being isolated and the RNA that comes out of that is super intact to an FFP block that's been sitting in a shelf for 10 years. And then we scraped off some like dust and tried to isolate RNA from that and everything and a bunch of examples in between. And it just turned out that during my PhD I worked on all these different projects where I had sort of experienced that full range from like really great to really terrible. And so I provide a bunch of examples like just as reference points that you can kind of like visually compare. This is another reference slide. I'm really talked through this is actually really old now so when RNA sequencing first started. And they consortium efforts launch like we're going to use this amazing technology to, you know, study the transcript and some systematic way across a bunch of different like, thematically connected perturbations or different cell types and so forth. And some of these consortia said well we're starting this big initiative let's like step back and say, we'll write we'll write some guidelines will assemble a team of experts will talk this through and we'll decide sort of what are the characteristics of an ideal RNA seek experiment. What kinds of things to do think about when you're planning to prepare for the analysis how should you generate your data what kinds of control should you use, and so forth. And I linked to these these documents here with this link. And, you know, they're like 10 years old now but that's really fundamental like stuff so I think it's still useful to review the ideas that they cover before starting a new RNA seek experiment. They'll already have RNA seek data like in their hands that they're waiting to analyze or starting to analyze already okay, and then maybe some others that like it's like on the horizon there's a plan to do some RNA sequencing like in the next six months or a year or something. Yeah, okay so a few more. Okay. Yeah, so for some of you it might make sense to like kind of take a look at some of these documents to talk about sort of RNA seek experimental design. So I mentioned it was mentioned earlier this the distinction about sequencing M RNA molecules versus non polyadenylated molecules that is one of several sort of tweaks to the way that RNA seek data gets generated. And I would say to this day, there still remains quite a lot of diversity across different sort of sequencing centers or cores. I'm guessing many of you like isolate RNA or send tissues or RNA to some kind of service that generates the RNA seek data for you. And they may have like a menu of choices like they do it two or three ways still pretty common. And that those two or three choices may differ between like the core at this university and the core at the university in the next province over or even within your same university there might be different people doing it different ways. And these are kind of the major examples of where the variation is so one of the things that you continue to say still see is that they'll offer an option or they just prefer to do a poly a selection of the RNA before the sequencing. And that's a really important distinction whether that's being done or not. And if it is being done. It means it's more important that your RNA be really intact because you're going to be priming off the end, the poly a end of all of your RNA molecules. And if your RNA is degraded that means you might be missing the all of the five prime and past the point where all the breaks happen. So you can wind up with data that's like quite biased towards the three prime end of your transcripts. And then of course you're, you're not getting the non polyadenylated RNA some of which may be of interest to you. But the advantages that it really enriches for the sort of coding transcripts and it means that if you are interested primarily in polyadenylated species you kind of get more bang for your buck you get sort of deeper sequencing of those molecules with less total amount of data generated. Pretty much every RNA seek experiment involves ribo reduction so some kind of attempt to get rid of the hugely abundant ribosomal RNA molecules that are in most of your species that in most cases are like really dominate so they in human people say like 95 to 98% of all the molecules correspond to the ribosomal species. So you can't just sequence the total RNA or that's all you would be sequencing is these few ribosomal species you need some strategy that gets rid of them and I'll talk through the competing strategies that people use for that in a couple slides. Size selection is quite a lot of variability and how size selection is done and also how fragmentation is done some people do the fragmentation on the RNA some people do it on the cdna. Some people will not do fragmentation if they expect the sample might already be fragmented some people try to do it uniformly there's enzymatic approaches. There's sonication approaches, and then you might decide to do size selection after the fact or tolerate kind of a broader range of sizes and that can depend a little bit on just the way your core operates. They might be combining your RNA seek data generation with a lot of other experiments some of which are not RNA based in that case that might be problematic for them to have some libraries with a much wider distribution of fragments than what they're typically seeing from, say their DNA or a tax seek or all the other things that they're doing for people that are working with really small samples there are some amplification strategies that are still pretty widely used that involve a linear amplification. There still remains to quite a lot of variety and whether your library construction retains the strand information or not. So, this gives you kind of a qualitative difference in the output where you get your sequence reads and in some libraries you can't tell which strand was being transcribed you can infer it by the way it aligns to the gene and it's like oh it aligned perfectly to this gene inside of this exon. And I know that that gene is transcribed in this direction so it's probably from that gene and it probably came from that strand, or if it aligns across an exon intron boundary you can say well there's only one way that that really makes sense because I have an exon intron boundary. But there are library construction strategies where the strand information is explicitly maintained so that we're right in the sequence information itself you can say like this came from this strand. And that allows you to do to sort of disambiguate areas where there's actually transcription happening in both directions at the same position so some species you'll have genes that are arranged on top of each other at least at the ends and you'll have like parts that overlap and you could have reads that align in that chunk where if you didn't have the strand information you couldn't be sure which of actually which direction transcription was happening because you're ultimately sequencing double stranded cda and that's in that instance. Exome capture is one of the ways that some people enrich for sequences that correspond to mature or known transcripts. I'll talk a little bit about that as well. And then library normalization there are some molecular biology approaches other than rival reduction that are sometimes used to try to account for the differences in the abundance between really highly expressed stuff and really lowly expressed stuff so there are a few strategies that people do on top of the rival reduction to try to get rid of really really abundant species say in their particular type of tissue. And all of these things. I don't recommend or not recommend any of them because they really are done depending on the what's needed in your experimental system. The main thing to keep in mind is just that if these things are varying between the samples that you're hoping to analyze in your experiment to be really like aware of the possibility of batch effects so if you have you know some of your libraries are created created with selection and some of them were not and you're hoping to do differential expression analysis you definitely have a potential batch effect that you need to try to account for. Okay, so just to walk through a little bit more of the molecular biology here here's like a really cartoon depiction of what the generation of data looks like starting with tissue at the top so you have you have some source of tissue that you take a chunk of and you isolate total RNA from it and then there's sort of usually like a point where you send some of the sample dynamic down a side path for a QC check and I'm depicting a gel electrophoresis I don't think many people actually do this they used to do this like actually take an RNA and run it on a gel and and visually look at it and see sort of the abundance and quality of the RNA. More common is to use some kind of quantitative like gel electrophoresis type assay like I talked about here where we are looking for a particular pattern of fragment sizes to assess quality. So we do this and we we decide that our RNA. We have enough RNA and it's of sufficient quality that we're going to make a library from it. And so we go back through this, this kind of workflow where we'll DNA treatment to a DNA treatment so that's to get rid of genomic DNA. It's not 100% perfect, but it will get rid of the bulk of genomic DNA that's still in your sample, because every piece of DNA there is going to come through and potentially be sequenceable. Once you convert to your RNA to CDNA, there's not really a way for the technology to distinguish whether a fragment came from genomic DNA or CDNA. So the next step is to do CDs that's this all of your RNAs get converted to CDNA and then it may have already been fragmented or you might fragment at this point doesn't really matter. So whatever your core likes to do their fragmentation and then there may be a size selection where they just pick out like a range of sizes. And at this point, there's usually some kind of at least basic cleanup that gets really rid of really small stuff or an explicit size selection that gets rid of small stuff so at this point you're probably losing small RNAs and RNAs that are particularly fragile, or maybe some RNAs that form like certain secondary structures. The reason is that this is pretty much an unbiased sampling of the transcriptome minus really small RNAs. So a pretty holistic view of the transcriptome, then you add your sequencing adapters to the end of each fragment and feed them into a sequencing experiment. So at this point here. There is usually a relatively near the beginning, or, well, it can be done at different steps but at some point during this workflow there's there's the enrichment step. And the three kind of like most popular enrichment strategies are depicted here. So in the top left. We're just depicting something that's not enriched at all. So this is imagining just the total RNA where you basically would just be sequencing rabbisomal RNAs like crazy and not getting a ton of useful information out of your RNA seek. Nobody does that they pretty much all do one of these three options depicted to the right here. I would say probably the most popular overall is the rabbisomal RNA reduction. So this is where you're selecting for rabbisomal RNAs. You're basically using a series of probes that hybridized rabbisomal RNA molecules, you're holding on to those. And then you're washing everything else through and it's the eluate that's basically the RNA molecules that you care about. And sometimes you might do that twice, or you might do it once check how well it worked and then do it again. And then the competing approach to that is sort of flipping the selection the other way, which is the polyase selection approach where now you're holding on to with a hybrid capture the molecules that you do care about. And you're washing through all the rabbisomal RNAs and you're keeping the stuff that was on the column. Another approach is to do a CDNA capture where instead of capturing the poly A tail, which sort of biases you towards polyadenylated species, you're directly capturing with an exome reagent, which is basically hybridizing to every known exon that was designed on your exome reagent holding on to those things and washing everything else through. And sometimes actually a combination of these things will be done, like you might do rabbisomal RNA reduction and then do a CDNA capture. The CDNA capture is really great if you're, if you're really interested in coding transcripts that are already known transcripts and your species already has one of these designs available. And it's pretty easy to do this capture and then you just in really enriched for the RNA molecules that you care about and it surprisingly doesn't introduce much bias in terms of their abundance like the things that are more abundant. They still come through and they're still by far more abundant the things that are not as abundant still come through there as being not as abundant it does like bring the low stuff up a little bit and bring the high step down a bit so it sort of compresses the range of expression values, which is also kind of an advantage. So sort of, you spend a bit more money prepping your library but then you, you're able to sequence it kind of really efficiently and get like every time you pull a new read it's like quite likely to be something you haven't seen before. I mean it's kind of a hard thing in terms of molecular biology they're doing because you're, it is like a bit of a needle in a haystack problem so you're, even though you can synthesize oligos very efficiently and get huge like molar, like amounts of them. You're still like trying to hold on to a ton of molecules, and it's just hard to capture them all so when you do the wash through. It's hard to like really hold on to all the rabbit some are some of them kind of get washed through again, and the more they are there, the harder it is to really hold on to all of them, but a second round will usually solve that problem so the second time through now the ratio is like way down instead of being like 98 to 2% and the second round it's like more like 50 50% so you have like, it makes it much easier to get the last few of them, and you always have it's never perfect even with two rounds you're still. If you want to sequence rabbismal RNAs, you'll get those no matter what you do, because there's just they're so abundant yeah. Any other questions on any of the like library construction data generation stuff that we've talked about to this point yes. Yeah, so they do come through and they do tend to be very abundant. I think for them, if you really are still finding that they're dominating your like counts too much, then you have to go to one of these. There are a number of like products available now for library normalization that are keyed towards different like types of tissues. That some of them will remove like him hemoglobin related RNAs, which means like people studying like blood conditions sometimes have like just tons and tons of these molecules and I believe there are also mitochondrial ones that will like try to but and they use kind of a similar. Well, they use different strategies actually there are some that use like this kind of approach that's like a hybrid capture. And they just instead of using all the goes that match the rabbis almost sequences they match other things could be anything anything that you think is too abundant that you're trying to get rid of. But they're also now CRISPR based deletion enrichment strategies where you basically like use targeted CRISPR to chew up things that are too abundant and to sort of like get rid of those things and then sort of bring everything else up relatively speaking. Yeah. In the analysis, can you remove them. Yes, once you get to the analysis stage you have total power to like, you can remove them or you can ignore them. And those can kind of be equivalent. I think a lot of people will just like allow everything to come through and then just choose to ignore transcripts that were really abundant that they're not interested in they generally don't cause a problem for the analysis. As long as you're able to like run your computational workflow like in with the resources you have and a reasonable amount of time. Ambient RNA, I think is usually meant in the context of single cell RNA. So the distinction about with single cell RNA, of course is that you're sequencing single cells. There's a kind of assumption that like you form a droplet of oil or whatever. And it's the RNA that's in that droplet came from the cell that was in the droplet that then was exploded and produced a kind of you make a CD and a library inside of that cell and in essence. But the these droplets are kind of all in a solution, and you can have cells that were had already broken apart where there's just RNA everywhere and that RNA just kind of gets into these droplets that and it isn't from a cell that was inside the droplet it's from just the soup. And that's why we call it our ambient RNA in a bulk RNA experiment essentially what we're sequencing all of the sequences ambient RNA because the very first thing you do is like break open every single cell. It all gets mixed together. And that has the advantage that you can have, you know, cells, you know, in a mouse or in the freezer, or tissue culture and then you can directly go to a state where you have RNA molecules that are protected from degradation. And you can make a very robust like sample like quite easily. But it's bulk like right from the first step it's like everything's in a big pile and you're, you have to like pull things out of the pile and try to figure out like how they corresponded to different cells. What about the RNAs inside of the cell cell? RNA. So you're interested in RNAs that are in like extra cellular. I think in the bulk RNA seek experiments, they, you can profile them because you're not the way the RNA is isolated should shouldn't remove them right like every cell is broken open but RNAs that were outside the cell are still there too. So the bulk RNA seek experiment should be able to like detect them single cell exosome analysis. I'm not sure if that's for the both, you know, just for everyone as much as I know, you know, just they wrap and whatever they have, you know, the cell media, and they try to, you know, that it's fine. If you want to enrich for them specifically then I guess that would be an earlier step where you try to actually separate yourselves from the non cellular RNA. Yeah, yeah. Yeah, you could either they could either be part of the bulk or you could try to separate them. And I think either path should feed into this kind of library construction, like relatively well. Any other. Yes. Do any of these options for the price to the cost of the. Oh, that's a good question. A little bit. I mean if you do an exome capture that adds a pretty good chunk of cost because you have to pay for an aliquot of the exome reagent and you have to pay for an exome capture step, which takes time and technician labor. But then the sequencing is slightly more efficient so you could produce a little bit less sequence data, but overall it still costs you more because at this point, the sequence, the per base sequence cost has become so low that in many cases it's the like the library manipulation costs can actually dominate your, your costs. So that is something you wouldn't do it lightly. I think if you're trying to maximize the number of samples you can process. You just want as many replicates or as many conditions as many like critters or plants or whatever it is as possible. In that scenario would recommend probably going with the rival reduction, because you have to do something to reduce the amount of rapids and RNAs and then go straight into bulk RNA seek library construction and then sequencing at a relatively low depth where your emphasis is more on sample count than it is on depth of individual libraries. But yeah, I mean, long story short, the answer is yes, but not huge amounts. It's like an extra couple hundred dollars here or not. There's depiction of visual depiction of the strand information that I talked through showing that, you know, sometimes you have libraries where you align your reads and the red and blue indicate the strand. And there's just kind of a random mix of the strands and you can't really tell you're not being told what strand the RNA was transcribed from. And then the bottom is like where you have the strand information encoded in the data in some way. And we're going to show an example in IGV of what it looks like during the hands on part replicates technical replicates. It's not something people do like you don't need to worry about like two runs of the the instrument being different or two different flow cells or lanes. And then there he constructed, you're probably even mostly okay to sequence some of your samples at one core and then later some of them gets us at another core, as long as the sequencing parameters are the same like sequence read length and things, even that is probably fine. But biological replicates of course, in biology you need replicates so RNA seek does not make that go away. There's a lot of different types of analysis you can do with RNA seek and some of these questions that you want to answer ask might be might influence how deeply you sequence. So if you're just interested in gene and differential expression, which is what we're going to focus mostly in this course, you can get away with relatively sparse sequencing of each sample. And you may get better like value out of doing in your statistics from doing more samples at a shallower depth than doing a few samples at a deep depth. But then if you're doing other things like alternative expression analysis if you're trying to discover new transcripts a little specific expression to taking mutations or fusions or studying RNA editing. All of those things will require you to sequence your library like more comprehensively. And as I said, there's some themes of RNA seek workflows they generally follow this pattern, you obtain your RNA data, you align or assemble the reads, you then have an alignment file that then feeds into a lot of different downstream tools which all take the same alignment file but ask different questions of the alignments for different purposes. And then you wind up with some that tool will output some often almost incoherent, hard to understand crazy custom formatted output files that you then feed into some kind of post processing to visualize understand sort of check sanity and produce figures. And then you get to the point we're really trying to synthesize and make a an interpretation from the data. This is from a review in 2019. So it's a little bit dated but still pretty accurate showing some example workflows. So these are some of the really popular RNA seek workflows. And we're going to go through several of these in detail, kind of doing a reference genome based approach and a reference free approach and then some different like sort of branches within the reference genome based approach. So the final slide ahead was this, we've already talked a little bit about this this, the idea of the difference between bulk and single soul so some of you are here, because you're primarily interested in the single cell RNA sequencing, some primarily bulk some both. I would say that these things are really quite different still so there are experiments where the single cell brings a lot of value, but I would not say that single cell RNA sequencing has kind of displaced bulk RNA sequencing that it's somehow like the same thing but you know, everything about individual cells they're actually like very different in what you get. So, in a bunch of different ways one way is cost single cell experiments are still very expensive because in order to produce a decent amount of data per cell you just need a large amount of data total. The total number of cells is really important to think about this when you're interpreting these experiments the total number of cells that you're profiling in a single cell experiment something like 10,000 15,000 if you're lucky. There are different planets, the bulk RNA seed could be millions or tens of millions of cells that were like making up the milieu that you isolated your RNA from versus 10,000 so like orders of magnitude different. If you're interested in genes that are rarely expressed you're just not going to see them at all and single cell the single cell data you get from each cell is extremely sparse it's like very, very patchy data it's like this. It's super super patchy with tons of drop out like overall picture of the transcript dome, whereas the bulk RNA seek you don't know what's happening in individual cells but you get a very, very high resolution image of that bulk transcript dome. So I think there's a lot of cases where making it makes sense to you both. This is just a reference slide again pointed to a supplementary table that has a whole bunch of like common questions with with answers. So that's just there for you to refer to. And yeah, so that's it for the lecture. We're going to transition now into really doing the hands on content. This is like a really high level overview of the the pipeline that we're going to walk through we're going to start with raw sequence data. We're going to align it do a transcript compilation where we essentially sort of assemble and estimate the expression of individual transcripts. And then we're going to feed the those into down differential expression pathway analysis visualization modules. But before we get to that we're really going to study the sort of fundamentals of the input data files from a kind of bioinformatics perspective.