 So you guys must be just totally overloaded with information by now. And I'm afraid it's going to get worse before it gets better. I guess you've seen a bunch of times now. I don't think we probably need to spend a lot of time. You can use the materials under the state of commons. This is module nine. And we're going to cover RNA sequencing and analysis. I should point out this module was developed primarily by my brother, Malachi, who has done this session in this workshop for the last several years, I think. At least last year. And we've also done this workshop elsewhere. So I'm going to try and put my brain into Malachi's way of thinking for delivering his slides. And then we'll go through the lab as well. So just to go over the learning objectives is useful. Basically, we're going to give you an introduction to the theory and practice of RNA sequence analysis. We're going to go over the rationale, some of the challenges, and the common questions. And then the lab portion, the tutorial, we really hope will act as a practical resource for those who are probably mostly new to the topic of RNA-seq analysis. And really, we hope it'll provide a working example of an analysis pipeline. And we're using gene expression and differential expression as an example task. But there are other kinds of RNA sequencing analysis that would occur in much the same way. And this pipeline would hopefully give you some clues how to tackle those problems as well. But the idea is that even if you don't understand everything fully here, what you're taking home, hopefully, is self-explanatory and complete and runnable. So when you get back to your labs, if you want to try doing some RNA-seq analysis, you have a pretty good starting point to plug your data into this and see if you can get some useful results out. So gene expression primer, I think most people here, I guess I probably missed the part early in the workshop when people said what their backgrounds are. But I'm guessing if you're in a bioinformatics workshop, your background is more likely to be wet biology. So you probably know as much about this or more than I do. But just to review, the whole concept of RNA sequencing is based on the central dogma where you have gene loci at the DNA-leveling genome, which are transcribed to produce a single-stranded pre-MRNA that are polyadenylated and have a 5-prime cap and other features. And then they're further processed to make mature mRNA, which is spliced, removing those introns and just leaving the axons. And then that can be exported to the cytoplasm and translated into protein. And it's really this part that we're looking at with RNA-seq. Depending on the protocol you use, you might have primarily polyadenylated, mature mRNA. But you might also, again, depending on the protocol, be pulling down some of these unprocessed mRNAs or other kinds of RNAs as well. So RNA sequencing, this is a typical, very high-level workflow of what you're usually doing. So you're starting with some samples of interest, of course. The RNA has to come from somewhere. And this is a common use case. Maybe you're comparing some normal cells to some tumor cells. And that's actually what we're going to be doing in the lab. You would isolate RNA from those samples. Those are usually converted to a CDNA library, fragmented, size-selected, and some kind of linker is added. Probably just about everyone now is using what is pictured here, which is an alumina flow cell for this. So they're probably using an alumina protocol for constructing libraries. And from that, you're going to get basically hundreds of millions of paired reads. We typically recommend paired-end reads, not single-end reads, for RNA-seq and billions of bases of sequence. And then the challenge, of course, is to figure out what this huge pile of reads means. And the way that is typically done is by mapping to a genome. Hopefully, if you have a reference genome, you can map to a reference genome. And if it's annotated, you can align to its transcriptome. But you may also be looking at a new species and doing de novo assemblies or other kinds of things with the RNA reads. And from that, you can predict exon junctions. You can try and infer the transcript isoforms in total. And there's just billions of kinds of downstream analysis you can do. And we're going to just look at a few of those and talk about some others. So why sequence RNA versus DNA? One reason is basically for functional studies. So the genome, in some cases, might be constant. But with different experimental conditions, you can have pronounced effects on gene expression. So for example, maybe you want to know what's happening between an untreated and treated state. Or maybe you want to know what happens if you knock out a wild-type gene and replace it with a mutated version of a gene. Of course, there are also many molecular features that can only be observed at the RNA level. So if you're just looking at DNA, you're missing things like alternative isoforms, fusion transcripts, RNA editing. You can also predict the transcript sequence from genome sequence, but it's difficult. It's just much better to go straight to the RNA. You only have kind of guesses of what might be the transcript isoforms when you're looking at the whole genome DNA data. So other reasons, this is kind of a cool one, is interpreting mutations. So maybe you have done some DNA sequencing and you've identified some interesting mutations and you want to know more about what effect that has on the protein sequence. Specifically, if it's a coding mutation, is it actually expressed? Is there perhaps a regulatory mutation in the DNA? And you want to know if that's causing an actual change in the relative levels of isoforms of that downstream gene from the regulatory mutation. You can also prioritize protein coding mutations using the RNA-seq data. So this is getting at the idea of allele-specific expression. So if the gene's not expressed at all, then maybe you're not as interested in that mutation. Like maybe it's a really cool mutation in a really interesting gene, but if it's not expressed, you're maybe not as interested in it. If it is expressed, but you're only getting expression at the wild type allele, maybe that suggests some kind of loss of function, haploinsufficiency effect. If the mutant allele is being specifically expressed, maybe that's a good candidate for a drug target. Challenges. So that's actually what we're going to be talking about for much of the rest of this presentation. So there are tons of challenges with RNA-seq as with many technologies. Starting right from the sample. So is your sample pure? Do you have good quantity and quality? RNAs typically consist of small exons separated by large introns. So the mapping is itself a challenge and we've had to develop splice-aware aligners that are specifically good at this problem. The relative of abundance of RNAs varies wildly. So you could have as much as 10 to the five or 10 to the seven orders of magnitude between the most lowly expressed transcript and the most highly expressed transcript. And that creates problems because just like with whole genome sequencing, RNA sequencing essentially works by random sampling. We can change that a little bit by doing things like captures, but at the end of the day, you're randomly sampling from the pool of RNAs and producing these reads. And what can happen, we see it's a lot, is a small fraction of very highly expressed genes basically eat up all your reads. So sometimes 50% of all your RNA sequences are representing just 10 or 20 of the most highly expressed genes. And unless you're interested in those genes, which let's face it, usually you're not, these are kind of housekeeping genes that people aren't that interested in. It can be, it can kind of feel like you're wasting a lot of your money sequencing them. So that's a challenge. It does, but things tend to continue proportionally. So the second line of data, you're wasting half of it again, but it does help. So as sequencing costs get lower, this problem is reduced. But we're gonna talk about this more, but we're still not really at the point where it's not an issue, I would say. So we've been doing things like try and capture experiments to sort of compress the dynamic range a little bit and bring some of the low ones up relative to some of the higher expressed genes. RNAs also come in a wide range of size. So transcripts can be very small from just a couple hundred bases to thousands or tens of thousands of bases long. When you do a poly-A selection, which is a common step in RNA-Seq, you tend to get a selection of large, the large RNAs, you tend to get three prime N bias. Just in general, you tend to get three prime N bias. So you'll basically just see more reads piling up at the three prime end. This is also partly because RNA is fragile compared to DNA. So it's easily degraded and you basically have a lot more concerns on the sample end. So with DNA, you can get away with a lot. You can have a sample sitting out on the counter for three months or you can steep and steam with Neanderthal DNA from a cave from 10,000 year old bones. But with RNA, unless you're starting from fresh frozen material, you're potentially facing some problems. So it can be done, but it's much, much more challenging to do with RNA. So because of that, basically quality and assessing quality is important. So this is one of the things we tend to look at, which is the Agilent trace with the RIN number calculated. I don't know exactly how the Agilent, it might even be a proprietary algorithm works, but it's similar to the Robosomal 18S, 36S, sorry, 18S, 28S ratio that people tend to use for assessing RNA quality, but it's just a little bit more sophisticated. It looks at more peaks and they assign a number. So it goes from RIN zero, which would look even worse than this to RIN 10, which is very good where you have clean peaks at certain predefined points in the electrophoreogram. And we usually look for RIN numbers, I don't know, five, six or better, but sometimes we will go all the way down to two or three. But below that, the data tends to be just kind of almost useless. Design considerations. So I think I provided this as pre-reading. Really, it's just for reference. Onco did a pretty nice job a couple years ago of producing some standards, guidelines, and best practices for RNA-seq. And that covers things like what metadata should you supply along with your experiment, what kind of replicates should you do, how much depth you need, what control experiments are required, and so on. The main thing I would say is just to think about RNA-seq the same way you would think about any other experiment, which is to think about it. Don't just do it. There is a tendency with the new exciting technology to sometimes people get a little bit carried away and think, oh, we'll just do RNA sequencing and that will solve all our problems. And of course it doesn't. It just gives you a lot of data, but you have to think about what you're gonna do with that data and why you're doing it. Because there's a lot of different ways you could run your experiment. So replicates are important. One of the challenges we didn't talk about yet with RNA-seq is that it's still pretty expensive. So we're still talking about like maybe still a couple thousand dollars to do a single sample when everything's said and done. So you can't really usually afford to do as many samples and replicates as you would like. And it tends to lead to some kind of silly experiments where people are doing like a tumor versus normal comparison and they're literally just comparing one sample to one other sample and then trying to calculate statistics from that. I would say it's okay to do such experiments if you really think about what your purpose is. So if you have a cell line and you just wanna characterize very well the transcriptome of that cell line and identify all the different isoforms that are present and then maybe use that to choose some potential markers for further experiments then having a small number of replicates can be okay. If you wanna do differential expression or classification or clustering analysis, all the same rules apply to RNA-seq as they do to microarrays. You need replicates if you wanna have statistical significance. And of course there are different kinds of replicates. There are technical replicates and there are biological replicates and the definition of these really can vary. Basically there's lots of levels of replicates and there's almost a smooth continuum between these. Technical replicates might include things like running the same library twice on two different flow cells or in two lanes of the same flow cell or using two different indexes. We tend to see with the Illumina technology extremely high reproducibility for those kinds of replicates. So you don't actually see people bothering with them very much anymore. So if you have the money to do another lane of sequencing I would say sequence another biological replicate, do another cell line, do another patient rather than making sure that data from two lanes next to each other is comparable because they're generally quite reproducible. At that point you're just adding more depth, which is fine. You may want more depth. But it's not probably necessary from a replicate standpoint in my opinion. Biological replicates as I said are still extremely important and of course you need to consider things like environmental factors, growth conditions, time at which you're obtaining the samples and all the other kinds of things that can lead to batch effects and so on. We tend to see pretty high correlation coefficients for both biological and technical replicates. So it is a pretty reproducible data. So what kind of questions can you ask of RNA-seq data? So these are some of the things you might wanna do with it. The most common is basically replacing microarrays, looking at gene expression and differential expression. You can also do alternative expression analysis, probably in a much more sophisticated way than you could with microarrays. You can discover new transcripts, which is hard to do with arrays, but fairly easy to do with RNA-seq. And that can also help you if you have a model, new model organism for example, in annotating new transcripts or annotating the genome. You can do things like we talked about identifying allele specific expressions. So I spent all my time looking at cancer samples. So that's definitely one thing I'm interested in. Basically, are we seeing the interesting snabs and indels in the RNA-seq data, and are they more or less expressed in their well type counterparts? You can do mutation discovery. It's a little bit more challenging than doing it from whole genome or exome data, but it is possible. Fusion detection is another interesting thing you can do with RNA-seq data. You can identify RNA editing. You can look for link RNAs and microRNAs and probably all kinds of things I forgot to list here. So for all those kinds of questions, there are some general themes to the RNA-seq workflow. They have distinct requirements, but there's also some commonalities. So they typically start by obtaining raw data, and inevitably that involves some format conversion. You then have to align and or assemble your reads depending on whether you have a reference genome or you're trying to make a de novo assembly. You're gonna process those alignments with a tool specific to your goals. So like today, we're gonna use cufflinks for expression analysis, but maybe you'd use defuse or top hat fusion or chimeriscan for fusion detection or some other tool that would take an alignment and try and tell you something about the RNA-seq data. Then you're gonna post process the data that comes from those tools. So they typically produce as arcane format coming out as what went in and you usually end up having to kind of reformat it and load it into something like our MATLAB and trying to make some sense of it with summarizing to gene lists and visualizing and prioritizing candidates for validation. So I've listed some tool recommendations. We won't go through most of these, but you have them as a reference for later. For alignments, you may use BWA and with a reference genome plus a junction database, especially if you have shorter reads and we'll talk about that. But more likely you're gonna use something like top hat which is a sliced aligner or some of the others. For expression or differential expression, we're gonna look at the tuxedo suite which includes cuff links and cuff diff, but there are other alternatives, including some very popular ones that are based on raw counts, like HTC count would give you your counts and then things like EDJAR or DC would give you the differential expression statistics. For fusion detection, actually in the workshop next week, we're gonna look at top hat fusion and talk a little bit about chimera scan as well. Transcript assembly, I don't know much about transcript assembly, haven't spent much time with it, but here are some listed and there are others and we're actually gonna go and do an exercise on how to find these tools. For mutation calling, I know that SNV mix is one of the few that was kind of designed with calling from RNA seek data in mind, but there are probably others. I think this came from Suraad Shah's lab who I think was here earlier this week. I don't know if he talked about it or not. And anytime you wanna find a tool or ask a question, I suggest you visit one of these seek answers or BioStar together, they're pretty comprehensive resource for identifying new tools or asking questions about how to run tools. And even many of the authors of those tools now, like at the Genome Institute, when we release a tool, we actually support it officially through BioStars. So it's a good place to go for help on using the tools. Of course, make sure you search them for the answer to your question before asking because most questions have already been answered probably. And it can be sometimes challenging to find them. So I thought we would do a short, have you guys actually tried seek answers yet? Has someone else done this? Okay, so this is just a quick exercise. If you guys wanna go to seekanswers.com and just get a feeling for what that website is like, I'll do it too. You can see I use BioStar a lot actually. Yeah, so if you go, I don't think you need to log in. If you go down on the left, you can go to the wiki. So here you just see like current posts or recent posts. Oh, nice. I guess we're all hammering it at once perhaps. I've actually never seen that before and I've done this quite a few times. And this I feel like is not actually being displayed properly either, but if you go to the software hub, which of course doesn't work, I'm very surprised. Well, if you go to the software hub, there's, and you can browse basically, like there's tag clouds, which will allow you to select on RNA seek tools, for example, and you'll see hundreds of RNA seek tools and you can click on them and you will see brief descriptions and links to their paper and posts about them. That's a strike against C cancers. So maybe we'll go to BioStars. Yeah, so another common question that we get is should I remove duplicates for RNA seek? And actually there are some good posts and C cancers and BioStars about this topic. It's, the answer is maybe. It's more complicated than for DNA. So it's very common in DNA workflow to remove duplicates. The concern with duplicates is that they might correspond to a bias PCR amplification of particular fragments. So if you see a lot of exactly the same pair of reads, you become suspicious that there's some kind of PCR amplification bias occurring where that was being preferentially amplified and then it's giving you a artificial idea about the coverage at that particular part of the genome. So in DNA, whole genome sequencing it's, I would say it's almost the norm to remove duplicates. The problem is that for RNA seek, you have sometimes extremely highly expressed very short genes. So if your gene is, your transcript is just 100, 200 bases and there's thousands or tens of thousands of copies of it, we actually expect a fair number of legitimate duplicates to occur from that underlying population of transcripts. And removing them might actually eliminate real dynamic range from your expression estimates. So one thing you can do is there are different ways to assess library complexity. Basically that's giving you an idea of how many unique species of reed pairs are in your data and how many are non-unique. And if you see that there's an unusual distribution of complexity compared to, for us usually this only works by kind of getting to know what good library looks like and comparing it back. And if you see something funny, you might decide like with a problematic library where you can't for whatever reason generate a new library and maybe it's a precious sample or there's no more RNA to go back to, you wanna proceed with analysis but you feel that there's a lot of problematic duplicates you might remove duplicates under that circumstance. But for things like just expression analysis, we usually don't actually remove duplicates. If you do, do remember to assess them at the level of the paired end reeds, not the single end reeds. But. How many reeds? Yeah, so that's basically what the complexity analysis involves is assessing that. So it determines the unique set of reed pairs and creates a distribution of counts for them. And then you can compare that to a good library. Things like the card tools, marked duplicates will mark duplicates and you can just get a sense of the total number of duplicates. But there are some packages out there for, I can't think of a name right now for actually doing a more sophisticated analysis of duplicates, which would be recommended. So another question we get a lot is how much library depth is needed for RNA-seq. This also depends on a number of factors. So really, it depends on what the goals of your experiment are. If you're just trying to recreate gene expression estimates and you see this sometimes, some people, for whatever reason, they basically want just what an AFI U133 array would give them. But they want to use RNA-seq for some reason. Maybe they believe it's more accurate. If that's really all you want from the data, you can get away with quite a lot less depth, like in terms of total numbers of reeds, than you would if you wanna determine very subtle changes in the differential expression of very lowly expressed rare isoform or something. So there's really no right answer to this question. And my feeling is we're nowhere near the point where we're sequencing so much that there isn't more information to be gained by sequencing further. There's still so much complexity in the transcriptome. And any of the experiments that people can afford to do right now are, I don't know, they're just really scratching the surface probably. Or they're getting pretty good, but they're not 100% there. It depends. What would I consider the minimum? For me, I wouldn't do less than a high-seq lane of data. But a lot of people are making a different choice for cost reasons. For our purposes, what we wanna do with it, because we wanna do a lot, we wanna detect rare fusions, we want to have a sophisticated alternative expression analysis pipeline. We wanna assess, especially for things like, if you wanna know the variant allele frequency in your RNA-seq data for a variant that may be from a rare subclone in a tumor population, then that's a problem we're dealing with a lot. We're looking at tumor populations and they're heterogeneous. And sometimes there's a very interesting subclone down at like 5%. And that subclone maybe has a heterozygous variant and maybe it's slightly, or they'll specifically expressed in the mutant or the wild type. So there's a lot of reasons why you might miss that if you don't sequence sufficiently deep. And I think for us, one to two lanes is like the cost-benefit analysis, that's where it falls. But there are people that maybe don't have such stringent requirements from the data and they're happy with multiplexing maybe two to three, two to four samples per lane. Then going beyond that, to me it's such a waste, like just do some arrays. Like there's so much cheaper still and easier. And if all you're gonna get out of it is kind of like crude gene locus level expression estimates for more highly-immediately expressed genes. I don't know, maybe that's not true. Maybe it's still worth it, but it's a tie. Either way, go ahead. But TCGA data, like how, what the depth for them? It's been changing over time, right? I think they started out with some of the earlier projects I think had single lane of GA2 data maybe. I think now they're also doing one lane of high-seq but I'm not positive. I don't analyze much TCGA data. To address the problem here, maybe be better. In terms of my specific example, with looking at variants, yeah, well we do both. So we tend to take a pretty comprehensive approach, but not everyone has that luxury. So really, yeah, I mean if what you're really interested in is looking at clonal heterogeneity and subclone architecture, then for sure it's more cost-effective to do an exome or some other kind of capture reagents. We wanna learn I guess what's happening in the transcriptome at the same time. Yeah. You should also look at the libraries, right? I mean, in the full transcriptome, you probably end up with many. Yeah. So you can do things like, kind of actually compromise between what you're suggesting and the problem is you can do a CD&A capture RNA-seq library, which we've actually been really satisfied with. It kind of does a good job of bringing down or bringing up the lowly transcribed and relative to highly transcribed but we still see pretty good correlation in terms of just the overall gene expression levels. You can do that? I think so. I think so. I'm not really up to speed on all the many various RNA-seq library preps there are. I know obviously poly-A selection is super common, but. Yeah, but so much more chance of bias in that case. Yeah. Yeah. I'm not sure. So another common question is what mapping strategy to use. So this is a little bit out of date now. So when people were faced with the issue of like 36 MERS, less than 50 base pair reads, sometimes we recommended using an aligner like BWA against the reference genome plus a junction database and that junction database would usually be tailored towards your read length so that you have the best chance of getting those short reads to align across a junction sequence. Or in some cases we even used to use like a standard junction database for all read lengths and then a slow aligner like BLAST, which could do substring alignments for the junctions only, but it was mind-bogglingly slow. But nowadays, I think most people are getting more like 100 MERS, two by 100 MERS. And I don't see any reason not to use a splice aligner like Bowtie Top Hat. So that's what we pretty much always do. And that's what we're gonna go through in the tutorial. So have you guys done an IGV session in this course? Yeah, so I don't know if you've seen RNA-seq data in IGV yet, but this is just a kind of comparison of what it looks like compared to some whole genome data. So this is a typical view that we're looking at. So you've got three tracks of data, the normal whole genome data, the tumor whole genome data. And usually we have only the tumor RNA-seq data for lack of a good match normal sample to use, although sometimes we have a normal RNA-seq as well. And we're looking for things often like this. So where you've got like an interesting acceptor site mutation observed somatically in the tumor and not in the normal. And then you can actually correlate that with an exon skipping event. So here's some reads that are skipping this exon, which was predicted based on this acceptor site mutation. So you're seeing a lot of things in this. Basically, how you can use RNA-seq to essentially functionally validate something you're seeing at the DNA level. And you're also just getting an idea of what the spliced alignments look like for RNA-seq data compared to whole genome data. So instead of getting this basically total coverage across the genome, you're getting reads piling up as you expect mostly on the exons, but not entirely. So another common question you get is how reliable are the expression predictions from RNA-seq? So we did some experiments a couple years ago to address this. We wanted to know our novel previously uncharacterized exon junctions that we predict actually real and verifiable. We wanted to know if the differential and alternative expression changes that we were predicting from the RNA-seq data would be confirmed by, say, a QPCR. And so we chose 384 events to validate with either QPCR or RTPCR and Sanger sequencing. And these are detailed in a publication which went along with the Alexa-seq method. But I think the story would generally apply to any RNA-seq experiment profiled with one of the other software packages. So the idea is that you have some transcript structure and maybe there's three exons. This is a very simple example which produces two isoforms, one where the middle exon is skipped. And you can design primers to look for basically expected products given these two outcomes and then confirm those with a gel. And so we did that on 192 different events and we saw about 85% validation rate. So the isoforms that we predicted from what we saw in the RNA-seq data were observed in the validation experiment which isn't bad. And more at a quantitative level looking at differential expression predicted from the RNA-seq data versus that predicted from QPCR experiments for the same exons and junctions. We again saw very good correlation in terms of the actual differential expression value predicted between the RNA-seq and QPCR. And depending on how you define validated, something like 88% of them were validated as significantly differentially expressed in the same direction using both technologies. So they're reasonably reproducible. People have done comparisons ad nauseam in the literature between RNA-seq and other expression platforms. They generally show pretty good correlation. So if we wanna try to break BioStar or maybe you can just check it out. At some points staggered. Usually we just introduce people to the idea. So like C-cancers, it's a forum where you can go and you can search for answers to your questions. You can pose questions. In that case, I think with both of them you have to create an account if you wanna actually ask a question but you can of course search it and use it without. If you do become a BioStar user, I encourage you to vote up useful questions and useful answers from any of the people there contributing that's what keeps them coming back is the obsessive need to get votes. Okay, let's try and break it. Come on, BioStar. So this is what it looks like. It's like any forum, you get all the recent posts. You can search for something like, let's try and understand what FPCAM versus RFPCAM is or something and you'll get often a question that's answering hopefully or asking a similar question than someone who's maybe made an answer and you can vote on answers that you like. Their posts are broken into a number of topics. So they're actually how to's which are pretty good and these are being developed extensively right now. So in the next six months there's gonna be a whole smattering of other how to's kind of along the lines of this course actually but more bite sized. So like how do I get data from Geo? How do I run cuff links? And the idea of them is that they're not gonna be just cookbook, they will be cookbook as well as conceptual like explaining why you would choose one parameter over another and so forth. Or also, yeah, compilations of review papers. Did you guys learn about Antibuy in this workshop? Yeah, so the how to section is broken into tutorials, tools and tips. And like I said, there's a lot more. You can also look for jobs there. We have several jobs that the Genome Institute actually posted here, like this one, to check that out. You guys are gonna be experts after this workshop so you should come and work for us. What else to show you about this? Like a request, it's a good question. I don't think there is currently a requests mechanism. Somehow I guess maybe it gets a little bit depressing for the people who create the content because they know that there's like an infinite number of things out there that they could do, but I think for the how tos, it's actually a good idea. I could suggest it to the admin. Yeah. So do you have 5,300 uploads? I guess so, yeah. Like I was referring to myself with the obsessive compulsive behavior. So Istvan is actually the creator and admin for the site. And what's scary is that there are actually two people who have way more votes than he does. So yeah, they're really into it. But he actually built the site. So this started on, are you guys familiar with Stack Exchange? So there's like many different Stack Exchanges. Like I can't think of one off the top of my head. I think there's sub-Stack Exchanges for R and Perl. It's basically a forum like this and it's a common framework and anyone could set up basically a forum and have all this code ready made for them to organize a kind of forum around their topic of interest. And Biosar used to be on Stack Exchange, but Istvan got tired of like the not flexibility of it. So he recoded the whole site to suit his own preferences. So it's really improved a lot as well since then. So introduction to this tutorial. We're actually doing good for time. So this is the workflow that we're gonna go through. It's, I'm not gonna lie, it's a lot. I hope that you will get a general feeling for the steps and the overall flow, but also recognize that you could spend a whole day or a week really diving into the details of how bow tie and top at work. And those have papers. And Cufflinks has its own paper or two papers and is very sophisticated and you could spend a lot of time understanding how it works. And so on. But what we're gonna do is we're gonna start with some raw RNA-seq reads, a test data set. We're gonna align those reads using bow tie and top hat. We're gonna use Cufflinks to compile transcripts and then identify genes and expression levels and merge them into a consistent set of transcripts. And that is gonna be fed into CuffDiff to do a differential expression for this colon cancer tumor versus normal data that we have. And then we're gonna use Cumberbund, which is our bioconductor package that they've produced, which basically slurps in the output of CuffDiff and very quickly gives you a lot of useful summarizations and visualizations to get you to a fairly advanced analysis without all of the agony of having to figure out how to do that in our yourself. I do also have a supplementary workshop which will walk you through that agony if you wanna do it the old school way. We don't have time to cover that today, but you can take that home with you. And I recommend it because if you do start actually doing this analysis, you're gonna run Cumberbund at some point, it's not gonna do exactly what you want. And then you're gonna be back to, and it's flexible, but at some point you're gonna say, I just wanna make a heat map my own way. And then you're gonna be back to just running heat map two or something like that in R. And that's what the other supplementary tutorial covers, those kinds of situations. Important inputs for this whole tutorial. We of course need some raw sequence data which are gonna come in the form of FASQ files. We need a reference genome and we need some gene annotations. And these have been provided for you, but there's again a supplementary document describing how you might create such files for your own analysis or obtain them anyways.