 So we're just going to do a quick lecture on kind of high level differential expression concepts, and then we'll pivot right back into continuing the practical exercises. As I'm sure most of you already know, the idea of differential expression is really trying to take the dream expression or abundance estimates that we're generating and tie them back in some way to genotype or phenotype. We're usually asking questions like which genes or transcripts are being expressed at higher or lower levels and different groups of samples. And are these differences significant where we're trying to account for things like variance and noise in the data. And there's of course like an infinite number of examples of kinds of differential expression questions you could ask. In this course we're asking are the UHR cells different in HBR brain cells are the HCC 1395 tumor cell line samples different from the normals. And one of the, or maybe both of the team or practical or integrated assignments rather we're asking are the wild type cells different from knockout cells or SHR and a knockdown cells. So these are this kind of simple comparisons we've set up for you. You may have similar or more complicated experiments or for questions in your own research, but the general idea I think of differential expression applies. We're using among other approaches package calls ball gown. So just to quickly introduce the idea of that I think you guys actually already ran this last night. That was the last thing that we did. And this uses a parametric F test. And it's kind of a more complicated statistic than a lot of approaches where it's actually comparing two different models. So, what it does is it fits two models to each feature where features like the gene or transcript, using expression as the outcome, and then one including the covariate of interest so like tumor versus normal or HBR versus you HR, and one without including the covariate of interest. And then it calculates an F statistic and a p value using the fits of these two different models. And it determines significance if the model that includes the covariate of interest fits significantly better than the model without the covariate. This is just kind of like a complicated way of doing effectively like the equivalent of like a Wilcox test or a T test or something. It's really just asking like, is the values for your gene significantly different between condition and condition B. And then like many approaches, it adjusts for multiple testing by reporting Q values, which is my default at a false discovery rate of 5%. Ball gown also comes with a bunch of visualizations I don't think we've looked at these yet so we'll probably do that today. And so that's one of the reasons we've included this package is that it like allows you to quite quickly get to some some nice visualizations, especially these like transcript level. Visualization that might be a little bit of work to produce on your own. I can show you like not only expression levels of the genes overall but also the individual transcript levels and you can kind of get a sense like okay this particular isophone with this structure seems to be more highly expressed and explaining most of the expression from the gene locus for this particular gene. But there are lots of other approaches and we're going to cover at least one or two of them. The conceptually biggest alternate approach is to use raw counts instead of these like kind of complicated TPM values that are derived from from string tie. So we're going to show an example of how to do differential expression based on those raw counts using edge R instead of using the string tie TPM type approach. So a common question is, which one should I use considering between the FPM TPM style expression differential expression versus the raw count approach. This is really a long running debate. But I think the general consensus is that when you wanted to leverage the benefits of something like the tuxedo suite so that's what this set of programs we're using is called string tie ball gown approach that's part of the tuxedo suite. If you want to get that like isoform level discovery and the visualizations that will be one argument for using the more complicated normalized TPMs that you get from string time. But it's also good for visualization. So when you're making heat maps. It definitely works better to use like a normalized value like FB camera TPM than to try to directly visualize the raw counts. The raw counts are, I would say less useful for visualization because they haven't been normalized in any way. For things like calculating full change. You can also be prefer preferable to use the TPM type values. People like to use the counts primarily for the statistical methods that are available for differential expression. So there are some very robust packages that have like been shown to have very good performance of really identifying the true differentially expressed genes. And then there are also some packages that allow for more complex, sophisticated, more sophisticated experimental designs. So like, maybe you did like a time series or you have like five different conditions that you want to compare in some complicated way like in a multi variant model or something like that. I feel like there's a lot more sophisticated statistical approaches you can apply to the raw counts with some of the packages that are available. Whereas with the like the string time ball gown type approach you get probably what is a very well thought out statistical approach but it's kind of like you get what they what they give you there's not a lot of configuration in terms of how you can set it up. And in general multiple approaches are advisable so we basically always do both count based and FB cam TPM style. This is a plot we made a little while ago, running cuff diff which was the predecessor to ball gown. And edge are which we're going to run in another package called DC. And you can see so this is looking at the event diagram of the overlap between the genes that were found to be significantly differentially expressed using these three different approaches. And there's a pretty substantial overlap, but there's also a lot that are unique to just two out of three of the approaches or even one out of three of the approaches. So depending on what your downstream goals are. If you do, maybe two different approaches or three different approaches, you have some options right so you could say I'm going to just focus on the most arguably kind of robust genes that were identified by two or all three approaches. Those are probably really differentially expressed, or maybe you want to be more comprehensive maybe you really don't want to miss anything and you want to take, like the union of all of them so any gene that was identified by any one approach could be considered important, or something in between. So that's kind of why we're trying to introduce you to at least a couple of different options. So lessons learned from the microarray days. You guys probably don't even remember what microarrays are at this point but before we had RNA seek we had microarrays. And when I was doing my PhD this is what a lot of it was based on. And there was a long period where this was like, I would say a highly relevant slide because the lessons learned from microarrays that were really being ignored, I would say an RNA seek experiments, it's much, much better now so I feel like we're kind of past this phase, but you should think about doing power analysis for RNA seek experiments so just because you have like a great technology like RNA seek it does not remove the need for biological replicates and for good study design. So there were sort of like hundreds of early RNA seek papers where you were seeing like comparisons of like literally n equals one to n equals one. It's kind of like what you have now a single cell RNA actually, or what you've had for last few years where because of the costs and the complexity of this new approach, and wanting to kind of understand how it works. There's a lot of excitement and there's you know how many papers have we seen the last two years with like one sample with 500 cells compared to another sample with 300 cells and like pretty sweeping biological conclusions being made. That's probably not a good idea, just like it wasn't a good idea to do like an end of one versus end of one RNA seed. We're doing an end of three versus three here for like practical reasons right because it would be a hassle for you guys to type out or copy paste like 50 commands. In many cases you should probably be doing, you know, larger sample sets more replicates, although there may be cases where three is sufficient. Multiple testing correction is also extremely important, more important than ever really so when we had microarrays for a long time there was sort of like 20 or 30,000 genes on the array. It's hard to getting like exon level or transcript level arrays that had maybe hundreds of thousands of features being measured by the microarray. But with RNA seek you have like a near infinite number of features, sort of depends on how you define them you have basically all of the genes. All of their possible isoforms their introns which can be expressed at some level. The actual exon exon junctions that are being expressed at some level either because of the sort of noisiness of the splicing machinery or because there are legitimately novel alternative splicing events happening all the time. Look at the data from so many different levels like you can do exon by exon comparisons you can look now for other kinds of RNA species that you wouldn't have been able to before right so like micro RNAs or link RNAs or other kinds of RNAs and as a result we have like millions or tens of millions or hundreds of millions of features right so you start doing all these comparisons between all those features you're going to get significant results that are just by random chance that are mostly spurious. So you need to like think about what question you're asking about maybe doing future reduction and then certainly doing multiple test correction. So the downstream interpretation of expression analysis so this is really a topic for an entire course. The expression estimates and the differential expression list that we're getting from string tie and ball gown or some of the other approaches like HTC can edge are can be fed into many downstream analysis right. You can do you know clustering and visualize that with heat maps. Some of that is provided by ball gown. And we also provide some kind of old school art code to do those kind of visualizations. You might do classification analysis where you're trying to develop like a model or a biomarker that predicts outcome or predicts prognosis or something. So if you want to do pathway analysis and dot dot dot there's many other kinds of analyses you could do. So we're going to cover some of these, but certainly not all of them. So just to reorient you for where we are we have at this point. We've got our raw data we've done alignments with high set we've done transcript compilation and estimation with string tie. And then we've done differential expression with ball gown. And we're going to kind of just pick up there we're going to finish that module which had just a little bit left and then start looking at ball gown visualizations and then doing other kinds of differential expression statistics and visualizations.