 Hi everybody, welcome all to this Virtuosy Computational Biology Seminar Series. Today we have the pleasure to welcome Mark Robinson from the statistical bioinformatics group at the Institute of Molecular Life Science at the University of Zurich. Mark studied applied mathematics and statistics in Canada at the University of Gelf and British Columbia. Then from 2001 to 2005, Mark worked as research assistant and research scientist at the University of Toronto in Canada on micro-aid data analysis. Mark completed his PhD in bioinformatics and biostatistics in 2008 at the University of Melbourne in Australia, working on methods for the analysis of DCMS-based metabolomics data. He has also during that time been involved on different other projects, such as comparison of anthropometric platforms, differential expression for count data and normalization of RNA-seq data. From 2008 to 2011, he worked as a postdoctoral research officer in cancer epigenetics at the Government Institute and at the Water and Eliza Hall Institute. He then moved to Europe and set up his own group as an assistant professor at the University of Zurich and is also a group leader at the SIB at the Institute of Bioinformatics. His group developed statistical methods and software tools for interpreting high-support sequencing and other genomic technologies in areas such as genome sequencing, gene expression and regulation, as well as analysis of epigenomes. So today Mark will share with us his work on modern RNA-seq differential expression analysis. Mark, thank you again for accepting this invitation and the floor is yours. Thank you and thanks for having me and thanks Diana for the invitation to come and also for the introduction. So I would like to talk about modern RNA-seq differential analyses and I guess, well, I explain what that means, what I mean by modern, but there's been a few recent innovations that are worth discussing. They introduced some new and interesting challenges for us. Just before I begin, I just want to highlight the people that were involved in this work and of course in Lausanne, people are very familiar with the first author of this, Charlotte Sunson. She was here for a few years and it's great to have her in Zurich. And also just to say that everything I'm going to talk about today is more or less in this paper and also just to point out this is kind of a new model or a new way for my lab is this open post publication of you. So this paper that I'm going to talk about in this sort of modern differential expression analyses is published in F-1000 Research and this is an interesting model that we just tried out for this project and it's been quite a success, it's been quite a positive experience. And also to acknowledge Mike Love who in theory is my competitor, he's one of the main authors of the DEC2 package, but as we were going through this project we found that he was working on something similar a few months previous and so we got together and tried to bring a unified voice to some of the opinions that we have which I'll discuss throughout the talk. Okay, so here's the unanswered talk so I want to give a brief history of what we do with RNA-seq data and I'm really going to focus on differential expression analyses, of course there's lots of other things you can do with RNA-seq data, but I won't speak about those today. And then of course modern is in the time also I want to tell you what is new with RNA-seq analyses, that's kind of the main point here. And what this brings us to is well, I guess what I hope people will take away from this is that they really think about what you actually want to ask of your RNA-seq data and it's not so easy when you really think about it, it's just a bit complicated. And then also I want to highlight some of the other things, some subtleties that come up in the analysis and I'll give you some of my opinions and then talk about a couple of open questions. And so just to motivate this, and I don't want you to think that my research program is motivated by social media, but this was an interesting blog post that came about late last year and it caught our attention because it's kind of a shock and awe title but it's also something that we've been thinking about and studying in our lab and so I guess what I hope to answer or show to you today is that we're not really doing anything wrong, there's not a major crisis here but there are solutions that are worth discussing and worth kind of teasing out and of course the story's not over, even though RNA-seq has been around for quite a long time, there are still some open questions. And just to jump to the spoiler here, at least this controversy is really about how people count their reads and quantify transcriptional outcomes and we'll get into this in more detail. More about union counting versus transcript counting and that'll be clear in a few minutes if you don't already know what that means. Okay, so here's my brief history. And so I think people are probably familiar with RNA-seq but I'll just step through it just for completeness. So you start out with some kind of population of cells and I guess nowadays you can even do this at a single cell level and then you need to make a few choices, I guess the two common choices are you select for poly-ray transcripts or you deplete the ribosome learning. So you make one of those two choices depending on what you want to see and then go through all the preparation steps, fragment into small pieces, reverse transcribe, add some adapters and sequence on a very large scale. And this is where I guess we kind of take over, although in theory we're also involved at the design stage and I've condensed RNA-seq differential analysis into three main points. So there's a mapping process and then there's this counting, I've got counting quotes because while there's different ways to do that and there's some details there. And then I guess my research is in the statistical side, computing differential expression statistics and that gets a little bit complicated because we're often working in very small samples and so we want to do the best we can with very small numbers of experimental units. Typical is three replicates of one condition versus three replicates of another condition and that's kind of a hard statistical inference problem but we do what we can to improve on those. And also just to say briefly that of course RNA-seq is used for lots of other things, you can discover new transcripts or new isoforms of existing genes, you can do some analysis but you can partition the reads that you see according to ALEO and look at a real specific expression and even look at RNA editing and various other things. So let's just jump right to the heart of the bioinformatics problems and so this is where people start to, well, disagree. And so there's various things here. So if you look at the mapping, I'll get into all of these steps in a little more detail, you can, you have maybe roughly two choices there, you can do full alignments, so take your reads and fully align them. Or there's this relatively new business of serial alignment and that is just simply 10 to 20 times faster, so that's quite an innovation at the stage of the game. There are some caveats of both of these and so we'll discuss a bit of those. I guess one of the main issues is counting, so there's really two distinct schools of thought here and some bioinformaticians get quite energetic about expressing their views on this and I must say that I guess I'm kind of an excellent union counter and I've now moved over to being a transcript counter but I should say that most of that is because it's faster, not because it's necessarily mind-blowingly better and I'll get into that a little bit later. And then, so the statistics are this, this is my main research and so because the data accounts, it's a digital technology, you're counting things, things like negative binomial or Poisson models are prominent here and I'm not going to go into all the details of those but there are various related statistical issues that come up. Okay, so let's talk about the first thing and that's this mapping business. So on the left-hand side is just a schematic of the CDNA that you sequence and the small reads that you get and really the trick there is people have been doing sequence alignments for decades so there's not anything remarkably new here other than we want to do this on a very large scale and so there's some advanced data structures that people use and I must say that this is also not my area of research, really a beneficiary of very good methods that are available to us in the public domain and I guess the one main trick there is that you need to have an alignment with GAAP aware such that when you have a read and it goes over an exon exon junction it properly maps that and there's various ways of doing this you can map to the transcriptome, map to the genome and add some extra pieces to genome and do it that way there's lots of ways to do this this is kind of like I said there's lots of nice tools available to you so relatively new is this approach here where everything's kind of shrunk into camers so you take your original transcript catalog assuming you have a transcript catalog but for organisms like human, mice, drosophila, they have very good catalogs take those reference transcripts there condense those into camers and then also take the reads and condense those into camers and then do all the matching on the camerscale and the reason you want to do this is because it's a lot faster and so just to give you a perspective of the speed on the left hand side here we're talking for some of the samples we have it's probably an hour worth of compute time across a multi-core server whereas on the right here just your average laptop would be about 10 minutes so I mean this is a pretty big innovation for 2014 and so I guess the main players there are the cell-fish tool but also the same author has a tool called salmon and then Leo Popter's lab has Callisto so there's a few methods with this kind of ideology that are getting used and so the main thing is that they're fast and we'll talk a little bit later about their accuracy and so on okay so here's where it gets controversial and people get really excited about their particular method and so on and so like I said there's kind of two schools of thought here and the first school of thought is the union counters which is kind of the no-brains method just some other reads that you see as they land in different places of the gene so it's kind of taking the union of all the axons of the gene and counterreads that follow those places and then there's the transcript counters where they try to basically come up with a model that allows you to basically portion off the reads so just some of these examples here on the left here you've got some reads here you've got two different transcripts of a given gene these reads here you don't actually know whether they come from the red transcript or from the blue transcript by using an e-mail algorithm and a model for this and you can model various features of the data various biases and so on you can basically portion off your reads and estimate what the transcripts are and one of the main claims of the transcript counting approach versus the union counting approach is that the union counting is going to be bad in certain situations and here's a couple of those situations so for example what you can see here is there's 10 total reads here and 10 total reads here so if you're a union counter you would say okay there is no difference so it's just a log of 10 over 10 the logical change is zero now if you were to do proper accounting of the transcripts where you know that there's kind of a sharing of blue transcripts and red transcripts here but then over here you're shifted all the way towards the red transcripts and I guess one thing that's of course well known with RNA-seq is that the longer your transcripts then on average the more fragments you're going to get from that transcript assuming the same expression level and so if you're willing to count the total output of this gene you should do a proper accounting of this and so in this case this is where the expression is actually more coming out of condition B than condition A after you account for the fact that the transcripts are different sizes so the total output of the locus in condition B is higher than in condition A and that would be missed if you just do the simple counting and so that's a fair point and the reverse also happens so there's cases where in this case we have 10 reads here and 5 reads here and so that says there's a lot of full change there's a 2-fold change if you just do 90 accounting but in fact if you count for the length of these transcripts blue is a lot longer it's exactly twice the length of the red transcript you can see that well in fact there is no change in expression and so that's so if you do this proper counting then well there's no change whereas if you do simple counting this is the whole argument and I think well I'm not sure the reason why people stick passionately to union counting but I guess it's just been easier until recently when these new tools came along so that's the kind of overview of counting so just so you know that these issues exist now I'm going to kind of jump to the differential expression problem and here's where it also gets a bit confusing and so I just want to lay that out in a simple terms and basically encourage you to define what the problem is and so I'm going to throw around a lot of terms for the rest of the talk things like differential transcript expression versus differential gene expression and then I'm going to use these symbols DTE and DGE and these are going to come up again and talk so spend a few seconds and try and memorize these with me the idea for differential transcript expression is you want to ask the question does the blue transcript change from condition A to condition B and then repeat the same task for B, does it change from A to B and then do that for all the transcripts that you see that are expressed in your data set that's one question you might ask but you could also ask a different question from the same data what is the total output of this locus and that's what I'm going to call differential gene expression so essentially you sum these up and the trick here is to sum it up in an appropriate way so that you get the total output of this gene in condition A and condition B and it looks like in this case I've made a guess here but there's definitely more coming out of condition A than in condition B assuming these are on a scale compared just to turn the screws a little bit what if you think about differential transcript usage now it's a slightly different different thing and what I'm thinking about here in differential transcript usage is well if you look at proportionally how much comes from the blue versus the yellow transcript I'm not sure you can see that so maybe two thirds of it is blue in this case whereas in this case maybe only one third of it is blue and you can also take this and also kind of translate this to differential gene expression so what I'm really trying to point out here is what is the question you actually want to ask of your RNA-seq data because different questions will lead to different statistical tools and statistical methods and maybe that's completely obvious but it wasn't even to me when I first saw it kind of expressed in this way and so basically the point is there are a lot of different ways to ask the question and once you know what the question is there's even a lot of different ways to answer that question with all the statistical methods and just to take this one step further although quite related there is for people that are interested in differential transcript usage and what I mean by that is you're really interested in whether the proportions change then one fairly good way to do that is actually project everything into exons and start looking at exon differences and just to highlight that Charlotte also just finished off her study to compare different methods to do differential transcript usage as a separate problem and it turns out that the exon counting business seems to seems to work quite well in that situation and so I was just to say that there's all these different ways of doing the analysis there's one more coming but just to say just to plan out differential transcript expression it's different from differential gene expression it's different from differential transcript usage it's a different way to ask the question okay so so I guess the question then for getting a little bit closer this is what influence could you do or should you do and I'm probably not in a position to tell you what you should do but I can tell you what you could do so I guess what you need to ask yourself is what do you want to know and so I've kind of spied out a few of these so there's differential transcript expression so that's doing everything at the transcript level there is differential gene expression so you kind of collapse all the information you get from the transcripts to the gene you just need to do that in the most appropriate way because different transcripts have different lengths so you can't just add the counts like I showed you in the schematic example before another thing that we tried didn't know how well it was going to work is you can ask a slightly different question so from your transcript level information you can say well is there any transcript in this gene that is differential expressed I have to slightly different question and ask is every transcript different in this gene and so what I'm talking about there is that's collapsing also to the gene but collapsing it in a different way that's more I guess you could think about collapsing transcript level p values into a gene level p values and that's maybe a reasonable thing to do depending on the question that you want to answer and then differential transcript usage or differential exon usage now that's more about asking the questions the transcript proportions change and so as we come back to our schematic example before I basically answered this has the blue transcript changed from condition A to condition B yes it has it was here and it was absent here has the red one changed yes it was absent here it was present here have any of the transcripts changed well yes that's kind of a natural follow on from this one if there's a yes here then there's a yes here has the overall expression changed no the overall expression has not changed because 10 reads for a long transcript is equal to 5 reads from a short transcript so the overall outcome the overall production of transcripts from this focus have not changed and then have the proportions changed yes well it's 100 and 0 here and it's a 0 and 100 there okay so this is obviously a very easy example in most situations it's going to be a lot more difficult than this there's going to be a lot more transcripts there's going to be a lot more uncertainty in all of this so what you could do I'm going to later make a few recommendations about what you should do but I may not really be in a position to tell you what you should do okay so there's lots of subtleties here in case you haven't noticed them or picked up on them so I guess one of the main reasons people stick to union counting is that it's easy one kind of a quick verse if your transcript catalog is incomplete you might question how well the estimation of transcript expression is I don't share that later but there are certain concerns there's probably lots of cases where we don't know the full transcript catalog so how well do all these transcript estimating work if you don't have a complete catalog we can simulate a bit of that and we'll do a bit in that study on the flip side I think it's quite clear the transcript counting is the way to go it's more accurate and it's more precise but you could also imagine in very complicated genes think about some MHC genes you have hundreds of isoforms pretty hard to distinguish from which of those isoforms is really different so we'll come back to that a bit later so there's kind of some trade-offs here but it's worth thinking about what you actually wanted another thing is that there's more transcripts than genes so what does that mean? that means that if you were to do testing at the transcript level there's more tests there's possibly there's definitely a higher multiple testing penalty problem or testing penalty and maybe that leads to lower sensitivity but it also again depends the actual situation that you're looking for whether that affects sensitivity or not one question I get a lot and I'll come back to this later is do the standard statistical methods that we've been using for kind of raw counts do they apply to these estimated counts? so with this transcript level counting you basically have an estimated count so it's not the simpler situation there's some uncertainty involved there and the short answer is yes, they do but there's more to the story and so here's getting to her my argument later if the transcript changes so say you were doing analysis and you were doing it at the transcript level and one of your transcripts came up as differentially expressed I've heard many biologists on this and they say when that happens they probably might look at the other transcripts with imaging because they want to know is this transcript changing because the whole output of this gene is changing so all the transcripts are going up or is it changing because it's switching from one to another and so I'll come back to that a bit later and another thing is this whole business especially with the union counting versus the transcript counting how big of a deal is this and I mean I guess there's a lot of exaggeration on social media and I'll try to answer this further but it doesn't matter in practice and the spoiler here is there's not so much but there are certain situations where it's worth thinking about okay so in the paper that I presented at the beginning we made some claims and so I'll just jump into our response so we're going to start thinking about maybe we should be thinking about our analysis at the gene level depends exactly on the situation so I'm going to say the actual estimation problem is much easier at the gene level so that's one argument I would make I'll give you a little bit of evidence for the fact that the statistical engines work well whether you do transcript level or gene level even though there's law or estimator counts and I'll come back to this interpretation and I think it's a lot easier to interpret things at the gene level than at the transcript level but I'm also welcome to hear other people's arguments and then the difference between union counting and transcript counting is mostly small okay so first claim is this and so what we did is we simulated some RNA-synth data and the way that we like to simulate RNA-synth data is we take a real dataset get some empirical distributions of abundances that we see and kind of build that into our to our dataset and so what I'm plotting here is the estimated transcripts per million versus the truth because in a simulation we know the truth and so I mean basically when you do this at the transcript level there's a set of transcripts that are just really difficult to estimate so there is some non-truly amount of expression that we put into the dataset but the algorithms we're using salmon underestimate the expression and I think this is just simply because these genes are really hard to estimate it's not really a problem in salmon, it's a problem in the information that's in the data and so if we do something a little bit easier and that is take the TPMs and then sum all the TPMs for the transcripts of a gene into a gene level somewhere then things get a little bit nicer we still have a few genes that are hard to estimate and we'll come back to those later but this is a little bit better and so if you want to I mean this maybe argues for the point that maybe you want to do things at the gene level because the estimation is easier but it also argues for answering an easier problem so this is better because it's easier it's easier to get overall summarized gene level expression than it is transcript level and on the right here is the devil the union counting business and well it's actually not so bad, right? highly expressed genes attract a lot of reads lowly expressed genes don't attract a lot of reads so compared against the truth, union counting is not so bad and you might even favor union counting in these cases where your catalog isn't complete because if you have at least a transcript that gets most of the X on the gene then you're not going to go too far or as good I mean there are situations where we will but there's a lot of situations where you okay so that's the first time maybe we should just be doing things at the gene level because it's an easier problem now of course people in certain situations and I'll come back to these later we'll have to look at specific transcripts and whether they change so my second claim is that the statistical methods that you use for raw counts and for estimated counts are I'll say equally healthy here I'll put healthy quotes because it's not a perfectly healthy p-value distribution I'm looking at p-value distributions here the five different panels are five different ways to do this counting there's two arrays that are called the simple I guess there's the feature counts and the simple counting so these are the kind of union approaches and there's a few different ways to combine the data and do the kind of proper aggregation of transcript level to the gene level and so what you see here is that I mean this is just a diagnostic that we do in every differential expression analysis just to see whether we get this kind of flat p-value distribution then a spike at zero that's what we hope to see there's a little bit of stuff happening here where we get a little bit perhaps a little bit of inflation of the false discovery rate and so on but this is reasonably healthy I've seen a lot worse than this basically but I guess the point here is that whether you do this on the estimated counts or the raw counts you get basically the same profile so that's an argument for suggesting that the methods that you use the count methods that you use are still okay for these estimate counts they're still count distributed and so the assumptions that you make are still reasonably appropriate the other diagnostic that we look at is dispersion wean plots and we don't need to to lay around too much about this but just to say that regardless of how you summarize the data you get sort of roughly similar profiles and this is sort of this business of trying to moderate estimates in the data so to get slightly better inferences by sharing information and so we use this as just a diagnostic and it's not so different across the different types of counting okay so let's come back to this again one thing I want to point out again is this little business here of taking your differential transcript expression but then returning the results at the gene level so just be careful of the question that we're asking here you're asking whether any transcript in a gene is differential now one result from the paper is this and these are plots that we look at a lot and what you see here is on the y-axis is the sensitivity the power of a true positive rate and here is the the false discovery rate and what a statistician once used to control the false discovery so you want to keep to the left side but then you want to be high and so these three dots here that you see these three open circles are three different cutoffs of the false discovery rate and then the dot is actually pointed at the actual false discovery rate so for example this one here I know corresponds to the third cutoff so I'm setting a 10% false discovery rate and in this particular example I'm getting a 12% actually of false discovery rate so I want to keep things to the left but keep things to the top and what I'm comparing here is a gene level analysis versus a transcript level analysis and so there's some statistical advantage to doing things at the gene level but also keep in mind that what we're doing is your answer and use your problem again so it's more powerful to answer an easier question and answer things at the at the transcript level and that's part of our argument and so these are the two questions we're answering okay and so I want to I just make the argument for which you might choose which might be more informative to you and I want to make the argument that to me it's better to answer two clear questions than one broad question and what I mean by that is that when you are doing differential transcript expression you're just saying does any transcript change whereas to me it's a lot clearer if you focus on gene as a unit and answer more specific questions about that gene does that gene overall change in its abundance does that gene do they like to form within that gene switch and not everyone's on board with this and I accept that but this is my thinking that it's easier to answer two precise questions than one broad question now the counter argument to this is that there's of course some situations where we specifically want to do transcripts specific transcript level differential expression analysis this is a paper about a very specific isoform and prostate cancer that's very predictive response to treatment to us and this was made a little bit of a splash published in 1814 but basically the presence or absence of this error 7 very specific isoform and prostate cancer is very predictive of whether that isoform is whether the patients respond to the treatment and so this would be a very good example where we want to do a very specific analysis on those particular those particular variants I'd still question whether you want to do this on a genome wide scale but there are cases where you definitely want to do this ok so for my last time I need to tell you a little bit about our simulation because we basically made it as extreme as possible to kind of really tease out where things where this unique counting versus transfer counting is really causing effect and so what we did is we took chromosome 1 just to make it a little bit smaller we chose a very typical condition that we see, two experimental conditions each with 3-dialogical replacements that's the most popular experiment that we get to analyze and so we as we deal with the simulations we put in truly differential things and what I mean by things is sometimes we put in overall differential expression so we take all the transcripts and overall change or we do it such that the isoform proportions change but the total output is kept at 0 or we just randomly select transcripts and put them in as differential and so this is essentially where we can now tease apart where the unique counting versus the transcript counting really makes a difference and it's pretty clear and so again same sorts of plots here true positive rate versus the achieved false discovery rate and what you see is that the methods break into two groups so the two lines that are here are the two what we call simple counting methods the union counting whereas the three methods here and the differences here in these three different methods are just slightly different ways to take the transcript level information and put it into the inference machinery and so essentially the union counting is where it goes wrong it's not very good, the performance is really degraded but when you really look at it it's really due to differential usage so the plots on the right here are this but split according to the truth where you have differential transcript usage or not and basically if you have no differential transcript usage then all these methods are pretty similar, there's a few that kind of fall out a little bit but there's several here that are basically indiscernible whereas the ones that you do have differential use when usage the proper counting you're in pretty good shape there you're getting decent enough power you're controlling the false discovery rates whereas here the union counting you just really massively throw out the false discovery rate so this is I guess what partially motivated the switch to transcript counting but I should also say a lot of it was due to the fact that these things just run a lot faster just to follow this on one thing you could do and you can do this for any dataset is calculate so you've got this simple situation where you've got two different experimental conditions you can calculate a large full change with the simple counting with the kind of union based counting and you can calculate a large full change by kind of properly adding that up the transcript expression to the gene level and doing a large full change there and so we did this for our dataset and that's up here in the top corner that's scene 2 and you can see the differential transcript usage is in red and the rest are in blue of course we don't know what's truly differential transcript usage in the real datasets we only know it for the simulation dataset and what you can see is this is where things are going wrong so you can see here that on the raw axis here this is the transcript counting the large full change you get and there's a bunch of large full changes here that are basically zero because remember in our simulation we put in no overall change in expression but we put in change of the isophones and this is where I mean this is where the simple counting kind of derails you get lots of these really extreme lab ratios that I really just do to the differential transcript usage and when you look back on this it's actually not too surprising what I'm showing you is not too surprising what was interesting for us and I don't think the story is finished here is that what if we did this for real datasets because we don't know if we've been doing union counting for the last 6 or 7 years how could we trust our past results and I don't think the problem is so bad I mean there may be a few instances where this really goes wrong but I mean these were just 3 datasets that we had right away available but you can do the same exercise on any dataset that you want but we don't see the same kind of derailing of the large full changes in the simulated dataset than in the real dataset and well of course we have to accept that the simulation that we did is pretty extreme in the end we put in a lot of differential transcript usage so this is I guess part of the argument for saying well it's not such a big deal in many real datasets now there are going to be datasets where there are cases where we switch from a long leciform to a short leciform or something like that where it really changes things just to summarize I just want to talk about some other considerations I mean this is RNA-seq data differential analysis and just some news some old things that you may know about or may want to think about if you were doing these experiments so I mean one thing that makes the transcript expression challenging is that a lot of times you have fairly low expression and so this is just a review and in my abstract it says genes of interest to pharmacologists are frequently expressed at such low levels that are not adequately represented in genome-wide so this may be true and for a grant application I just looked at some datasets and said well how big of a deal is this and this is what I came up with and so what I did is I just ranked the genes according to the number of reads they attract and I did this across a couple different datasets and what it means is basically 80% of your reads go to the top 10% of genes so that's not a very efficient way to collect data unless you're really interested in all the high-level expressions but if you're interested in the more low-level expressions and probably there's a lot of situations where you are not a lot of your real estate goes there and so there's some experimental ways around this I don't know how well they've been received or how it'll work overall because it adds another step in the lab and some other cost and so on it may not work out in the end but essentially if you could deplete some of these really high-level transcripts and some structural proteins that you're not really interested in maybe you could get a little bit deeper for the same sequence and depth so I'm not sure that's all that well appreciated but it's something that seems to stand out pretty strong a recent paper from McWatson's group just coming back to a point that I discussed earlier some genes are just super difficult especially if you have very similar gene families and he did an experiment with a thousand reads onto every gene or every transcript and try to estimate them and what you see is there's a big spike at a thousand but then there's a long tail to the left where the reads just cannot be kind of unambiguously mapped and so he makes an argument that they make an argument that some of these genes are all of disease relevance and certainly that's the case another point that's kind of lingering in the back of my mind is that maybe in a few years we're not going to be doing any of this fragmenting into small pieces we're just going to sequence for white transcripts either at a CTM or an RNA level this is probably not so far away this is just a plot of some of our first experience with some Pac-Bio or iso-seq data and it was just brilliant we had a reference chemo and this particular example at the top here we had a this was the annotation which was inferred by partly RNA-seq data some computational algorithms and you can see that with the computational algorithms that was two transcripts it's quite clearly one transcript and you can see these are reads just across the whole transcripts at the price point that something like Pac-Bio is if you want for discovery but as these technologies get higher and higher throughput and maybe Oxford Nanopore is really getting first at that point now maybe we can do this in a qualitative way and sequence our CDNAs and just essentially count at the transcript level makes some of this business a little bit easier and so I'm going to just wrap up here to say that there's no real crisis here we haven't really been doing our RNA-seq all along but there are some cases where this business of transcript counting and unique counting is large enough to think about this is just my view unless the need really dictates I would suggest answering an easier question answering questions at the gene level because it's easier to do estimation at the gene level and so are any transcripts differential because that's an easier question answering very specifically about every isophone and so that's why I say I find it easier to interpret two precise questions but that also doesn't mean that this isn't important so the analysis methods for differential transcription are actually essential for building into some of these analyses and so there's a little bit of work to be done transcript weather estimates improved gene weather inferences one question people ask me is what do you do in your lab and so I guess I would say that if you want a fast pipeline I would combine salmon with edge-r and now you have to take into consideration there's a bit of a bias there since I'm the co-author of EdgeR but that's definitely what we do in our lab most of the time now one caveat there is that well actually sometimes these four alignments that are computation costly they're actually really useful to look at and of course I always encourage people to look at the data and with these fast pipelines you just take estimates out so these four alignments are actually really useful especially if you're interested in spacing so I mean you might want to slow it down if you want to be clear about things one thing I haven't really commented on but I should say something about is that these methods are so fast salmon and caristo that you can do bootstraps of the containers to get uncertainty estimates of your transcript level estimates and so the question is what to do with those and I don't have a full answer because I mean basically my argument earlier in the talk is that well actually you just throw away the uncertainties due to your standard EdgeR analysis and it's still fine and the reason I think that it's still fine is that most of the genes are actually really fairly easy to estimate so there's not so much uncertainty it's a very small proportion of the genes that are a bit more a bit more uncertain to estimate but of course you should think about or I mean me from a method side we should think about propagating those uncertainties into the differential expression calculation and we're working on this I'm sure lots of other people are working on this and we think we have kind of a good approach to make use of the bootstraps I'm not going to tell you about it today but there's more to come there and of course we're not the only ones working on this okay I already mentioned this part the failure elements are useful last thing is that as part of this collaboration with Mike Love we created when I say we I mean Mike Love created the TX import package now this TX import is the bridge between the estimators, the transcript level estimators so Sam and Felisto, RSEM all these estimators and the other layer which is the statistical layer which is DC2 with EdgeR or Voodmina so just to let you know that that exists and I think it was just last week it's now available in the developer version of Bioconductor okay so I will end there and thank of course Charlotte Sonison who did the majority of the work here also Mike Love and then there's lots of other people in my group that are kind of related in this project Shadei and Gosia and Lucas that are involved quite heavily in RNA-seq data and there's the Agile Development team which is Gordon and Yunshun and Melbourne Davis is in Cambridge now and I also thank the funding agencies for supporting these people in these projects and thank you for this one