 Okay, good. So let's get started. So this morning we'll talk about RNA velocity and we'll see how to what it is, how to calculate it, why you would do it and what are the things that you may need to keep in mind. But before we talk about what it is and how you would do it, we'll say a few words about why we would like to do it. So if you have read some papers or heard about RNA velocity, you probably have seen figures like this. And this is typically what you would get in the end when you have done like an RNA velocity analysis. So you will get things like the figure here to the left, where you have the streamlines that represent the flow in this data set. So the way you would interpret this is that the points here are cells. So this is a tisny plot or a Yuma plot or some kind of low dimensional embedding. And to see how the cells are kind of developing or moving in this plot, you will basically follow the streamlines. So you would see that over here, there is some cycling going on, there are cells that are still in the cell cycle. And then at some point they will go out of the cell cycle and they will start differentiating here, they will go through these cells, these pre endocrine cells, and then eventually decide what type of differentiated cell they would like to become. So that's a very typical plot that you would see from an RNA velocity analysis. And of course, we will see how to generate them today. What you also often see is our plots like this. We look at individual genes. And these are often referred to as phase plots or phase trajectories. And basically they show the amount of spliced RNA and the amount of unspliced RNA, and then the velocity trajectory through them. So we will see how to make that as well and what they mean and how to interpret it. So that's kind of where we want to get to at the end of this. So then to the question of why we would like to do this and why not just use the data we already have. So when we do most of the single cell analysis, we just look at the RNA, the abundance profile, the gene expression profile of each cell. So that will tell you the abundance of the mature RNA for each gene in each of the cells. So that's what you would estimate, for example, if you run Alavin like Abby showed yesterday, or if you run CellRanger or another quantification. So that's usually very good to figure out like in what state is the cell, what's it doing right now, which other cells are it similar to. But it doesn't really tell you where the cell is headed. So it doesn't really say in what state it would be like a little bit into the future. So that's the idea of the RNA velocity that to use the snapshot data, because of course it would be better if we could take the same cells and just sequence them over time, then we could really see what happens to this cell one hour in the future or a little bit into the future. We can't do that with the current single cell RNA-seq protocols, we have to take a snapshot and then the cells are gone. So this the RNA velocity is a way to kind of try to get to the dynamics without having to look at actual time series of the single cell data, of the same sense. So then what is it? Well the underlying model, the kind of conceptual model is what's shown in this figure here. So the whole idea is to use not only the mature RNA but also the unspliced or the pre-mRNA abundance in the cell at this point in time in order to say something about where the cell is headed. So kind of on a high level you can imagine that if there is for a given gene in this cell if there is much more unspliced RNA or pre-mRNA than mature RNA, you can imagine that probably this gene is about to be up-regulated because there is much more like it is a lot of precursor RNA. And oppositely if there is a lot of mature RNA but very little pre-mRNA probably this gene is about to be down-regulated because there's not just there's just not enough in the pool of pre-mRNA to keep this expression level up. So that's kind of the idea of why having both the pre-mRNA and the mature mRNA will help to say something about where the cell is headed. So if you combine then this information across all the genes you're looking at you will be able to say where do I think this cell will be sometime in the future. So the conceptual model underlying the RNA velocity calculation is shown here. So you have like your DNA which is being transcribed with some kind of transcription rate which can depend on which state the cell is in either this gene is actually being transcribed or it's not. So that doesn't have to be constant. So the transcription generates this unspliced RNAs that will be represented by you at a specific time T. And this is then being spliced with some splicing rate beta which is assumed to be a constant for all the cells for each gene. So it can be different across genes but it's constant across cells and over time. And when it spliced it turns into the spliced RNA or the mature RNA which we represent by S and this is then being degraded with some rate constant gamma. So this is like the the conceptual picture and if we translate that into mathematics we get this system of ordinary differential equations. So the first one just says that the rate of change of the unspliced RNA is the the transcription will increase it right. So the more it's being transcribed the higher the faster this unspliced RNA will change. And then it's decreased by the splicing. So the more it's faster it's being spliced the slower it will actually increase. The second equation says the same for the the spliced RNA. So this term here is the same as this one because whatever is spliced from the U the pre-amarnapool will go into the mature amarnapool. And then this is the degradation term so the faster it degrades the slower this will increase. So this is the whole kind of mathematics that's behind the the RNA velocity. These are the equations that we would would like to solve. So what we do is that we input the values of U and S for our different for cells. So this is the spliced abundance and the unspliced abundance we will see how to estimate that as well. And then we're trying to estimate these parameters. So the alpha the beta and the gamma. And then the RNA velocity by definition is this. So by definition the RNA velocity is the rate of change of the mature amarnapool. It's nothing magical it's really just this. And of course if we now have all these parameters and we have we have the measurements of the U and S we can calculate the RNA velocity. So yeah so this is the equations that we're trying to solve and I wanted to illustrate this. I made a little app that we can run. So you can run this too if you want. It's in the advanced sizing of the RNA velocity plot velocity app. So since this is not something that you will actually modify you can run it from here. If you want to modify it please copy it also out to your working directory. But if you just open the app dot r file and you click run app that should work. So it should open this one. So this is just to kind of illustrate what this means. So we have these equations again. These are the same two equations that we had in the slides. And we're just going to assume that at time zero we don't have any expression of this gene. So it starts from not being expressed. It can start from anywhere else of course but we're just going to assume that there is no expression of this gene to begin. So we can actually solve this set of equations explicitly. If you want to have a look at what the solution actually looks like written out. You can check this paper. So this is the S C Velo paper and they actually write out the equations, the solutions to these equations explicitly. So we will assume that beta and gamma are constant over time. But that potentially the transcription rate can change depending on whether the cell is in an oops let's try again. So the transcription rate can potentially change depending on whether the cell is in an active state or in a I mean if the cell is this gene is actually being transcribed on it. So let's look at what we see here. So this plot down here just shows the current alpha. So here we're actually assuming that the transcription rate is constant. It's always 20 which is a arbitrary number but here it's the important thing that it's constant. What this one plot here shows is the actual solution. So the U of t is the blue so this is the unspliced abundance and as the red here is the spliced abundance over time. If we solve these equations what are the actual time course, the profile. And here we see the the spliced abundance versus the unspliced abundance. So this is the face plots that you would okay I will just run it locally. I'm not sure why it's crashing. So this is the same thing. Yeah so this is the the plots that we saw in the first slide just showing for all now all the time points the spliced abundance versus the unspliced abundance. So the first thing that we can see is that so we start here at zero zero so that's the initial conditions and then as transcription starts both the spliced and the unspliced abundance start going out but the unspliced abundance goes up first which makes sense right because to get to the spliced abundance you actually have to go through the unspliced state. So the blue curve goes up first and then a little bit later follows the red curve. So that's the first thing that this kind of makes sense in the interpretation. The second thing we can notice is that eventually both of these will become flat. So both of these will get into a steady state or an equilibrium where actually they don't change anymore. So eventually if the transcription rate is constant we will end up in a steady state where both of these derivatives are zero right and the actual value of this of the abundance is here in the steady state will depend on the parameter. So if we change theta to 0.5 you see it's actually different right so the ratio between them is different and the actual values are different but the behavior is the same so eventually it's going to go into a steady state if we leave it in a state with a with a constant transcription rate for long enough. And the way we would see that here in the in the face plot is so we start here so this is the initial condition s is zero and u is zero and then as time passes we will start walking up this curve here so you see that u is always like increasing faster than than s which you can also see here and then eventually we will end up up here which is the steady state. So that's that's kind of how how these ones would would be interpreted these these things these points that are above the line here we will see just in a second what this line is. These points that are up here are in points in that there are representing cells that are in a state of where this gene is being induced so corresponding to this region here where the expression is actually increased. Pamela are there any questions at this point or anyone else? No that I can see you. Okay so let's quickly go back to the the slides. I wanted to just so how so okay so this is all fine but can we actually use this to calculate the velocity? So we saw that the point up here the point up here are the steady state points so the equilibrium point and the equilibrium we saw that that meant that the s and t were both constants so the rate of change of the spliced abundance is zero. So if we just look at the equation that we have here that means that beta u minus gamma s is zero so in other words beta times u equals gamma times s so in the steady state this will have to hold because the change the rate of change of s is zero and if we now just divide by the beta on both sides we get that u in the steady state u is proportional to s with a proportionality constant gamma over beta right? So that means that if we know that these points here up here are in steady state so if we know that we let this process run for long enough so that there will actually be some cells that have reached steady states we can actually calculate this ratio because that will be the slope of this line that goes through the steady state point and zero zero so this particular line here is gamma over beta times s so we can we can actually estimate this ratio directly from the data if we can assume that there are some points some cells that eventually reach steady state or that we eventually reach steady so okay so that's all fine but how does that how do we get the velocities from that? Well if we if we now look at any point on this curve so this point here is the coordinates so the x coordinate is s of t and the y coordinate is u of t so this is just the spliced and the unspliced the bundles at the given time given point in time if we now look at the point that it's just below so if we take the same x coordinate but the value on this line the x coordinate is the same and the y coordinate now since this line is y equals gamma over beta times x the y coordinate here is gamma over beta times the x coordinates which is s of t so if we take the distance the vertical distance between this point and this point between any point on the curve and the corresponding point on the on the line it's u of t which is this one minus gamma over beta times s of t which is this y coordinate which is actually proportional to the velocity so this is just if we multiply this side by beta we actually have the velocity so that means that if we can assume that this points that we have some points in steady state we can estimate the gamma over beta ratio and from that we can get this line and from that we can actually get the velocity estimate for each point on this phase plot right we just take the the distance from that point on the phase plot down vertically to this line does this make sense or are there questions on that so basically this is what what VeloCito is doing so the original tool that was built to do our navelosity estimation this is what they're doing they're assuming that there are some points up here they're estimating this this slope and then they get the velocities by taking the the distance to that line from this phase plot plot that we will get from the then of course there are some some additional code to kind of make this more stable more robust and to somehow make it also applicable in some sense to to genes that don't necessarily reach steady state but this is kind of the basic idea and i wanted to just show one more thing here so now here we assume that the transcription is constant but what if it it just stops at some point so it's transcribing for a while and then it will stop so we can change that here we can set the transcription rate type to two states so that will mean that it's transcribing for a while and then it's stopped it stops and we can just make this a little bit longer okay so so here it's it's the transcription rate is 20 for the first 10 time units and then it's it stops so you see that the the beginning in the beginning it looks exactly the same as what we had before which makes sense because there's no way the system will know that this will stop transcribing at something but as the at the point where it actually stops transcribing the abundances are going to start going down and they're going to go down in an exponential fashion and in the same way as the the unspliced abundance went up first before the spliced one the unspliced abundance will also go down a bit ahead of the the spliced one so you get the unspliced abundance go down and then a little bit later the spliced abundance will go down sorry Salot we have a question from Elrique who's asking what is the time unit yeah so this is a little bit arbitrary here so this is just like a model and also this value is 20 here it's an arbitrary number just to make just to show the process so in in reality the time unit is actually a bit difficult so if what you get out of the of the tools will also be somehow arbitrary if you actually want to say that after one hour this is where i think i am you will have to impose some kind of knowledge about that because the system doesn't know let's say how you're gonna get something that's proportional to the correct time and if you this because the data doesn't know whether this going from here to here takes one hour or if it takes two hours it's just gonna give you back things that are internally consistent but the time may be off by a by a proportionality factor and we have one more question from Sina who's asking where do the beta and gamma parameter estimates are coming from uh so they they come i'm not sure i i understand the question so the the the estimates here are coming from taking so up here is the steady state uh if you draw a line through that point and zero zero you get a line with um slope gamma over beta so that will let you estimate the ratio between them and one more thing which is uh from malison asking whether we have the same alpha estimation for all the genes nope and also the beta and the gamma are not the same for all the genes so velocyto assumes that beta is the same for all the genes and that's basically so that they can estimate not just the ratio but the actual gamma so velocyto sets beta arbitrarily to one which basically assumes that the splicing rate is the same for all the genes uh and of course if you if you have gamma over beta to 1.5 and you set beta to one then you have also gamma because gamma will be the 1.5 um so sc velo will not make that assumption they will actually allow all these parameters to vary between genes you will estimate them for each gene circuit okay great so um what i wanted to show here was that now that we kind of go first up and then down we get these phase plots that that look a little bit more like maybe what you see in in the real data so this part here is the induction part so this is where the the transcription rate is non-zero so that in that period of time it will first increase the gene expression so this gene is being induced then up here it will be in the steady state so depending on when this decreasing transcription rate comes it will stay in this steady state up here for some time and then once the transcription ends we're gonna go down in expression and then that's where we are in the repression phase which in which case we're below this line here so we can estimate the velocity in the same way as i showed on the slide so we take a point here and we take the distance to the to the line so for all these points the velocity is positive which makes sense because here actually the gene expression is increasing right so the the velocity is the rate of change of expression so if the velocity is positive because the curve is above the line then the gene expression is going to increase here we take this point minus this point it's actually this distance or this value will be negative which means that the the RNA velocity is negative which means that the gene expression is actually decreasing so that's what's happening here right so yep sorry to interrupt we have two more questions one is about the genes where we don't have uh intron or a splice formation so i guess these cannot be estimated for this is a question from Sebastian and also there's a question from Simone who's asking if there is any experimental confirmation of the profiles that are reconstructed by Velocito i guess on the original paper is the question so yeah so for the genes that don't have any introns we can't do this because we don't know whether what we're measuring is spliced or unspliced i mean it's the same and whether there's experimental confirmation of these kind of profiles or exactly yes i mean i guess experimental confirmation yeah i'm not sure it's more like you you see this kind of patterns and then you fit the model to that and the model fits the data reasonably well and there is a like a mathematical model underlying if they actually looked i mean you can't look over time in a specific cell so that's kind of the problem you need to you need to make assumptions about your data in order to say something because there is no way you can really validate what's going on with singles and RNA sick data but their trajectories did make sense in terms of what was known of the biological systems that they were studying yes so overall kind of the they got signals or or profiles patterns that that made sense biologically um and the genes that came up kind of made sense biological too but i don't think they they could actually go in and validate afterwards the specific gene it's more on the interpretation side like yes this gene we knew about before it makes sense in this in this context okay good so then the the last thing i wanted to show here is then what happens if we if it doesn't if we don't let this trans transcribe for long enough so if this is only now transcribing for a very short time and then it goes down we don't have actually time to reach the steady state so you see this this has actually stops uh increasing and starts decreasing before it goes into the steady state and this is where the this model uh this way of estimating velocity that is used by by bellow cipher and also by the steady state model in svelo actually break down because here this is this line is the true line this is what we would like to estimate uh but if we would just take the cells here in the in the top of the plots and say that i think this is my steady state we would actually overestimate this rate so this is where the the dynamical model of svelo that we will use later and that we will that actually solves this set of equations where that has an advantage because it can deal with this with this situation without having to assume that whatever is on the top right of this plot will be the steady state. Charlotte? Yeah. Uh one more question from Mutlien who's asking what is the effect of having single nuclei data instead of single cell data on the estimation of velocity? Yeah that's a very good question um i think this is still being debated um so in one way you can say we have much more intronic rates typically in single nucleus data so it should be easier the kind of the intronic expression unspliced expression is much higher um on the other hand you so so they're kind of the degradation would maybe rather be like exports from the nucleus or something like that uh and maybe it's not really clear if the assumptions are satisfied so is that uh nuclear transport rates constant across the cells across the states um i'm not sure i don't know if one can make that assumption um is so so i think i think it's possible uh and i think both the developers of velocito and esivello have done it and said that it seems to work but both have also said that uh they're not directly making guarantees right so there are still things that we need to learn about single nucleus data it wasn't developed directly for that um but basically it seems to work so let's go back so this is kind of an overview of what uh what we're trying to do what these plots mean uh how we can use them and and interpret them so let's go uh into what we would actually need to do to estimate these velocities um so as i said we actually need both spliced and unspliced abundance to go into uh to solving these equations so we can't use the the regular gene expression estimation pipeline that you saw yesterday afternoon for example because that will only give you the spliced or the the mature RNA abundances uh there are tools now that are developed for estimating both spliced and unspliced abundances i'm going to focus on the ones that work for for droplet single cell RNA seek data um also because that i think has been a little bit more studied so as you also heard yesterday there are like different categories of this quantification methods you have the ones that align to the genome and you have the ones that align to some or map to some form of transcript so in the genome alignment category we have a velocity or actually it assumes that the data has been aligned to the to the genome for example using cell range and also star solo will will do that it will align the reach to the genome and then basically once you have the aligned the reach to the genome you have something like like this figure here so this black line is the genome to read things here are the reads and then this is a gene with exons and introns so uh once you have aligned the reach to the genome you would basically count everything that falls always in an exon like these three reads here will be exonic or spliced reads like this ones that are always in an intron will be counted as unspliced and then depending on which tool you use and which settings you use the other reads are going to be treated in somewhat different ways so reads like this that partly overlap an exon here and partly an intron and here it's actually completely intronic will be typically counted as intronic because it can't be exonic because it overlaps the intron reads like this that fall in an exon here and in an intron here well that depends on how which settings you're using but you may count it as intronic because it could be intronic or you may actually throw it away so there are a lot of different ways of counting reads here on up to the line to the genome depending on how permissive you want to be basically with reads that are not completely unambiguous for the transcriptome mapping tools like albin that you saw yesterday and also calista bus tools so they will both map to the transcriptome and estimate abundances we can't use just the transcriptome because that will only give you the the mature RNA so we have to somehow extend this transcriptome with either introns or pre-mRNAs and we will see how to do that but actually we need to ask ourselves first what we mean by an intron so in this case again extracting the transcripts is straightforward you just take the exons of each isoform may you glue them together but extracting the introns that we will want to add to our pasta file that we will then index with samon is not so easy so you can imagine doing this in different ways right you can either take the collapse approach which is here to the left where you take all the transcripts of a gene and you collapse all the exons and then you define the introns that's what whatever is left so in this case we would collapse these in exons here to a a longer exon this exon is this one and then here we have actually the same method and the introns would be just whatever is left so the part from here to here and from here to here so that's one way of doing it and that's basically means that if the read could be either exonic or intronic so if it was here we would prefer to assign it to an exon so basically we have removed all the intronic parts that could be exon so it indicates a preference for exonic assignment the other option is to actually look at each isoform separately and define the introns like the actual introns so here we would have the three introns that we would add to the the the feature set and yeah they will overlap the the exon here but we will assume that that alivine can deal with that so the other thing i wanted to mention here is this flanking sequence so since reads that are intronic can potentially overlap an exon the nearby exon as well so they can be partly exonic and partly intronic they don't have to be completely inside the intron to be unspiced we add this flanking sequence on each end of our introns just to allow for these reads that are partly intronic and partly exonic to also be assigned to introns so once we have decided on on what we want to consider an intron to be we can make our FASTA file we can expand extend our FASTA file so if we choose the collapse approach the the features that will actually go into the FASTA file that we will index by samum is the transcript so these are the two isoforms the the exons glued together and these two like new artificial introns if we choose this approach we have again the two transcripts so these will always be the same that the transcripts will not change depending on how we define the introns but then we will have these three original introns including their the flanking sequence so once we have chosen this is these are what we will put into the FASTA file an index with samum and then quantify with alavin and i just before the break i just wanted to illustrate a little bit of the what the difference is that you may see if you quantify with different tools so there are mostly or mostly these four tools currently that are used for for our navelocity complication and then you can use them with different options so for example for for alavin which of these ones do we choose and additionally do we quantify exons and introns jointly or separately so do we take our data and first quantify the gene expression as usual and say that this is now my my spliced abundance and then we take the same data and independently quantify the introns so what would happen in that case would be basically that all these reads that fall in this region would be counted twice because when we just look at the the the transcripts the exons we will say yeah it could be exonic so it's mapped here so i will count it as exonic if we don't know about the introns at that stage if we just look at the introns and don't know about the exons we will say yeah it could be intronic so i will count it to this gene too so we will actually count many reads twice if we quantify exons and introns completely independently of each other but it's an optional so i just wanted to show a few aspects where the methods are different so what you see here is an example of a gene so here you have the gene model this is on the negative strand so this indicated by the red color this is the coverage of the reads so these are the reads that fall on the on the negative strand and those that align to the positive strand and as we saw yesterday the the 10x data is actually stranded so all the reads that come from this gene should be on the same strand so here in these panels here i'm showing the the count the spliced counts and the unspliced counts for each of the methods these methods here you don't have to go that much into detail on what they are right now and this is the fraction of the reads that are all spliced so the first thing we will see on this gene is that there are actually a lot of ambiguous regions that could be either intronic or exonic and in particular this region where all the reads are could be either exonic if we look on this isoform or intronic if we look in this so this means that the way we define the introns actually makes a big difference if we define the introns with the collapse approach where we just collapse all the exons first and take the introns to be whatever is left this region here will not be intronic because it it's exonic here so when we collapse if this region here will all be exonic so in that case as we see here we actually get almost no unspliced reads for this gene if on the other hand we define the introns using the separate approach this region here could be intronic or could be exonic and as we see alavin will actually assign some of the reads to the spliced as spliced and some as unspliced so it really makes a difference right whether you can actually say which is correct it's a different question and it depends on the gene maybe in this case it seems to make sense that it's exonic because it actually ends exactly at the end of this on this exon and this is the three prime end of this transcript and this read should be in the three prime end but you in order to say which one is correct you would actually have to look at the gene and it's not even obvious all the time which one where these reads would actually come but the point here is that it makes a difference and you actually need to check your quantifications to see whether they make sense the other thing that's yep sorry we have a question from Simone just think what's the situation when you're dealing with genomes that are not very well curated on the precision of the results and if there's any benefit in testing different versions of genome annotations probably yes I mean you would have you would have problems for example I mean all these tools assume especially the ones that align to the or map to the transcriptome will assume that all the transcripts that could be expressed are actually annotated you could have a problem if you have a not very well annotated genome and there is a another gene here for example that's not annotated in the intro of another gene you will have if that gene is expressed you will have a big peak here that you will assign to the intro of this gene whereas it would actually come from the unannotated gene so it can definitely lead to problems if you if you have misanotations or incomplete annotations and I would also there I mean I would always I think recommend looking at plots like this for the genes that actually turn out to be the most important for your analysis the ones that seem to affect the the velocity calculations or the the streamlines the path that you will get through the day right so the the other thing I wanted to just show here is the difference between quantifying axons and introns together or completely independently so that's the difference between so the top bar here is quantifying them together so basically letting them compete for the reads and the second line here is quantifying them separately so you quantify first the transcripts and then the introns and you can see that actually the the fraction of unspliced reads is more or less the same but you're basically double counting a large fraction of the reads because you don't you first just quantify the axons and then all these reads will be axonic and then you quantify just the introns and then all these reads will be intronic. Another way that you also heard yesterday which in which these methods differ is in how they deal with overlapping features or reads that map to multiple features. So this is a interesting gene so here we have a gene here on the on the negative strand it actually overlaps with another gene so this slightly paler red here is a different gene on the same strand maybe this is a wrong annotation maybe it's actually the same gene but in the annotation there it's written it's given as a different gene that's a different name. So almost all these reads will actually map to both of these genes and almost all the tools except for Alavin will just throw all these reads away saying we don't know where they come from so we're just going to toss them all away whereas Alavin will try to use as you heard yesterday the EM algorithm to kind of try to figure out where these reads come. And I think in this case it seems to make sense that a lot of reads actually come from this gene because these reads here don't actually fit with this this example. But that's another way in which the methods differ the way they handle overlapping genes or reads mapping to multiple genes. A related thing that's specifically for CallistoBus tools is that it doesn't take into account the strandedness of the reads so if you have a situation like this where you have a gene on the positive strand and then another gene on the negative strand if you don't account for the strandedness of the reads all these reads here will actually be considered ambiguous again since they overlap two genes and they throw away. So this is why CallistoBus tools here don't assign these reads anywhere but just toss them away. Whereas the other methods that account for the the fact that this data is stranded will say they can't come from this gene because then they should have been on the other strand. And finally one thing that's specific or maybe specific to droplet data so in Tenex data for example we have all the reads should ideally come from the three prime end of the transcripts which means that in principle we could imagine that instead of indexing transcripts and introns we could imagine indexing transcripts and unspliced transcripts so not just introns but the entire unspliced transcripts. It seems that for droplet based data that's a little bit too hard to actually estimate so it's pretty difficult to say from just three prime data whether it comes from the spliced version of the isoform or the unspliced version of the isoform. It's easier to say whether it comes from the exons or the introns. So there would be with three prime data there seems to be a lot of genes that actually just get more or less half half half the reads will be assigned to the spliced and half to the unspliced because there's no real way of of telling. So this situation may be different for full-length data like smart tick data where we actually have reads along the entire transcripts so there it may be easier to use actually index the full transcripts and unspliced transcripts.