 I mean, I'll still briefly introduce myself, but I'll give a shorter version today. Yesterday I said, I'm a statistician and I work in general in the biological field. So I developed method in computational biology. So yesterday we've seen the first part of kind of my research, the research I've done, which is in stochastic modeling of single-cell data in systems biology. And today I'll talk about the second half, which is what I did in Zurich. Oh, I see, University of Zurich, which is statistical methods for bioinformatics data. So here I'm actually going to talk about in the end, our packages. So statistical software tools. Again, yesterday I said the main two reasons for using a Bayesian approach are latent variables. So missing data, basically, which are due to the measurement error and to the measurement process of biological data. And the same is going to hold today in bioinformatics and prior knowledge. Yesterday showed how we embed it prior from other experiments. Today it's mostly going to be about other genes. In bioinformatics, you analyze a lot of genes and you can still use some information from the other genes. Maybe just to fully briefly, I don't always go for the Bayesian way. Like we had a non-parametric approach, which is not here. But I mostly go for the Bayesian approach. So Bandit is the first method I developed in Zurich. And the luck is not particularly fancy. I know I go for simple plain logos. No, I'm not particularly good at graphics. But it's a method in back conductor for to study alternative splicing. So I'll very briefly introduce the biological background, which is a bit embarrassing because you all know more biology than I do. But I'll try not to say anything silly. So in description, a molecule of RNA is prescribed from the DNA. But this immature, unspliced molecule has both exons and introns. And with splicing, the introns are removed or spliced out. And the exons are basically together. And so this generates a mature or spliced molecule of mRNA, like that one, which typically only has the exons. Now, this process can actually be done in several ways. And therefore, it can generate several distinct mRNAs or transcripts. For instance, in this case, we have all the five exons, which are here. In this case, the third exon was removed or spliced out. And this is called exon skipping. In this case, the fourth exon was skipped. So there are various biological processes that can basically generate several mRNAs. That's not particularly important here. The key point is that a single gene can actually generate several mRNAs that can code for different proteins. So this is a quite important, flexible mechanism because it allows a single gene to generate multiple proteins. But as usual, these things can go wrong and this process can be disrupted in some diseases. And so it can be interesting to study variations in this process. And so that's exactly what our tool does. We study differential alternative splices. So it changes in the alternative splicing patterns of every gene. And we normally do that across experimental conditions, like healthy disease or treatment A, treatment B, or untreated patients, and so on. So what do we look for in particular? Because this is quite vague, differences in alternative splicing. Well, we look for differences in the relative abundance of transcripts. So you see here, we have, for instance, three transcripts from a specific gene. We can look for the proportion or relative abundance of these transcripts, like 70%, 20%, 10%. And we look for changes in these relative abundances across conditions. So one condition could have 70%, 20%, 10%. Another condition could have 50%, 40%, 10%. For example. And this is quite different from canonical differential gene expression, where you actually look for changes. First of all, at the gene level, so you don't care about the specific isophomes, you really look at the overall gene abundance. And secondly, you look at actual abundance, not relative changes. So you look in that case at changes in your role abundance at the gene level. Well, here we look for changes in the relative abundance of transcripts within a gene. So in practice, what kind of data do we use? Well, we use bulk RNA-seq data. Again, bulk means we have an average signal across many cells. So this is not single cell data anymore, we're averaging the signal of many cells and we compare two or more groups. So mathematically to do that, we would like to ideally observe the expression of transcripts. So again, this naive example, we have three transcripts, we would like to know how many RNA-seq reads are coming from transcripts A, B and C. But unfortunately, that is typically not available because the RNA-seq reads are normally compatible with multiple transcripts. So consider this, again, even simpler example with two transcripts only, blue and red, and some reads that align on top. So you see some reads aligned to positions which are unique and are compatible with one transcript only. So this read is only compatible with the blue transcript. These reads are compatible with the red one only. And so we are sure or confident of what transcript they're actually originating from. But most of the reads actually come from ambiguous regions and are compatible with both blue and red transcripts. And this is typically the case in RNA-seq data. And so in this case, we don't really know what transcripts did, what transcript these reads are coming from. Now you probably know algorithms like salmon and Calisto which basically use an EM approach to estimate the amount of these reads which is coming from the red and blue transcripts. And eventually in the end, they give an estimated abundance of the red and blue transcripts. Now, the issue is that there is a lot of uncertainty in those estimates. And so if we want to then build a differential method based on those estimates, we have to account for the uncertainty in them. We cannot just plug it, well, we could just plug in those estimates, but we're neglecting the uncertainty. And so the influential results are going to be affected. So in order to account for the uncertainty in these reads, we avoid this quantification step that comes from the typically salmon or Calisto or SEM and we try to input the raw data which is basically the equivalence classes. So what transcripts each read is compatible with? So in this, again, for example, we would basically input this kind of data. One read is compatible with the blue transcript only, this one. Then two reads are compatible with the red transcript only, those ones. And then seven are compatible with both blue and red. And these are the reads. And so what we're going to do is again, we're gonna have missing data. We have a latent variable approach because these seven reads are coming from one of the two transcripts but we don't know which one they're coming from. And so we again use a latent variable approach and a data augmentation approach which is what I described yesterday where we treat the allocation of these seven reads as a parameter. So again, we have enlarged our parameter space. These seven reads are now, the allocation of these seven reads is now a parameter that we're going to sample. So we will sample in the MCMC the allocation of these reads to the transcript of origin. So let's consider the general case where we have a gene. So again, we're considering just one experimental condition say healthy samples, one specific gene. So this gene has K transcripts and we have N biological replicates. So N samples from a condition like the healthy samples. So we assume that the counts at the gene level are distributed across the transcripts from a multinomial distribution. So this is count data, a gene has N i counts. We assume that they're distributed as a multinomial across the K transcripts. Now the key parameter of inference here is the spy that basically tells us the relative abundance of the transcripts in this specific sample. So again, like yesterday, we have multiple samples typically in a three versus three group comparison we have three samples of five samples. Normally it's quite small number of samples. So we don't assume that this parameter vector is the same across those samples and we use again a hierarchical model. So we set each sample as a distinct parameter but there is a common prior, which is a Dirichlet. And so thanks to this, again, I'm saying this again just in case you were not present yesterday or you missed this part. But again, with a hierarchical model, there is flexibility because parameters can vary between cells but there is also sharing of information. Now, and so now we have these hyper parameters. So these are the group lever parameters, deltas, which can be interpreted as follows. So the summation of the deltas is the so-called precision parameter and it tells us the degree of variability between samples. So that governs the sample to sample variability. And then this pi bar, which are basically the normalized deltas, these guys represent the relative abundance of transcripts but not anymore at the individual level. So not anymore for the individual biological replicate but at the group level. So this is the average relative abundance at the group level. And basically the precision tells us how this average relative abundance changes from sample to sample, from biological replicates to biological replicates. So here we get to the informative prior parts. Now, these parameters obviously are different for every gene but there's going to be a similarity across them. So in particular, this precision parameter here, we can actually get for this parameter an initial estimate from another model. So it's not our model on our data. And so we get that estimate for all our genes which have at least two transcripts. So it's thousands of estimates. And then we cannot compute for these thousands of estimates a mean and a standard deviation. And that will represent our informative prior for the precision parameter. In practice to be a little bit technical we do that on the log scale. So it's log of this precision parameter. We get many estimates and we can compute a mean and a standard deviation. And that allows us to formulate an informative prior. Again, this is mildly empirical base. So empirical base means that you use your data to formulate an informative prior and then you analyze your data. So in a way that's slightly wrong because you analyze your data twice. Now, this is what we're doing here but it's extremely mild because every gene will contribute in a very small fraction of these estimates because this estimate will be based on thousands of genes. And so each gene will give a very, very minor contribution. So this is extremely mild empirical base because every gene contributes in a very minor fraction. And then we use a similar scheme to what I've shown yesterday which is called metropolis within Gibbs which sounds like fancy, but it's really simple. I mean, you just block, you just do blocks of the parameters because simply you cannot update all the parameters you have in one shot. So you just separate them into many blocks and you update each block at a time conditional on the rest. So it's something very natural. And so we have our three main blocks are the, as you see in the hyperparameters, the hierarchical ones, and then the latent states. So for the hyperparameters, we follow a metropolis sampler. For the hierarchical ones, we use what's called a Gibbs sampler. I'm not sure you've seen this but this is computationally more efficient because you don't need to sample and then accept, reject. You just sample directly from the target distribution. So this is actually very efficient because you always accept what you sample. And same thing for the latent states. They're sampled from a Gibbs sampler. So although we have many parameters, most of them follow Gibbs sampler, only few have an acceptance rejection step which makes it actually quite efficient in terms of mixing a convergence. Also for the latent states, I just want to give an idea of how we do the sampling of the latent state. So assume you have an ambiguous read and this read is compatible with transcripts J and K. Well, then we can just compute the relative abundance of transcript J and over normalized, so over the relative abundance of transcripts J and K and the same for transcript K. And so now in each iteration of the MCMC, these ambiguous reads is going to be allocated to transcript J with probability, well, basically proportional to the relative abundance of the transcript. And it's going to be allocated to transcript K with probability again, proportional to the relative abundance of transcript K. I'll take the question. I did different genes have different samples. I'm sorry, it's a bit noisy. Okay, the question is are different genes have different samplers? Like are each different gene like ended differently or are they all ended kind of at once in the big sampler? The mathematical construction is the same. We still follow a multinomial, there's usually a prior, informative prior, and this something in blocks. But H gene is analyzed separately. So this is for a single gene. A single gene has a multinomial across its transcripts So we didn't talk about multinomial in our class yet, right? Yeah, so just briefly, a binomial is where you just have a probability for whenever multinomial is when you have different and impossible and each have their probability. I think I knew. Is it multinomial? Yeah. Why is it multinomial again? Because you can have more than two transcripts. You can, yeah. You kind of, you generally have K, I mean K is kind of general, two to, well, possibly large number. Is it clear? Transcripts in a sample. Yeah, for a gene can have multiple, so multiple, maybe a type of transcripts. Maybe outside you should have clarified. Obviously we don't do these for genes that don't display alternative splicing. I mean, some genes may be associated to a single transcripts. In that case, there is no alternative splicing, right? All, we obviously only analyze genes where there is alternative splicing. So there are at least two transcripts or three, four, you know, as many as the gene has. In that case, we can study differences in alternative splicing. With multinomial, you can. Yeah. When there is more than three. And why, I don't know anything about the hierarchical aspect of this stage in the gene. Like, I don't know, maybe it made sense yesterday in the hierarchical with different model, but today I don't know. Okay, so there is a question about the hierarchical aspect of the model. What? Maybe you have the hierarchical aspect of the model. Uh-huh. So, I mean, conception again, I mean, forget about what we're looking at. Let's take it generally. You have parameters. In this case, the parameters are pi, but it doesn't really matter. You have parameters. We have biological replicates. We don't assume that every biological replicate the same parameters. That would be unrealistic. So we allow each sample, each replicate, to have a distinct parameter. So that's why the I, the I represents the replicates. But then we don't also, at the same time, on the other hand, we don't analyze each replicate independently because that would allow a different value for every replicate, but it would be silly because we would analyze them separately. They share information between them. And so the way they share information is via their hyper-prior parameters, delta. So these are the, see, pi I are the sample-specific parameters and then delta is the group-level parameters. Now, in particular, getting to the example, these are the relative abundances. So we say, in a way you can qualitatively take it like that. All samples don't have exactly the same relative abundance of transcripts. There will be biological variation. Even if a transcript is dominant, maybe in one sample, it's like 80% dominant in another sample. It dominates with like 70% of the reads in another 190. There is some variability. And then you have, at the group-level, you can think of the average at the group-level in terms of relative abundance. And so you can think of like 80% as the average relative abundance of the first transcript with some variation from sample to sample. Is that actually clear? I think so. Is it? Conceptually, it's the same thing as a mixed model in frequently statistics. You allow for the multiple samples, but you're allowed for variation between them. I actually conceptually prefer hierarchical models because I think the more you can see the hierarchy. Okay, thank you. Okay. I don't remember about the Dirichlet distribution from yesterday. There is a question also about the Dirichlet distribution. So what's up with it? Why are we still using it? No, what it is. Oh, it's just a distribution. Now, why are we using the Dirichlet? So the pi's are between zero and one and the Dirichlet is a distribution which gives values between zero and one. Okay. So you need to have some distribution that reflects your data, right? So the probabilities are between zero and one and their summation is one. So the Dirichlet is a distribution that gives those kinds of values. Second advantage, the Dirichlet is conjugate compared to the multinomial meaning that this is a bit technical, but that the conditional distribution of pi, conditional on delta and on x is a Dirichlet, which means that here, pi given delta and x, if we use a Dirichlet, then we know this conditional distribution exactly and we can sample from it straight away. So it's a conjugate choice. It's a convenient choice because it simplifies the algorithm. So part of the computation becomes analytic so you can have this part go much faster because you don't have them to rely on a computational simulation, so it's worth. And so it is a good, it's a realistic choice among, I mean, it's one of the several plausible choices, but this one has the advantage that the solution is actually analytic for this part of the model. Okay. Thank you. And why don't you say three cents, please? Sorry. To African one? About the sheen of metropolis within Gibbs to you block the parameters. That's what you are asking. Is that a question? Yeah, I think the question is about why you made the choice of having the metropolis within Gibbs in this three layers. Well, I mean, because the distant parameters, I mean, they actually, you know, each parameter is sampled from a distribution, right? Eventually you want to target the full distribution of parameters, but that's quite hard to target directly. So you normally split it into blocks which are easier to sample from. So in this case, we naturally sample the hyper parameters because they have a specific conditional distribution. Then we sample the hierarchical ones because the conditional distribution of the hierarchical parameters is a Dirichlet. Then we sample the latent variables because the conditional distribution of the latent variables is a multinomial. And so that makes a natural choice for how to split this. So is that like per sample or per? Well, I mean, this is a bit of a schematic. In practice, it's more complex. This is more like the grand scale, like three categories. But then within here, every sample is as a different distribution here. So each sample has a Dirichlet distribution. So this is actually, we have n biological replicates. So it means that this is done n times. So we sample the values n times. And these as well, it's actually multiple multinomials because equivalence class will have a multinomial distribution, but they're all independent. I mean, in this step, this conditional bit is all independent. So you sample them one after the other. Okay, thank you. Now, this is done for a specific gene, but we have normally multiple, I mean, we have normally, we always have multiple genes. We have thousands of genes. And so, in Bayesian statistics, you might have seen that you want to afterwards make sure that your posterior chains have converged, that they look good, that they mix well, that they're converged in a reasonable place. But obviously, so normally you would somehow look at the posterior chains of your parameters. If you have many parameters, you can look at summaries like the mean and so on. But here we have thousands of genes. So there is, and this, by the way, this is a tool for users. There is no way users who don't know Bayesian statistics can look at thousands of trace plots. Obviously this is unfeasible. And so we use an automatic convergence test that automatically assesses if the chains have converged or not. So these tools are not particularly accurate just to clarify. These automatic convergence tools are just better than nothing because there is no alternative here. And it ensures that the chains don't really fail. But don't take this, I mean, in general, don't take this as a fantastic method for assessing convergence because still the best way to do it is visual inspection. Now here we also have a computational challenge because this is, again, this is a Bayesian hierarchical model with latent states but we want to run roughly 10,000 of these models or thousands of these models. And so in practice, this can become very, very heavy from a computational perspective. So what we do is to run this method in parallel. So each gene is running parallel and we cut the algorithm in C++ because this is an R package but R is terribly slow for this kind of loops. And so we actually cut the key part, the MCMC in C++. And so on average, every gene takes less than one second. Aggregating all genes, we're talking about like one hour roughly on a laptop which is very slow for a Bayesian, but I think, I mean, from a statistical perspective, it's actually absolutely acceptable. Now a tiny detail, transcripts have different lengths. You see here, not all transcripts have the same length. So longer transcripts will have more reads because they're longer. So it's more likely that reads align to that place. Now this is not an issue because we're looking at relative abundance across conditions. So this is going to affect equally all conditions. But when we compute this relative abundance of the transcripts and we test for that parameter, we're actually testing the wrong thing because we don't want to test how many reads are compatible with the transcripts. We actually want to test the relative abundance of the actual biological mRNAs. And so to go from one to the other, we just divide by the effective length of the transcripts. And so we go from the proportion of reads that map to the transcripts to the proportion of actual biological transcripts. So that's a tiny detail, but we go from basically pi to what I call pi t and we test this new parameter here, which again accounts for the length. It's a marginal thing, but it gives a bit of an improvement in performance because well, it's quite logical but testing now the right biological object. And so now we want to do a test and compare two conditions. So say we have run the previous model that I showed on two conditions, A and B, healthy disease. And we want to do a test on these two conditions. Now, how do we do that? Well, we consider the difference in the relative abundance of the group level relative abundance. So the group level between A and B. Now, under the null, there's two differences at zero because there is no difference in the splicing patterns. Well, under the alternative, there is some difference. Now, I don't know if you've seen the aspect but doing statistical testing from a Bayesian point of view is actually very challenging. So this is one of the main drawbacks for me at least in Bayesian statistics. It's very hard to do statistical testing. And so normally people use Bayes factors but as I was saying yesterday, the computation of Bayes factors is non-tricky. It's very hard. It can be very unstable and it's very hard to get good estimates of the Bayes factor. So here we did something which is extremely unusual and which is we basically went back to the frequentist word. So we basically take these differences, these are the omegas and we approximate these differences by a normal distribution where the mean and the covariance of this normal are obtained from the posterior chain. Now, we actually, I mean, a bit surprisingly, we notice that this difference actually is approximated really well by a normal distribution. And then once you have approximated this difference by a normal, then it's very easy to use a multivariate world test. And that is just the multivariate version of the classical world test. And this gives us basically a gene level test and we do exactly the same thing on the individual transcripts. Now we look at just the difference on one of the omegas. So one of the transcripts. And this allows us to have both gene and transcript level test. So what's the advantage here? The gene level test tells you if there are differences in the splicing patterns in the overall gene. But then once you have found that a gene has differences in the splicing patterns, you probably want to know what transcripts are actually involved. And so that is what the transcript level test tells you. The transcript level test tells you what are the individual transcripts that are changing between conditions. Because in some genes, you may have a large number of transcripts and only some of them are changing. So after I explained the method, I'll get into how more or less we try to assess the performance. We've done several simulation studies. I'm going to show you the main one here. The way we simulate data is to try to make it basically as realistic as possible to real data, but also introducing a ground truth. So introducing a real differential effect that we try to recover. So here we're basically in parameters on real data. And then we simulate from these parameters. Now we simulate what's called at the read level. So this means that we basically, we don't simulate from like the multinomial distribution. That would be too naive. We actually simulate RNA secretes. And then we have to go through the, alignment and quantification procedure. So we have to align this to a reference genome or transcriptum and get our inferential results and get our basically the input data, the equivalence classes. So this simulation has the uncertainty in the mapping that we want to model. And we try then to compare two groups of six samples where obviously we have artificially introduced a difference in the alternative splicing pattern in one of the two groups. And then we check how well the method is able to recover that difference. And we benchmark obviously against several competitors because our method is not the only one that does this kind of analysis. So you see here, there is a list of various competitors. In green you have bandits but then you have a bunch of other competitors. And I'm not sure you're familiar with this kind of plot. This is similar to rock. You have true positive rate. So you want to recover as many true positives as possible. And on the X axis, you don't have the false positive rate but you have the false discovery rate which tells you basically how many false discoveries you have in your set of significant genes. So you want these to be basically as small as possible. You would like to be towards the top left corner. So you see bandits is in green here. It has good control of the false discovery rate because you see these circles are close to the theoretical lines or close or conservative. They shouldn't be, sorry, they shouldn't be more extreme. And then at the same time, you have a good, we have higher true positive rate meaning that there are methods meaning that we can recover more real differential splicing patterns than other methods. And we see pretty much the same when we do a transcript level test. So this plot for us was for a gene level test. We have pretty much the same for a transcript level test. We got a good control of this false discovery rate and also get a good true positive rate. You see there's fewer methods because not all methods allow you to do a transcript level test. Some methods or several methods only do a gene level test. And lastly, we've also looked at a real data set. Now real data sets are good because they're realistic but they don't have a ground truth. So it's hard to evaluate if a method does well or not. So thankfully in this data set, we had a bunch of genes 82 which had been validated in the lab. And so we're confident that these 82 genes are display, really display differences in the alternative splicing patterns. And so we've done something like a rock curve which of course is very rough because we're saying that these 82 genes are true positives and the remaining genes are don't display alternative splicing, which is wrong but it still gives us an idea of how well we can recover these 82 genes. And again, you see that we get higher, I mean, for a given false positive threshold, we get here in green higher true positive rate. So I mean, I don't really like showing off with these kind of plots but it's an indication that the method is doing well and it's recovering well the right genes. Now obviously this is because this method is accounting for the very uncertainty between samples. So with a hierarchical model but importantly also in the mapping. So the technical variability basically but this comes at a cost because having thousands of full MC models with latent variables, that's quite computational expensive. And so if you see here, the cost of the method is here it's actually halfway through. So in blue, here you have the actual cost of the pipeline that comes before. So when you run salmon, Kalisto, Arsem and so on. And you see that that is actually most of the times that's a major cost in the analysis. And then you have in red the actual cost of the differential method. And so if we zoom here and we consider all of these methods which share the same input, you see that we are the slowest among these methods but the computational cost at least from my perspective is still reasonable because it takes one hour on a laptop to do a differential analysis and all the other methods, the input, I mean, they don't do this allocation of the latent variables, which is the costly thing. So wrapping up briefly, we've developed this method to the genes and transcripts that display differences in antenna and splicing from bulk RNA-seq data. I didn't mention this, but the method works with both alignment to the transcript with salmon or Kalisto and to the genome with stock. The key feature is that mathematically model the uncertainty in the mapping as well as the variability between samples. We correct for the difference in the lens of the transcripts and we do testing at the gene and transcript level. It has good performance, but obviously it's slower than other methods and it does not allow for covariates like batch effects. But most of the times batch effects don't affect relative abundance of transcripts because they normally affect overall abundance at the gene level. And this doesn't play a role when you look at relative abundances. So this is unlikely to have an effect on relative abundance of transcripts. Now, I don't wanna run late, but if there are questions about banning, I'll take them now. There was a discussion more or less in the chat about what Suzanne and Vandril had, but I think it has been solved among them. I don't know, Vandril, if you want to ask, otherwise, go ahead. Excuse me, I'm sorry, I was struggling with my new button. There was a question in the room. Is one sample a batch? So the batch effect is when you have a batch of sample where, for instance, sequencing another set or something like this is one kind of potential latent and it could be very sequence one sample at a time. And it could be that you sequence one sample at a time. Yeah, in which case if you either sequencing in completely different conditions, that might be not ideal, but in that particular, for that particular program, apparently that's not too much problematic because usually the sort of batch effect that you expect are not the one that affects what you are looking at, right, Simone? Yes, so with batch effect, we mean like if you have, say, three samples and you analyze them into a machine and you get three samples in a different batch, you will observe systematic differences normally between them. And sometimes you want to correct for them. But what I was saying in these days is that those differences normally affect gene expression and not the relative abundance of transcripts. Second thing, if you analyze every sample in a different batch, that does not introduce a bias, but it increases the variance, okay? So let's distinguish between bias and variance. When you analyze three samples in a batch and three samples in a different batch, that introduces a bias between them. But if every sample is in a different batch, well, you're not biasing your analysis, but you're definitely increasing all of the variance because now the difference between samples is not only the biological difference, but it also has a batch difference. So there is a lot of variance. So it's not biasing results, but it's not ideal because you have a lot of variance that is kind of unnecessary. Other questions? Okay. I have a small question. So you account for difference in size between the different transcripts, but have you also looked at other potential effect that may affect mapability? I know that GC persons may sometimes play a role or something like that. Is this something that if we have information to this information could be plugged in? So what kind of thing did you mention? I couldn't hear? GC persons, sometimes play a role. No, I haven't account for that. No, I mean, an implicit assumption of the model is that you have uniform coverage. Again, obviously towards the ends, it goes down, but it's a unit for coverage. And that's an implicit assumption in the model. I haven't accounted for other things. I think it would be complicated and also quite hard, but you need extra information and also you need a good model, right? Because if you model it wrong, then it's best not to do it at all. You need also a good model, which is probably not that easy to formulate. But no, I haven't even looked at it, I have to be honest. No, okay, thank you. So I'm going to go through another package we've done, which you haven't published yet. I mean, we have a preprint probably in winter. And this is, I mean, conceptually, it's very similar to bandits, but it extends bandits. So it's differential regulation that's called the package. And what the package does is to take the idea of bandits and use it in a different context on single cell data for now, but also bulk later, to do a different kind of biological analysis. But obviously we had to do some modification. So it's a mathematical, similar model, but a slight extension. So again, I'm going to give a brief introduction about here, the biology. So again, in transcription, we've seen you can transcribe an unspliced mRNA or immature. And then with splicing, you have a mature or spliced mRNA that yes, of course, can degrade later. Now in bulk data, we actually have good resolution about the range of transcripts or spliced mRNAs. In single cell data resolution is quite low at the transcript level. But, I mean, at least for bulk, for droplets, sorry, for droplets single cell data. But you can still distinguish between the unspliced mRNAs at the gene level and the spliced mRNAs. So you don't know what transcript they're coming from, but you can still separate the mRNAs which are mature and immature. Now, you're probably wondering, why is this useful if you don't distinguish the transcript? Because that is still informative and that can give information about the direction of gene expression. For instance, kind of the most notable example of this is RNA velocity. Now in RNA velocity, you use this information to study the derivative of spliced mRNA. Now in mathematics, the derivative tells you the change that you will have in your, in this case, spliced mRNA in the near future. And so the assumption behind this is that if you have a higher unspliced ratio, a spliced proportion, then you would expect an equilibrium then a gene is probably being up-regulated because this large fraction of unspliced mRNA will be spliced and it will increase the spliced mRNA more than the decrease that is going to happen because of the degradation. Vice versa, if you have a lower unspliced relative abundance, then you would expect a deep equilibrium, then you probably have down-regulation of a gene because the newly spliced mRNA will not fully compensate the large amount of spliced mRNA that is going to be degraded. So basically looking at the relative abundance of spliced and unspliced mRNAs gives you a hint about what's happening in the future. So if gene abundance is going up or down very simply. So we take this idea and we use it to compare experimental conditions. So we basically check for differences in the relative abundance of unspliced mRNA between conditions. And so if a condition has a higher relative abundance of unspliced mRNA, then we speculate that that condition is being up-regulated not in absolute terms, but compared to the other condition. Now, again, this is different from the differential gene abundance because differential gene expression would simply look at the differences in the overall abundance of spliced mRNA. Here we look at differences in the near future change. So in the direction of the spliced abundance on whether it's going up or down. Now, since we have single cell data, we can cluster cells that are similar into clusters, typically cell types. And then we do the analysis on every cell clustered separately. So we cluster cells and then we identify changes in regulation that are specific to a cell type. And this is for instance, very sketched output of what you would get. In this case, for instance, we're comparing three and six month brain organoids and we can see that this particular gene in this particular cluster has a higher abundance, higher relative abundance of unspliced mRNA. And so we would speculate that it's being up-regulated at six months compared to three months. So again, it's not absolute up-down regulation, but it's always in relative terms between two conditions. Now let's get into the mathematical aspect or better technical one actually, but at the mathematical. Now, how do we get these spliced and unspliced reads? Again, we can use pseudo-liners like Alevin, Alevin-Fry, Callisto Bustools, which are basically the the analog on single cell data of then the analog of salmon and Callisto. But just like in BustData, we have uncertainty. And here it is to a larger extent because on droplet data, we have mapping uncertainty on two levels. So we have reads that are compatible between the spliced and unspliced versions of a gene. So within a single gene, you're not sure whether those reads are actually spliced or unspliced because obviously spliced reads, spliced and unspliced genes will have parts that are in common. We have exons that are in common between them. And those amounts in our analysis, those are represents some 5, 20% of the reads. But then additionally, while in BustData, you rarely have reads that map to multiple genes. Here you have a significant fraction of the reads that actually map to multiple genes because you only observe really the beginning of the reads. So very often you have multi-mapping reads between different genes and those amounts to roughly 5, 40% of the reads. So there is a lot of uncertainty in these estimates. And so once again, we try to account for this mapping uncertainty and we do it into different ways. For the mapping uncertainty that concerns the genes, we use again a little variable approach. So we sample the gene that reads come from. For the mapping uncertainty that concerns the ambiguous reads, so spliced and spliced, we consider something like that, but that is actually quite challenging because the ability that a read is spliced or unspliced is very hard to estimate because it depends on several factors that are not known. For instance, coverage is not uniform and have multiple spliced and multiple unspliced transcripts. Those have different lengths and importantly you don't know their abundance because you only know things at the gene level. So this probability, the probability that an ambiguous read is spliced or unspliced is unknown. And so we still try the little variable approach assuming in a very, very rough way that this probability is 50%. So ambiguous reads have a 50% probability of being spliced and 50% of being unspliced. This is obviously very rough. And alternatively, we've tried a different approach where we artificially increase the parameter space. Now in biology, we have spliced or unspliced reads, but from a technical point of view, we have spliced, unspliced and ambiguous reads. And so we try to model this trivariate vector which accounts for the technical aspect. And we've seen that this approach works better, which is also intuitive because kind of assuming 50% probability without actually having an idea of how much that probability is, is well, trivially a rough choice. And so in the end, we actually go for this choice and we enlarge the space of the parameters considering the relative abundance of spliced, unspliced and ambiguous reads. And so in this case, we can have a Dirichlet-Mathinomia like before, but as I said, this is an extension. We have two models. One model for every gene and this is basically like in bundles. This is a Dirichlet-Mathinomia like in bundles for every gene. So you have a relative abundance of unspliced, spliced and ambiguous reads. So by the way, this is ordering in this way simply because it reads as USA. And so this is a bit catcher. But then additionally compared to bundles, you also have another model, which is a Mathinomia and it tells you the relative abundance of genes. Now this model, per se, we're not interested in that one. We're interested in these relative abundances, but this model is useful because of the latent variable allocation that we do across different genes. And again, let's make an example. The latent variable allocation, again, it's similar to what we've shown before, but it accounts for these double Pi vectors. And so if a read is compatible again with the spliced version of gene G and the unspliced version of gene J, then it's allocated to the first choice with probability which is proportional to what? Well, the relative abundance of gene G and the relative abundance of the spliced version of gene G. And to the second case, it's allocated with probability proportional to the relative abundance of gene J and the relative abundance of the unspliced version of gene J. So anyway, it's basically the product of the two probabilities and then normalize so that they make one. Then you do that this iteratively. So conceptually, it's the same. Mathematic is just a little bit more difficult because you have two layers instead of one. So how do we proceed with the pipeline? Because now we have single cell data. Well, we first group cells into clusters. So as I said, we do cell type-specific changes for identify, but then in each cluster, we actually aggregate signal across all cells because our model actually does not distinguish between cells. So we compute pseudo bulk measurements in every cluster. And then we do our inference in every cluster separately. Just like before, we have a region dispersion parameter. And again, we use an informative prior from all the other genes. So we have again an empirical, a mildly empirical base approach. And then we proceed pretty much like in coordinates. So we want to look for differences in this vector now unspiced, spiced, ambiguous reads. And so we do testing with a multivariate world approximation on the posterior of this guy here. So conceptually, as I said, it's a similar model, but maybe just to clarify the differences. Here we have a double mapping uncertainty. And so we have a two-layer model. And then the framework is actually quite different. Also, another thing which we haven't done yet, but it's ongoing, is that so far we applied the method to single cell data. And so we identified changes at the gene level, but in cell type specific changes. So in specific populations of cells. But we're actually extending this to also work on bulk data. And the interesting thing is that you can go deep in a different direction, because obviously on bulk data, you have an average signal across many cells. So you cannot identify changes on specific cell types, but you can identify changes that are specific to transcripts. So you can identify changes at the transcript level. And so you can identify the individual transcripts, which show differential regulation, which show differences in regulation. So roughly we've done benchmarks on this one, comparing to two other methods that do this kind of analysis on single cell data. I didn't find many, honestly, I only found these two methods and both of them obviously ignored the mapping uncertainty, which is again, the whole motivation of our work. And we proceed similarly to what I've shown before. We take a real data set. And so we have real data observations. We split it into two groups and we introduce artificially a difference. So again, we have real data, but now we have artificially introduce a difference that we try to identify. Now we do it into two ways. We introduce a difference into regulation and also wanting differential gene expression that we don't want to detect because it's a nuisance difference. Then we simulated the read level with Minno, again, in order to have mapping uncertainty and we get back our parameters. So similar products before, true positive versus false discovery rate, we want to be on the top left corner and we see we are again in green. I mean, this is by chance, but I tend to have green methods. Okay, Bandit's also was in green. I didn't notice that before. So you see that the other methods tend to have, first of all, a strong inflation of the false discovery rate. So basically they try to provide a one, five and 10% false discovery rate, but eventually they get a much larger amount. Well, we roughly control for the false discovery rate and at the same time, we have higher true positive rate, meaning that we can recover more real differential genes. Again, roughly wrapping up, we propose a method similar to Bandit's, but now we have two sources of mapping uncertainty and therefore we have this double model. We're actually developing a double analysis framework for single cell data, so cell type specific changes, but at the gene level and on both data, transcript specific changes, but on across all cells. I haven't shown a computational benchmark, but obviously we have a speed concern compared to methods that just input estimates and don't account for the mapping uncertainty. And just like Bandit's, we don't account for covariates, but once again here, we're again looking at relative abundances within genes. And so it's unlikely that covariates and batch effects will play a role on relative abundances. Yeah, maybe I can stop again if there are a couple of questions on this one and then I'll just show you a hint of a preliminary work on proteomics. You don't see anything, yeah. I can slow down. I was rushing a lot because I was expecting a lot of questions, but now I can slow down, just to have more time. This one actually, are there questions? No, no, there are no questions in the room, what did you say, so, all good. All right, so this is actually kind of a job at advertisement, but I don't actually have funding. So I will talk about a project which is aimed at integrating proteomics and transcriptomics. This is, again, to develop a statistical method that we will release in our package. I think this is the most interesting method I have done. Well, I have ongoing, at least so far. This one was actually awarded a Marie Curie Fellowship which is a two-year postdoc fellowship, but I cannot accept it because of my position, because it's incompatible. So I'm actually looking for someone who is interested to apply and do this and apply to a Marie Curie Fellowship with me as a supervisor. So if you don't know what a Marie Curie is, it's basically the European postdoc fellowship. It's two or three years depending on whether it's two years in one place and three years if you go one year abroad. The salary is really high and they also have a good research budget, basically 1,000 euros per month. The applications are in September, but if you're interested, we need to obviously prepare. And so, I mean, in case you're interested in this, let me know by sometime in winter. So now I'll actually talk about the project. Now, the end of the project is to infer proteins. How do you infer proteins so far? Normally with bottom-up proteomics where basically you don't observe proteins directly, but you observe peptides. And so again, you use peptides as a surrogate marker for the proteins. But again, just like in Choskidomics, you have a multi-mapping issue because most of the peptides are compatible with multiple proteins. Like this peptide here is compatible with proteins A, B, C and D. So when you see a signal at the peptide level, you don't really know what protein it's coming from. Now, this is very, very similar to the, sorry, I apologize for doing this thing. This is very similar to multi-mapping which we have in RNA-seq data. But here we can kind of solve the problem and we can estimate the abundance of transcripts. Now, why can we not do the same in proteomics? Well, because the data are of lower quality, there are many fewer peptides than there are RNA-seq reads. And also we have an issue with the intensity of the peptides because the relationship between the protein abundance and the peptide abundance is not linear. In RNA-seq data, if the abundance of a transcript doubles, you can roughly expect the RNA-seq reads to double. This is not the same here. If a protein doubles in abundance, the peptide abundance is going to increase but it's not really going to increase linearly. It's not really going to double. So we have a lot more uncertainty here and that's why we cannot really do inference at the isoform level. So what do you normally do in transcriptomics? Well, you normally work on groups of proteins, typically the gene level. So you normally do inference at the gene level and so you identify gene that are present or abundance at the gene level. So what we aim to do is to actually do inference at the isoform level. Well, how can we do that? The issue with this, the issue, the main problem with this kind of data has to do with the data itself. And so the way we calculate is by using more data. And by that, I don't mean using more proteomics data. I mean using a different kind of data. In practice, we want to jointly model proteomics and transcriptomics data because transcriptomics data is a lot more accurate and we can use that as a prior for the proteomics data because obviously mRNAs are then translated into proteins and there is a good correlation of normally 65, 70% between protein and trusted abundance. So we can use that to enhance the information that we have. So what is our goal? Our goal is to basically have a model that can infer if an individual isoform is present in a database and also estimate its abundance. But we're statisticians, so we also want to give a measure of the uncertainty of these estimates. And so we can do that with a posterior probability of the presence of an isoform and with a posterior credible interval of its abundance. Then in a later stage, we're actually going to extend this and jointly sample and jointly model multiple samples, basically biological replicates so that we can average out the biological noise and get a group level answer. And once we do that, we can easily compare groups of samples. So healthy, diseased again, treated, untreated and do differential testing, not at the gene level, but at the isoform level. So just to give a very simple and naive example application, we know that in these transcripts, the transcription factor, this transcription factor here is differential abandoned between some types of melanoma, but we don't actually know what specific isoforms are differential abandoned. So our method would basically try to answer that question and get a more accurate identification of the isoforms that are differential abandoned. In mathematical terms, you have some input data which are basically the abundance of the peptides. You know what proteins are associated to each peptide. Again, each peptide is associated to one or typically more proteins. And then that's the case bit. You have the relative abundance of transcripts which is estimated from RNA-seq data. But what you want to infer is about proteins. So you want to infer the total and relative abundance of the proteins. So we start from these guys and we want to get there. I have implemented just as sketched which again, a multinomial Dirichlet structure because I think it's a natural structure with this kind of data. But there are alternative ideas which I'm going to consider. So why am I doing it? Because again, you can think of the overall signal across all the proteins in your data and you can distribute that to your protein isoforms. How like you distribute that according to the relative abundance of the proteins which follows a Dirichlet prior. And so where does the transcriptomics data enter here? Well, it enters here. Your prior is very informative and the information is coming from the transcriptomics data. So we use the relative abundance of transcripts as an informative prior for the relative abundance of proteins. And then again, we have this peptide information and our abundance and we try to again allocate it back to the proteins that are compatible with this peptide. And we do that again with a latent variable model. Now, this is not the only model I have in mind. This is the one I sketched because I'm familiar with it but I think a natural alternative is to have a mixture model where your peptide signal is a mixture of the protein signals. And it's something that we will investigate. Now in terms of validation, we normally, as you've seen we normally validate these tools in simulations because we try to simulate data, we have a ground truth and we validate it. We see if we can basically recover to get close to the ground truth. This is not possible in this kind of scenario unless you make a naive simulation which is not realistic and it's absolutely not informative because there is no good simulator for mass spectrometry proteomics data. So in practice, the realistic simulation happens in the lab and thankfully we collaborate with biologists in the States that are doing experiments to validate our tool in the lab. And so what I mean by that is that they collect some data, they have a ground truth and so we can see if we can be able to recover that. In more less mathematical and more formal terms this is the kind of output that our model provides. Here you have a bunch of proteins. You see that the first part of the details is that all these guys are from the same gene. So these are all different isophrams from this gene and this is again a bunch of isophrams from a different gene. So what do we provide? Well, the probability that these isophrams are actually present in the data sets. In this case, this guy will have a high probability, the other ones have a low probability and then an estimate of the abundance and a posterior credible interval around these abundance that for instance indicates how uncertain we are about the abundance of this gene. I don't know where to take the question now. Okay, I'll take the question now and continue later. Hi. Sorry, I'm- It's impressive. Okay. Maybe, yes. So I wanted to ask if he is aware of the ribosec profiling method that allows to understand that which are the RNAs that are being translated at a certain moment. Did you understand that? A little bit. It's connected to ribosec. The ribosec profiling method. Yeah, exactly. Because I think that would be a validation of the experimental validation of your model in a better way than doing RNA seek and mass spec in parallel. I think ribosec could be the best combination because you can do two in one and then at the same time even for the same size. Yes. So, yes. This was mentioned in the discussion. Now, the issue is the result that I know very little about the biology and this is very much ongoing. So I only understand the biology when I'm done with the part. This one is still ongoing. I actually don't remember. I don't remember because I post this project. That's why I'm actually looking for someone to take it over. I don't remember how and if we were going to see, to use ribosec data in our project. It was definitely part of the discussion. Yeah, okay. I really don't remember how we wanted to connect that. Definitely not in the model, but in the validation. Yes. So that that was a possibility we discussed. Another question from the room. So I try to speak. Can you hear me? Yeah, yeah, very well. So, I was wondering if you think it would be possible like to make a universal ground in regards to this mass spectrometry data, for example. So let's say that you have a bunch of proteins and you, you experimentally kind of verify how much those are, you know, how you say like how much those they are present in a certain mixture and then you could kind of use those parameters prior for your simulation and for, for those proteins. Do you see what I mean? So it's you being kind of a database of priors that is taken experimentally and then you could use it as a, you know, as a ground truth for your data or your simulation. I'm not, I'm not sure this is possible. Maybe I mean I forgot to clarify this, I apologize. But the prior is specific to the sample you're using. So every sample in every sample you collect, you know, mass spectrometry and RNA-seq data on that specific sample. So that prior is sample specific. Because otherwise that I think that would be too general. It wouldn't, it wouldn't be informative enough. So I mean, I don't, I don't, I don't think you built a general prior knowledge that you could always use for a car. I mean, for all the samples you analyze, if that was the question. Yes. Maybe though you could kind of be in the relation of, you know, of expression. So because I mean, I think, I mean, what I know that. So there are some clusters of proteins that they are more expressed than other, right? And so some proteins also they tend to be expressed, yes, more than the other proteins. Maybe there is a variation of the choice, but maybe this could account in the case of your model. So the final part, maybe this could help. Maybe this could help in your model like to predict the eventual, you know, relation for proteins, certification or something like this. Yeah, it's not something we have considered so far. There were a few thoughts we had about in general other pieces of information we could, we could incorporate in the model. So we would consider other pieces of information again to try and add more data in the model so that we can get more reliable results. But this is quite preliminary. So we haven't really explored a lot the space of the possible things that we could, we could add and include. There were several others. I mean, we can consider this one as well. There were several others like the ORF quality and yeah, but I mean, we haven't done that. Okay, maybe I showed the very final slide actually. So I said, we haven't, this is preliminary again. We haven't benchmarked this on even like on experimental data, but we tested it on some real data and we obtained some results which are, I would say comforting. I mean, they are reasonable, they go in a good direction. So here you have the, you know, tip log TPMs. So they, you know, just get abundance. And then on the Y axis, you have log protein abundance which are estimated from the model on the same isoform. So this is isoform level, you know, just get versus estimated proteins. Now you see when we don't have an informative prior. So we only use mass petrometry data. The two are not much correlated. You have correlation of 19%. Because there is just not enough information in the data to actually get good estimates. But when we use an informative prior, and you see here on the right, then you get a correlation of 60, 63%. And you see there is good agreement between our estimates of the protein isoforms and the corresponding transcript abundances. And just to put that in context at the gene level, this correlation is about 65% on this data set. So that's kind of where at the most you can get, because that is aggregating transcripts. But obviously, once you also add noise and variability across transcripts, you will have a value which is close to that one, but obviously a bit slower. So this, again, this is not a full appropriate validation, but it indicates that results are reasonable. And we can definitely get a much better job when we use that prior information from the transcript data. And the other comforting bit is that the isoforms were stratified into three categories, depending on their kind of translation ability. So on their ability to translate proteins. And so in low, medium, and high quality of translation. And we can see that this correlation is actually quite low for isoforms, which have a low ability, a low translation ability, which is expected, because they don't translate really well the transcripts. Well, for those that have a medium or high translation ability, then this correlation increases, which again is reasonable and comforting, because the more an isoform is able to translate, the more we would expect coherency between transcripts and proteins. So I will stop here. If you want some references, I think there are shedder slides. Here you have the kind of the packages and the paper. This is still going, but I'm referring to a paper where they initially tried to use this setup. This is a picture of the Robinson Lab where I've been for six years and I'd like to thank Mark, who supervised me in all these projects and a lot of quotas and in general the lab. And lastly, yesterday we had the David, today I decided to go for painting, so I'll close with this circus from Tkandiski. So thank you for being here until 6.30, which is quite late.