 Okay, thank you. Okay, so I am Simone. I am a postdoc in Mark Robinson's lab in Zurich. And I am a statistician, so I developed statistical methods in general in computational biology and in particular in bioinformatics. So this one is called the differential regulation. It's a method for single-cell RNA-seq data, which we've actually done jointly a bunch of people from Mark's lab and also Rob Parcher's lab. Now, the method intuitively studies regulation from single-cell RNA-seq data. And just as a bit of background, from this kind of data, you can't separate the abundance of splice and unsplice reads. And this is for instance used by RNA velocity tools to study the derivative of the splice and RNA over time. And the basic assumption here is that the ratio between young splice and the splice, if it's higher than you would expect at the equilibrium, then this suggests that a gene is probably being up-regulated because this large splice amount is going to get spliced and it's going to increase the mature mRNA. While if this is lower than you would expect at the equilibrium, the gene is probably being down-regulated because the newly spliced mRNA is not going to fully compensate the degraded mRNA. And so in the near future term, the splice mRNA is going to decrease. So we take basically this assumption and we try to bring it, if you like, a step further and we use it to compare experimental conditions like development time points or things like that. And so what we assume is that if you have in a condition a higher relative abundance of unsplice mRNA, when this suggests that this condition is going to be, the gene in this condition is going to be up-regulated compared to the other condition because this larger amount of unsplice mRNA is going to be spliced and increase the mature mRNA more that it's going to happen in the other condition. And obviously conversely, if this relative abundance is lower. And so mathematically, the way we identify these genes is by simply not relying on all the assumptions of the velocity tools, but by simply looking at the relative abundance of unsplice reads and looking for changes in these relative abundance. Now, since we have single-cell RNA-seq data, we can of course cluster sets before we do the analysis so that we identify cell cluster or cell type specific changes in regulation. Now, just keep in mind that this is very different from differential gene expression studies because there you would look for changes in the total abundance of splice reads. Here we don't really care about the total abundance of splice reads. We just look for differences in the relative abundance of a splice reads. So the idea is that you look for differences in the near future change. So in the direction that the splice mRNA is taking, regardless of its total abundance. And so here it's an example of the kind of output like a gene that we identify. Here we compare three and six month brain organoids. And you can see that at six months, this particular gene, this particular cell type has a higher proportion of unsplice reads. And so we would conclude here that it's being upregulated compared to the other condition. And again, it's not an absolute up or down regulation. It's always a comparison between two conditions. So getting a little bit into more the kind of mathematical and statistical aspects. What the kind of input you could use are estimated splice reads that you can easily obtain with pseudo-liners like Alibin, Alibin-Pry, Callisto-Bastos. But from droplet protocols, there's a large amount of uncertainty in the assignment of reads. So there are so-called multi-mapping reads that can map, first of all, to multiple genes. And secondly, to the two splice and unsplice versions of each gene. And so we call these later reads ambiguous between S and U. So of course, you can still use these estimates, but if you neglect the uncertainty in these estimates, your inference is going to be affected. And so what we try to do to account for this uncertainty is to do two different things. So for the reads compatible with multiple genes, we use a latent variable approach. So this basically means that we sample the gene allocation of reads. And so this is just read as an unknown parameter. Now, for the ambiguous reads, my idea was to originally do the same, but in order to use a latent variable approach, basically assign these ambiguous reads to the splice or unsplice versions, you need to have an estimate of the probability that the ambiguous reads are spliced or unspliced. And that is actually not trivial to obtain. And so instead, we decided to treat these ambiguous reads separately. So instead of working with a bivariate vector, which is what happens in biology, obviously, splice and unsplice reads, we work with a trivariate vector where you have splice, unsplice, but also ambiguous reads, which are treated separately. And so we have basically two models, a multinomial for the relative abundance of genes, which we're not really interested in. It's a nuisance model, basically, that is useful for this latent variable approach of the reads compatible with multiple genes. And then we have a hierarchical, the Richelieu multinomial approach model, which is the one we're really interested in for the relative abundance of splice and unsplice reads. Now, here I said we compare experimental conditions, but obviously we need biological replicates. So we compare multiple samples in each condition. And so we embed them in a Bayesian hierarchical model so that each sample has its own specific parameters because we account for the biological variability between replicates. But at the same time, there is sharing of information across them. And influential, yeah, I don't go into details, but we use MCMC schemes where we basically alternatively sample from the conditional distributions of the parameters given the latent states and then the latent states given the parameters. We've done benchmarks against a couple of methods which are the ones that at least I identified that do this kind of analysis that are AZA, which is based on a JAW, and we too, which is another Bayesian method. But the key point here is that they ignore this mapping uncertainty which I mentioned. And so we've done benchmarks on real data, which I'm not going to show, and in simulations where we use a real data set as an encode. So we have four samples that basically belong to the same experimental condition and we split them into two grips and we artificially introduce a differential effect into one of the two grips. So it's actually effectively a semi-simulated data set. And we run two simulations, one where we only introduce a, it's called differential regulation effect and one where we also introduce a differential gene expression effect. And the idea is that this is a nuisance effect that we don't want to detect. And obviously it's going to make our life harder when detecting differential regulation without detecting differential gene expression. And then as I said, the key point of the study is that we want to deal appropriately mathematically with mapping uncertainty. And so we obviously have to introduce it in the simulation and we do it with MinNote, which is a read-level simulator from Rob Patres' group. So here are the key thoughts from the simulation. So you have two positive rate and versus false discovery rate. So if you're not familiar with this, you basically want to be on the top left area, high two positive discoveries and few false discoveries. Now, without going too much into details, our method is in green, you see it as, it doesn't manage to really get a very high two positive rate. I think this is because of the simulation itself, but relatively the other methods has higher two positive rate and also controls for the false discovery rate. A little bit less for all the methods when you introduce differential gene expression that you see a little bit out towards the results. So wrapping up things, I don't really know how, oh yeah, I think I'm doing fine with the time. Dropping up, we haven't really proposed a new analysis. As you've seen, there are tools that do this, but our idea is to have a more if you like sophisticated mathematical approach that can lead to better performance because we account for the uncertainty in the mapping of droplet reads. Obviously this comes at mostly a computational cost because you have a more complex model to optimize, but we actually, I haven't spoken about this, but we actually have two versions available, the more if you like complex version, but also a faster one, which doesn't do the gene allocation, which is slightly less accurate, but it's also faster. Covariates are also not modeled because it's not very easy to include them in this kind of model. And we're currently working on an extension of this to also work on bulk data, where my idea would basically have two kinds of analysis at different resolutions, where on single cell data, you can identify changes at the gene level, but which are cell type specific, as we've said, because you can cluster cells. On bulk data, obviously you cannot have cell type specific changes, but at the same time, you can have transcript level changes because on bulk data, you can actually distinguish the transcripts of each gene. And so you could study things at different resolutions on the two data sets. The package is already available on bulk conductor, and after we've done this extension, we'll write a pre-print. So hopefully, I think by September, October, we'll also have a description out. Okay, thank you very much for your attention. So we do have at least one question we're coming to the end of the session here. So we do have at least one question for you, Dr. Taberi from Ryan Thompson. In the past, I've simply run salmon or costo on the reads while including the unspliced transcript sequences in the reference, and just let the EM algorithm sort out the ambiguous reads in the same ways for alternative splicing. How does your method compare to this approach to spliced ambiguous reads? So I guess what's the marginal benefit of properly modeling with a hierarchical model? I need a minute to reread that. Not that fast. I'm on some place on the way to including unspliced transcripts. Well, I mean, that's kind of in the past, if you like. I think the question was maybe before I mentioned the other two methods. So that's basically what the other two methods do, right? Because they take these EM estimates and they deal with them. So these estimates are not wrong. They're accurate, but there's uncertainty in them. They're estimates. So the other methods just treat them as if they were real values. They neglect the uncertainty and you can see that these as impacts on the inference. So I guess I'll follow up a little bit on that. Suppose that you applied the bootstrapping estimates from, say, a Callisto run or a 11 run. How does that compare to properly hierarchically modeling the uncertainty therein? Well, I mean, that would be another way to do that. I mean, this is not the only way you can, of course, deal with the uncertainty in these multi-mapping weeks. You could use those bootstrap estimates. But then of course, you also need a mathematical model to somehow mathematically correctly model the uncertainty via the bootstrap estimates. So that is our approach because I feel it for me, obviously, for my background, that is the most natural approach for me. But of course you could develop something that started from those bootstrap estimates, yeah. So I guess then, yet further, will there be a comparison of, say, sleuth and differential regulation in the pre-print? So I don't remember very much the details of sleuth. Does sleuth, because sleuth, that's differential turning splicing, I guess? Yes, this was transcript usage. Yeah. Yeah, I would have to think about it. Yeah, if it's possible to get to this kind of framework, yes. Cool. Although I'm not sure because in the end it would be an extension, right? Because that one is for bulk RNA-seq. These are a bit different features. But yes, of course, I think about it. If it's possible, yeah, we'll add it. Very cool. Thank you for a lovely talk. Yeah, thanks for your questions. Do you have any further in-person questions? Leo? Hi. Hello, can you hear me? This is a question for Stefano. So I was interested by your method for detecting outliers and it seems that it makes the assumption that the data is gonna be unimodal. Could it maybe also help you identify droplets in some cases or are you concerned that maybe you could be actually throwing away things that you actually wanna keep? Yes. Yeah, so for sure it's a unimodal model. And we identify, in some data sets, for example, this bimodality, it's captured as outlier. But I mean, the estimates are for the user to judge. So in some case where we have a huge presence of outliers for many cell types is probably the indication that you by missing some covariates or your model should be complex, my explaining those. So the concept is, in some case, you have just single outliers that investigating you might be finding that what they are. And in other cases, you have to judge and to run iteratively your model to explain those bimodality. Yeah, definitely it's not a bimodal in the sense. But it's good that probabilistically you, the model identifies for you that those are outliers according with your model. Good. Hi, this is for Simone. Great talk, especially the idea of using the uncertainty in the read mapping is pretty interesting. So I was wondering the A category which you're talking, which was ambiguous reads, right? Is there any inherent bias or have you seen any correlation of the A reads compared to the, let's say, length of the gene? So, or let's say, number of intron or number of exons in that specific genes because if there is a correlation, then certain group of genes would have more uncertainty and eventually it'll make the differential analysis more complicated, right? So do you have any sense of like, is there any correlation or certain group of genes are affected by this relatively more than other group of genes? Thank you. Hi, yeah, thanks. It's a good question, very good. I don't know, I haven't looked it up, but it is what you say is a bit connected to what I said that, you know, the probability that an ambiguous reads is spliced on a splice, it's very hard to estimate because coverage is non-uniform, in the end, it's gene specific, it's even transcript specific. Because depending even like, even within the same gene, depending on alternative splicing, that is going to affect this. And so that's why we actually decided to keep it separate because it would be very mercy to get a good estimate of that, which ultimately I think would have depend on many of the things that you mentioned. You satisfied with the answer? Yes, thank you. I don't see the camera, it's okay. I have a question for Simone. So you kind of mentioned that, oh, you know, you can cluster the cells and do whatever, but do you do that clustering on just the spliced reads, the unspliced reads, both combined? Because I know 10X genomics is now making default in their cell ranger, just, oh, just count everything, it doesn't matter. It's better to count them all together. Yes, so I thought about that. So we tend to use the spliced because it's a little bit more coherent with what is normally done. Normally you would basically do, you would basically cluster based on counts and counts in the end are better reflected probably by spliced only. But yeah, I think you could use whatever you want in the end, I normally use spliced, but I think that is a little bit, if you like a prerequisite, let's put it that way of the package. It's not something we do. So I would use spliced, but you could also use, I don't know, the summation of the two if it makes more sense. Hi, this is a question for Gabriel, if he's still here. Is he there? Yes, thanks. So I found you're talking really interesting and you're talking about a method that is capable of analyzing a very large single-nucleus RNA-seq data set, at the same time being faster, less memory intensive, and using linear mixed models, linear mixed-effect models. So I feel like you're taking advantage of a lot of subtle bulking in order to be able to make the single-nucleus RNA-seq look like bulk, but is there a trade-off at some point of how many cell types you can look at? Because maybe once you start to have a lot of small cell types then your subtle bulk data set will still be pretty large and maybe that will impact the performance of your method, I don't know. Yeah, that's something we've thought a lot about in terms of, it's gonna be a trade-off. Power to do differential expression in bulk and subtle bulk and is impacted by the read depth. And when you get increased resolution in your cell types, you're gonna be losing some of the read depth there, right, because you're splitting your reads across multiple cell type clusters. And so it's a balance between having, say, 10 cell clusters with very large read depth versus 50 cell types with lower read depth. I think it really depends on the data set and we're looking at this empirically. But one of the reasons that the computational efficiency is so important is when you wanna explore the subtle bulk at these multiple different resolutions, if it takes you a special machine 12 hours then it takes that much longer to get to the downstream analysis. So I think it really depends on the biology and yeah, we'll see. I have a question too, Stefano. Yeah, that's a great talk. I'm more interested in combining the different data types that you showed in the early slide. So maybe those can you map and then aggregating the different information from the DNA, RNA and proteins. So how do you combine those kind of data types together? Yeah, so the method doesn't deal with that, it's just downstream of all that. So the methods assumes you have given identities of clusters of cells. So it means that you look into each data separately? No, no, basically it's very simple, it's just the input data is just counts of cells per cluster, then how you derive that is all upstream of the method. So focus is just on the statistics of that problem. But do you have any kind of method or that you develop to combine different data types together? Not me, but there are plenty of people here which some will speak before. So yeah, you can find your knowledge in this crowd. Sort of a public service announcement. Kind of a public service announcement. There's a tremendous amount of interest and follow-up in this session. So at the same time, we have only 10 minutes till the Brahma Mukherjee's keynote and the Cure Auditorium. So to the extent that we can take this offline and keep a record of it in the WebEx sessions, that would be wonderful. Thank you all for three wonderful talks and a tremendous amount of stimulating discussion. Thank you. Thank you so much.