 Okay, thanks for the introduction. Yeah, I'll say a couple more words about the scientific aspects. I am a statistician, but I'm very, very much on the applied side. And I have mostly, I mean, I pretty much have always worked with biology, biological applications, but mainly in two very different fields. The most recent thing I've done is to work in bioinformatics. And in that field, I mainly develop statistical tools, so statistical methods that are then released as our packages basically for several omics data sets. And so in this case, I don't really have a particular data set that I care about that I want to analyze, but I tried to develop a general tool, a general method that others can use in their analysis. And before that, and I did that at the University of Zurich, where I was a postdoc and I still work in this field now in Bologna. Before that, I did my PhD in England where I worked in systems biology. In that case, I had pretty much the opposite scenario. So I analyzed a specific data set, which was actually very complex to analyze and so we develop kind of an ad hoc methodology for that. Now, this introduction is useful because I'll be talking about this these fields. So I'm going to give you a couple of, I'll be speaking about a couple of projects from the bioinformatics side and one or two projects from the systems biology side. Now, generally, I mostly am a Bayesian so I mostly use business statistics for two main reasons, because I work with biological data and biological data is characterized by a lot of missing data. So the, you know, biology is already complex. And so there's a lot of noise on the biological side, but the measurements we have are inaccurate than noisy. We never, you know, we wish to observe like mRNA or protein abundances like populations, but we never really do. There was always noise. And that introduces typically latent states or missing data for the actual elements we're trying to observe. And with frequent statistics, it's very hard to actually work properly with missing data and propagate the uncertainty missing data forward. In business statistics, I mean, that's a lot more intuitive because every missing point is just a parameter. And so you treat it just like any other parameter, you just sample it. So it gives a very natural framework to deal with latent variables. The second reason is that in biology, you often have additional information about your data, your parameters that could come for instance from other studies. You know, maybe there was someone else five years ago who did some analysis and found the degradation or synthesis rates of your, you know, the organism you're studying. Or the information could come from other genes. Sometimes in mathematics, you know, often you study multiple genes. And although each gene behaves differently, you know, the other genes still can tell you something about the specific gene you're studying. And so that again can be embedded in a very natural way as prior information. So business statistics gives you a very elegant and formal way to deal with these two issues. So I said I'm going to give you an overview of a couple of projects in mathematics and then one or two depending on time in systems biology. I'll stop in the middle so that because topics are actually very different so you can ask questions about, you know, the first part and then the second part. Okay, I'll start with Bandits. Bandits is the first project I've done in bioinformatics and it deals with bulk herniasic data. So I assume most of you know what herniasic data is, but I'll give a very, very short introduction. In bulk herniasic data, you observe a signal that represents an average of many cells. So you cannot distinguish what comes from each individual cell. But on the contrary, you can study quite decently transcript, transcript level signal so you can disentangle what transcript of a gene signal is coming from. Then on the other hand, for instance, you have single cell herniasic data where you pretty much have the opposite. So you can separate the single cells, but then it becomes harder to disentangle the transcript signal. That's from the transcripts from the same gene. And then obviously you'll have spatial transcriptomics and other things. But let's focus on bulk herniasic data here. So very, very schematically and briefly, what kind of data do we have? Well, we typically have this herniasic fragments which are basically sequences of, I mean, acids which are reverse transcribed from the mRNAs. And that those originally were coming from some specific mRNA or specific transcript, but we don't observe that tag, that label. They don't say I'm coming from transcript A, transcript B. You just observe the sequence. So what you have to do is to align them to a reference. So we typically have a reference genome or transcript them. And so we just check what are the locations of our reference that those sequences, those reads are compatible with. And after we've done this, and as you will see later, this is a very noisy process. This is not exact. This is a, this is typically an imperfect allocation. And after you do that, then you can just count. Okay, how many reads are compatible with this Regine 13, how many reads are compatible with this Regine seven. And then you get accountable and then you can do your influence with that. We've seen some of the methods which in this with this method with bandits we try to account for the uncertainty here and so avoid these actually count matrix but this is the classical and most general scenario. Okay, so let's get a little bit more into what the package actually does. We work with alternative splicing. So, I am very I'm a statistician so I'm very very ignorant in biology. I'll try not to embarrass myself and say very general things. You know, in transcription, like a molecule of RNA is transcribed from gene from DNA, and originally this RNA has both accents and entrance. So this is also called unspliced or immature RNA. And then after splicing, the accents are put together and the instance the entrance are removed or spliced out. And you're left with a molecule of mature or spliced mRNA. And this then will translate into a protein. Now with alternative splicing, a single gene can actually lead to multiple mRNAs or multiple transcripts that can code to several proteins. And therefore this is a very useful process. It can be disrupted in disease or it can be altered by drug treatment. And so sometimes it's useful to study how this alternative splicing process changes between conditions again healthy disease treated on 3D or treatment A versus treatment B. And what we what bandits does is exactly this. It tries to look for it just look for the genes that display a change in this kind of alternative splicing patterns. Now, when we talk about changes in alternative splicing patterns, we have to define what we mean. And in practice, we mean the relative abundance of the transcripts. So if you have a gene that codes like here for three transcripts, you can think that in one condition, you're expressing 70% of the times transcript A, 30% transcript B, and never expressed transcript C. While in another condition you can express maybe 50% of the time a 20% B and 30% C. So we look at the relative abundances of the of the transcripts within a gene. Now you might have heard of, I don't know what's your background but you might have heard of differential gene expression. That is a different thing because in that case we look at the overall abundance at the gene level. So we don't really care what transcript the mRNA is coming from the reads are coming from. We only carry the overall abundance at the gene level so we aggregate all of them together. So this is a different kind of analysis. Now let's get more into the mathematics or conceptually at least. Now, in order to do this, what we would love to have is just how many RNA secretes are mapping to each one of the trial are coming from each one of these transcripts, but we don't know that, because as I said before, the alignment step is actually noisy. We don't always know how many reads are coming from a particular transcript, and that complicates our inference, because in practice, we have what's called multi mapping reads. So let's take this very simple trust, very simple gene with two transcripts, blue and red. And these are supposed to be kind of represent RNA secretes. But in some cases, there is a unique alignment because you know that this read is coming from the blue transcript, because there's only the blue transcript in this location. These two reads are coming from the red transcript, but these seven reads are compatible with both. So in reality, they're coming either from the red or the blue transcript, but you don't know that. And there are tools like someone or Kalisto that use something like an expectation maximization algorithm and estimate the total abundance of the red and the blue transcript. But as I was saying before, this is an estimate. So there is uncertainty in the in these estimates. And so you have to propagate this uncertainty forward if you want to have an accurate inference for your method. So instead of using these estimates, we use what I said before a little variable approach. So we work with what's called equivalence classes. So we just count how many reads map to what transcripts. So you will have one reads mapping to the blue transcript only, then two reads here, mapping to the red one, and then seven ambiguous reads that map to both blue and red. Now, as I said, the seven reads have an actual allocation blue or red, but we don't know it. So for us, this is missing data. The allocation of those reads is unknown, and we just treat it as a latent variable. So this allocation is a parameter that we sample within our model. Now, obviously, we don't sample it once. We sample it iteratively within the model so that, you know, every iteration of the model has a different allocation of the seven reads. And that propagates the uncertainty in these allocations. Now for a moment, assume that you've done the allocation. So you know what you know what transcript each read maps do. You have a gene, which has K transcripts, and you have several biological replicates. So we have like three or five healthy samples, for instance. Now, you think you think that in this gene, the reads that you have observed are distributed across the K transcripts, according to a multinomial. This is a classical assumption, you just have a multinomial allocation. And this multinomial as a parameter pi that just tells you the relative abundance of the transcripts like, like I said before, 20%, 15%, 30%. Now we use what's called a hierarchical model, which is exactly the same or conceptually the same as a mixed effect model, because we assume that every sample has its own parameter vector pi. But every sample has the same relative abundance of transcripts that that can vary. And so each sample has a different parameter. But at the same time, it would be, it wouldn't be very clever to analyze each sample independently. And so we have a common prior. And so this is actually here where parameter where samples have their own parameters, but they also share information between each other. And so this is a classical Bayesian hierarchical model, it's not that fancy. I think I made it more complicated when I explained it now. Now, this is useful because not only the loss for sharing information, but also it gives us parameters at the group level. So if we want to study healthy versus disease, we don't care so much about the individual people in the healthy group or the individual mice, whatever we're studying. We care about kind of an average at the group level of healthy people and an average of the disease people. And so these parameters here are called hyper parameters and they represent the group lever description. And that can be reparameterized as the group level average relative abundance and this precision parameter that indicates how this group level average changes from sample to sample. Basically, how much variability between the risk between the samples of the group. And so I described, I said at the beginning there are two main reasons for me to be a Bayesian, you know, latent variables and prior information. So I describe here how we use a latent variable approach. And here we get to the second point which is sharing of information with with an informative prior. So, I anticipated a little bit things before because I said, we analyze multiple genes. So this is this kind of analysis done for, you know, 1020,000 genes. And so each gene is obviously has some, you know, number of transcripts, different parameters so we cannot assume obviously the same parameters for the genes. But there is, there is something that we can think of as being similar, not the same but similar between genes, and that is this dispersion parameter here or precision parameter. That again indicates the variability between samples. And so what we do with bandits is to get an initial estimate of those parameters. And then we use it to formulate an empirical, an informative prior. So I heard that before empirical base was was already mentioned. So that's, that's the same thing. We also use an empirical base approach that basically means that we use the data twice. It's a bit of a trick, because you use the data to get this initial estimate. And then you reuse it to analyze. So it can be risky. It can be risky empirical base because you're using that twice. But in this case, it's a very, very mild. And this is because this prior formulation here comes from thousands of genes. So each individual gene contributes to the prior, but in a very, very, very marginal form. And so it were almost not reusing the data twice. And then we get to the overall. See, this leads us to the overall sets of parameters that we want to sample. So as I said, we have this basically latent variables there. The hierarchical, you know, sample specific parameters, the hyper parameters, the group level, we want to sample them because there is no analytic formulation that just gives us the, you know, the posterior distribution of the object of interest. Now, as you can think, you know, sampling all these parameters cannot be done at once. Just, it's not realistic because we're talking about many, many parameters. And so we sample them in blocks. So we use what's called metropolis within Gibbs. That it's a fancy name that basically means that we're just doing little bits at a time and not all the things together, which is very intuitive and I think reasonable. And so we built we know the conditional distribution of these parameters. And so we just sample from it. Now, the issue with this kind of method obviously is that you have to check that, you know, first of all, computationally, it can be intensive, and then you have to check the things are converging. But thankfully, most of our steps, I don't know if you've done that before I did not attend the previous two days so I'm not sure. I have seen the discrete difference between Gibbs and metropolis sampler. But if you have, assuming you have most of the steps follow Gibbs sampler, which means that you directly target your distribution of interest. And so that is quite efficient in terms of convergence of mixing, and only a handful of parameters so there are a few parameters which have very, very few key only in particular what these are dozens. Only those key parameters follow metropolis sampler which is a bit, a bit nastier and slower. But obviously additionally to that we want to ensure convergence. And I don't know if you've done that or also. Probably the best way I think in my opinion the best way to assess convergence of basic statistics is to sort of not be statistics of posterior chains and CMC posterior chains. The best way is to actually look at the chains, because that works much better than using any any tests that you can find around. You can use thousands of genes, and also you release a tool for users. You cannot have users check who don't know business statistics. You cannot ask them to check thousands and thousands of chains that is absolutely unfeasible. And so in practice we have to use a convergence test, which is absolutely not ideal, but it's better than nothing at least it guarantees some sort of stability in the results. And then, I'm pretty sure you have spoken about you've talked about that MCMC can be very computational intensive, because you have to sample a lot of parameters, and you do that this kind of loop there. You do that every iteration for me it depends on the on the program but you know thousands of iterations typically. So that can take a long time. Particularly if you're coding ARB. So these are more practical things, but you know if familiar with our when you do nested loops, that takes a very, very long time. It's not efficient to do nested loops in our and MCMC is all about nested loops, because you have iterative processes. And so we have coded this in C++ and we managed to to get it to get it going in a in a much smaller time than than we originally had in in our and so overall it takes like less than one second per gene so that means like one or two hours for for full data set on a laptop. So statistically, we do tests for every gene for every transcripts. We have thousands of them. So we want to correct for it. And so we do that by, you know, Benjamin Hutchberg correction so that ensures that FDR is calibrated. I won't talk about that I'll spare you this one. But just keep in mind that we do account for the length of transcripts, just intuitively, some transcripts are longer, some are shorter. The R band nuts to equally abandoned transcripts, you know, take two equal transcripts if one is longer, you would see more reads in that one, because there's more places that where reads are coming from. And so we account for that. We normalize by the length of the transcripts. Okay, so far shown how you can do your inference in the healthy group in the disease group, you know, group a group B whatever you want to call them. But I said the method was to actually do identify genes that differ between groups. So now, what we want to do is to actually compare groups. What basis statistics is not very good at is to to do tests. Because, you know, it's, you cannot, you don't, you don't get a p value with basic statistics, you can do base factors. It's not the same interpretation is rough. It's definitely not as elegant as a frequentist statistical test that gives you p value and FDR test and so on. And so here was a bit of a trick, because we, we go back to the frequentist work. So we actually approximate our posterior chains and use a frequentist test. So let me show you what we did it's quite intuitive. You know, if you if you're starting to groups, a and B, you have this kind of, you know, group level relative abundances. So we compare these guys in the group a and group B. Now, we just take their difference. If the alternative splicing patterns are the same, then those guys are expected to do the same. And so the difference is zero. And that's our null hypothesis. Well, if there is alternate, there are differences in alternate splicing, then some of those guys will be different. And so some of those differences will not be zero. And so we have our alternative. So we can approximate the posterior of these differences by a multivariate normal. With this, you know, mean and variance are just covariance metrics, it's not just a variance. These are just estimated from the posterior chains. So if we have a multivariate set, then it's very easy to use a multivariate will test and use the frequently statistics. And that also allows us to do testing for each individual transcript. Same thing, but we have a univariate will test. Now, after developing that we obviously had to test it and so we rent several benchmarks. I'll just give you the key idea here. We try to simulate data, which is as realistic as possible. But while we we take we start from a real data set, and we further parameters. Then we use those parameters to generate a new data set. And we do that by simulating actually at the read level. So going back to our scheme, we simulate these reads. Okay, we don't simulate account table because obviously that would neglect the uncertainty that comes here in the alignment. We simulate actual reads that then have to be aligned and so on. And so we actually have the uncertainty in that alignment step in our simulation. And then we simulate two groups where, you know, we artificially obviously introduce a difference between the groups in the alternative splicing. Now the key point mathematically is that the two things that we do mapping uncertainty and sample to sample variability, none of the other competitors has both of them. And with this is actually what motivated us to develop the method. And I won't go too much into detail here because I don't think it's very interesting but I have to at least acknowledge and mention this. This is a true positive versus FDR plot. This is a rock curve, but with the FDR instead of the false positives. So you want to be on the top left corner you want high true positive and low false discovery rates. And so we have bandits in green and then several competitors. And you see that we're here so we're doing quite well in terms of true positive rate so statistical power, but also false discoveries. So the G-level test, and we have pretty much a similar pattern for the transcript level test. And then additionally we'll also use a real data set where we have some tips that were validated in the lab. Now, let me say general thing. Simulations are good because you have a perfect ground truth. You simulated for me so you know the ground truth. But, although you try to make them realistic they're never as realistic as real data. So real data on the other hand is perfectly realistic because it's real data, but you never have a perfect validation set. And so we try to do both things and get an overall picture. So here we have a validation set of 82 genes. And that allows us to build something like a noisy rock curve where again you have bandits there in green so again it shows that also in real data we had pretty good performance. Computationally we obviously take longer than simpler methods, but still we're in the range of like one or two hours for full analysis on a laptop because of the you know the computational aspect I said before. I'll spare you the summary. I just say that the package is in back conductor and yeah I mean if you want more details there is the paper. If there are no questions I'll tell you the second project in mathematics and then maybe I'll stop a bit to cover this this part. Okay, completely different project. This is about proteomics, in particular about proteogenomics, because we actually try to integrate transcriptomics and proteomics. So forget everything we spoke about before, because this is a very different application. Now, bottom up proteomics is the main way to get to study proteins, but the issue technical challenge is that you wish to infer proteins, but you don't. You study peptides. And now peptides have the issue that most peptides actually are coming or associated to multiple proteins. You don't really know which one you know the what what protein the signal is coming from. And then also peptides occasionally are erroneously detected. You do have a probability of their error. But obviously this makes our inference even more complicated. Now, if you think about this, this is very similar to the mapping issue I mentioned before with RNA-seq data, because we have reads that map to multiple transcripts. Here we have peptides that map to multiple proteins. So why not use those methods, because things are messier in proteomics. So you have a lot less information, way fewer peptides, there's a lot more technical noise. And so while in transcriptomics, we can get good estimates of the transcripts, even just by using someone or Calisto, in proteomics that is very, very challenging. And so normally you have two options, either you use some methods that are not very accurate at the eyes of them level, or you do inference at the gene level. You lose resolution, you don't actually know what what are the individual eyes of ones that are that are involved in, you know, whatever process you're studying. So with this project, what we try to build is a model for that actually does influence at the eyes of them level. And the key idea, one of the key ideas is that we try to enhance the data that we have by using trusted comics data. Obviously, you know, transcripts are correlated to proteins because they're a prerequisite of proteins. And so we can use additional information and try to do a better job at studying proteins at the eyes of them level. And finally, getting a bit more into the detail, what we want to do is to study if a protein is in isoform is present in the database, estimate their abundance if they're highly lowly abundant, and then also study what are the isoforms where the transcript and protein abundances are very different, you know, those that maybe have a low trusted abundance, but they're really abundant at the protein level, or vice versa. And then at the transcript level, a lowly abundant at the protein level so something is going on between the three processes. And obviously we associate because I'm a statistician we associate a measure of uncertainty to this. So we, we don't just say if I'm in finance or from his presence absence we have a posterior probability of presence. And with respect to the abundance we also have a credible interval so an interval estimates. And just to give you an example application I think it's quite intuitive. You know, there's just this myth myth transcription factor, which is a gene, which we know is differential abundant between sub types of melanoma. But we don't know what are the isoforms that are differentially abundant. And so, with this kind of approach. So I think of studying the individual isoforms that are abundant in each sub type of melanoma and then comparing sub types melanoma sub types. So the nice thing from my side, the nice thing about this project is that I think for the first time in mathematics. It's really supported by a lab by a logical lab, which is in the States, and they're really helping us, you know, motivating the study, providing us the data for the validation, and also kind of guiding us in the, in the right direction not just what is, you know, statistically pleasant but also what is biologically useful. So now I'll get a little bit into the mathematics but I'll try to keep it at the high level. So I'll try only to talk about the concept. So what kind of data do we have. Well, we do have, say you have peptides, as I said we observe peptides. So you can think of observing the abundance of peptides. And then each peptide is associated to one or multiple proteins. So you're going to have say this vector there that basically tells you all the proteins that are associated to the peptide, basically just a map. And then I said that peptides can be a runnously detected, but we do have an estimate of the probability that they're runnously detected. And so we have that estimate there. And then if you collect our AC data, we also have the information about the abundance of the isoforms on the transcript side. And we take that as prior information. And we would like to have basically we have information about peptides and transcripts. What we want to have is information about proteins. And so we want the overall abundance and the relative abundance of the protein isoforms. So, how do we get there. Well, for a moment, assume that you actually have the abundance, the overall protein abundance. So you don't have missing data. You don't have these latent variables. Well, then it's very easy. You can distribute the abundance again with a multinomial. You have an isoforms. You just allocate the abundance to your isoforms following a multinomial. And then your relative abundance parameter has a prior that that that basically depends on the transcript relative abundance. So the transcript relative abundance is used to formulate a prior for the protein isoform relative abundance. This is not a hierarchical model. It looks the same as before, but there is no idea. It's just just analyzing individual samples here. So this is. But as I said, this why is I'm not observed. These are the protein abundances. We don't see them because we see information at the peptide level. And so we, again, this is mixing data and we recover the information at the isoform protein level with a latent variable approach. We have a double latent variable approach, because, as I said here peptides could be everyone usually detected and they map to multiple proteins. And so we sample first of all, if a peptide has been correctly detected. And then only for peptides which have been correctly detected we go to set the second step. And so we allocate their abundance to the proteins of origin. And that's how we do it mathematically. So it's actually fairly simple model because it's again it's a Bernoulli and multinomial. There's nothing really fancy here. So peptides correctly or incorrectly identified that that just that is just a sampling from a Bernoulli depending on the error probability, and then only for peptides that were estimated to be present. We allocate their abundance. So again, a multinomial, but obviously this multinomial does not depend on all the pi parameters there. It depends on a subset of those. And in particularly only on the proteins that each isoform, each peptide is compatible with right intuitively if a peptide is compatible with proteins a and b, you know, you're going to allocate it to proteins a and b. And obviously, you know, this is the key idea, but in the MCMC we do this iteratively so there are not many steps I think a few thousands of steps and that each iteration you do this, you do these steps. Now, for the validation we use a different strategy here compared to bandits. The simulations in proteomics are not as not as good, not as accurate as they are in Choskiptomics. And so we actually use real data. And so, again, the nice thing of having collaborators is that they collected very accurate real data for us. So they had what's called the multi protease data set. We're basically that analyze the same thing that asked me to be too technical, but then analyze the same thing in my view for using six distinct proteases. Now what we can do then is something like a cross validation approach, where we basically analyze one protease, and then we use the other five to build a ground truth. And so we check the results of the protease we analyze with the other five. And then we rotate. So we analyze one protease at a time and use the remaining five and we do that for all six proteases. Now, as I said before, real data means that the ground truth is noisy. It's not perfect, but it's accurate enough. So how do we build a ground truth here? Well, we use a subset of the data. And in particular, we only use the peptides we are very confident about. So we only use the peptides that have a very, very low probability of being wrong, and that are associated to an individual protein. And then we test our model with a couple of, with two frameworks using the mRNA abundances, but also without them, because although I've shown that, you know, we use mRNA abundances, in principles the model can also run without them and you use a vaguely informative prior there. But obviously the rationale is that if you have them then hopefully results are more accurate. So we have a few competitors, but as I said there are very few that do this kind of inference so in the end where we have only three competitors. And so we try to validate two pieces of inference here. The present absence of the individual isoforms, but also the abundance of the isoforms. So in the present subsets, it's quite intuitive that we have a rock curve where, again, you have basically sensitivity and one minus specificity so true positive versus positive rates again you want to be as high as possible with your curve. Well, the diagonal indicates pretty much randomness or toss of a coin. So you see that these three lines there they represent our competitors. And these two lines represent our method. So in red here, you have the method without the marini abundances, which is there because it uses the same information as the other methods. You see, there is a there is a game. And then when we add the marini abundances, we get more accurate inference which is exactly what we were what we were hoping for. And we also validate the, as I said, the, the actual abundance that we estimate. And we do that by simply looking at the, you know, and marini abundances that we estimate versus our noisy part truth. And we got actually very good agreement in terms of locked in correlation of 0.72. Which in my view is particularly good because, again, keep in mind that here you have noise on both sides. So obviously there is noise in our estimates and that's always the case. But there is also noise in the ground truth. So this ground truth is noisy is not perfect. So there is, I think there was a pretty good agreement keeping that in mind. The project is not complete. It's actually nowhere in publicly, but where we're writing the package. And I think we'll have the package and the preprinting summer, hopefully July, if we don't delete. And then this is to me like one step in a series of potential analysis in the field where you can think of extending this model and doing other things in proto genomics. The most natural extensions that I have in mind are to just consider multiple samples. So this model uses one sample. But as you've seen in abundance, you can use multiple sample is our hierarchical model. And that has two advantages. One is that you have more actually multiple advantages, you have more information, because samples sharing information between each other. You have a group level result. So like an average for the group A and average for group B. And then you can do differential testing, but at the ice of home level. So that allows us to go one step forward, and also do differential testing at the ice of home level between groups. And similarly, we can explore single cell applications, where in that case, you can think of doing again inference at the ice of home level, but on cell type so cell type specific inference. And importantly, I do have a, I do a funding for a one year post that position in Bologna. So that is actually to continue this project that is to develop methods and continue these basic methods in proto genomics. It's in the department of statistics. So there is a requirement of having a PhD in statistics or a similar field like mathematics by statistics. We should have the application up in July. And then the start date has to be within the year. So probably sometime between September and December. I have to say apart from the project. Bologna is a wonderful city, not, not maybe the best city to visit in Italy, but it's a really wonderful city to live in. So if you're interested, drop me drop me an email and I'll give you more details. So just stop here. Take all your questions, and then only with the time left I'll cover the second part because I think it's a it's less interesting. So let's relevant at least for me. Thank you. So I guess that night if everyone has any questions. We've given we've given them enough time. Okay, I'll move ahead. And then, and then you have a second shot at the end. I always write in the right in the chat. Oh, I guess you mean that the pregeomics package. No, it's not, it's not out. We, we are writing it. It's, it's almost complete we're kind of refining it. So, I mean, realistically, I think the package could be out in June and then the paper, I mean that takes a bit more time because I don't have a lot of collaborators but yeah July August maybe. So but I think the package will come first. So I think next month it should be out. I'll put it on Twitter. Okay, then changing completely topic. Let me show you a couple of applications of business inference in systems biology. So as I said before, in bioinformatics, I worked on developing actual methods so tools without a particular data set at hand. I did the opposite in systems biology. So I actually, you will see there are actually two actual data analysis where we develop an ad hoc methodology for them. Okay, questions coming. Do the common peptides contribute equally. Okay, I think, okay, I'm asked if, I don't know if you can see the question if the common peptide intensities contribute equally to the group's peptide maps to So basically like if you have a peptide here that maps to four proteins if it's I guess the question if it's allocated equally to them. It's not the peptides are associated or you know that the information about peptides is allocated to the proteins they map to, but not equally that that would be a very, you know, inaccurate model. So let's take this peptide here that maps to in isoforms D and E. The allocation of the information is done proportionally to the relative abundances of the proteins. Okay, so there is a probabilistic model behind. So you have basically here the relative abundances of the proteins, they determine how you allocate the information about the peptides. The isoforms is a lot stronger, let's say, so it has a lot more information than the other one. And these, the information of this peptide is going to be allocated mainly to that one. And it's pretty much the same in bandits. Okay, so it's at least in story when we allocate. When we associate reads to the isoforms of our to the transcripts of origin. Okay, happy to take more if you think. Okay, so let me talk now about this project where we were starting. And again, this particular transcription factor and our to which is somehow relevant because it plays a key role in kind of regulating the expression of some important, some important genes I won't, I won't go too much in the biology, but it's an important gene that regulates others. And so it's quite useful to study it. So in order to do that, we collected several measurements in this case, we have light intensity measurements. So fluorescent data means that basically you, you introduce a preventant in the cell. I'll tell you about the details later, and then you stimulate it with a laser, and then produces light. Now the light is proportional to the protein abundance, but if there is more light, then there is more protein if there is less light there is less protein. But if there is not a one to one match, you don't have the actual population, you only have something that is proportional to it. So you do this thing, if you have, then you get these images, and then say me manually, you have to draw the borders of the nucleus and of the cytoplasm. And if you do that, then you can compute the average intensity in the nucleus and in the cytoplasm. And so you basically summarize this image with two points. Now if you do that, every two minutes for several hours, we will get to a bivariate time series. So when you get the intensity is in the nucleus and in the cytoplasm. And you see that they are partially independent but also partially correlated like here, there is a drop in the nucleus and there is an increase in the cytoplasm, which is intuitive, because here there is a large amount of an RF2, which is moving from the nucleus into the cytoplasm. And the interesting thing of the interest of the project was to try and study these movements and in particular these translocations to understand more about the system. So we had an original idea of how the system was working, which by the way, that's why it's called systems biology, because you have these kinds of systems that explain how kind of a process is working. But that actually was very complicated and too hard to work with, because we didn't have all the data we needed, you know, this involves several actors as you see, but we only had this an RF2 there. So we had to simplify this system which is more accurate but impossible to fit. We had to simplify it to something like that, where you have five possible events so this kind of summarizes the key things that can happen. You can of course have a new synthesis of a new molecule that, you know, our apologies told us that it's supposed to happen in the cytoplasm. It's an equation that can happen in the cytoplasm, but it can also happen in the nucleus. And then as I said before, things can move from the nucleus into the cytoplasm or vice versa from the sorry from the cytoplasm into the nucleus or vice versa from the nucleus into the cytoplasm. So you have five possible events and mathematically you can associate based on obviously some biological rational a hazard to these events that represents if like their probability of happening in a short time interval. In particularly we think that there is for instance, a constant synthesis that degradation is linear so it's proportional to how much protein you have that the input is linear. Well the expert is actually more complicated, because it's no linear, and it also depends on a delay. So it depends on the amount of protein that you had in the past. And that is because there was a process here that is happening. And so you know that there was a delay delayed quantity here that plays a role. So anyway, these are kind of parameters that are defined together with the biologists. But the key point is that these allows you to then formulate a model. And in particular, this is what's called a Markov jump process where you have several reactions, you know the five reactions there. And these hazards tell you how likely they are to happen in a in a small time interval. I think I skip this because it's probably not relevant for these audience, and that allows you with further approximations to get to a normal like, okay. So, you get to a normal likelihood where you basically consider the difference from a time point to the next time point. So every time, you know every two minutes you have a movement. So every two minutes you recompute these hazards, and you check what happened next. So where did this time series go. Now, one of the key things is that this model is stochastic this. This is basically a stochastic differential equation. And it's actually quite challenging to work with stochastic models, because it's very hard to infer them, the parameters. So why didn't we work with the deterministic one, because deterministic models are actually good models, when you work with an average signal. So if you have a large population of cells, and you want to study the overall signal of this and our two protein on average across many cells, then all the models work very well. But single cell data very noisy. And so deterministic models are not not accurate. Okay, he's so far I described a bit the complexity on the biological side. But then we also have the complexity on the measurement side. So, originally, you know you have DNA that transcribe some RNA and translates into protein. So that is the protein we would like to study. But that's not what we observe. So this we, we engineered the DNA, and then we reintroduce it into the cell. So these engineered in it will then just write a reporter protein that hopefully similar to the original, sorry, a reporter mRNA that hopefully similar to the original mRNA. And then when you simulate that with a light intensity, you get the images that you've seen before, but this is actually what we have this light intensity there. But again the assumption is that this process on the right behaves very similarly to the process on the left. And the other issue is that this light intensity as I said does not reflect the overall abundance of, of, of proteins, and it is actually at most proportional to it. So we have a proportionality constant couple. But additionally, you also have a random error so a random noise on top of that. So we assume that there is a stochastic noise in the measurement process. And the fact that this error is stochastic complicates a lot our inference, because this introduces, again, latent states, latent variables for the original population of mRNA molecules. So you can think of the original process X here observed every delta two minutes, which is the process of the mRNA molecules of the protein molecules sorry. And what we observe is at each time point we get a noisy measurement of it. And that's what we observe that we get another independent noisy measurements. And then again that's what we observe. So again we try to separate these two sources of uncertainty from the biology and from the measurement, because obviously the biology one is what we're interested in and we want to study the measurement when we try to regress it out we don't care about that. We do that again with a latent variable approach. So again we assume that all these X's are latent variables missing data, and we just sample them. So this is for a particular cell, but we actually do observe multiple cells. That means that again we have a hierarchical model where you think of why is like the data like the bivariate time series for every cell. So we have about 35 and 36 cells for two conditions. So in each in each one of these cells we have a bivariate time series that is associated to a parameter vector. And each cell as a distant parameter vector again, same as before so we have there is flexibility in the model, but there is a common prior there. So there is sharing of information across the cells. I was thinking what I'm supposed to finish at three right. Yeah. So that gives us a model to sample from it has many parameters, because you see, we're about nine parameters there for every cell. So that means nine times 35 plus the hyper parameters. So there's a lot of parameters they're going on. And so we don't sample again all of them together we sample them in blocks of correlated parameters. So we try to find parameters that are correlated and we sample them together. Now the other again the other key thing for me comes in here with the informative priors. So in this case, to informatics we analyze several genes. So we had information from our genes. Here instead, we have information from additional analysis that we did. So first of all there was a study which is now quite old, but still relevant because it gave us some indications about the degradation rate, which we use in our analysis. And then we did some additional studies additional exploratory analysis, and we could get an information some information about the measurement process, and that allows us to actually estimate the measurement process parameters and include that as prior information. And so here is a bit. The final results of our influence. These are the posterior chains of the 35. Let me check. I've got 35 buzzer and 36 stimulated conditions. Now obviously you don't understand anything with these plots, because the 71 lines for each parameter. So these are 10 images, one per parameter, each images 71 lines. And so we summarize that with the hyper parameters so the group level parameters, one for the buzzer and one for the stimulated condition. And there you go you get a much more schematic view, which actually allows you to draw some conclusions. You understood that the expert from the, yes, sorry, the expert from the nucleus is a lot faster three times faster than the import from the nucleus. And then, God, this was pretty long time ago. So guys, then when we stimulate cells, most of the parameters don't change. So, but what changes really is the, the, the rapidity of the, of the, of the movements. So the import and exports are both faster. So basically the cell, I mean I say that the cell becomes a bit more hysterically because it did the, the important moves more quickly between nucleus and cytoplasm, but the synthesis and degradation are marginally affected only. And third key thing that we found, I didn't show you how we, we found that, but also we found that this is a noise induced oscillator meaning that it's a system that as you've seen in the original image at the beginning. It oscillates, there are movements, but that only happens in a stochastic context in a deterministic context, this is a system that does not oscillate. So this is not to talk too much about the biology, but more to show you what kind of conclusions you can get from our base and analysis of this kind. And lastly, I'll show you the final analysis, again, in systems biology, where we, we were trying to study transcripts, so we were trying to study mRNA abundances. So this is still fluorescent data, still fluorescent images, but the difference mathematically statistically is that we actually don't have a time series here. So technically we do have a process that evolves over time, but we only observe it at a time point, so just a snapshot. If you like, it's a bit like my informatics data, omics data, just a snapshot, not a time series data. And so to do that we started a little bit what was a good model to describe the transcription. And so we, we started from the very basic one. So this is a transcription the most probably the simplest transcription model, where a genius believe to always be on and have a constant transcription. So the gene just describes at a constant rate alpha, and then the mRNA is degraded at a constant rate beta. So very simple model. That leads to a stationary distribution, which behaves like a person, meaning that if you observe the process in time, and if you get or if you get many cells, then you it's like drawing from a person distribution. So this is clearly not a realistic model, because there are there is a lot more noise than the Poisson model. And that that's why we went into we jumped to this second model, which is a lot more realistic. So you see the top part is the same as before, but now there was an additional part at the bottom there that basically says that the gene can be active on, but also inactive off. The gene switches between two states on and off. And again, it does it stochastically so there are these waiting times that are modeled by these parameters k zero and k one. Now this is, this is a lot more realistic because basically says that the gene, typically the gene genes are mostly off, and then they transcribe a marine a a lot of money, only in a short period of time. So these models, what are called transcriptional burst, again gene mostly off, then it turns on transcribes a lot, and then it turns off again. And these again you can study it and it's a Poisson beta distribution. But the model we ended up using is this one here, which basically is the same as before but with an additional arrow there. We don't assume that the transcription is completely zero in the off state, but we assume that there can be a non zero transcription. So the gene, when it's off, it's not completely off, but it's actually working with up with a background level. And that, you know, our hope was that this marginally increases model realism because you also allow the gene to transcribe well it's quiet. And that still results in a Poisson beta distribution. So just to give you an idea. This is like a simulation but it makes a bit understand. I think the process. So this is what you would normally have with these kind of models. So this is the mRNA abundance and this is time. So you would see that, you know, the mRNA abundance is around this range, then the gene turns on there for very short time, a lot of money is produced. And then again it's degraded, then the gene turns on a lot of money is produced, which is then degraded and so on. But then if you look at it horizontally so we just look, basically you compute the density of this. So that's the thing. If you compute the density of these and you look at it horizontally, then you get a distribution that looked like that, which is a Poisson beta distribution. Sorry for the noise, it's mine. So again, this was about the biological part. But just like for like, just like for the nerve to, we also have a complexity about the measurement process. In this case, we actually had a different measurement procedure, because we have the original DNA transcribes mRNA, and then we introduce a fluorescent tag on the original mRNA. So we don't have a parallel process going on. We actually just put a tag on the actual mRNA. And then when you stimulate it with a laser that gives us a measurement. But from a statistical point of view, from my point of view, it's the same thing, because the key is that that measurement is random, and that introduces mess. Because that introduces a proportionality constant that I have to estimate, but importantly, a random error. And so once again, the original said this kind of the empty mRNA band that is not observed without error. And so that is a linear state that's missing data that we recover from our model. Now, we do recover this. We do work with this missing data in a different way though that maybe that's good because it gives us also a bit more space. What I've shown so far, so far I've shown one approach of dealing with missing data, which is typically called data, data augmentation approach that basically consists in sampling the missing data. I said, you know, missing data later variables are just parameters and you can sample them. But that is one approach. But there is also assume there's many but here I'm showing an alternative approach, which is to integrate out the later variables. So you can literally compute an integral with respect to the variable and get rid of them. Now the challenging is parties to actually do that integral, because that's no trivial. So without going into too many details, we sample from the our model many times, and that allows us to compute an approximation of our likelihood. Now, normally in the MCMC, you would use the likelihood to, you know, to do your inference to do your MCMC. As I said, the likelihood here is basically given by a very complex integral. So we estimate that integral with an unbiased estimate. And that allows us to use this unbiased estimate instead of the original estimate of instead of the original likelihood. This is called pseudo marginal approach. And, again, that basically consists in conceptually it's very simple, replacing the original likelihood with an estimate of it, which has to be unbiased for the MCMC to work. And then the MCMC works just as usual. You might ask, why did you do a different thing with respect to the other methods why did you not sample those latent states. Because that would have led to a very, very big posterior space. Okay, so in this model for every cell, we all for every sample sorry for every sample really have seven parameters, which is fairly small posterior space it's only several parameters to something. There are 2,000 latent variables. So sampling 2,000 later variables every seven parameters would have been very messy, because it just gives a very, very big posterior space to explore. And so we got rid of it and basically integrated those 2000 later variables out. So it's a little bit more technical. When I say mess that means that the convergence is harder and also the mixing. So the chains kind of get stuck harder, you don't manage to go everywhere in that posterior space or it's harder let's put it that way. So once again we have a hierarchical model. In this case the hierarchies not on the cell, because we cannot put a hierarchy on the cell because each cell gives us one observation. We would like to but we just cannot do that. So every experiment gives us about 1000 single cell observations. And here it is so a vector of 1000 observations. And that is associated to its own parameter vector for four replicates. And then there is again a common prior so again it's a hierarchical model, but on the replicates instead of the cells. And in this project just like in the other one we obviously skip that part, but we obviously test the model in simulations first, just to make sure that things are running before we actually fit the model to real data. But just like in all the other cases I've shown, I will, we also take advantage of informative priors, because we're basically trying to estimate, and I'll show you why this is really really important conceptually. You know here you have a density, okay. So think about the normal density. Now the normal is defined by two parameters, the mean and the variance. You know normally most densities are defined by one, two or three parameters, okay, like the normal, the exponential is defined by one, negative binomial defined by two. The idea is that with two or three parameters you can get a very sensible you know density you can get a very wide spectrum of densities. Now this particular density here is defined by seven parameters for the dependent biology and three that depend on the measurement process. Now that means that what you're mathematics what you're trying to do is to say, okay, I have that about these densities. Let me try to refer the vector of seven parameters that gives these densities. It's very hard because there are so many vectors of seven parameters so many possible combinations that give you pretty much identical densities different technically but almost identical. So in practice, it's very very hard to identify a seven dimensional vector that gives you a density. The way we deal with this is again using informative priors, because we have additional experiments, we ran additional experiment on the measurement process. And that allows us to use a very very strong informative prior on the measurement error parameters. So that reduces the parameter space. It doesn't really reduce it but it simplifies it a lot because two parameters are almost fixed that by the prior, and so we're left with pretty much five parameters only. It's still a difficult job, because you have to identify five parameters from a density, but a lot easier than use that having seven parameters. Okay, so once again, having an informative fire was absolutely essential for us to to be able to to identify those those parameters there. After we validated that in in simulations we took some real data, we started the HIV gene but that was actually just an example gene we were not so interested in studying HIV in this case. We had the HIV gene under two conditions of stimulation. So this is the kind of data we had. As I said, each sample was a density of basically 1000 measurements. So for under the kind of smaller level of see of stimulation and for under the higher level. So this is the input that of our method, you can see that there is a difference between the, you know, the black and the red dotted line. When the stimulation increases, the density shifts to the right, which means that you observe a higher light intensity. So the question was then what parameters are changing. Okay, so this is what we try to answer. What are the parameters that are actually affected by that are actually leading to that. So these are the results of our influence seven parameters here I'm plotting all the hierarchical parameters because we only have four. So that that is feasible to visualize. So one thing the first thing we to notice is that most of the parameters don't change between the three conditions. So what are these parameters that refer to the switch rates. So the first answer to our, you know, kind of biologists was that the measurement process and the transcription is not we are not really affected by the stimulation. But what changes the most is the activation of the gene. So when the level of stimulation increases, the gene is more active. Again, it turns on and off more frequently. Okay, so it's more dynamical. And secondly, we also wanted to see whether, you know, adding this additional parameter was actually useful. Again, I keep a small reminder, you know, we were starting from this model and then we added this little branch there. So we allowed for transcription in the offseat. We just wanted to see whether that was actually necessary that was useful or if it didn't change anything. And the answer was, it depends, because if you look at the value of the transcription in the off state, it's a lot smaller than in the on state. Okay, so you see here it's between zero and 4% compared to the on state. So that tells you that when the gene is on, it's a lot more active like 25 or more times more active than in the off state. But at the same time, the gene is mostly off. So it's only on about 10% of the times. So 90% of the times the gene is inactive. So that means that even if the transcription rate in the off state is small, since the gene is mostly inactive, you still have a good portion of the transcription that actually comes from the off state. It's about 10-20%. So indeed, most transcription comes when the gene is active, but a non-negligible fraction like 10-20% actually comes from the off state. No, I'll skip you. We don't have a package here obviously because I said it's a real data analysis if you're very interested and you want to read it there. So let's spare you the summary. And I conclude with the same pictures as last time. Shame on me. I promise I'll change it next time. But this is an image of the David. So if you want to take questions, I mean happy to take questions about both parts.