 So here in this presentation what I'm going to present to you is you saw the gene page that we have information of presence of expression with FDR value and expression score and I'm going to dig more into the statistical analysis that we do to generate this information on the BG website. Okay, so this is an overview of the BG pipeline. So we integrate different data types, notably bulk RNA-seq and signal cell RNA-seq, but also a few metrics data, EST and situabilization. We do a lot of quality control and condition filtering to keep on the healthy white type data of high quality, and then we reanalyze all this data to detect active signal of expression and this is this part I'm going to present now, especially for RNA-seq data both bulk and signal cell. So yeah, our question is how can we detect the condition where a gene is actively expressed. So just to get back to basic molecular biology. So here on the DNA you would have like promoter sequence, near bio gene, and you have a transcription factor binding to this promoter to activate or suppress expression of this gene. And when the gene expression is activated you will have the RNA polymerase, providing generating the mRNA transcription, and then again ribosome binding to this mRNA to activate the translation and produce the protein. So the point is that at each of these tape, this is a molecular binding processes that are all stochastic, so you would have a binding affinity representing a probability of the transcription factor of binding to this promoter. Or here again the ribosome would have a likelihood of binding to the mRNA to produce a protein. So all of these processes are stochastic. And that leads to a variation in gene expression. So for instance here it was a study looking at the expression of a GFP in yeast over time and each line here is the expression level in one yeast cell. And you can see that this expression vary a lot of our time, even though they didn't do anything, they didn't do the, they didn't change the condition when they observe this expression. And so here in this graph what they did in this experiment is that in each cell they incorporated either a yellow fluorescent protein or a cyan fluorescent protein. And they tried to see, and so these two genes were under the control of the same promoter. And they were trying to see how there is a variation between the expression of these two genes under control of the same promoter. Because the idea is that there is a viability due to stochasticity, so it's just the inner hand molecular process. And there is a viability due to external resources available, such as the ribosomes have to be available for translation and if they are all busy on other genes they cannot do the translation for your actual protein. Or maybe you don't have the nutrients available to to generate the protein so they are these two kinds of, of, of limitation, intrinsic limitation which are the stochasticity of the molecular process and external extrinsic limitation that are because of the availability of proteins. And here, basically you can see that there is a dispersion of this measurement of the level of expression of one of the fluorescent protein, as compared to the expression of the other fluorescent protein. So here, there is a dispersion along the diagonal of this graph and autogonal to the diagonal of this ax. And here I'm going to ask you a bit of a tricky question. I'm going to see if you have an idea of what it represents. So, I'm going to have to switch often like that sorry so I check my screen in Firefox. So, I'm going to ask you a question so if you take the master document. I'm going to give you a little presentation we is a link to the activities, and that will lead you to this page. And for this first part you can go to the work that Mark if you can actually do the work that maybe, or I can do it so that I will share the results maybe. Yeah, maybe you can actually let the the work that and I will share the results when it's done. So I asked you on the graph that I showed. Okay, thanks. So on the diagonal, does it represent intrinsic or extrinsic noise of expression or both. And on the ax autogonal to the diagonal does it represent intrinsic so stochastic processes are intrinsic so resource limitation. So just to get an idea of how you perceive that and then I'm going to detail this, this graph. So please let me know Mark when it's over so that I can click on the link. 20 seconds left. Yeah, vote vote vote. 10 seconds left. Many people haven't voted. You don't have the figure display so that's five, four, three, two, one. Okay. I cannot see the results here. Okay. Can you share the result, please. Yeah, it's pretty simple. So you didn't get taken your risk and say both. Okay, so actually I'm going to show you the result. It's not both. Can you stop the sharing maybe I'm going to manage to share both my firefox and my presentation. Yes, so getting back at this actually along this axe, the diagonal axe is the angstricic noise. Because it means that the variation is correlated between the two genes, meaning that the resource available we're limiting for both the genes at the same time. So the angstricic noise, which is really the stochastic molecular process, it should be uncorrelated between these two genes, even though they're under control the same promoter, it will affect the two genes differently because it's totally random. So you have this random noise uncorrelated between the two genes, and this non random noise depending on resource availability that will be creating between these two genes. So I wanted to show you here actually it's just that gene expression is noisy, there is a variation of the gene expression process that is not like intended in a way where gene noise can be a feature of some genes but just to say that then to detect whether an expression is active. Is it active when the gene expression is as low as in this cell, or active when it is as high that in those cells. So it's actually a trick question to answer. So this is a first kind of noise which is the biological noise, but then we also have technical noise so here I show you a preparation of our insect library. So you have like your cells, you extract your RNA, and then you capture a specific RNA species so it must experiment you will have a polyase selection. So you will get only mature protein coding genes RNA mRNAs, or you can have a ribodeplation to remove the ribosomal mRNA but keep all the other species. So then you will do the reverse transcription to get your CDNA and base of that you will be adapters do PCR amplification and do the sequencing. And then after that you will have the bioinformatics analysis you will have your sequence reads, you will try to align them to your genome, and then quantify the abundance of expression of your different genes. So these steps, you will have noise introduced. So now if you go back to the Google Docs so you see my, my browser here right. Right, right, we see it. So if you go back to this document, you have a second question here, why I asked you to think about what are the different sources possible of taking that technical noise in an RNA seek experiment. So please in this in this table for your name and and propose some source of technical noise in RNA seek experiment and I show back my screen. So, maybe this one so you have these two steps the library preparation step and the bioinformatics analysis step. So if you a minute or so to answer. Okay, you have lots of suggestions. So I try reading what what you put. So first is time of sampling. I would say time of sampling in my opinion would be more a biological noise. Because yeah, based on circuit and rhythm for instance will have a variability of gene expression, but that will not be due to the library preparation step or to the bioinformatics pipeline analysis instead that will really be the actual variation in gene expression. So then you have batches in library preparation, different hands extracting the material different dates or ingredients. Yes, exactly. That's a big source of variability of technical noise. And this is why most of the time you need replicates in good experiments. You always have one sample and you will do technical replicates just for the sake of the defying this technical noise. Actually, in a lot of experiments, you don't have such replicates using a same biological sample. So you need a method that can accommodate this problem. So sample preparation, library constructions. Yes, that's all technical noise source. In terms of the cells, my opinion, that's more a biological variation. And then different technician RNA solution, key plots exactly. We have abstraction, preparation, CDN a synthesis effect of batches. So exactly all of that are sources of technical noise and level of gene expression, very low versus very high, that will be a biological variation that wouldn't be a technical version due to the preparation of the techniques of sequencing the CDN a. Okay, so I just wanted to emphasize that we always have in gene expression measurement, these two sources of noise, a biological noise which is due to the stochasticity of the process and the availability of the resources and the technical noise due to everything that you mentioned during all those steps. So, yeah, I just cited a bit. So here it's more the noise for the bioinformatic analysis, for instance, you could have errors in reads or errors in read mapping, you would have read data ambiguous and can map to different parts of the genome. And depending on the genome quality or the genome assembly, you would have different mapping occurring. So all of that are sources of technical noise. But so again, the question is, at which point can you say that the gene, yes, it is actively expressed over backgrounds, stochastic transcriptional noise and over technical noise. I'm just checking, yeah. Okay, so what we want to do in BG, as you saw on the gene page, Mark presented that before, we want to say where a gene is active, and in a way we want to transform all the data. So it's a bit like in situabilization data, as I show here, it's in situabilization from Zidphine, in a zebrafish embryo, and the spots here are zone of active expression for a given gene. And in a way we want to transform all our data in such a way to have area where we know that a gene is actively expressed and then it is an information that is easily comparable between genes and between species. So I'm going to show you how we do that from bulk analysis data, and also from single cell analysis data. So just showing you again the analysis pipeline. So we extract the DNA, get the mRNA, do the fragmentation. So the sequencing and then the alignment to genome. From from this, then we get reads mapped to genome and the question is that, at which point you consider a gene as expressed. So in most experiments in most analysis. Author of often use an arbitrary cutoff. So they use an arbitrary cutoff on the expression level. And then they say, okay, above this cutoff the gene is actively expressed. But it is very unlikely that one threshold can can fit all situations of gene expression, all different cell types or all different species or all different experimental conditions. So I think there is very little consensus on what is a good value to consider that a gene is actively expressed. So, what I'm asking you here now if you go back to the Google Doc, I'm asking you, what is for you an appropriate threshold to consider a gene as actively an instance that could be I don't know I say, whatever that could be above 10 tpm. So tpm is a unit in our analysis is the unit of gene expression level. tpm is as well, but you also have the unnormalized read count for instance do you consider gene as expressed, as soon as you have one read map to a gene or 10 reads map to a gene so if you could please enter here in this document. What do you expect to be a good threshold to consider a gene as expressed. So again maybe like 10 tpm for instance or 100 reads map to the gene. Absolutely no idea but if you have a guest please provide it here. So thanks for taking the risk of entering an answer what you have no idea. Maybe if you're a bit shy of putting an answer you can put in just not put your name. Yeah, it's not mandatory it's more to know who we are interacting with it's not to put some peer pressure or whatever. It's more time someone else wants to take a chance. Okay, so I take your answers and most of your answers sorry there is a last answer but I will comment as it is written so most of your answers. So consider read counts. So read counts would be would be a difficult you need to use because it's unnormalized per gene lens and library size so if you have a library with 100 million streets. What does it mean to have 10 reads or 100 reads map to a gene and also if you have a very long gene. 10 reads are short. Maybe it's very easy for a very long gene to just to stochastically generate 30 reads or 20 reads and really depends on the gene lens and the depth of sequencing. Unnormalized read counts would not be a good measurement to to see activity of the gene. So usually that should be unnormalized units such as TPM TPM is the default normalized unit in in our own a second gene expression study. And I see here for instance over 10 TPM. So 10 TPM would be actually quite high but in some studies it can happen in most studies it is one or two TPM. But again very little consensus, the most robust experimental analysis I saw were suggesting to TPM as a good overall measurements in human. So, in most today you will see that over one or two TPM you consider the gene as express and then you use those genes in downstream analysis, such as for a differential expression analysis. Here I see three standard deviation over average expression for me that will be more related to over expression differential expression analysis. I would say that your gene here is is is over express as compared to the other sample but that wouldn't tell you whether a gene is active or not. Maybe a gene is below the average expression but still actively express in the context of that gene, which might be very lowly expressed all the time but still very important. So, so yes, thank you for giving that so it shows that it's very not clear where you can consider when can you consider a gene as actively expressed. So here I show you a figure from a recent paper of our lab showing the true positive, the true discovery rate and the full discovery rate at different threshold TPM threshold like that so as the baseline of the gold standard for the truth. We use ribosec data in mouse lever so we have ribosec data, which allow to really identify the mRNA that have been translated so you get the mRNA that were protected by the ribosome so the mRNA that were being actively translated at the time of the of the library preparation. So this is the baseline of the truth of where which genes were actively expressed. And so we compare that to RNA sec data we use 89 RNA sec libraries in mouse lever and apply different cut off of 0.5 TPM to 10.5 TPM and compare that to this truth provided by the ribosec data. So you see that for a high value of TPM such as 10, you have a very very low false discovery rate, because you're very stringent you consider only very high expression as being truly actively expressed, but you have a true discovery rate of 75%. So you miss it means it means that you miss a lot of genes that were actively expressed. And on the other side if you take a very low cut off. Of course you will have almost all genes that are actively expressed, but at the cost of a high false discovery rate so you will have a lot of false positive in your result. So you see it's a bit, it's a bit tricky to define a threshold like that and usually it's one or two. So you can see that indeed with two you have a good balance of true positive and false false positive. And this is looking at all the species in BG. So it is a box plot for each of the 52 species in BG here. It's the percentage of protein coding genes that are considered as expressed actively expressed using a threshold of 2 TPM across all the insect libraries integrated in BG. And you can see that with this special you have a huge dispersion for some species so here it's human, and you have a lot of libraries in which you will have 0% of protein coding genes as expressed with this 2 TPM threshold. So this library preparation or the sample led to have a very high level average level of expression. And if you define a threshold of 2 TPM, none of your genes are going to pass this threshold. So the species overall the median percentage of protein coding genes considered expressed is quite low. Usually you would expect something like 70, 70% of your protein coding genes as actively expressed. So and you can see that there is a high variability between the different species here. Okay, so now I'm going to show you how we do that in BG, how we determine a level of active expression in any RNA-seq library and in any species. Before that, before showing you what we do, how we do that, I'm going to ask you a question on WooClap. So Marc, if you can activate the WooClap MIDI. I'm going to ask you if you prepare an RNA-seq library, sequence it, what do you expect to find in your results? Do you expect to find only reads mapped only to genetic regions? To find reads mapped also to other regions of the genome, such as intronic regions, intergenic regions. So, so please follow this link here. Marc, activate the vote and vote for what do you expect to find as results from RNA-seq libraries, which genomic features are you going to find when you map your reads to your genome? Okay, the vote is activated and one person is voted. And I stop the sharing so that you can share the result, Marc. Okay, I'll do that when we get to the end of the time. So the results are pretty stable. Okay. I forgot to mention that multiple answers were allowed. So maybe that tricked you. So most of you answer protein coding genes, as well as non-protein coding genes. Okay. And actually what is more surprising in that from RNA-seq library, you will also find reads mapped to intronic regions and intergenic regions. It's very surprising because we're supposed to have captured only mature mRNA, so with intronic regions removed. And intergenic regions, they are not supposed to be expressed at all. And yet we find some expression level, some reads mapping to intergenic and tritic regions. So you can, I'm going to share my screen again. So this is here. So this is a density plot of the measurement of reads mapped to the genome. So here you have the log RPK aim. So it was an expression level unit used at the beginning of RNA-seq where RNA-seq started to exist. It has some float. So now it's TPM that is you. But in this paper it was still RPKM. So you get the expression level here in log RPKM and the density. And so here, this line here is the exonic region. So what you expect, but you can see that you still have expression level for intronic regions. So this one is intronic and for intergenic regions. It's not very surprising, but again, the expression is a noisy process. And just randomly you could have, yeah, RNA primaris binding to an accessible intelligence regions, just by chance basically, and you will find some, some, yeah, some reads mapping to this region of the genome. So what I want to show you as well is that you get this distribution, this density for exonic regions, and you see here, a shoulder on the left of the distribution and then this is very typical in most RNA-seq libraries when you look at the density of your of the expression level of genes, you will see the shoulder on the left here. And in this paper, what the authors hypothesized, it was that this shoulder here and this peak on the right where actually the sum of two distribution, a class of what they call lowly express genes and a class of what they call highly express genes. And this lowly express genes here, when they look at them, they show all characteristic of non-actively express genes. So the chromatin was not accessible or the histone, methylation marks were consistent with the genes not being expressed. So the genes was not actively being expressed when looking at these features, but yet you have some expression level associated to them. And what we noticed actually that this distribution here of the lowly express genes, well, they match closely the level of expression of intergenic regions. And intergenic regions in well annotated genomes, well, if they don't contain any genes, they are not supposed to be expressed, right? So probably by using the expression level of intergenic regions, we can have an estimate of a threshold above which genes are actively expressed. So this is what we do in BG. So here I show you this density plot generated by us from a Josephina Melinda Gaster library. So this is a library integrated in BG. Again, you can see here this shoulder on the left for genes. Here it's protein coding genes. So it's even lower and this peak of expression of actively expressed genes. And in blue you have the intergenic region distribution. So you can see that here, for instance, you have, you would have a higher level expression here for intergenic region on average than for protein coding genes. So probably the protein coding genes here, they are not actively expressed, it's just expression noise. So what we do is that we estimate the distribution of expression level of intergenic regions. And then we compute a Z score as the standard deviation. So we look at the TPM value for your gene and we compare that to the distribution of intergenic regions. So we compute this Z score based on distribution of intergenic regions. And from there we compute a P value, which means now that using this approach at each TPM level here, we can give you a P value that you are actively related to the hypothesis that you are actively expressed or not actively expressed. I'm just checking what, yeah. Okay, and this is the results here. So here, this is what I showed you earlier, the percentage of protein coding gene in all RNA-seq libraries for all the species integrating BG. So this was with a TPM threshold of two TPM. And this is by applying our method of using intergenic region to estimate the background noise and have a P value for the hypothesis of active expression. And here you can see that, first, it's much more consistent between species, you have much less dispersion. However, it's a good indication that the method is maybe more robust across species. Here for human, you have less of a dispersion like that, and you have no libraries showing expression for no genes actually. So, looking at these graphs you can already have an idea that probably our method is more robust across samples and across species. So I'm just checking, yeah, just checking the question I asked you afterwards because since I don't have the presenter mode, I don't remember exactly where I put the question. So here again, it's as the previous graph I showed you so this is using a TPM cutoff, this is the exact same graph I showed you earlier. And then this is using our approach. And the cutoff here are P values based on this comparison to the distribution of intergenic regions. So here, with the lowest cutoff so the most stringent cutoff, you will see here that we have a very low false positive rate, and quite like good average true positive rate. So if we look at this P value cutoff of 10 minus three, probably it's a bit equivalent, as you can see to a cutoff or one or two TPM. So here that will achieve probably a good balance of false positive and true positive. So here, the advantage here is that it's not arbitrary you you set a threshold on the P value and you have a statistical analysis the statistical hypothesis that you can validate. And that's going to be much more robust across samples. So to show you that here is a distribution of so it's again a box part of the percentage of protein protein genes consider as expressed. It's in human blood samples from GTEC. So if you lose if you use these two TPM threshold on this human blood samples coming from GTEC, you have this distribution in the different libraries of the percentage of protein protein gene express, and you will have an average like 30% of the genes as expressed. While if you use our method, you will get a much more normal as expected in other organs or their samples, level percentage of protein coding genes considered as expressed. And it's because in blood you have protein genes, all the read like 80% of the reads map to those genes. And, and actually you do globin depletion usually when you prepare a blood sample you do globin depletion to remove those three that you capture all your signals and will not allow you to look at expression of other genes. But we actually realized that in GTEC, some samples were prepared with globin depletion, some were not. And so this leads to this high variability, and actually our method accommodate the fact that you have either globin depletion or not. It doesn't matter because the intergenic noise is going to be modified depending on whether you have 80% of your read map to globin genes or not. The intergenic region noise estimate would vary accordingly and allow you to recap today, a proper signal with a lot, a lot, not a lot of dispersion. And so I think here I had a question. Yeah, maybe actually I was over enthusiastic. Oh, yeah. Okay, that's what this question. Yeah, this is why I wasn't first. So I'm going to ask you this question. With this method based on inter intergenic region estimate what limitations, could you expect in using this method. So I know it's a tough question because I mean you're not familiar so it's this question here. I know you're not familiar with our method, I just presented an introduction. But if you think about it so we use intergenic regions, defined by genome annotation, we map reads to this intergenic regions to estimate the background noise of expression. We don't have any limitation using this approach. I give you minutes or two to answer that. And no worry I know it's a tricky question. We have only a few people answering showing that the question is quite tricky. So thanks for the three people there's answering right now. I don't think we're going to have more people answering apparently. So, if I read the answer correct annotation of intergenic and express regions viability in proportion of free intergenic expression between cell types and tissues bias annotation and technical viability library preparation in different experiment that may affect sequencing of intelligence regions and may affect the threshold determination. So, I see kind of two answers here so the first one is based in annotation. And yes, it's a totally correct so for human mouse, Joseph Philham and Augusta we have very well annotated genomes. We are pretty confident that the intergenic regions do not include an annotated genes, but for some other species. So we have a lot of genomes being genomes being published now. And, and lots of those genomes are not at the level of quality as the human or the mouse genome. So they miss a lot of genes. These genes are unannotated and they are going to be considered as intergenic regions. And of course these genes are going to be expressed so it's going to skewed our level of expression of what we consider intergenic. So the genome annotation is a huge limitation in this approach and I'm going to show you how we address that. The other answer I see is viability in proportion of free intergenic expression, kind of the same answer here like technical viability affecting the intergenic expression. But this is actually what we want here is that these viability due to the take to the library preparation is going to affect the gene expression as well. So if the preparation affected somehow the gene expression, it is going to be reflected in the expression level of the intergenic region so if the library preparation led to have more noise in the expression level. Well, it is going to be reflected in the expression level of intelligent region and it is exactly what we do to have an adaptive threshold that accommodates any library and not a fixed threshold that will not deal with library with a high technical viability for instance. So this is exactly the advantage of our approach that it is going to be taken into account by the intergenic level of expression. So getting back to this presentation. So about the genome annotation quality. It is exactly the main limitation with this approach. So here for Maca Camuletta where the genome is less well annotated in blue here you have so we took all the RNC library that we had in BG and map them and look at the distribution of expression. So here this is the intergenic region. And here the intergenic regions are very true to the right, and much more overlapping with the protein coding genes, and it shows that indeed in this intergenic regions we missed a lot of gene there are a lot of genes that are actually expressed. And so it is not a good reference of the expression noise. So what we do in BG for that is that we do the convolution of this intergenic regions and probably here so I didn't look exactly in BG but so here you see that we have several going representing this intergenic region expression levels and probably in BG we are going to keep only the most left distribution here, the convoluted distribution so that we are going to keep the intergenic regions only represented by these Gaussian here. And so in BG using this method, we are going to have a subset of true intergenic regions that we are very confident that they do not contain any annotated genes so that our method is robust to the genome annotation quality. So I'm not going to give you more details about that but just know that what we use in BG is not the default intergenic regions coming from the genome annotation. We are going to refine this intergenic regions to keep a subset of robust true intergenic regions to estimate this background noise in the expression signal. So this method for calling genes present absent based on intergenic regions is available in a bio conductor bio conductor package that's really released called BG call. Oh, and I kept a slide from last year sorry so there is no practical this year about this package. So we can use this package really easily for any species in BG, because for any species in BG we would have provided this pre computed true intergenic regions. If you want to perform such analysis on a species that is not in BG, then you will have to provide this intergenic regions yourself. We have a method to do that. We have a repository where you can upload your own set of true intergenic regions for you be used in the package as well so if you use one of the 52 species in BG is going to be very easy. And if you want to do that on another species, we provide all the scripts and tools allowing you to do it but it's going to be more work. So moving to single cell analysis. So for single cell analysis, the most typical use is to perform cell type clustering so that you can identify cell types new cell type, and it's not commonly used to identify active signal of expression for genes in individual cells. So this is a density plot showing the proportion of genes expressed across for in an experiment expressed across all cells for instance here it shows. So in this experiment, these genes here, they are expressed in all the cells from this experiment. And these genes here, they are expressed in none of the cell of this experiment. So in this experiment that you have a clear B model distribution here. So some genes, if you look really at individual cell, the expression in cells of the same cell type is very, very close. It means that here, some genes are expressed in almost all cells in this given cell type, and some genes are expressed in none of the cells in this given cell type. So there is really a signal of expression that is very interesting to get from this single cell analysis data. So here I show you like for full lens protocol so you have kind of two type of single cell analysis analysis. So one allows to sequence the full lens of the mRNA. And so you have, it's pretty much as the same as for bulk RNA-seq library you just have like well established PCR amplification steps, but it's pretty similar to bulk RNA-seq. And so when we look at the density, but I showed you earlier for bulk RNA-seq, so you can see it's pretty different, but still we still have, we still have a difference between the intergenic regions here and the protein coding genes here. We can still manage to make the difference. And we will look at the p-value distribution. Yeah, we have a p-value distribution that look like we can use that to define your threshold and notify gene that are actively expressed. So here I show you the percentage of protein coding genes considered as present from bulk RNA-seq in human and mouse and from single cell full lens RNA-seq in human and mouse. So we have much less genes considered as present, but it's expected because we have much less read in these libraries. So 10 times less read in this full lens library. And we have much more genes that receive zero reads. We have much more dropouts so that will not allow us to quantify the gene expression. But still our method can be used plug and play on full lens single cell RNA-seq data. So for target-based single cell RNA-seq data, it's a different protocol. It allows to study much more cell at the same time. You can study hundreds of cells in just one experiment, but you will have much less reads per cell. So the classical approach, for instance, is you have beads where PCR primer, cell barcode, unique molecular identifier and poly-DT tail are accessible like that. And through a microfluidic device, you will encapsulate one cell and one bead in droplets and then trigger the reagents and the reaction of sequencing, just cell per cell analyzing hundreds of cells at one time. So just keep that. But then you have much less read. So if we look at the density plot here that I showed you earlier for bulk and full lens RNA-seq, well, here you have almost like no signal. And if we zoom in this little tiny area, this is here, well, we don't have any difference between intergenic coding, protein coding genes because we have fewer reads per cell and more dropouts. So it's going to be difficult to use our method cell per cell. What we do is that in an experiment, we take all the cells mapped to a same cell type. So for instance, in an experiment, we have a B cell population. So B cell population made of several cells, and we pull the RNA-seq result of all these cells belonging to the same population. And then we can again recapitulate a signal where we are able to distinguish between intergenic noise and active signal of expression for genes. So then we can estimate, okay, this gene is active in this cell population, and we then get back to individual cell again. So if we have one read map to a gene considered expressed in the population, then we are going to say, okay, this cell had one read, it is expressed in this cell. The take home message here is that for target-based protocol, we estimate gene activity at the population level, and then if it is significantly expressed, we can get back to each individual cell of the population. So considering the time I have left, I'm not going to provide you more questions. I'm going to be just a bit late. So I'm going to skip a lot of material. So here is just to show you that using a naive cut-off approach, so it's a different unit, it's called CPM in target-based protocol, but lots of authors use a naive cut-off on one CPM to consider gene as expressed. A pathway analysis when considering gene as expressed with this one CPM threshold or using our method. And what we observed, you can see the result in this paper, is that this naive approach was missing a lot of pathway that are very important for this cell, and that using our method, you identify this very relevant pathway in this way. So considering in this cell, meaning that our approach provides a better true-positive, false-positive ratio that we're using this naive cut-off. And I'm going to try to wrap up, but basically, yeah, just to recapitulate. So we have like from bulk RNA-seq, we use a Z-score in terms of standard deviation from the mean of reference intelligence regions that allow us to compute a p-value. So we use the same approach, plug-and-play for full-length single cell RNA-seq data, because we have enough reads in the libraries. And for target-based, we pull cells from a same shape type in an experiment, and then we use the same approach. And after that, we can get back to individual cells. And just to mention that we have also Afimiatrix, ENC, in-situabilization data. And for all of this data, we produced a p-value. This is really our aim, you will see in the next presentation. For each gene in each sample, we provide a p-value considering the hypothesis of active expression, as opposed to noisy random expression. Okay, so BG produced expression calls p-values of expression for each sample at the gene level. Present expression call considering a significant p-value represent expression of a gene over the biological and or the technical noise. We produced that from all the data types integrated in BG, and we will also consider absent expression calls like reported absence of expression when the p-value is not significant. And we produce that only for a subset of the data type. In-situabilization, Afimiatrix arenasic, because for instance we consider that for EST data, for target-based single cell arenasic data, we don't have enough reads to consider that when we don't have expression, it's because it's not expressed. We consider that it might be because we just miss the expression. So absence of expression in BG is only infer for some specific data types where we are confident that we have access to enough statistically significant information. And then how we integrate all this information to BG that's going to be for the next presentation.