 I'm going to explain to you how we, in practice, integrate the data into BG, and after the break we will show you how to use this data and download it and make use of them. And as a start, I will use the example of single-cell RNSIG data, because these are the data, a lot of people are working out nowadays, and they add more complexities than previously bulk RNSIG data. So just as a reminder of what we obtained from using single-cell RNSIG data, so basically for droplet-based technologies you have beads like that, with sequences attached to each bead, with a PCR primer, a cell barcode, a molecular identifier, and then a poly-T tail to attach the mRNA, the poly-A mRNA. And then using a microfluidic device, you will isolate in each droplet one cell with one bead, allowing them to do the sequencing. And sometimes you have errors, you can end up with droplet that are totally empty, or just with a bead but no cell, or with cell but no beads, or you can have multiple cells in one droplet which complicate the sequencing. So that's how the base technology work, and how we make use of it. So we are going to extract a sample, isolate the cell, and at the end of the day what we produce, most of the time, is a cell clustering. So where each dot here represents a cell, and they have been clustered based on their gene expression. And then the aim is to identify marker genes highly specific of each cluster, so that we can determine what is the function of each of these clusters, what is the cell type of each of these clusters. For instance here, you have a gene that is really not specific of any cluster, because it express almost at the same level in all cluster, while this gene here is very specific to this cluster. And then biologists are going to perform functional enrichment analysis on these marker genes and identify which cell types it represents for each cluster. So here, for instance, those are cup for cells in this data set. So this work of identifying the cell types like that for each cell at the end of the day, it's a very complicated process actually requiring a lot of processing steps. So after generating the reads, you have to filter the cells to remove, for instance, cells that were not properly integrated in a droplet. Then you have to perform some filtering for genes, for instance, that doesn't have enough gene expression information. You have to perform normalization. And then you have to do a step of dimension reduction before doing your clustering, performing differential expressions between your clusters and finally do a functional analysis to determine which cell type each of your cluster represents. And each of these steps is going to impact the final cell type assignment. So for instance, the clustering, you could do it in many different ways, more or less stringent, and that would impact the end result. So in BG, that's going to complicate our correctional work a lot, because which of these different aspects should we capture when we provide the final information about how this data set has been analyzed. So I would like first to refer to the Google doc. So okay, I'm going to as Mark and stop the sharing because I'm not actually sure. So if you go, you have the link again to the specific Google doc. I hope I'm sharing. Yeah. So you have the link here again to the activities. And you get this first question here. So just to see, because you can make use in different ways of single cell data, and I would like to see how you use single cell data. So what is, in your opinion, the most valuable information that you can obtain from single cell analysis data. So I give you one or two minutes to fill this table. And please, yes, don't hesitate to participate in this Google doc because then it allows us maybe to reframe what we're saying or identify something that was not well understood. And thanks to those doing it. So composition, so I go line by line. So composition of heterogeneous tissues. Yes, it's interesting because then it allows to see the different contributions of different cell types to the overall bulk, expression information. So gene signatures, yeah, to see which genes drive the definition of cell types. Cell type segregation. So discover new cell types. I would say that probably that the most widely used of single cell analysis data is to perform the cell type clustering and identify the cell types, allowing to discover new cell types. So lineage, yes, you will have like a development type trajectory. So you will see which cell type evolved from which cell type. You know, the origin, the lineage, as it is written here of cell types, function identifiers from homogenous cell populations. Seems to be an interesting one, but I'm not sure to exactly notify what it means. But yeah, probably similar cell population would have different function if this is what you mean. Don't hesitate to speak as well if you want. Proliferation stage, yeah. So I mean, apparently you are all very much aware of what we can do with single cell data. Complexity of tumor, microenvironment. So yes, spatial transcriptomics is more and more widely used. So yeah, I mean, obviously, the audience here has a very clear idea of what we can obtain from single cell analysis data. Thank you. So I'm going to share again presentation. Okay. And so then what do we need if we want to reproduce an analysis, so that if we want to integrate in BG in a reproducible way, the information providing in any single cell experiment. So again, most of the time you will go through a cell clustering and cell type identification. And if we want to present this information as produced by the authors showing the cell types over the cell, overlaid over the cell clustering like that, but also we would like to see expression of different genes, various genes overlaid over the cell clustering. So here you see a specific gene that is actually kind of specifically expressed in this cluster. What do you need if you want to reproduce this analysis as performed by the authors in their paper. So in the paper, you will often see that, but how can you reproduce it? And actually in single cell data set, it's not yet well standardized. And it's very complicated. So first to process the data, it will depend on the protocol. So Mark mentioned that so far in BG, we have categorized 32 single cell protocols, but they are much more than that. And for instance, the first difference would be whether you isolate the full cells or only their nuclei, their nucleus. And then if you do single nuclei analysis, it's going to change how you process the data because you're going to need to map the reads also to the introns as a gene because in the nucleus, the mRNAs are unprocessed. So you still have the introns and it will impact how you process the data. You also need to know the specific protocol to know where is the barcode information allowing to identify the cell and the UMI, the unique molecular identifiers allowing to account for PCR biases. And the barcode and the UMI are not at the same place in the sequencing reads depending on the technology, depending on whether it's a smart sec one or smart sec two or 10 X chromium one or two. So we need to capture this information if you want, if we want to process properly the data in our pipeline, or then the annotations are going to be dependent on the other steps like the dimension reduction step, the clustering steps. So for instance, taking the example of the fly cell atlas again, they provide different types of annotation, they perform two clustering, one that they call the stringent clustering and another one that they call the broad clustering. And some cells have a completely different annotation base on which cluster they use. So some cells have been annotated over a cell in the stringent clustering, but no one in the broad clustering. So we reached out to the people working in FCA and we asked them which clustering should be the final clustering, you know, or how can we capture information about the different clustering they perform. And then if we want to reprocess data and make the analysis as the author did, we need to find the barcode information. So we need to have the information of which cells belong to which clusters to make the link between the cell and the cell type. But often actually the barcode information you cannot find in the primary repositories, in the second read archive for instance. It's not mandatory when you deposit your data to submit the barcode files. So often we have to dig into the supplementary material of the paper, we have to reach out to authors. In the case of the fly cell atlas, for instance, on the sequence read archive, the FASTQ files were corrupted. So we downloaded the data, but they were not usable. And again, we had to reach out to authors to make them fix this file. So from the curation point of view, it's much more complicated, but at the end of the day, we do all this work. And when you go on BG, and we will show you after the breakout to access this data, when you download the data, they have been already annotated. You will find for each cell the barcode and the cell type annotation in a very consistent reproducible format. So now there is a WooClap link. I stopped measuring for you to go to the WooClap. So you have the WooClap link. Again, it's the same link. And Marc, maybe if you can launch the question, please. Sure. Sorry, which one should I launch? About what information we need to recover from a single cell data set. In a typical polyasec library, what features do you expect to find? No, can I actually launch it myself? Sorry. Ah, the last question? No, the question name, what information do you need to recover from? It's written as last. Okay, it started for everyone. You have 57 seconds. No stress, 53. I mean, don't hesitate to answer if it doesn't make sense. A bit to show also the complexity of this data, the complexity of creating this data. In the meantime, I see a question in the chat about do you propagate annotations between different studies? So the cell types name are the same. Yes, we do. And I'm going to present that during this presentation, actually. Okay, so time is almost up. I'm going to share. I'm going to try to share two windows at once. I know it doesn't work. Okay, apparently it's weird. I don't see the answers. I should be. So maybe can you share your screen, Mark? I apologize, but I don't visualize the results. Should be able to. Okay, so what information do we need? So do we need the unprocessed flow data? Well, then it depends. For instance, we could go directly from the gene expression count. But probably if we want to remap to the latest genome version, for instance, then yes, we need the unprocessed flow data. But in this case, we absolutely need to know what protocol was used to know where the background, the UMIs are positioned in the read. So it's either you have the gene expression count and you are happy, or you need both unprocessed for data and information about the protocols. Then if you want to display the annotation as provided by the author, either they provided cell type information for each cell and then you're good. It's simple. Or you get only cell type information for each cluster, which is in most of the case, in most of the time, it is okay. They provide information per cluster, the authors in the papers. And in that case, we need to make the link between each cell and the cluster that belong to. So in that case, we need the association between cell barcodes and clusters. So you see it depends on where you start from the information you need. So if you get only cell type information for cluster, then you need the information of linking the barcode and the cluster. But if the authors provided the information directly for each cell, you don't need the link between the barcode and the cluster. You see, so this question was just to show a bit like yeah, kind of the complexity. And I appreciate that someone answered, okay, just give me a nice visualization. Okay, so now I'm going to present what we do in BG actually then to show you, to give you a simple answer about where a gene is expressed. So our aim in the processing pipeline is to detect the conditions where a gene is active. So what do we mean about where a gene is active? It's a bit similar to what we can do with in-situabilization data. Here you have an image of in-situabilization data on zebrafish embryo. And it's pretty clear in a way because you see exactly the stained areas where a gene is expressed. So there is an hybridization of a probe, of a fluorescent probe to detect where a gene is actively expressed. And then you see the areas where this gene is expressed. But then it depends on how long you perform the hybridization. If you wait too long, maybe the whole embryo is going to be stained. But still here you can see clear areas where there is an active expression of the gene of interest. And this is what we want to do in a way, but using also Birk RNA-seq data, single cell RNA-seq data, showing you where your gene is actively expressed. Which means, and so why do we need that? So it is important, for instance, when you perform differential expression analysis, the first step in most cases is to remove genes that have a very low expression because you won't be able to perform a differential expression between two conditions where one gene is absent and the gene is present in the other condition. So in most cases, these genes that have no expression in the condition are filtered out. But there is no clear criteria for filtering these genes out. Or also if you study sex bias genes, for instance, you want to find the gene that are expressed in male but not in female, for instance. And again, you want to identify the genes that are not expressed in one of the sexes. But again, you don't have a clear criteria to do that. And also, simply for answering the question, where is this gene expressed in the same way that in situational data, you see where your gene is expressed. This is what we want to provide in BG, tell you clearly, look, this is the important expression pattern of your gene of interest. But then how do you do that? Basically, you need to identify expression signal that is over the background transcription noise and over the technical noise. Because yeah, you can, in RNSE data, you can even find reads for intergenic regions that are not supposed to be expressed. So it is, in some way, first a noisy process, gene expression, it's stochastic. And the technology is also generating noise that you need to account for. So I'm going to focus on RNSE data, both bulk and single set to explain to you how we do that. Okay, so what we want is to detect active expression in different conditions. And in most cases, authors in experiments use an arbitrary cutoff on the expression level to say, okay, over this cutoff, my gene is expressed. But it's very unlikely that one threshold can fit all the different gene expression in various conditions. And also there is little consensus of what is the exact value that you should use as a threshold. So I would like to ask you to go to the Google doc and tell me what is, in your opinion, a good threshold. So yeah, so this is a question in your experience, what is an appropriate threshold expression level to classify genes as actively expressed. So please tell me what's the value and in which units. Read counts or CPM or CPM, just that we get an idea of how various this threshold can be actually. So Raja, you say I was told 1.5. Can you please add the unit 1.5 full change. Thank you. Okay, so you can already see here that there are like many answers. So Raja getting back to you, sorry, 1.5 full change, it means that then you will have to compare to conditions. But what do you do if you only have one condition like you have one library. How do you identify in that one library, the gene that are actively expressed. So I see one filter map read. So as soon as you have one read, you have expression. I mean, that makes sense from a logical point of view. But again, the technology is noisy and you find reads for intelligent regions, for instance. So probably one read not going to work. And if you have a very long gene, it's more likely to produce one read than a short read. Yeah, one exon. Okay, I understand. But again, so one read does not represent the same thing depending on the less of the gene or 10 counts. I see five TPM, three, five times a background level. And then the question is how do you identify the background level. And I'm going to show you how we do that. But basically, that's a really good idea. So yeah, more than one read. So you can see that even among the audience here, there is little consensus about what is a good threshold for that. So thank you for your answers. Okay, so getting back to the presentation, I show you actually here, it's a plot of the true positive and false positive rate at different TPM cutoff. So what we do is that we took an experiment where they use ribosec data so that you know what are the truly expressed genes in your sample. So we use that as a God reference data set to know what are the truly expressed genes in that data set it was in mouse lever. And then we use several TPM threshold to check for the true positive and false positive rate. And we try various cutoff. Here it's 0.5 TPM, 1 TPM, 2 TPM, 5 TPM, 10 TPM. And then I mean, it's just a matter of stringency. I mean, what is here? The best value here. If you go for 0.5 TPM, you get almost 100% true positive rate, but you get a high false positive rate. So you identify genes as expressed while they are not actually. So it's about like 20% false positive rate. It's like for the red, the red false positive rate here. So it's quite high. And the most used threshold, I would say it's 2 TPM. In most truly, it's one or 2 TPM. And 2 TPM, you achieve quite a good balance between true positive and false positive. So it's around what like 90% true positive and around 10% false positive. But you see, it's just, it's not clear where you should stop, where you should put the threshold. Yeah, and in our experience, integrating thousands of data together, it varies between different experiments. Sometimes maybe it's going to be 10 TPM, that is a proper value, and sometimes it's 0.1 TPM. So the approach we have in BG is that when you look at the distribution of gene expression in a data set, so this is a density plot here for instance of the exon, density plot of the expression level of exon here, you see that there is like a distribution with a shoulder on the left like that. And actually in that paper, this 2011 paper, they show that the shoulder on the left of the distribution is actually a sum of two distribution. In yellow, the lowly expressed genes that they identify as not being actively expressed. And in purple, the highly expressed gene that they identify as the truly active genes. And what you can see here is that the lowly expressed genes on the plot on the right, they also show the intergenic level of expression, the dotted lines intergenic region. So you can see further that you have signal of expression for intergenic regions, and that it matched closely, sorry, the not expressed gene signal. So maybe using like that, the signal of expression of intergenic regions can give you an idea of the background expression signal. But then actually we are currently in the process of improving this method. So far, we are using signal of expression from intergenic regions to estimate this background signal. But we're in the process of improving this method because we are not totally satisfied with the false positive rate, actually, that we have with this method. But basically, our approach would be to compare the expression of a gene to a distribution of non-expressed genomic features. And we expect intergenic regions to be non-expressed genomic features. And so by comparing the expression of a gene to this distribution, we can obtain a p-value, give you a p-value of the likelihood of your gene being actively expressed in a sample. And so we are currently working at improving this method. So there are several things we are currently evaluating. What is the base choice of non-expressed genomic features? Should it be intergenic regions? Or maybe intergenic regions close to genes because if they are very far away from genes, maybe the chromatin is totally close in that area. The region is absolutely not accessible to the transcriptomic machinery. And it won't be a good estimator of the background signal. Or we could use genes that we are sure are not expressed by using a very low TPM threshold and we take all the genes below the threshold and we look at their distribution. And then how should we represent this distribution of non-expressed genomic features? Is it a normal distribution using log TPM value or a negative binomial or Poisson distribution? There are discussions in the field about what is the true distribution of non-expressed genomic features. Apparently it's a negative binomial distribution. So we are in the process of evaluating all of that. So I'm not going to go too much into details here, but what you can just remember is we are going to compare a gene expression value to a distribution of non-expressed genomic features. And we're going to provide you a p-value. And in the current trees of BG, it is based on intelligent credence, but it's going to be improved for a lower false discovery rate in future releases of BG. Okay, so I invite you to go back to the Google doc, please. So the question is what could be so that you can put your mind on this question. What could be the limitations of using intelligent credence as the reference for non-expressed genomic features? Can you think like, I mean, I don't expect you to have the correct answer to this question in most cases just so that you think about it and think what could be the best approach to solve this issue. So I see that you probably struggle on this one considering that not a lot of people are answering. And then there is a second question as well that you can answer. What would be the limitation of using genes with an expression level below a cutoff, which are kind of the two solutions I suggest here. Okay, so please continue filling the document. I start going through the answer that we already see. So first answer is that the intelligent credence will not be expressed. So maybe I wait a little bit because it's going to be difficult for you to type and listen at the same time probably. Okay, so first answer is there will not be expressed intelligent credence. Well, that's kind of the point actually that we want non-expressed genomic features to estimate the background signal. And we do find reads mapped to intelligent credence. So we will detect some actually, which is surprising in a way, but it is the case. So we want them to be expressed, but then someone has put they are not represented in the dataset. And yes, that would be an issue if there were not. So can that be from DNA contamination? Actually, we consistently find them in all datasets. So it's very unlikely. All authors report, studying this question report distribution of expression for intelligent credence. So we do find them consistently in all libraries. And they do represent non-expressed genomic features. Intron retention. So then we don't work at intron level because for instance, for single-nuclei data, you expect to find introns because the mRNAs have been unprocessed. So we cannot use introns as non-expressed genomic features because we will find them in single-nuclei data. So I see the second line, low annotation quality of genome. That is a very good point between model species and non-model species. We see huge differences. And actually in BG, we recreate the intelligent credence to remove those that show a very high expression signal, which are most likely unannotated non-coding genes. So for non-model organisms, we do actually have this problem and we have a creation step in BG to take that into account. So we use only reference intelligent credence that we are sure do not contain unannotated genes. Differences between species, tissues, cell types in the level of this intelligent credence. Yes, it is a good point. And this is why we are looking at using intelligent credence closer to genes to be sure that they are actually accessible in all tissues and that they can represent an actual non-expression background signal. What about an enhancer RNA that would be intelligent but transcription signature? Yes. So an enhancer in that case, I mean, they should not be expressed, right? So there would be, we don't expect to find reads mapped to them. They are not expressed features. So they have low-quality and large number of data sets might be needed. Actually, no. We do find them in each RSA library. So I move on. Thank you for your answer. Using genes with an expression of a beetle cutoff will depend on the analytical location. The expression depends on the tissue, tissue-specific expression. Okay. So it is actually the same problem kind of we get back in circle. What would be the appropriate threshold so that in all contexts, we truly identify the non-expressed genes. So this is a very good point. But probably if we take a very low TPM threshold, we're going to be safe in all conditions. But these are all things we are currently benchmarking. You can just remember that in the current race of BG, it is intelligent region that we use and intelligent region that we have re-annotated to make sure they do not contain unannotated genes. Okay. So thanks a lot for all the very interesting answers. Okay. I think I need to move on a little bit faster to get to the end of this presentation. So I'm going to show you in single cell RNA-seq data. So in single cell RNA-seq data here, I show you the same density plot as before of the expression level of genes and intelligent regions and protein coding genes. So these are full-length RNA-seq data, so smart-seq technology. Smart-seq technology, you study a much lower number of cells per experiment, but with a much higher number of reads per cell as compared to droplet-based technology. And you see here that we still have expression signals where we can distinguish between protein coding genes and intelligent regions. So basically for smart-seq data, we use the same approach as for bulk RNA-seq. And you can still see that in this full-length data, so here I show you the percentage of protein coding genes considered expressed with our method in bulk RNA-seq data in human and in mouse and in single cell full-length data in human and mouse. And you can see that we have much less genes that are considered expressed in each cell as compared to bulk RNA-seq data. But it is expected because in bulk RNA-seq data, you will have a mixture of cell types, so probably you're going to have more diverse genes expressed. But still, in single cell RNA-seq data, you get about 60% of the genes in a cell that don't receive any reads. While in bulk RNA-seq data, it's more between 5 and 10%. So part of that is biology, because you don't look at a mixture of cell type, but the part of it is technical, is that you have less reads per cell, so you will have more dropout, more genes with no reads. And if you look at droplet-based data, so droplet-based data such as 10x allow you to study hundreds of thousands of cells in the single experiment, but with much fewer reads per cell. And here, this density plot, you see here, I have to zoom in to look at the actual expression signal in each cell. And you can see that you can actually not distinguish in one single cell expression signal from x-photogenic regions or from gene expression. So what we do in BG for droplet-based data is that we're going to do pseudo bulk. We're going to pull all the data coming from all the cells from the same cluster. We're going to pull the data in the same cluster to recover more expression data. This technique is called pseudo bulk. And then we are able to recover an expression signal allowing to distinguish gene expression from intergenic gene expression. And we're going to make this statistical test. And if we know that a gene is expressed in a cell type, then we go back to each cell and we look, okay, do we have one read map to the gene? This gene was expressed, so it is expressed in that cell. So maybe it's a bit complicated. But at the end of the day, we are able to tell you in BG for each cell, the gene that are actively expressed in them going through this type of doing pseudo bulk. So basically for each data type in BG, I'm going to skip maybe a bit that, but for each data type in BG, we define methods like that allowing us to have a p-value for each gene of the signal of expression. And then how do we integrate all this information from all these data types? Because in BG, we also have affinitrix data, EST data, in-situable decision data, but for all the data at the end of the day, we're going to provide for each condition and each gene a p-value of whether this gene is significantly expressed on that. So then Mark presented to you that we use ontologies to annotate the data and that the terms in an ontology are connected between each other. So here I'm going to use a simple example of the pancreas, which are two children having two children endocrine and exocrine pancreas. And the developmental stage with two terms, sexually immature being part of the fully formed developmental stage. And here I imagine the expression of three genes. So here you have three genes in one of six SRRM4. And we have calls of presence and absence of expression. So in the lower graph, we're going to have expression of gene A in the exocrine pancreas, sexually immature of gene B in the endocrine pancreas, sexually immature. So we have information for these three genes, but at different levels in the ontology. And what we're going to do is that we're going to propagate this information along the graph of terms so that at some points we have an integrating information when we can compare expression of our genes. So you see that gene A that was having information at the level of exocrine pancreas, sexually immature. Thanks to propagation. We also know that it means it is expressed in the pancreas at the fully formed adult stage. Okay. So we're going to propagate information for each gene on this graph. And at some point it's going to be comparable. It's going to converge to a same condition where we can actually compare the expression of these three genes. So this is how we integrate information in BG and there were a question earlier about how can you make the cell types comparable and that's how we're going to do. We're going to use the cell type ontology to propagate the information and we're going to converge to a common cell type at some point for expression of various genes. Maybe I'm going to skip that because I'm running a bit late. Okay. And so what we propagate actually is the p-value. So as I showed you for each library, for each sample, for each gene, we're going to provide a p-value of the significance of the expression. And those are these p-values that we are going to propagate in the ontology. And then at the end, we're going to do a FDR correction of all these p-value so that we get one single answer to the question, is my gene significantly expressed in this condition by the integration of thousands of p-value, FDR corrected. We're going to give you one single answer. Yes, this gene is significantly expressed with this FDR corrected p-value. And then we're going to give you an answer. Okay. Is my gene present or absent? This is all based on this FDR p-value. So I'm not going to go into details. You can find that on the documentation of the BG website. We're going to show you the BG website. But basically, we use this test of significance of expression. We propagate them in the ontology. And at the end of the day, we give you a simple answer. This gene is present, absent, with three levels of confidence, gold, silver, bronze. So these calls are informative about telling you whether your gene is actively expressed or not. But then you lose information about the expression level. Okay. And you still want to know whether your gene is highly expressed or lowly expressed. So to do that in BG, we use what we call expression scores. Expression score, basically, we use nonparametric statistics to be able to compare expression level between any data sets, basically, because otherwise we need to account for batch effect to normalize the data. We don't know which will be the batch effect in thousands of experiments. So to integrate expression level information, we use nonparametric statistics. So basically, we rank the genes in each data set. We normalize those ranks to make them comparable between different technologies, different conditions. And the ranks are also propagated in the graph, as I showed you earlier. So when we provide the expression level information in BG, it's based on all the expression-level information in all sub-conditions and in the condition itself. Yeah. And then we transform these ranks into a simple value between zero and 100. So at the end of the day in BG, you get expression score information between zero and 100 representing expression level, base, and nonparametric statistics. I'm going to speak and switch the, skip the last workflow because we don't have time and just show you an example of then how you retrieve this information in BG. So if you go to a gene page on BG, on the BG website, but we're going to show you how to do that after the break. But this is the gene page for the APOC1 gene, which is an allipoprotein gene, very important in lever. And here you see the FDR-corrected p-value. So a very significant p-value in the right club of lever and the expression score. So it's 99.94 over 100. So very high expression level. Probably this is the most highly expressed genes in that condition. But then when you look, and so those are the non-express genes where the FDR is non-significant. The non-express conditions where the FDR is non-significant. And then the point is that, thanks to the integration, if you look at all the species present in BG, in mouse, the top express conditions for this gene is left love of lever. In zebrafish, it is also lever. In chimpanzee, in bonobo. You see that in all those species, we correctly identify that the top express condition for that gene was lever, with a very high expression score that is very similar in all those species. You can see it's in all case above 99 on top of 100 with a very significant FDR p-value. So you see how this data integration allows you to have a consistent comparable information between all those species. So again, even more species. You see, high expression score of 100, 99.9. You see how consistent it is, I hope. And then we have a tool that we're going to present as well where you can automatically compare expression between a list of genes. So here I enter the autologous genes of apoquan interior in our tool. And the organs that are identified as most highly conserved in their expression is epitobiliary system, lever. You see that all the autologous genes have expression significant in this tissue. So you can also perform that automatically not going through each gene page individually. Yeah. So in summary, to know where genes are expressed, first we perform a precise manual annotation so that all the data sets are comparable. And then we generate p-values for each gene and each sample of the significance of the expression level. We compute expression ranks and score for each gene and each condition and we propagate all this information along the graph of condition to give you an integrated consistent information at any level of the