 Okay, this part is more about the data integration, so it was a question this morning about how then we integrate the different experiments, different data types. So I first show you how we generate these present types and calls, and now how we do this overall integration between all these experiments and data types. So I presented you these parts of the pipeline, and now I would more focus on these two last steps of the pipeline. Okay, so an important aspect of our work in BG is not only about the analysis that I presented, it's about the expert creation. So Marc told you a lot about that, about how we do quality controls, annotate data, each sample manually, and I'm going to tell you again a bit about that. Okay, so as Marc said this morning, each and every sample in BG, we annotate it to anatomical entity, developmental and life-stage, sex and strain. So when we say that we know the conditions where a gene is expressed, it will be that we can know where a gene is expressed regarding the organs, developmental stage, the sex and the strains. The other aspect of gene expression, we do not capture it, but that could be easily extended. That could be extended to capture physiological conditions or end-german type conditions, but then depends on the amount of work needed or so to perform these annotations. So these are, okay, I just throw a bunch of numbers here. So you get the different data types in BG, RNA-seq, afymetrics, in situ, EST, here you have number of samples, number of experiments, and the number of conditions that are notated. So for instance, for RNA-seq data, we will have about 600 different organs that have been samples. But in situabilization data, we will have 6,000. So because it gets very, very detailed, so it can get to the cell level resolution. So this is why we value this kind of data. It's low throughput. It's not a lot of gene at the same time, but it's very fine, very detailed. So yeah, from RNA-seq data, we will have an average about only 20 organs per species. While here for in situ, we will have an average more than 1,000 organs. So there are large differences. But basically, if we look only at the high throughput method, afymetrics and RNA-seq, so we have data in 1,000 organs in the different species so that we make, like, on average 35 organs per species. And if you mix that with the developmental stage, we have like about 4,000 organ stage. So this will make, like, what, 100 per species on average, more than 100 on average per species, for which we have kind of the complete genome expression. So as Marc said, we focus only on healthy wild types. So it means no abnormal genetic background, no disease, no gene knockouts, no treatments, not expected in the wild. That's kind of hard sometimes to define. So we have the example of fasting time, for instance. So we have experiments where they conducted fasting of mice, for instance, at which point after how long this becomes to be not normal, because of course, individuals in the wild they experience fasting from time to time. So these 12 hours of correct fasting time is one week of correct fasting time. So we have to draw lines in the sand at some point to say, OK, this is something we do not consider normal. This is something we do consider normal. So we have guidelines that you can find that I will put maybe in the Google Doc. Patricia, can you add a note, please, in the Google Notes for me to share the criteria we have, but we have a clear set of criteria. Yes. Thanks a lot. So we check a lot the information consistency, and we'll give you a few examples of how deep we go into digging into the papers to be sure that the data are high-quality, because we are a secondary database. We are not a primary database such as SRE, which has to accept everything. We can be picky. So we take only the highest quality of the data available. And so we perform quality control, and we also did something that are kind of unusual usually in primary repository that, for instance, will remove hidden redundancy. So for instance, we have identified that in our affine tree datasets in the first place, there were as much as 14% of the samples that were duplicated, meaning a sample was reused in different experiments or experiments was fully duplicated as part of another experiment, that sort of thing. So we developed methods to identify that and remove this redundant data point, because for us it's important so that we don't count twice. We don't consider as two independent experiments a same experiment giving us the same information. So for our analysis, for instance, we will use FastQC for the quality control, and then we'll do density plot of the TPM values for checking whether the distribution of expression values is in an acceptable range. We will also check the TPM values, the counts of reads aligned. So if in a library we get only a few thousand reads that could be mapped, obviously there was something wrong with this library. So we will check that sort of things, and the idea is that we are very picky in BG. The data you will find in BG are only of the highest quality. For our matrix data, we actually develop our own quality matrix that of course I'm going to say that in a very objective way, that outperform every existing quality control method, which is called IQRA, which is published. So we will use that quality score if we have access to the raw cell files. If we have access only to the processed mass file files, we will have metrics such as the percentage of genes considered as present. So for each chip type, we have defined minimum percentage of genes present that for us to accept these chips. So for instance, if on a chip we have only 5% of the genes that are considered present, there is something fishy because a cell cannot function with only 5% of the gene expressed. Okay, so for in situabilization data, we rely on the information from the model organism databases. So the in situabilization, they are annotated by model organism databases such as Zidfing or WarmBase. So we rely on their expect curation, but also for some of them we have an ongoing collaboration. So with WarmBase for silicon, we have a collaboration with them for annotating affine matrix and bulk RNA-seq data. And so in BG, we will focus on the healthy one type while they also annotate the gene knockouts and the virus treatments. So just to show you an example of the kind of curation we do to check whether the data are really good. So here it's an example of an experiment, an affine matrix experiment in Geo. And here you get the list of the samples available that you could retrieve. And you get basic information attached to each sample. And for instance, there was a replicate here. For this sample, they say that it was in Makaka. They said that the individual was 6.5 years old. But then the same replicate, so for which they get a different tissue, they say a different age. So the same individual is assigned two different ages in this experiment. So obviously that's something fishy. So we are just going to discard both samples. So either we try to contact the others, but if we get no answer, we just discard the sample. Another example here, it's in SRA. So here it was in the supplementary data. It was not like that on the website. So when you look at the supplementary data, you get the sample names here. And they had two columns that were source name and tissue. And for source name, you see here, it was longissimus dorsi. You get a consistent, it's consistent between the two columns. But here, for the same information of tissue, it's a different source name, multiple muscle tissue. It doesn't fit longissimus dorsi. So again here, we will contact the others, but if we cannot, I need to plug my light up. But yeah, if we don't get the information, or if the information is suspicious, we'll just discard the sample. Here, and again, a last example. So an experiment in the geo with different samples. And we have this sample that is available from the geo interface with no further information. But when you go back to the paper in the supplementary data of the paper, this sample, there were these little stars here and these stars say that are removed because sample was an outlier. So it means the others discarded this sample, but they had to submit it to geo because they have to. It's part of the guidelines of most journal that you have to submit all your data. So they submit all the data, they are available, and if you use them like that from geo, you will not be aware of that. So we do that kind of work to go back to the publication and check each and every sample. This is why it takes so much time. This is why you need a human to do it. This is why it cannot be done automatically. Because here you can see the format is obviously not standardized. So yeah, you need a human to do that. The work of curators is really valuable. Here we have an example. Also that would be my last example of how picky we are. It was an affimetric data set of human placenta samples. And so here you have value of our quality score that I mentioned, accurate rate. And here it was a correction to the reference. So from all the placenta samples, we build an average gene expression. And there we were comparing the correction of individual samples to this average reference. So this is what we call correction to reference. And you see that there is a clear correlation. The better the correlation to the reference is, the higher the quality score is. And so here, for instance, you have a low quality sample, which show a low correlation to the average of the human placenta samples. But here we had one outlier, a high quality score, but a poor correlation to the reference. So we went back to the data, to the paper, and realized that this sample, instead of being placenta, it was decidua, which is not true because decidua is a substructure of placenta. So it was not true, it was placenta, but it was just not all placenta. It was a substructure. And you can actually see it in the data. It's obvious from the gene expression that it was not the full placenta. So we went back and we reallotated this sample. So this is the kind of analysis that is very hard to create all kind of tests that will manage everything. Again, we have manual annotation and it's really needed to have high quality expression data. So Marc showed you this slide this morning. It was about the annotation of the GTEC dataset. GTEC dataset, so it was the version 6. It was about 11,000 samples from 200 donors, something like that. And you can see that we kept only half of the samples. So each and every sample, we went back manually through it. We reviewed for each donors whether the donor, it could be acceptable or not, or whether we should selectively discard some samples, or whether we could accept everything. And so we reviewed manually these 11,000 samples and kept only about half of them. Okay, so that's the part about the creation. Yeah, so it's a lot of work. Each sample, we mapped that to anatomical ontology, developmental stage ontology, sex vocabulary, strain vocabulary. We have to harmonize the strain names because it's not harmonized in the literature. So all this kind of work, it's about being picky and the work of curator. It's all about having a mind state of being very precise and focused. Okay, so with this annotation, we are able to know the conditions where gene cycle is expressed. So we get the statistical analysis to detect expression, but we have the curation to know where and when. So now what I'm going to present is that, okay, we have the conditions, we have the codes. How do we make sense of all of that? How do we integrate all of that? So the first important thing that we do is that we do what we call propagating the present absent expression codes. So we give you an example of what it means. So as I said, we use ontologies to annotate our data. And it's very useful ontologies because as Marc said, you have relations between the terms. So for instance, here, this is just a small part of the anatomical ontology we use. You have terms such as concreas. In the concreas, you have the endocrine concreas, endocrine, okay, stick to endocrine, exocrine, okay, I always say crine. So you have the endocrine and exocrine concreas that are both part of the concreas. And we also use developmental stage ontology. So you have, for instance, the sexually immature stage which is part of the fully formed stage. So the names are a bit weird but it's because it needs to accommodate any species. So you have this ontology. Now we have these gene expression codes. We have a gene here that is said to be present in the exocrine concreas at sexually immature stage. We have this gene present in the endocrine concreas at this stage and we have this gene absent from concreas at the fully formed stage. So we can see that the data they come from, they were annotated at different level. We didn't get, in the case of this gene, we didn't get the information of the precise subpart of concreas. We just knew it was concreas or maybe it was the whole concreas mixed in. And the developmental stage, we didn't get more information that the organism was mature. It was not an embryo. This is all the information we get. And for these genes here, sexually mature, we just know that it was not an embryo. It was not a sexually mature individual. We don't know more. So we get that. So that could be represented like that here. You will have a graph of condition. So you will say, for instance, that exocrine concreas at sexually immature, it is a subcondition of concreas at sexually immature which is a subcondition of concreas at fully formed. So I hope this is clear, but basically from the individual ontologies, we can recreate kind of an ontology of conditions. So mixing up the relation between the organs and between the developmental stage. And here, so in this condition, you will get expression of the first gene. In this condition, you will get expression of the second gene. And in this condition, you will get absence of expression of this gene. So if you look at the data like that, you see that there is no condition where we have information for all of the genes. We cannot do comparison, we cannot integrate. So what we do is that we are going to propagate this present absent expression calls in the graph of conditions. So that after propagation, you will get something like that. So what we do is that the present expression calls, we propagate them to all parent terms. So for instance, if you say that a gene is expressed in the undercrined concreas, it means that it is expressed in the concreas. Or in the same way, if you have a gene expressed, I don't know, in a hypo-camp, it means that the gene is expressed in the brain. It means that the gene is expressed in the nervous system. It means that the gene is expressed in the organism. So we can propagate like that all present expression calls to all parent conditions. And for the absent expression calls, if you say, for instance, that a gene is not expressed in the brain, then can you say that it is expressed nowhere in the brain? I mean, because maybe it's expressed in just a small group of cells in the brain and you will miss that. But still, we consider that we can propagate absence of expression to the children, to the child conditions, but just one level down. So if we say that a gene is not expressed in the concreas, we propagate this absence of expression one level down to exocrine concreas and undercrined concreas. So basically in this case, it means that this absence of expression, we have propagated here in the undercrined concreas at fully formed. We do not propagate absence call using the developmental stage. We use only the anatomy. So if I rephrase present expression calls, we propagate to all parent conditions, to parent organs, parent stages. Absent expression calls, we propagate one level down for the anatomy. We do not propagate for the developmental stage because obviously we have not samples each and every developmental stage. While an organ, we might have to take the full of gown and crush it and look at the expression in the full organ. So including all the substructures. So this absence of expression of this gene here, we will propagate the expensive expression here in the undercrined concreas free form, in the exocrine concreas free form. And the presence of expression from this gene, we will propagate to all parent condition for this gene to all parent condition. At the end, what you can see here, after propagation, we have information for the three genes. We know that the first gene is expressed in the concreas at fully formed, that the second gene is expressed in concreas at fully formed, and that the third gene is not expressed in concreas at fully formed. So from these calls and propagated, we will get these calls propagated, much more information that is not now become comparable. So this is a first level of integration. How do we compare between different conditions in different experiments? We do this propagation so that then we can arrive at comparable conditions where the data for different genes are integrated. So this is a first level of integration. I want to mention that this anatomical ontology we use, well, it's very smart. I love it. I think, so we are part of the developer, but the guys we did this machinery, they are ontology specialists. And I want to show you how useful it is to use a good ontology. So here, for instance, it is an example of the olfactory tubercle in Mars. So the olfactory tubercle, so okay, you have a structure called Islands of Calera. And the structure Islands of Calera, it is part of the olfactory tubercle in Mars. But in primates, the Islands of Calera is part of the nucleus accubus. It's not part of the olfactory tubercle. So you have the same anatomical structure, which is not organized in the same way in different species. It does not belong to the same structure in different species. So how can you capture that in an ontology? So you have mechanism to say in Mars, Islands of Calera is part of olfactory tubercle, not part of nucleus accubus. And in primates, including human, it's part of nucleus accubus. So ontologies, they allow to do that. So I just show you how it is done, but you don't care so much. But here, it's a format called oboe. And you can see it's for this term, Islands of Calera, you get a relationship part of to the nucleus accubus, which is valid only in this taxon, with a taxon, I did the identifier. And in Mars, it's part of olfactory tubercle. So you have rodents here, that is targeted by this relation. So this is in oboe format. This is in Manchester syntax, but same, the same thing. Islands of Calera, in primates, it's part of nucleus accubus. Islands of Calera, in rodentsia, it's part of olfactory tubercle. So we have these different kind of relations, depending on the taxon, and we use this relation accordingly when we propagate the calls. So in human, you will not propagate expression in islands of Calera, you will not propagate that to olfactory tubercle, but to nucleus accubus. So it can be very precise. Thanks to ontology. So I want to stress how much it is important to use ontology is when you annotate your data and not use any vocabulary or not use free text, because then you can do these kind of things, which is quite amazing in my opinion. So then Mark got this question, how do we reconcile between individual calls of different experiments? So let's say, for instance, you have two experiments to do the same condition, and then, okay, so that would be more, okay. For instance, let's say that you have expression in pancreas of a gene at adult stage, but you have absence of expression in pancreas at embryo stage. So that would be a first level of discrepancy between different conditions. So we will say in BG that present calls always win. So in that case, if you ask for expression in pancreas at any developmental stage, we will tell you, okay, this gene is expressing pancreas. But then if you ask specifically for pancreas at embryo stage, we will tell you, oh, we have no conflicting data at this stage. It is absent. So this is how we do. We present always win kind of way. We could have a different case where we have two experiments to do in the same condition. And one might say present and the other experiment might say absent. Again, present is going to win. So this is why it's important for us to correct for FDR. And but it also makes sense anyway, because maybe in one experiment, you didn't get the same environmental conditions. And we just didn't capture that in our annotation because we are limited in what we can capture. So again, it's just a matter of the level of details you go into. So again, if it's just pancreas, why did then it's expressed? If it's pancreas during night, well, then maybe it's absent, but we didn't capture this information. We can just go as far as we can. Okay, yeah, okay, I said a lot of that. Okay, so it's a bunch of texts I apologize, but then just to tell you that we also compute a confidence level. So as I said, this is the top of the slide for individual calls, we have two quality levels, low and high. But then we are going to aggregate this different sample, different experiments in the same condition together. So here we are going to compute a confidence level for the call. So if in the same conditions, we have two individual experiments, two independent experiments, giving us the same information, we are going to say that we are highly confident about this data and we are going to call that gold confidence. If we have only one experiment giving a call of high quality or two experiments giving a call of low quality, we are going to call that silver confidence. And if we have only one experiment giving a call of low quality, we are going to call that bronze confidence. So we have this three level of confidence from the aggregation based on the number of experiments supporting the call, gold, silver, bronze. Because for us, it makes sense that, yeah, if you get confirmation from individual experiments, then you can start trusting this information. So again, here, it's the same information for absent calls. It means that we will have two individual experiments telling us that the gene is expressed with absolutely no contradictory information, even in the substructures. So when BG tells you that the gene is absent with a gold quality, it means that it has been verified by independent experiments and nowhere in the condition, even in substructures, there is expression detected. So when BG tells you it's absent, it's really reliable because we are, yeah, we are very picky about absent because it's very hard to say that something that is not there is not there. It's very, yeah, it's very difficult. So we are very picky about that. Okay, so you see how now we can integrate different conditions, different experiments. And this is because we have these calls. We have this qualitative information that is comparable between data tiles between experiments. So it's very, very useful, but you miss a quantitative information. You want to know how highly your gene is expressed and the calls that don't give you that. It's a limitation. So to overcome this limitation, we have developed what we call expression ranks. So yeah, expression ranks. It's a way to evaluate the expression levels. But obviously, between different data types, you cannot just compare the expression values. How are you going to compare the signal intensity of prop sets in affymetrics and of TPM values in RNA-seq experiments? You cannot. So what we did is that we developed a non-parametric statistics based on ranks. So for each data type, we are going to compute a rank based on expression data. And then we are going to normalize these ranks between data types and conditions to make them all comparable. And that will allow us to give you a value for the expression intensity. I will give you examples. But I will just briefly show you the steps for RNA-seq because it's easier. So for RNA-seq, first we take each library and we rank the genes based on their TPM values and we do a fractional ranking. And then for each gene in a condition, we compute a weighted mean over all the libraries available in this condition. So it's a weighted mean. We weight that by some parameters. It's not important, but basically, we weight by the informativity of each library. So the more a library allows to distinguish expression of genes, the more weights we give to these libraries. So it means that at this step, from RNA-seq data in each condition, you get a mean of the ranks of all genes studying that condition. And then we store some parameter that will allow us to normalize between data types and conditions. But that's it for RNA-seq. We rank the genes per library and we average that in each condition. That's it. Very simple for RNA-seq. We also do that for our penetrics, but for our penetrics, we will first normalize between different chip types because different chip types have not the same genes on the chip. So you will have, for instance, 5,000 genes on the chip or 10,000 genes on another chip. So we normalize that so that the ranks and average chips are comparable. I give you an example for insituabilization. Insituabilization is very hard because it's not quantitative. It's just staining areas on the picture that have been annotated by curators. So we just have the information that this gene is expressing brain, for instance. So how could you get quantitative information from that? So what we thought is that the more often an expression information is reported in a database, the more biologically important this expression is. So, for instance, for, I don't know, DLX gene in members during embryonic development, you have hundreds of evidence showing that because it's essential in the developments. And so it's really often reported. So what we did is that we just assign a score to each evidence reporting present high, present low, absent low, absent high. We assign a score to each of these evidence and then we sum the evidence, we sum the scores and we rank genes in each conditions based on this score. And it's actually working. I mean, so it does not allow to very finely distinguish gene expression, but it actually works. We have genes with only in-situabilization data and the ranking of conditions makes sense considering the known biology of the gene. But again, we will normalize the ranks between data types so that in-situabilization data will not mess up RNSE gene expression ranks. And for EST is pretty much the same. We will pool all ESTs in a condition and count the number of ESTs for each genes and rank the genes based on that. Okay, and then we have a normalization between all data types. So what we will do is that we will look at the maximum rank in a given species and in each condition we will look at the maximum rank in that condition from a specific data type. And then we will normalize the rank of the genes in the conditions from a data type as compared to this maximum rank in a species over all data types. So I just throw a formula here, but that's just the basic idea. So for each data type in each condition, we do a rank, we average that, and we normalize between all conditions in data types so that it's comparable. Well, yeah, and then at the end, we do a global mean over all the data types. That's just a formula, but I really don't care, but each data type, there is a weight assigned to the data type based on the data that were available so that you will give basically more weight to our menacing rather than to in situalization if you get information from both data types. And so at the end of the day, BJ gives you for each gene in each condition that has been studied, you just call to tell you how high it is expressed. And then you adjust to mention, so that's rank, but it's really non-intuitive. Users are often asked question about that because the higher the expression is, the lower the rank is. So this is counterintuitive. When you look at the value, you see a small value and correspond to a high expression that's messing up. So we translate these ranks into what we call an expression score. So an expression score is normalized between zero to 100. So it's very clear. You don't have values outside of this range. And the higher the expression score is, the higher the expression is. So it's more intriguing. And so I will show you, I think I should stop, okay. So just to finish, I'll show you an example of what we do with that. So Mark is going next to present to you in more details, the tool we use, but I will show you how we use these scores and ranks on our GIN page on our website. So I give you the example of one G, the GIN APOC one, which is a Nalipo protein G, which is used in liver mostly. And you can see that liver is a top ranked ranking conditions in human. So this is the rank score, normalized, weighty. We have afimetric data and forensic data that has been used together. And we have this expression score, which is really, really high. What's interesting is that you can see that the expression is actually high in a lot of other tissues. This G is expressing more than 200 organs and highly expressing most of them. This is often surprising because in literature, you will see that authors, they put forward the condition where GIN is important and they don't speak about the other conditions. But actually most GINs are expressed in lots of conditions even where they are not supposed to have a function that's a bit disturbing. So you can see that this GIN is highly expressed in a lot of tissues, but still BG managed to identify the condition where this GIN is essential. I mean, where this GIN is the most highly expressed from different data types. You get afimetric here. Here you don't get afimetric data, but still it go blends in together. And then you get this GIN, this Sartongus GINs in other species. And I show you here are those seven other species from primates, rodents to fish. And you can see that every time the top ranking condition is lever and that can come from here for zebrafish to even get in situational data. For these species, you get on the RNA-C. Here you get afimetric, do I have something else? Yeah, you get only afimetric here, RNA-C here. So we managed to take all this data from different data types in different species all together and we give you one single answer to the question, where is my GIN expressed? And this answer makes sense because it's ranking by our expression score. Okay, so that's it for this presentation. So in summary, to know the conditions where GINs are expressed, we annotate anatomy, development, sex, strengths. We have very stringent filtering of the data from manual curation and quality controls. We are picky, we are secondary database. We keep only the highest quality possible. We integrate all this data by generating present absent expression calls, by propagating and reconciling these calls and by computing expression ranks and expression scores. And if you want to look at all of that in details, the source code of our pipeline is available on GitHub. So everything is available. You can check absolutely everything. And that's it.