 So I presented to you a part of the statistical analysis we do in BG to generate a p-value for each gene in each sample for the hypothesis for testing whether a gene is actively expressed on it. And now I'm going to show you how we integrate all of that in BG so that you can compare expressions between conditions, experiments, species. Okay so again showing you these pipelines so I presented you the detect active signal of expression and now I'm going to present to you how we integrate all this data consistently into BG. So an important point that Marc mentioned in the overview is that we have an expect expect curation. So it's very important to have very precise and detailed annotation if you want to be able to compare the data between different species. And with what I will present I hope it will be clear afterward but so as a reminder what we annotate is anatomy so anatomy information analytical entity and cell type. We annotate development information so development and life stage we annotate sex and strain. And I just throw a bunch of numbers here so that you have an idea of what to present as amount of data. So for bulk RNA-seq for the latest release of BG we have above 16,000 RNA-seq libraries manually annotated like that to anatomy development sex strain about 13,000. Like 1500 full-length single cell RNA-seq library we are working now for the next release at integrating target-based single cell RNA-seq libraries. And that represents so here you can see that represents considering in-situ in-situ are really very detailed precise annotation in a very specific sub-region of the brain or whatever so that generate a lot of conditions we have a lot of information but for a few number of genes for a small number of genes with this precision while for bulk RNA-seq we will have less conditions but for whole genome. And when we look at that so for instance considering only bulk RNA-seq and afro-matics where we have access to expression information for almost all the genome that represents almost 7,000 conditions in 700 organs. So integrating all this information manually created provide us with gene expression information in very much a lot of conditions and analytical entities. And in typical experiments that will be isolated you will look at expression per experiment and typically you would have access to expression information in maybe 50 organs for instance and not 700 as in BG. So now I'm going to show you how we use the ontology. So Marc presented to you what an ontology is and I'm going to show you how we use this ontology to propagate and reconcile all this expression information across experiments and across anatomy and species and what's not. Okay so here Marc presented to you what were ontologies and I'm going to give you an example with a simple part of the ontology. So for the anatomy you will have endocrine pancreas part of pancreas and exocrine pancreas part of pancreas. Okay we are going to use only the three terms for the anatomy and for the development you will have the sexually immature stage so juvenile stage for most species which is part the fully formed stage. So fully formed stage is post larval post embryonic stage where you are yeah your organism is complete the development of your organism all the parts of your organism are there. So it's post larval post embryonic stage and among these you will have sexually immature or sexually immature but I'm going to use only these two terms. Okay so now let's imagine that we have information for three genes in this condition endocrine pancreas sexually immature and the cream pancreas sexually mature and pancreas fully formed. So here what I represent here is a graph of condition considering only these three and anatomic identities and these two developmental stage. So we would use all permutations of these conditions to generate what we call a graph of condition an ontology of conditions. So you will have the exocrine pancreas at the sexually immature stage which is part of the exocrine pancreas at the fully formed stage which is part of the pancreas at the fully formed stage. So you can see here that we have all possible permutations with relation between them and the first place we have expression information only for these three conditions here exocrine pancreas sexually immature for one of the gene endocrine pancreas sexually mature for another gene and pancreas fully formed for the third gene. You can see here that at this point in no condition we can make a comparison of expression because we don't have data for these three genes in the same condition. So if we were just to use the data like that we would be stuck here and not be able to do any comparison of expression between these three genes. So what we do is that we are going to propagate the expression information in this graph so that we know for instance that if this gene is expressed in this condition it means that it is also expressed in this condition and in this condition and this condition. So this is what we do here we propagate the information. Okay I hope that is clear and what we end up with at the end is that we managed to get a comparable expression information in this pancreas fully formed condition because we have propagated the information from all the different conditions at the end of this process we end up with one condition with information for the three genes. So this is how we manage to perform comparison in BG. So what I want to show you also is that in the same way that we showed you that our curation process is very precise the ontology work to describe anatomy and development is also very precise and it accommodates different between species. I give you an example here of the different between differences between species in anatomy. So in the brain you have a structure called Island of Carrera which I like yeah and which is part of the olfactory tubercle in mouse. Okay so you have the structure here which is part of the olfactory tubercle so those are brain slices in mice but if you look in primates actually the island of Carrera are not part the olfactory tubercle they are part of the nucleus accubus here. So the structures are present in mouse and in primates but they are not in the same area of the brain they don't have the same part of relationship. So if we were to propagate information using the graph of condition here the graph would look different if it was a propagation for mouse or a propagation for primate. Okay and so this is yeah summarizing that Island of Carrera part of olfactory tubercle in mice not of nucleus accubus but part of nucleus accubus in human for instance and not part of olfactory tubercle. So you have different cities like that in the graph representing anatomy in different species and we have an ontology tool it's just I throw that at you you don't have to understand it's just to show you how we capture that in an ontology so the ontology here is represented represented in oboe format here it is in Manchester syntax and what it says basically it says when island of Carrera is part of primates then it is part nucleus accubus but when it is part of rodencia then it is part of olfactory tubercle. So this is the ontology trick that we use to be able to represent anatomy in any animal species and have correct propagation of expression information depending on the species. So here I have a little wooclap for you so again from the main page you have the data integration. Yeah Frédéric the link to the activities here the sharing doesn't work for this google document I don't know why you don't see my screen I see your screen but if you we click on the the link activities yeah it's not shared you have to to do the sharing yeah my bad I thought that it was okay well it should be shared actually I don't know does it work for other people for me it says not at all okay try to replace the link here because it works now all right no access denied we need to provide access as it does maybe does it work now yeah now it works okay so here we are in this first question about the graph of conditions in bg so please follow this link and if mark you can activate the vote it's active you have 56 seconds left okay so please vote so multiple answers are allowed so please click any answer that you think are correct and then I will stop my screen sharing for mark to display the results we have one answer I'm 32 seconds left yeah so please follow this link here yes answers are coming in okay so I stopped my screen sharing for you mark to display it and displayed at the end yeah okay so the graph of conditions generated in bg sorry so considers all permutation of an identity death stage sex strain so yes it considers all of that so as I showed you we had only three anatomical entities and two developmental stages and we generated conditions for all permutation of this term in the example I showed you and we will do that for all organs all death stage all sex all strain leading to have like tens of thousands of conditions in bg representing conditions of expression it varies between species yes it does so as I showed you you would have different relationship between organs in different species and we will take that into account to produce the graph of conditions this graph will be different depending on the species it uses anatomical ontology yes it uses developmental stage ontology yes as well and yes for everything that was a trick question everything was true so I didn't show it to you but we also use a kind of a sex ontology and a strain ontology because if we have data annotated in male or in female at some point maybe we would like to make the comparison without taking the sex into account like we would like to make the comparison for any sex really and this is what you see on the gene page in bg when you see only anatomical entity it is not taking into account the sex so it means that we have propagated all the information from female and male to a common root so in a way it is an ontology because female is a subtype of any sex and for strains it's the same all the strains that we annotate in bg they are all a subtype of wild type because in bg we only annotate lc wild type data so again if you want to compare your gene expression without taking into account the strain we need to propagate all this expression information to wild type so that it is applicable to any strain so here all the answers should have been clicking basically okay so if you can stop your sharing I go back to mine okay you see full screen right okay so I continue okay yeah so we we propagate this expression information meaning p values okay so for each sample for each gene we have generated p values to estimate whether it was activity expressed and then we propagate this p value so here it's written present absent but what is propagated really are p values so each p value from each gene and sample is propagating along this graph of conditions so that at the end of the day we got p values in this condition for all three genes okay but then we we might have many samples and many conditions for a same gene so we would end up with many p values for a given gene in this condition for instance so what we do is that we do a fdr correction we compute the fdr corrected p value by the benjamini hot bear procedure so b edge procedure for correcting for false discovery rates and then so at the end of the date mean that for each gene in each condition we will end up with one fdr corrected p value and this is what is displayed on the gene page in the fdr column and also what we would keep as information is the fdr corrected p value in subcondition so for gene a for instance we will have one fdr corrected p value in this condition but we will also keep in mind what is the best fdr corrected p value among the subconditions and it is important for the next slide for next slide at the end of the day in bg you are going to show a simplified information of present or absent with three quality level uh gold silver or bronze so present gold is when for a gene in a condition the fdr corrected p value is below 0.01 uh present silver is if the p value is below 0.05 and present bronze it's like if in the condition itself the p value was not significant but in a subcondition the p value was significant so if i go back to here imagine that for gene a in this condition the the p value is not significant but maybe here in that substructure it was significant so when we took into account all the information then the correction led to have a non-significant p value here but still it was expressed in this subcondition so it's to have consistent information that we do that because for instance let's say that you have expression of the gene in the heat brain but then when we integrate all the data in brain we conclude that it's not significant that will not make any sense to say that the gene is expressed in the brain but not in the brain right so this is why we do that we take also information into account from the subconditions but here it is with a low quality level it is present bronze and in bg by default we only display the gold and silver level so if you use our packages or download our data you might see the bronze information but on the bg website you will see only the gold and silver very highly readable information and then we have also absent expression calls information and basically i'm not going to go into too too much details but absent it's when it's not present and you have fdrp value provided by what we call trusted data types for absent calls meaning RNA-seq, afimetrics, full-length single cell RNA-seq because we don't trust target-based single cell or est data to produce reliable information to know that the gene was not active okay so then you have absent gold if the fdrp value was above 0.1 silver if it was above 0.05 and bronze is the fdrp value was not significant and from any data type so even target based single cell RNA-seq but again we won't display that in the bg interface so again i have a small question here for you that the second question about present expression calls so please if you can follow this link here and mark if you can activate the wooclap mark yeah it's the a gene is considered expressed in a condition yeah yeah starting the right one okay it's started okay so i'm gonna stop my sharing so that mark can show the results a lot of last-minute change in the votes okay so a gene is considered expressed in a condition if for this gene the fdrp value is below 0.05 in this condition yes the fdrp value is below 0.05 in any of the sub-conditions yes at worst that would be a present bronze but that might be a present silver or gold if the p-value is significant in the condition itself as well the fdrp value is below 0.05 in this condition and above 0.05 in all sub-conditions yes it is true as well it was kind of a trick that it doesn't matter that it's not significant in sub-conditions if it is significant after propagation so it means here that maybe we didn't have enough statistical power in the different part of the brain for instance but when we integrate all this data when we propagate all this data at the brain level then we have enough statistical power and we can say that the gene is expressed in the brain okay so again all this sensor were supposed to be true okay so I get back to the presentation okay so now I'm going to quickly give you an idea of what we call expression ranks and expression scores so the problem with our approach of generating binary present absent expression calls is that it's qualitative only you don't have any quantitative information about the expression level so you miss this formation to compare expression level between experiments and between genes but how can we generate an integrated expression level from different data types, different experiments, different genes, different species so we could not use parametric statistics for that we use non-parametric statistics based on ranks so meaning that in each sample we rank the genes based on their expression levels and then we integrate this rank sorry oh yeah so then we we we propagate this rank information we normalize them so for each data type we compute a weighted mean ranks based on the expression data then we normalize this rank between data types and conditions so that they are comparable and then we also propagate these ranks using the graph of conditions in the same way that we propagate the p-values okay and then ranks are not intuitive because the lowest rank is the highest expression level so users were often confused because of that so what we do is that we transform these ranks into what we call an expression score an expression score is between zero and 100 and the higher the expression score the higher the expression level of your gene okay so I have a last work lab question here very quickly if you can launch quickly the vote please mark a last question about integration of expression information just to see if all of that makes sense to you if you can vote is started yeah if you can go on the vote page and vote then just wrap up the presentation should have put an option we decide by a work lab both of users so for now the variance on the replies is pretty low ah more variance is appearing okay okay so our expression information is integrated in bg p-values are propagating along the graph of conditions yes so this was all part of this presentation so I had hoped more yes to this present absent expression cause are propagating under the graph of conditions no we generate the present absent expression calls from the p-values from the fdr corrected p-value that were propagated so we don't propagate the calls directly we propagate the p-values and then we make the calls on the expression information in the condition itself is considered no we consider also the p-values coming from all substructures all sub conditions because we propagate them and then we look at them in each condition expression ranks and expression score are propagating along the graph of conditions yes so it was kind of questions summarizing a bit everything and obviously the latest things presented this is what you had in mind but everything was true except this one uh no this one was not true and this one was not true so first and last answer were true okay so I'm going to wrap up in 30 seconds uh the end of this presentation okay so I wanted to show you an example so it's the gene page the result gene page for apocone which is an alipoprotein a digestive protein produced by liver and if you look at human you can see that thanks to the expression score we managed to show the top the highest expression in liver okay and just to mention also that we report absence of expression so this is where this gene is not significantly expressed you can see a lot of epithelium terms here the gamut epithelium so obviously this digestive protein should not be expressed in epithelium apparently considering this data what I want to show you is that we have a very consistent information across all species so mouse, liver, zebrafish, liver, primates, liver you know for all these species the top term that come up for this apocone gene in all these species is always liver and if I do an expression comparison using the tool that Marc just highlighted before so I enter the list of all orthologues in teria and the results I have is that all my genes are conserved I have a conserved expression in liver epithelial system endocrine digestive system so again this is how we can compare then this information across species and experiments so just to summarize two other conditions where genes are expressed bg performs manual and precise annotation to anatomy developed sex and strengths and then we integrate all of that by generating p-value for each gene and each sample computing expression ranks and score for each gene in each condition and then we propagate all this information along the graph of condition and generate a fdr corrected p-value