 So thank you for staying with us. We now have the last part of the lectures on our tools and how to access the data in practice. So Frédéric showed you a bit how it's done, where a lot of how it's done behind the scenes and I'm going to show you what we do with it in front of the scene. And I'm going to start with showing you just how you can get our data because we have a lot of data integrated and maybe you need it for your own analysis. And the first thing to be clear about is that our BG data is completely free to use. It is released on the CC0, creative common zero. That is the license which means you could, which is the same as a public domain. So it's like the text of Shakespeare or something. You can completely do. So this is the same as, for example, all the resources at the NCBI. So if you take a gene sequence, a gene bank, you don't need to ask them for permission to do sign with it. You don't need to cite them. You don't need to do anything. There's no legal obligation. It is totally with no legal obligations. Then there are academic standards, which we expect that if you use BG a lot, you would cite us, but we are not going to send lawyers after you basically. So that's the first thing. And I have a small work lab poll to make sure that this is clear for everyone. So I will stop sharing to launch the poll. So there I have launched the poll. Can you please vote on the availability of BG data? Yes, so it's a different link from the master document. You have the link to the work lab of this. Right, so the work lab link is always the same actually. So if you just keep that tab open, it's always the same. That's what's nice with book club. And since I just explained it, I hope it is clear, but this way we reinforce the learning. That area you can tell the SIB training group that we learned from the trainers course. So I think this is pretty clear. I'm going to restart sharing immediately. Okay, so do you see the work lab on your screens? Am I sharing the work lab? Yeah, yeah. Okay, so everyone agrees the data is freely available. So I think this is clear to everyone. That's cool. Now, when you download our data, we have download files that you can access from BG and we can provide either process data or expression codes. What we call process data is really the quantitative data. So for RNA-seq will be counts, RPKMs, TPMs. And we provide process data for the data which we process and where this quantitative expression is relevant. So AFI metrics, bulk RNA-seq and single cell RNA-seq. We do not provide it for in situ because it comes from the model organ database and we do not provide it for expression sequence types, ESTs because we have it for historical reasons, but it's not such good data. And you get for each species a folder you can download which is compressed and you can uncompress which will contain all the information of processed express data. And so you can know that this data has been already curated. So it's wild type, healthy. You can know that if we tell you this is liver, we check that it's liver. And you can know that it was processed by uniform pipelines so that if you compare the different data sets and combine them and some analyzes, they will all process the same way mapped to the same genomes in a consistent manner. And we also have files with expression calls. So what Freddy was just explaining, where and when a gene is present or absent and you can download this. For example, if you want to filter, you want to do some experiment where you want to limit. So you want to look at interactions between proteins in the brain, but you want to limit to genes which have expression in the brain and not have all the interactions which are found by an interaction experiment but for genes which will not be in a healthy individual expressed in the brain, you can download this and filter present in the brain. And you can either use alcohol or decide to be more or less stringent since we also provide the P values and FDR corrected P values. And we have simple files, which just like you first go in the gene page, you just see the anatomy. So that's what the simple files are. You're just going to see the anatomy without much details summarized over all the data types or advanced where you're going to be able to see all the parts of the condition if you want to set on sex, development of the stage and so on and also the different types of data. If you only trust calls from RNA-seq, but not from microarrays, you can filter on this. And these are classic CSV files that you can import into any tool such as our Python pandas, except if you have enough memory depending on the file. And so what's in these files? In the process file, you have the experiment ID, the library ID. So all the information if you need to go back, library type and then for each gene, so you have a row per gene per library, you will have the gene where it is expressed. So here you have the cell ontology, the developmental stage, the sex and the strain, all this information. Then you have all the information. So if you just want the raw read counts and you want to treat them, you can have them. We also normalize by the alternative normalization TPM which we recommend and FPM, which we provide for historical and consistency reasons. You have the rank which corresponds to the rank score that Fredy was speaking about, whether it's present with what p-value and how it is used in BG, if you want to be consistent between what you see in the libraries and what you see on our webpage. It could be found present in one library, but because when we combine it with the other information with the correction for multiple testing, the FTR, it is no longer considered present, so it could be not part of the code. And we provide something similar for my query. And then for the present absence, you have a simplified with the gene, it's name where it is expressed, when it's expressed. So you have here the anatomy, development and so on. This is the simplified, so there's not a lot of details whether it's present, what trust we have. Fredy spoke about gold, silver, bronze, and it's an expression rank again, and you have complex or simplifies. So this is our data, now how do you obtain it? Well, first you can go to the webpage, if you go on our homepage, you see all those pieces, they're not just pretty pictures, you can click on any one of them, or when you click on one of them, it unfolds. So here I clicked on human to unfold this box where you have all the downloads. So you can download the process data here with the RNA-seq, Afimetric or Singer-Stell, just the experiment information to know what experiments we have, which is smaller or all the full files of process data with the counts, TPMs and so on. And you can also on the webpage access, if you can switch between accessing these process data or the present absence calls data. You can also get it directly to an R package, and if you're working in R, I really recommend this way, because these files are quite big and annoying to pass yourself, whereas in the package, we already have all the information to all the methods, sorry, to pass our data. So it's our functions, which allow you to directly retrieve RNA-seq or microarray data from BG, and we provide you process gene expression data with functions to format into an expression set object, which is classic bioconductor. Our package is treating expression, and we have a publication, and we have documentation on bioconductor, and because part of bioconductor, we get regular alerts, which makes sure that we always make sure it's working on all the different environments. So if you have some older version of R on Windows, it should work and whatnot. So that's one way to recover, and so now I will do another work-lap, exit, and here I'm going to poll you. So this is not a quiz. I want to know from you, if you see these different ways of getting data from BG, what would you like to do? How would you like to get data? And I presented you the webinar, but we also have FTP and Sparkle for those who are more into programmatic access. So if you can tell me what would be your favorite way to get the data? So we don't have a hardcore geeky by information accessing through Sparkle so far? Because I have hidden slides to present the Sparkle endpoint. I totally do it if people vote for this. So you can tell your hardcore by information friends that we do have a Sparkle endpoint. Okay, so most people want to download from web or go through the R package. Okay. So thank you. This was just a little poll to know how you feel about it. Now I'm going to present the gene page. I will be pretty short on this because both in my demo at the very beginning and in what's really presented now, we heard a lot of the gene page. So just to make sure that we, when I present all our tools, we remind you of this. Expression conditions are defined in the gene page by anatomy, development, sex, or strain. You can restrict the data you use. If you remember earlier this morning, I showed you that we had the list. We could click yes or no for RNA-seq, microarrays, and so on. We show significant present and significant absent from direct observation and for propagation as pretty just explained. You can see the orthologs and the paralogs which we take from the OMA database, the ortholog metrics database, the database of orthologs and paralogs. And you can send them directly to our expression comparison tool or just get the list. And we have cross-references at the bottom of the page too. For example, Uniprot ensemble, which is our key reference for genes, which are for species which are an ensemble. And for model organ data is when it's relevant. For example, if you go on a zebrafish gene, there will be a link towards Zeefin, which is the model organ database of zebrafish. And I have a little work of this too. So this again, asking your opinion, I would like to know when you go to the gene page, what would you like to see first? So you would like to see, you can click on only one or several. So if you only interest in development decision, not in anatomy, sex and strength, just click on this. But if you want to always see all four conditions together, click on all four, okay? Yeah, actually we won't be able to distinguish between people who want only sex or all of them at the same time. This is not a full usability study. Yeah. But at the same time, it confirms that there is, the first access of expression that most of you want to see is anatomy. Okay, so thank you for your votes. So I see that there's the less love for strain and development of stage and sex are obviously important, but really anatomy remains the main structure. When I ask where is a gene expressed, when I ask about the expression pattern of a gene inanimate, I usually mean it's expressed in the brain. Or where that's consistent with our expectation. Thank you. Now I'm going to present a bit more complicated tool. So these are went fast because this was stuff you already heard about just putting it back in context of the tools of BG. This is a tool we haven't spoken about yet and it's quite important. So I'm going to take a bit more time on this. Top Anat, top leg, the top and Anat leg anatomy. So what top Anat does is does very similar to a gene ontology enrichment, but instead of asking what gene ontology functionalizations are more frequent than I expect by chance in my list of genes, I'm going to ask in which analytical structures are my genes more expressed than I expect by chance? So I have a list of genes and I'm going to ask, are these genes actually much more frequently expressed? I have, for example, digestive genes are they much more expressed in liver and intestine than expected by chance. And for example, if I did some experiment where I was trying to find the list of genes but I don't know if I did, this would give me a strong confirmation. So for each gene in BG, we have expression and anatomical structure. That's what we do. And if I have a list of genes, some structure will be represented many times in the list. It means that many of the genes on my list have an association to the same anatomical structure. So for example, here I have a list of genes and in my list of more than 3000 genes, I have 300 and 500 more genes. I have 46, which have expression in the frontal pole of the brain, which is Euberon 2795. So I have here 46 of my genes in my list which are expressed in this frontal pole. I also have, sorry, my list is only 100, sorry for my mistake. I also have 56, which are not expressed in the frontal pole. So is this 46 out of 112, 102, sorry, more than expected by chance. Well, I can compare what do I expect from the general genome. And if I look at all the genes in the human genome, this is human genes, I have about 10% which I expressed in the frontal pole. This total is much more than the 20,000 protein coding genes because it also includes non-coding genes which have significant expression, okay? And so we can do a ratio between these two. And we find that I have six times more expression of my gene list in the frontal pole than expected by chance if it was these proportions. And so we can do a Fisher exact test which like the more exact version of a chi-square test. And we find that, sorry, this is expected, the six, I would expect six. So we have 7.6 times more than I expect. This is super duper mega significant. And even if I correct for multiple testing, I tested over all the anatomia structures, it's still very significant. So we have a strong enrichment of the genes in this gene list in the frontal pole of the brain. And these are genes which are the number one example. If you go to the top of that website, we have pre-computed examples. These are genes which have been from various studies associated with autism and similar phenotypes. Autism and schizophrenia, actually, that's it. Okay, so it makes sense biologically to find them more in the frontal pole of the brain which is known to be involved in autism and schizophrenia. So in a more general way, what we do is that for each anatomia structure for which we have data, we're going to compare our gene list to a background, look how many I expressed there relative to expectation from the background, do a Fisher or hypergeometric test, and we're going to do a deconvolution of the ontology graph, which is a bit complicated, but basically if you have a gene which is expressed in the frontal pole, it's necessarily also expressed in the brain as Fidelic just showed. So if these genes are more expressed in the frontal pole, they're also more expressed in the brain, but that's not extra information, it's the same information which I'm saying again. They're the same thing in the gene ontology. If you have more hydro lasers in your list, you probably have more enzymes because hydro lasers are a type of enzyme, but that's boring. So we have modified the R package top go, which is a package for gene ontology enrichment, which does this deconvolution and we've modified it so that it treats not the gene ontology, but uberon, the ontology of anatomy, and we can do the same deconvolutions. So if you use our top anatomy tool, so it allows you again, you can put any gene list and get the enrichment in anatomy structures across any species which is in BG. Obviously the species for more data will be more powerful, but you can do it in any species which in BG. And something I should mention here, I didn't write on the slide, it only works on data which is observed in that space. So if you do a gene ontology enrichment test on platypus, what you're going to see is what we hope to be the function in platypus from blast hits on human and mouse, basically. No one has done the molecular studies of the gene function in platypus. If you do a top analysis on platypus, what you see is expression in platypus, strictly in platypus. So it's experimentally observed in platypus on data which was curated from platypus. None of you study platypus, but it's the same for any other species. So if you do such enrichment analysis in this tool, both for gene ontology and for anatomy, I just want to attract your attention to these important pitfalls. Background is critical. Here I compared my genes to the whole genome in the example, because any gene in the genome could have been found associated with autosmoschizophrenia. But for many studies, you could not find all genes. If I look at genes for which I find positive selection in humans right up to chimpanzee, that's only genes for which I had autologs between human and chimpanzee, and for which I was able to do an alignment and calculate positive selection. And so my background should be those genes, not all the genome. And this is very important, because otherwise you get very wrong with that. Because we test over all the anatomical structures or the gene ontology, all the go terms, we have a lot of multiple testing and you have to pay attention to this in the interpretation of your results to the multiple testing correction. And as I mentioned, the terms are in a graph, as you've seen, so they are not independent if you're in the frontal pole, you're in the brain. So it is very important to pay attention to this. I mean, you always have to de-correlate because sometimes the interpretation without de-correlation makes sense, that you have to be aware of it, pay attention to it. It's the biologist who must make the final interpretation knowing the pitfalls. Or we can use, in top and at, the algorithms from top go for de-correlation, ALM, which is very stringent, or weight, which is less stringent. And top and at is on our webpage and is in the BGDBR package. So you have a gene list, you want to look in a cool, easy way. You can just paste that gene list on our webpage like you have many web pages for gene ontology enrichment. But if you also want to include anatomical enrichment testing in an R script or pipeline, if you have already in R, you already have your gene list there, then you can just call our package and very easily get your anatomical enrichment and also all the options which exist in top go package for those who know it are in the top and at version in BGDBR, whereas on the webpage we had to restrict a bit the options for usability. So here I'm going to ask you on the Google Doc. So I'm going to the Google Doc. Oh, there. Can you tell us a gene list that from your experience, you would like to test in top and at when you have time, you'll go back to top and at think, oh, this is cool. I can test where my genes are expressed. And I have this list of genes from my work. What would be the list of genes? Genes involved in caskizofrenia. So I see some gene name. I think if you write all the gene name, that's going to be a bit long. So more general list like, you know, genes involved in this disease or this process or found by this test for selection or different reacting to this drug and so on. I don't know if you hear background noise on my side, but we have a campus day of visit of kids. So I have a lot of noise behind. So I'm sorry if you hear kids screaming. Home and ambassador disease genes, nice inflammation genes, genes for tissue resident memory T cells. What do you expect to find them expressed? Let me ask. Jose. Okay, so thank you for your examples. If you have other examples, you can add them. So I think it's most from my experience of playing with top and up for several years. It's most interesting when you have a pathway which is not directly anatomical. For example, home and ambassador disease genes. And you're wondering where are they expressed? Jesus first in liver endothelial cells will probably be expressed in liver endothelial cells. So it can allow you to check that our database works well, but you might not learn so much new biology, right? Inflammation genes is also interesting. So I'll go back to my slides. Thank you very much. You can continue writing if you want. So one thing that BG does and is quite unique, I think, is that we have homology of anatomy. I don't think it's unique. I know it's unique because the other people who said they would do this 10 years ago, didn't because it's way too much work and we were the only ones crazy enough to do this. So what we do is that our bar curators, mostly only our lead bar curator reads the literature on homology of anatomy. So you know all autology of genes, but there's also homology at the anatomical level, right? So our arm is homologous to the forelimb of a mouse and to the wing of a chicken, right? So that's anatomical homology. And a lot of anatomical homology is not trivial. So our curator reads pathontology literature, evolution of development literature, comparative anatomies, zoology literature, textbooks to get the most accurate information on this homology. And this is all recorded in our database. And so we give you access to it. And the first way we give you access to it is we have a tool which you will find in the, if you unfold the search menu here, there is a search for anatomical homology. And there what you will find, you will get to this page where you will see the, you will be able here to put identifiers of uberone or synontology, you select species, and you can find which of these anatomical structures are homologous between these species and what is the homologous structure? So here I put all the anatomical structures from human, which have been studied in the GTEC project. And I asked, do they have homologous homologs in the zebrafish Daniel Rario, which is the model organism, the most amenable to experiment in fish? And so the question is, how many of these structures which are studied in GTEC, can I compare to studies in zebrafish? And I have 31 here, which have, where's my mouse? I lost my mouse, that's cool. I don't know where my mouse is. So I have 31 which are in the, now I got my mouse back. Okay, 31 here, which have homology. And you see that sometimes the homology is quite simple, both zebrafish and mouse have a hypothalamus and it's homologous. Sometimes it's not so obvious. So the left ventricle of the heart and the primary heart field are homologous or the lung, you see the last one here. And again, I lost my mouse, this is a knowing bug, okay, here. Here the last one, you notice that the lung of the mouse of human, sorry, is not homologous to the gills with which the fish breath, but to the swim bladder with which the fish puts air inside and use it to change their level in the water. And this is a well-known fact in comparative anatomy, but maybe most molecular budgets don't know it. So here we give this information. And again, my mouse disappears. This is the bug already had in Zoom. Okay, so that's one thing we can do with the anatomical homology. And the other thing we can do, we already showed you a bit is the expression comparison. So if you click here on the analysis, you can unfold here expression comparison or on the homepage, it's big button expression comparison. Here you can put the list of gene identifiers either from your own study or directly from the orthologs or paralogs of a gene page. And you're going to get in which structures these genes are expressed. And you will see only the structures which have homology. So if the gene, if these are only, these are all brain, this is a brain gene, the first example here. It's a brain gene, known brain gene, which is found in all animals, all bacteria in animals. And here you have, for example, this is Phyllis catus, the cat, and this is a fly base or fly and so on. And you find that overall, everywhere is found in the central nervous system with very high score and found in all species present, right? So this is a brain specific gene common to bacteria and indeed, and what we see here is that I have the central nervous system, which is a homologous description of all the central nervous systems of the different animals. And going back to the Google doc. So this was a rapid presentation of our tools. And I'm just going to ask you again, for your opinion, what is your favorite BG tool? So with everything we presented this morning, what would be your favorite tool to use? Top Anat, the gene page, the expression comparison, the homology of anatomy, the downloading files. Also you can mention the R package, BGDB R package, which allows you to get the data into R or to analyze Top Anat in R. So I'm letting you write your favorite BG tool in this Google doc. Again, if you're shy, you don't have to put your name. The tool is more important. So Vinayaka, you're in the table above or in the table below. Otherwise we can copy paste it. Is the Google doc shared? Are you seeing it? Préderique, can you confirm? Yeah, yeah, yeah. Okay. I don't know. Usually there's this kind of green thing around it. I didn't see it. So a lot of love for Top Anat, either because it's super cool, which it is, or because the last thing I spoke about. Oh, you spoke about expression comparison. You're right, you're right. But I think the Top Anat is very cool. Also, if I may say something really cool about Top Anat is that every time we test it on some gene list, because I like to play with it. So I see some paper with a gene list or some study on Twitter or something and I try it in Top Anat. And the results always make sense. And this is super reassuring because Top Anat is built on everything else we've presented you this morning. So if we made mistakes in curating the datasets and then annotating them, Top Anat wouldn't work. If we made mistakes in the quality control of the data, it wouldn't work. If we made mistakes in calling genes present absent, it wouldn't work. It only works because everything else is correct. So it's also an excellent quality control for us to check that it works. Could I ask to CRV-NAYAKA what would be your use case for additive kinomology tool if you accept to tell us? Because it's maybe the less sighted tool among our tool and I'm curious to know about what you would plan to do with that because it's directly useful to our expression comparison, but alone it's interesting to know, yeah. Okay. It's a relationship. Yeah. And so I should emphasize again that additive kinomology you see is not, is never derived from our expression data. Because this is a question I get quite often. We do not use our expression data to derive homology. Because then it would be circular if we use our expression data derived homology and say, is it conserved in the homologs? Well, yeah, because that's how we calculated it. Our homology only comes from the literature and in the literature expression might be used as a evidence, but then it's used as social evidence someone external to BG who's an expert who's been considering this and usually it's specific marker genes, for example. And then we, if you are really interested in our antimicrobial homology, we have a detailed file on our Github where you can have all the details of how we inferred it with every line of evidence and every bibliography reference. So maybe the same homology has been reported by three different studies, one based on process, one based on the marker gene and one based on developmental patterns and we will give this separately. So thank you for your contributions.