 In this presentation, I will present how you can download, retrieve, use in your own pipeline those data. And in the next presentation, Mark is going to show you how we can make use of the tools, of some tools that are going to obtain more knowledge about this gene expression. Just a second, just checking, yeah, okay. So first about what you are allowed to do with our data. So it is about the license. So the license is CC0. It's mean that you can basically do anything you want with data. It's not even mandatory to cite us. Of course, we would appreciate if you do. If you make use of our data, we would much appreciate if you cite us in your publication. But it's not even mandatory. And the advantage of this license is that we can then provide our data in a variety of tools that require the data to be CC0. It is a case, for instance, in WikiData, which is like the backbone for Wikipedia for producing structured data in Wikipedia. Notably, it is used also independently of Wikipedia. So WikiData require all the structured data information in there to be CC0 to allow integration of any data there. Okay, so we're gonna start just as a warm-up, a Wooklap about this data to make sure it's very clear to everyone. So Marc, maybe if you can launch the Wooklap about BG data is... So please, if you can go to the same Wooklap link. So BG data is available and request freely available or available from NCBI just to make sure totally clear to everyone. Okay, so it is totally clear to everyone. So exactly, it's freely available, completely free of use, no citation required. Yeah, you can really do anything you want with those data and no restriction whatsoever for commercial use or not, okay? Okay, so there are various ways to retrieve the data and we provide this data in different formats, okay? So the most basic format are the TSV file, tabulated separated files that we provide in different flavors. Either you are only interested. So first we provide one file per species and there are simple or advanced files. So simple file, you will only find the calls of expression basically and the conditions and the advanced file, you will find much more information like you will find the p-values per data type if you want to use only one single data type and not the integration over also affi-metrics and insituabilization, for instance, you will find information of expression score per data type as well, so the number of samples. So in the advanced file, we provide as much information as we can, but it means that the file is much larger. So it might be more practical to use a simple file and we provide either anatomy on the file where you have only the information at the organ level or a file with all condition parameters, meaning anatomy, developmental stage, sex and strain as well. But again, the file is then larger. And so those are to retrieve the expression calls that were presented in the previous presentation, but you can also retrieve the process data for your own analysis, for instance, with the TPM values for each gene. So it means that we retrieve the fast Q file and process it so for single cell data taking into account the protocol use, the position of the barcode, the UMI as I presented. And then you can directly download these TPM values processed along with information for each library, the annotation of each library. So those are what we call the process data. So we have these two type of data, either the calls or the process data of actual expression levels for each sample. So this is an example of the type, the kind of information you can retrieve is this file. So you will have the gene ID and the gene name, the uberon anatomical entity ID and name, when it is single cell data, you will also have the cell type ID and cell type name, developmental stage, stage name. So blastula, for instance, sex information. In that case, the sex information was not provided. So it is NA, all the strain information. And you see in the simple file, you will have the expression called present, the call quality based on the FDRP value, gold. And you will also have the actual FDRP value and the expression score. And in the process data file, then you will retrieve really information of read count, TPM value, also if FPKM value for historical reason, the actual rank of the gene in that sample and the detection flag and the p-value from that one single sample, meaning without the integration by propagation or of that, you really look in one library independently, you see the p-value of your gene expression in that library. And we also re-provide the annotation and we also provide information about the protocol, the sequencing technology use, whether it was paired N or single N reads, three prime N out of that. So how can you retrieve this data? So you can go either on the BG website, on the BG website, you will have links like that. So if you go to the download section, a specific species, so here it is for the process data. So you can have the bulk and basic data, process of matrix data, the single cell. So far now on the current release in the full-length data, but I will show you the annotation of droplet-based data as well. And we have our package called BGDB. It is available from Bioconductor. So with this package, you can ask, for instance, give me all the libraries that are available in brain, in human brain or in mouse brain, okay? And you can retrieve this process gene expression data. And we provide functions so that you can reformat your data into an expression set object. So an expression set object is using many packages for downstream analysis. So you can just identify all the relevant data in the conditions you are interested in, process them as expression set objects for any downstream analysis. And so you can refer to documentation on Bioconductor. And also, of course, you can reach out to us if you have any questions. And we also have, so it's more for computational biologists here. We have also Sparkle endpoint. So Sparkle endpoint, it allows for people not very familiar with Sparkle endpoint. It's a way to interrogate a database and perform federated queries between many databases. So using a Sparkle endpoint, you can ask to BG, give me the genes expressed in the human brain and at Uniprot, you can ask at the same time, which have the gene ontology annotation, I don't know, neuron development, okay? So it means that thanks to this Sparkle endpoint, you can make sense of BG data along with many other data available in all the databases, okay? And we provide also a web interface at biosoda.exe.org allowing to perform this scan of federated query but with a nice user interface if you're like, it still requires you to be able to write Sparkle queries, but we provide examples of queries. So you can find a queries that match your need and just edit it if you're unfamiliar with this language. And so the, maybe the most user friendly way to see the annotation, to browse the annotation is using the web interface. And so I'm gonna do a live demo of this but just to tell you, so first you have the production version of BG at BG.org where you can browse all the data that have been integrated in that current release of BG and we have a mirror site annotations.bg.org where you can see the data that we have already annotated but not yet fully integrated with data propagation, calls of expression into the production release of BG, okay? So we have these two websites, one where you can see really the information currently integrated and the other one we can see the information that is going to be integrated in the next release but that has already been annotated. So for instance, this is where you can find an annotation for the fly cell atlas that we have performed. We have annotated it, it's not yet in the release of BG. It's gonna be soon released but not yet. So I'm gonna just show you how to walk through this web interface to make the queries to retrieve annotations, share my screen again. I managed to, yeah. Okay, so here I have these two windows so bg.org and you have this link, raw data annotations here. So these are the information in the current release of BG but maybe I'm gonna use annotations that bg.org to show you the annotations of the fly cell atlas, okay? So annotations.bg.org, you only have access to the annotation. You don't have access to the gene search or of that you only have access to the annotations, okay? And in single-cell analysis data, so we have one tab per data type here and we have like three types of information. The experiments listed, then the raw data annotation meaning release of samples with their annotation and you see that we have many information so the table is quite large and we have then the process expression values meaning the actual TPM value read count for each gene in each sample. So it's much more data of course, okay? So if I start from the experiment and if I just browse like that the single-cell analysis data so you see that for now we don't have so many experiments we have only seven annotated but each experiment is already 100,000 of cells for droplet-based technology. So for the fly cell atlas alone, we have like several hundreds of conditions representing. So I'm gonna focus on that maybe and first like you can just browse the information for one specific experiment by following this link here. So if you go to this page, you will have a description of the experiment, a link to the source data here in the sequence free archive probably at CCBI SRA. So you can go back to the original annotated data and so here you will see the annotation of each cluster in the fly cell atlas because as I mentioned for single-cell droplet-based we do pseudo bulk. So we annotate at the cluster level and we pull all the cells part of the same cluster to perform the present absent calls. So here it means that we have 1,518 conditions representing the fly cell atlas meaning the different cell types in different developmental stage for different sexes for instance, okay? So you can see here the cell type annotation for each cluster. So there are like many cell type represented in this atlas. I'm gonna change the number of rows I display. So you can see that we have many, many cell types already obtained from this one single experiment. Most of them come from insect head in added flight, okay? We're gonna provide information about the technology that has been used, the sequencer that had been used, whether it's a full length technology or in that case a three prime, what has been fractionated whether it was the full cell, I mean whether it was a full cell or the single-nuclei. So fly cell atlas, it's single-nuclei data. This is what you see here. It's paradigm read. So you find different information about the protocols used so that we can repossess the data yourself if you want but we provide also process data. At the end of the table you see a link browse result to see the actual gene expression values in that specific sample, in that specific cluster from the fly cell atlas. So here in that specific cluster, then now for each gene, we're gonna see the expression level provided in CPM unit. So it's not TPM because you don't have to correct your gene lens because it's only three prime n sequencing and you have unique molecular identifiers allowing you to know how many transcripts you had before PCR amplification. So the unit used is CPM, okay? So you have a number of UMI, the CPM and we reprove again the annotation so that it's easier to browse maybe, okay? So that's really like if you're focused on one single experiment, you can set the way to browse the data but maybe you're not actually, maybe you want to find all information available in the brain for instance, in the human brain. So for that you have this form here and the way first you select a species, for instance human, maybe you're interested in one specific gene, give me all data containing information for aux genes for instance. But here for example, I show you if you want to retrieve information, all information available in BG in the human brain, you select the term brain here and this is an important check box here. It asks you whether you want to retrieve data in all substructures of the brain. So it means also the data annotated at the cerebral hippocampus level, whatever, you know? So do you want to retrieve data exactly annotated to the brain with no much, no more information or also all the substructures of the brain? And this is thanks to this check box. I will say that in most cases you want to give this check box, okay? And you can also for instance, select the sex, ethnicity, the developmental stage. So for the developmental stage, we present a simplified view of the developmental stage but then you're gonna retrieve the exact term I'm gonna show you. So here I'm gonna say I want data fully formed since we need to accommodate many species, the term is not human-specific. So here what you would like to see probably for prime and post-juvenile, probably you would like to see adults in human but since we are multi-special database, the labels are more generic, right? So here I'm gonna ask to retrieve all data that we have in BG, bulk RNA-seq in the human brain including all the substructure at adult stage. And then, yeah, let's start with that, okay? And I submit, okay, so it's a bit slow here. Of course, live demo. So the annotation website is not our main website so it's a bit slower. Okay, so here first you can see that you retrieve information in the cerebral cortex even though you didn't specify the term in the first place it's thanks to the ontology, thanks to the propagation. So sometimes we don't have precise information, the authors only told us it was adult but sometimes we have the exact age and since you asked to retrieve the child terms you will retrieve all of that. And here when it is available we have the sex information and the ethnicity in the case of humans. So here it was white Caucasian female, okay? And then again, we tell you whether it's full lens, sequencer, the fragmentation whether it was per hand or single hand read, okay? And again, then for each sample you can go and find the expression levels in that specific sample for each gene, okay? An interesting feature is that, I mean, so you have this form to make this first query and then you will have filters depending on your query. So it's similar to e-commerce websites when you do a first query and then you can refine your query thanks to filters. So here I ask all data in the adult brain of human and then I can see actually here all the exact term that were annotated in the adult brain, adult human brain, okay? So here you can see that we have data in putamen for instance and you can refine your query like that, okay? Actually, I want still in my brain query I would like just temporarily to see all the putamen results, okay? So you can refine your query and it's interesting because you can see all the values actually anatomical entities in the human brain. You can see all the values, all the ages that have data in the adult brain, okay? And the sexes, we have both male and female and sometimes we don't know the ethnicity so we don't have much here because a lot of our data come from GTEC and GTEC do not allow us to release publicly the ethnicity information. So this is why you have confidential restricted data. You can see all the experiment that do include some adult brain information, okay? And you can see also the exact library ID, okay? So, and when you browse between experiments or process expression values you keep the information that you put in the form in the first place. So you can easily browse between these different pages. So the website here is slower probably maybe because you're paying with it I guess some of you and also because annotations.bg.org is not our production server. It's a server to have a glance at the next annotation being integrated in BG, okay? Okay, so from the BG homepage I go to raw data annotations, okay? And I'm interested in the dog. So here you get the scientific name but also the common name. So I say like dog and I'm interested in the brain so I'm just gonna enter brain and I want to retrieve all data including the subparts of the brain so I'm gonna keep this check box checked here, okay? So I submit and yeah, the answer is that if I look at experiments the number here that you can find either here or here is that we have 11 experiments that have been performed including data in the dog brain, okay? So, and then for the samples it's what we call raw data annotations. We don't call that sample because again we accommodate several data types for in-situabilization data that would be incorrect for single set clusters. They are not really different samples so we call that raw data annotation. So if I click there, the form will have retained my settings so I don't have to enter it again and I see that I have 29 bulk RSIK libraries and you can see here the entity. So you can see it's actually in most cases subparts of the brain. And here in those cases it means that we didn't have more precise information provided by the authors. We are annotated at the brain level or maybe they extracted the whole brain of dog to perform the analysis, okay? But it is unlikely considering the size of the brain. So 11 is the correct number of experiments and number of samples, 29. So I guess 28 was a typo. So 63,000, I guess it's a process expression, I don't know. So I'm not sure how you get that number of Mekailo if you want to comment, maybe you can speak as well if you want, but many. So I thought you were speaking about the gene expression quantification but no, I don't see how you get that number. Maybe it would be interesting to know how you get that number. So do not hesitate if you want to comment. And then how can you know the number of Mekailo entities? So I mean, it's a bit annoying but again, using the filters you can see all the precise terms that were used. So here, yeah, there is no really other way than counting yourself. So it's like one, two, three, four, five, six, seven, eight, nine, 10, 11. So we have 11 terms representing the dog brain. So sometimes it's brain itself because we didn't have more precise information but sometimes it's more precise. So a bit of cortex, hypothalamus, okay? So we have 11, 11 organs, sorry, sample into dog brains, okay? So 10 was also correct, I guess, because you excluded the brain from the number. So that was well played by you to count the information. Okay, and then I ask you to provide the range of expression level in TPM of a specific gene, SRRM4, in the dog's cerebellum. So not the brain, okay? So again, you have to go back to the interface and play with it. So you need to find the expression level information for that specific gene in that specific tissue, dog's cerebellum. Again, a bit late, so it will be three, five minutes late. I give you a bit more time to answer this. Okay, so I start showing you how to do that since we already have quite a few answers. Okay, so I go to hear process expression values, which is the way to get the expression level information. So I'm not interested in brain anymore. So I could that in different ways, you know, I could just filter here, for instance, selecting cerebellum or I could edit my entire query if I wanted. So I'm gonna use the filters here, but actually I forgot that I asked for a specific gene. So I'm gonna edit my query here and I'm gonna select SRRM4 brain here. I'm gonna use a filter afterwards. So now I have result only for my specific gene and I'm gonna refine my query to look on the cerebellum. Okay, so here I really don't have much, I mean, I don't have many rows here. So it means that I had like four libraries in the dog's cerebellum. So it's pretty easy to see. And then yeah, the expression level range in TPM go from 8.15 to 30.08. So here's the correct answer. And I guess, Mikaelo, you provided the read counts, I guess, right? He'd go from 380 to 2095. So yeah, this is what you put. So here I asked the expression level in TPM. That was the trick, okay? And here, but you also have the read count information which is not that useful because you need to normalize by sequencing depth and gene length. Okay, and finally, there is also a WooClap. Mark, if you can launch the WooClap about a favorite way to get BG data, please. So it's for us to understand. Launched. Thank you. So yeah, please go to the WooClap link and it's for us to understand what would be your preferred way to retrieve this data. It helps us to know what we need to work on and refine. And thanks, Mikaelo. I think that you answered my question about how did you get the incorrect answers and that you didn't press submit? That's interesting, okay? Because since you have results immediately appearing, you didn't realize that your form was not submitted. Okay, so your favorite ways to get BG data would be either download from web and then our package. Okay, so basically download from web, it means that you have these query tools, but then you can click on the link to the data and that will be on the FTP. The files are all hosted on the FTP, but you can browse the FTP directly. But okay, so no fan of Sparker here and then more used to use R as a programmatic language. So that's helpful for us. Thank you. Okay, I just like, okay, just to finish, I think I forgot to show you something that is quite important. So I'll go back to annotations.bg.org. Sorry, I just want to show you, yeah, the single serenity data. So for the fly set list, for instance here, you have a download link and here it's a H5AD file because here you have like hundreds of thousands of cells. So CSV is not a format that is convenient for so many data. So if you click here on this link, you will retrieve a H5AD file that allows you to retrieve for each cell the complete annotation of cell type stage sex train and with the expression values, genes, the CPM values for each gene in each cell, okay? So when you go on the experiment page on BG, you always have a link to retrieve the process data for that experiment. And for single cell droplet based data is going to be H5AD format, which is in our opinion the best format so far to provide single cell RNA-seq data.