 overview of BG. So in this first part of the course, I'll give you really a very broad overview of what is in BG and how we collect information there. And then we'll go into more technical details of some specifics of the database. Okay. So first, as we already mentioned, BG is the work of a team. So it's not one person. And if you have any questions, we have this email address, which goes to a ticketing system, which we reply to as rapidly as possible, and it goes to the right person inside the team. So you don't need to guess. We are on Twitter and Mastodon, and we monitor these and also answer two questions there. I should have put, we also answer two issues on GitHub, but most experienced and biased users don't use that way to ask questions. This is the team last summer. There's been some change. So I put a more recent photo, although Chris Mangal here is not actually from the team. He's a very close collaborator and good friend of ours, but this is otherwise the team in the recent declaration meeting in Padua, Italy. And so what does BG do? The our goal with BG really is to help biologist users and to understand gene expression. So gene expression is a complex trait. There are many aspects to gene expression, especially in multicellular organisms like animals. And so what we want to do is to make it easier for people to understand and use gene expression and to help biologists with their experimental computation that have the easiest access to this information. So everything we do is trying to fulfill these two goals. Can we understand better gene expression and can we make it useful to biologists? And so now I will do a quick demo, which is already always a bit dangerous, but let's try this. Just because the easiest way to show you what we do in BG is to go to BG. So okay, Zoom has lost my part. Okay, so here you see the homepage of BG, which you can access by typing www.bg.org, particularly. And here what we see first is that we have a pretty big list of species, so 52 species as Frédéric said. So in BG, our goal is to integrate gene expression from animals. So we have only animal species, so no plants, no yeast, no bacteria. And in principle, we are open to integrating any animal species, although of course there's always some trade-offs to what data is available, what time we have and what priorities we have. But you see here the five classic model organisms of animals, humans, mouse, zebrafish, Josephine and Augusta and C. elegans, and then various animals of agronomic interest, and then various animals, which represent different aspects of biodiversity. And if you go to our homepage, the first thing you can do is search for your favorite gene. So like if I look for insulin, I get here all the results from insulin in different species, and if I click on one, I see the expression in that gene. And that's what we do, we provide you gene expression, right? And we also have on our homepage various tools which we would present to you over the day to compare expression between species, to access the annotation, to express cause of expression, and to analyze enrichment of gene expression. And we have here on the page all the information that you might also need in various ways. So the most specialized access, the species of list with the information with genome version we used, which source of the data, you can download our data. We have some more specialized resources such as our packages, Parkland Point, and we have information on the publications, the videos of this course, and so on. So all this information is there. Now go back to my, and now I will start the first of my Wooklap. So on the page of the course, you have a link to, or maybe share this so you see what you're doing. So going back to this. So here if you're on the page of the document, you see that here there's a link to activities. If you click on it, you get to Google Doc, and here there's a Wooklap link, and this Wooklap link will actually be the same during all the courses, a tool to provide interactions during courses. Okay. And so I'm going to launch the first Wooklap, and it's just asking you to answer a very simple question, whether we have information from these different, different species or data sources. So you have one minute to please vote on this. So you follow the link and you vote. Okay. So thank you for the votes. So you were all correct that there is a lie and healthy human data. There is also platypus, this important, we don't restrict ourselves to the classic model organism. We take, when there is sufficient data in a non-model species, which can be of interest to some evolutionary questions or some other biological questions, even biomedicals like some fish or model for aging or whatever, then we take it in as much as there is a reference genome of high quality and there is sufficient expression data. So we have platypus, we have weird fishes, we have amphioxus and various weird species. Also we're in the Department of Ecology and Evolution, so that's interesting to us. We do not have yeast and we do not have herbidopsis, so BG is restricted to animals, and so this is not going to change. We are animal centric and this was a bit of a trap because I did not yet speak about this but we curate only healthy wild type data. So for humans, for example, we do not have cancer data. So if you look into the database of gene expression where all the data is like geo and our express, you'll have a lot of cancer data for humans. It's most of the data actually and we do not keep that data because what we want to show you is how the gene expression is in a standard normal state. So I'll go back to my slides and so what we do in BG is that I was sorry just every time I change I have to reset things. Okay so we take gene expression from a diversity of sources, so ESTs which is a very old-fashioned way of getting expression but we still have it, aphymetrics, microarrays which were the main way to get genome-wide expression 10 years ago, bulk RNA sequences the main way in the last 10 years and increasingly single cell RNA sequences we should hear a lot about this morning and also in situ hybridization data, so this is very precise data usually one experiment at a time in an experiment typically development of bulge but also sometimes in adults especially in mouse people are going to hybridize to take a transparent embryo or do a cut through an organism, hybridize a marker for one gene and see exactly where the genes first express it in text and this is then curated, verified by these model organism database such as MGI for the mouse, Zephyne for the zebrafish, flybase for Josephila and so on and we do not do this curation that's not our job but we recover it in agreement with them and we keep only again the healthy wild type and so all these data we have to process so we do quality control only take those which are healthy wild type, we process the data so that aphymetrics, RNA-seq bulk and single cell all process the same way in a consistent manner, we standardize and map these two ontologies all these terms would be explained over the morning and we integrate all this together in BG and I think this is really an important point as far as we know we're the only database where these different sources of information on gene expression are integrated together now not presented separately, you don't have to choose when you go to the BG page or use the BG tools whether you want to see from microarray or from bulk RNA-seq or from single cell RNA-seq or from in-situ hybridization we're going to give you all the information we can together now I've spoken here about quality control, filtering, mapping all this is part of bulk curation and BG is a curated database and it's very important to understand this concept what distinguishes a curated database from a non-curated database so uncurated databases are very common and they have advantages and drawbacks of course so a typical example of an uncurated database all of you are probably familiar with is what used to be called gene bank is now called NCBI nucleotides where all the DNA sequences which were ever made public are all there so because they're all there even if 20 groups sequence the same gene there's a huge redundancy because it only depends on the information that people put in without additional verification and without additional organization there's a low organization of the knowledge so you have whatever information someone put when they put it and nothing more it has an added value of course that it can be complete and up to date because it's automatically generated so a gene bank or other nucleotide database up to date every 24 hours a curated database is one where you have curators human beings who are experts who verify all the information we should put into the database and organize it and a typical example of a curated database is the swiss prot part of uni prot which probably most of you are familiar with and when you go there the data are verified there is minimal redundancy so different group sequence the same protein or analyze the same protein it's in the same it comes together into one entry annotations are standardized so it's always the same term used for the same thing and there's the added value is that the knowledge is organized and reliable so it's much easier for you to recover it and you can trust it much more and this is just a quote from the back curation society the back curation involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets I think is really important that it's the translation and integration of information so the back curators are doing this work for you of finding what is the relevant information finding how it can be expressed in a way which can be manipulated by informatically and integrating it so that you have access to all this together and so what is annotation annotation is associating a biological object to a feature sorry I have a window from zoom which is blocking my slides which is a bit annoying I have to change this and I cannot there sorry there okay associated biological object to a feature based on evidence both of these parts are important associating a biological object to a feature for example associating a gene to a gene ontology term and based on evidence that is we don't just associate them randomly we have to know why we're associating them and we're going to document why we associate them and you can have this you can have this association this annotation without curation so if you give every gene in a new genome a gene ontology terms according to the first blast hit to another genome that's uncurated annotation you do not verify anything you trust the automatic system whereas if you read papers which describe functional assays on a gene and from there you say okay according to these assays the function is this or that and you find the right gene ontology terms correspond to this function from the assay and you put them to the gene annotate them to the gene that's curated annotation that's what's done for example in Swissport and bg is a curated database so all expression data and bg is verified so we only take wild type healthy expression data and any data which does not fulfill those criteria is excluded and then every expression data set is annotated by manual curation so it's curated annotation so we associate the expression data to what anatomical term it comes from what age was the individual what species and all this and we read we have curators who's job it is who are professionals and read all the metadata which is submitted to g o r express and so on but also go to the paper to the supplementary material of the paper and in case where it's unclear contact the authors but we need to be certain of what we do and we follow standards for the annotation which terms we use how we control the the confidence we have and those annotations those standards are also curated so as to have the highest possible standards and so now if we go to the google doc so you have the link to the google doc in the in the activities and here i'm going to ask you to each write two examples of curated databases you know of and of uncreated databases you know of and there's a column participant name but if you feel uncomfortable putting your name you can just leave it empty someone asked where the google doc so there's a link from the from the word document of the course so i can show you this so here's the word document and if you click on this link arrive to this document so we see here a diversity of types of databases of course sequence ones but also various like see see if we don't all mean are phenotypes plicellate less and fly base plicellate less as gene expression single self fly base is uh like a z fin phenotype gene expression many things of an organism uh uh bigger pathways g worse catalogue is uh association between gene and phenotypes so again phenotypes well i have to mention that mgi is curated in the same it's a model organ database like uh fly base or z fin a geo for example is the equivalent of nc gynecliotype but for the gene expression data so people put their micro RNA seek data there and it is not verified okay so we see a diversity here i will go back to my slides okay so as i told you we only cured wild type healthy gene expression and why do we do this because wild type gene expression wild type healthy is informative on what we call the causal function of a gene so if you want to know if you ask what does a gene do if i take the example of brca1 braca1 when it's mutated you increase your chances of breast cancer but this is not the function of the gene the function of the gene is not that any more than the function of your tire is to stop your car when it deflates when it's punctured the function of braca1 is what it does in wild type healthy individuals and so that is what we want to capture and it's evolutionary relevant because if you're comparing gene expression between human and zebrafish you want to compare healthy wild type human zebrafish not a sick human to a mutant knockout zebrafish and so it's a and it provides a reference for biomedical studies so we started from an evolutionary biology question but actually we find that it is very useful for biomedical studies because often if you study gene expression in disease in treatment and so on you want to know what is the reference how did this gene behave in a wild type healthy individual and so this provides this reference and to give you an example of what it means to do this curation if i take the g-tech data set which was until the human sat last the largest data set for gene expression in humans it's used by many people as a reference relative to diseases or other conditions and it's they say in their documentation they were collected by from 54 non-disease tissue sites across nearly 1000 individuals but if you look at their documentation they also say we have not excluded specific donors from specific tissues based on their cause of death or medical history so the curators of bg went over all the pathology reports and all the annotation of the g-tech data and what we found is that in many cases this was not healthy individuals so we have among the subjects which were included in g-tech we have 235 which indeed are healthy but we have 179 for whom we rejected all data because for example they died of drug abuse they died of cancer and we have 158 for whom we rejected some tissues and kept others so if you look at the samples here we kept about half and there are individuals where we discarded from the subject or discard just the sample because for example if you have someone who died with dementia we would take the gene expression from the liver the muscle but not from the brain if someone had a liver disease we would take the brain but not the liver and so on and so in fact you see that in the end we only kept half of g-tech samples so when you use or g-tech it is not curated it is not wild type healthy and in bg we do this job so that you can be sure it is healthy and so in total we reviewed 12 000 libraries of irony seek and only kept a bit almost 5000 and then we annotated this to reference standards ontologies with the anatomy the age the sex and the ethnicity but because g-tech is from humans which is partly confidential data we only show publicly in bg the optical entity but a broad age range so the destiny from which they are and not the ethnicity and you can recover all this information following such a link where you have all the annotations we put on g-tech and then we standardize the data so what does that mean for example that metadata about library sorry I have to change my mouse again to be able to do something this is a bit annoying there so we standardize the beta data about library construction for example if you take irony seek there can be strength selection you should forward or revert to unstranded library types fragmentation all these features are going to change how you can analyze the data and what you can do with it and we standardize this when you capture the data and then we annotate diversity of protocols of bulk irony seek so we have a classification of all the bulk protocols that we have encountered in annotating data curating data for bg and we have 40 protocols which are classified according to what can we do with them can we call genes present if they're there but an absent if they're not there or if for example you specifically looked only for coding genes and you don't find non-coding gene that mean it was absent from the sample but you didn't recover it from the experiment if you have only three prime ends of RNA then you don't need to normalize by RNA length because you do not sequence all the RNA length and so on and so we have to adapt our protocol and when you recover the final process data we've already taken this into account and we do the same for single set irony seek which I must say is a lot of work because single cell is much newer than bulk and yet we already have 32 protocols and almost every two weeks we have a new protocol that we have to classify they can there's different ways to isolate the cells to isolate the RNA to sequence the RNA to barcode it or not to identify the cells and right now we classify all those protocols but we only keep in bg four protocols smart sec smart sec two tenex chromium v2 and v3 and it actually makes eight because for each of these we can accept single nuclei or single set and if we go back to the google doc now you have a question uh so here I put you a little quiz um asking you to uh tell me what are the healthy wild type data from an experiment so you can read the description and this is the kind of job our curators do we have circadian course of liver mRNA profile of world of wt b mal one liver cow ko reverb a beta liver double ko cry one cry two double k after 12 weeks of high fat diet feeding and ad libitum or time restricted feeding what would you keep here as healthy wild type so we had a question about the eqtl analyzes so the answer for this will also depend on your personal access to the sensitive part of the data because as far as I know to run the eqtl analysis you need to be able to associate the gene expression with the snips which can only be done if you have full access to the individual information but okay so if I look at what you have been writing on the google doc here there's some diversity so something that none of this is um you know healthy wild type several of you uh kept the wt annotation which indeed means wild type some people also kept some of the knockouts so no none of the knockouts would be kept in bg because they are not wild type so we don't keep either the cry one cry two double knockouts nor the b mal one liver knockout um so those who think that nothing here is wild type have noted that there is a treatment indeed there is high diet feeding ad libitum or time restricted feeding and that is the kind of question we uh have to address all the time when we create data for bg what is in fact wild type healthy so if I give some mice more food and some mice less food is this still a healthy environment and this is a difficult question which in this case we consider that any feeding um regime which could happen in nature will be considered within the normal biological variation so we're going to accept the high fat diet and the time restricted feeding and we are only going to accept the wild type so we will take and this is a circadian time cost so there will be points of different times over the day samples over different times of the day we're obviously gene expression will vary but again this is part of the natural variation of gene expression in the wild type healthy individual so we'll accept the different time points we'll accept the two diets but we will not accept the the knockouts simple or double and these are the kinds of choices we have to make and our aim is really to give you gene expression which represents what you could find in nature what has been in fact selected by natural selection over millions of years in these species and so mice over the last millions of years have lived by day and by night have had sometimes more food sometimes less food so this is within their natural variation but within their natural variation they did not get liver specific knockout to be more one okay so that's our logic in these annotations and now that was the first part of the declaration which is to choose the data we're going to use but then we have to annotate this data to make it useful to you and for this we annotate it to ontology so what is an ontology ontology in bioinformatics is a list of terms so you agree which words you would use for what so for example you will say that for brain you would use brain or central nervous system but you choose one term and if you only have a list of terms you have a control vocabulary which is quite useful already for example the enzyme nomenclature is a control vocabulary then you have definitions of the term so if you have a list of terms with definitions this is a dictionary not only in our life outside biology but also in bioinformatics and generally the management of knowledge and computer science and now if we have relations between the terms then we have an ontology so what is relations between the terms I'm not only saying I'm going to use the word cerebellum I'm not only giving it a definition but I'm saying the cerebellum is part of the brain and I have a relation between them and there are various types of relations and these allow automatic reasoning what is automatic reasoning well it's simply the fact that if I say I want all genes expressed in the brain and there is a gene who has an annotation that is expressed in cerebellum then I can recover that gene because automatically I can know anything expressed in cerebellum is expressed in brain since the cerebellum is part of the brain and the most well known ontology that certainly you all know is the gene ontology so the gene ontology has these different parts right it has specific terms they have definitions and they have relations between them so side migration in the hindbrain is a side migration and is a side migration is part of hindbrain developments you have different types of relations and all this side migration the hindbrain in the end is part of biological process right and these are used in all databases which want to annotate function of genes for example here you have an entry from uniprot swiss prot from homo box gene and zebrafish and you have here these go terms with here said migration in hindbrain which was here and we use other ontologies ones which allow us to describe where and when a gene is expressed and the main feature of where and when a gene is expressed in an animal is anatomy so if I tell you this gene is expressed in mouse or in zebrafish or in fly going to know what does it mean express where is it expressed which organ so we have here the ontology called uberon which is an ontology which describes the anatomy of any animal and it's so here you have for example the liver and it includes within the ontology the fact that some organs or some terms you will only find in some groups of species so here the liver is only in taxon vertebrates so the liver of vertebrates is only in vertebrates within the same ontology that you also have the terms for say fly but specifically here we can do an automated reasoning to say I should only recover liver if I have a vertebrate and then you have the same part of relations you had for oops sorry for the gene ontology and you also have other relations I should develop from contributes to the morphology of part of and so on and so we have all this description which is quite complex of the anatomy of an animal and this we use to annotate the gene expression we find to the specific as specific as possible terms in anatomical ontology in uberon so as specific as possible means that if someone tells us this is the lobe of the liver will be specific lobe of the liver but if they say I took this from the head without saying whether it's the brain the eyes whatever we're going to call it the head and we have these more specific terms like here you see abdomen and these less less specific terms sorry these more specific terms like liver here or hepatobiliary system so here a little wooklap we'll see if you've been following this so you go to wooklap and I'm going to start a new one there are actually two wooklaps here sorry this is the wrong one so quitting yeah so about ontologies so about anatomy annotation to ontology can you tell me in an experiment which has RNA seek from six organs across 10 species of mammoth and birds how many different uberon identifies so specific terms in uberon ontology of anatomy should be used for the annotation of this gene expression so the wooklap link is the same as previously I have one vote it's a very close race between two votes all options have been chosen chosen now so a majority voted for 60 which is six times 10 and this is not the correct answer because the fact we use one ontology which covers the anatomy of any species of animals these are the same six organs the homologous six organs across mammoths and birds we can use the same uberon id to use the same uberon identifier to recover the liver of a human a mouse a chimpanzee or a chicken so we only need six for this and this way standardized and you can recover easily the gene expression from the same organs in different species okay and so rapidly what makes this these ontologies useful is that they're used by many resources we have one common standard so you know the gene ontology is used the main database but uberon is used it's gone to all animal species and used in all the big projects and many small projects annotating expression or other features of biology which are relevant to anatomy for example g-tech annotates to uberon the human cell atlas and notice to uberon the fly cell atlas notice to uberon it covers a large domain of knowledge so all animal anatomy and what makes an ontology useful is that the tools leveraging it and I think we are the main provider of tools for uberon and you will see this later today and we also have to use increasingly the cell ontology because we have single cell data and so just describing organs and tissues are not sufficient the cell cell ontology describes cell types and it is again a similar structure as what you saw for the gene ontology or uberon you have terms with definitions and relations so a hoop first cell is located in tissue specific macrophage it is part of erythrocycperin and so on and now there are some challenges for this because organs are sounding well known for centuries and it is pretty well described single cell RNA-seq is making us learn a lot about cell types and so we often get new cell types or specific cell types described in single cell RNA-seq which would not yet described in the cell ontology so for example from the fly cell atlas we get cell types of t neurons t4a t4b t5a and t5b but these types are not in the cell ontology so we had to annotate them all to a parent term other co-energic neuron which is a type of neuron which includes all these t4a b t5a and b but don't include other t neurons so we keep it as specific as possible we don't just say neuron but we cannot have the the granularity the level of detail of the original experiment because not yet in the cell ontology so that's why we're working on together with other groups who work on the organizational information for cell single cell gene expression to improve this and we constantly require new cell type terms because new cell types are discovered simply and so for example we have the distinction between octopanemenergic neuron and teraemenergic neuron which is not yet in the cell ontology either so for example when we curated the we are curating right now the fly cell atlas and we found some errors which we had to correct a bit like for g-tech so for example the male reproductive system was reported with two different identifiers it should have only one should be consistent and sometimes there are obviously errors because we find ovary cell annotated in a male or testis in a female so period there was some error in reporting here and sometimes we find that the cell type where there should be the cell type they report an organ or a tissue and added hangar is not obviously a cell type so I have to go back over this and recreate it systematically and we've curated a lot of the fly cell atlas all the public data in bg and it is available through this link we have curated the 61 libraries 1500 conditions and we have obtained 27 million processed expression values and overall what do we annotate to in bg we annotate to an intimate entity and cell type what I just described but we also have separate ontology for development and life stages so all the embryonic development but also aging and post embryonic development such as metamorphosis and species where there is sex male female undefined and sometimes it's hermaphrodite like in c elegans and strains or populations if we have this information until one information of expression bg is going to be a gene in a species with all this information of anatomy cell type life stage sex and strain and so this is an example of annotation where from the paper from the the metadata in the public database you have poly A RNA-seq and embryonic date 12.5 mouse gut from three wild type male three wild type females so we structure this so you have a specific term intestine with the ubron idea you have a stage of the mouse which corresponds to 12.5 day with an identifier you have the strain which is specified here was not written but if you go back to the paper you can find it and so on and so this is the end of the overview part so we put together a curation of data and integration which will get more information about later we're allowed to compare between species and all this comes together into bg and so i see one question in the uh so frederick wanted to add some uh specificity to what i said no yeah it was just a small mistake that those different neuron types you mentioned it's not that they were not in the cell ontology is that from the single cell data they could not identify clearly which of these four neuron types it was so they provided the four annotations but in bg it doesn't work we cannot have incentives about the annotations so we map to the common parent describing these four neuron types