 Good morning, everyone. I'm going to present to you the BG database this morning And first I would like to emphasize that BG is not my work alone. Obviously, it's the work of a team This is the team as of a year ago. Of course, there's always a bit of turnover, but it's pretty stable And the people with red circles will be presenting today Freddy rick in the middle of the red circles would be giving the other half of the lectures this morning and Sarah and Julia here will be giving the hands on this afternoon also have to prepare all this and If you have any questions in the future, we have an email which we monitor So don't hesitate to ask us whether you need new features or don't understand something or you have a bug to report or whatever So what is BG? It's a database of gene expression. Our aim is to Understand gene expression and help biologists users. So we try to really be at the interface of these two Missions to make the gene expression data useful to people and For this we work both on improving the data and on making tools which make it useful I'm going to go through a number of concepts which are important to BG and two databases to understand what we have and how we do it And the first important concept is biocuration Indeed there are two main types of databases in the life sciences uncurated databases and curated database and Typical uncurated database is gene bank in gene bank anyone who submits sequences the sequences are there So the advantage is complete Anyone who submit the sequences the sequence are there so you know that all the sequences are indeed there But there is a lot of redundancy several people can submit the same sequence and they do There is no organization There's no guarantee that people put all the information you need or even that the information they put is correct or written in the right way to find it So the main added values that it's complete and very up-to-date, but it is not very organized and not very useful for recovering knowledge The opposite of this are curated databases. The typical example is the Swiss prot component of the uni prot kb database In curated databases the data is verified redundancy is removed and annotations are standardized and This is done manually by experts who are called curators who verify the knowledge and structure in the database So the main added value of a curated database is that the knowledge is organized and reliable and easy to recover It is not necessarily quite up-to-date So it's whispered you will not have every protein sequence up to date But the ones you have will be non redundant and with reliable annotation and underneath I put the definition of the by of by curation from the International Society of by curation of which Freddy rick is in the executive committee So in BG we are curated database and for this we use annotation annotation is Associating a biological object to a feature based on evidence. So that sounds very abstract So for example, if you have a gene and it has a gene ontology term associated with this association was an annotation You can annotate automatically without curation. For example, you say every gene in a new genome I sequenced I'll give it all the geo terms all the gene ontology terms from the first blast hit I get in uni prot then it's not curated. You just automatically transferred or you can say I read a paper about this gene in this paper They describe a function with experimental evidence and thus based on this evidence I will give this gene ontology term to this gene and that is then curated annotation and in BG We do curated annotation So BG is a curated database first all the expression that we include into the database is verified We have criteria which Fredaik and I would explain later and Create data which does not fulfill this criteria is excluded. So we only include data which has been verified a priori by our curators and Then every data set that we include is Annotated to the right Annotation terms by manual curation reading the description in geo or are express and when needed when it's unclear reading the paper Reading the supplementary material of checking the website sometimes even contacting the authors and these annotations follow Standards which are themselves curated and I'll explain some of these standards later So that it's a work which is reproducible and reliable So the main way we annotate is to ontologies I've already spoken about the gene ontology. I will speak a bit more about ontologies now What is an ontology? An ontology is first a list of terms so that we agree which words we use But just a list of terms the control vocabulary, which is already very important so that we don't use different words for the same thing so we don't use for example or up a limb or Something else depending on who says it but we always use the same words so that we can then easily recover the information attached to these terms In addition we define these terms so that we can know what they mean and reason on them and sorry and Use them and when we have definitions of terms we have a dictionary What makes an ontology is that we always have also have relations between the terms So for example if I say that the hand is part of the arm that the arm is attached to the top of the body These are relations between terms if I say the brain is in the head It is a relation between terms. I say the brain is part of the nervous system It is a relation between terms when we have these relations We can reason on them We can say any gene was expressed in the brain is also expressed necessary in the nervous system since the brain is part of the nervous system and You can see that we can have different types of relations because the relation between the brain and nervous system It's not the same as the brain relation between the brain and the head the brain is in the head But it's part of the nervous system and we can have more complex relations such as some structures developed from others and embryogenesis some structures Interact with other structures, etc So we have these different types of relations and together the terms the definitions and the relations form an ontology which allows us to annotate our data in a reliable and reproducible manner and to reason on these annotations So the typical example of ontology that everyone knows in the life sciences is the gene ontology So here you see I hope you see my mouse We see a term cell migration in hind brain and we see that it has relations is are here are types of relations So high cell migration hind brain is a hind brain development is a cell migration But we also have other relations for example brain development is Part of the central nervous system It is a animate organ development We see that an ontology is a graph not a tree because one term can have several parents and one term can have several children and These are used in annotation. For example, I spoke out Swiss prot where notations are manually curated here I have homeobox protein hot B1 a with Manually curated annotations to gene ontology terms including this cell migration in hind brain, which is here in BG the main ontology we use does not describe gene function But anatomy because we describe gene expression in animals and gene expression animals The main features that is expressed in different parts of the anatomy So we use uber on which is an ontology which was made from first Aligning and merging different model organism species specific ontologies of anatomies So anatomy ontology describing mouse anatomies zebra fish and I to be Josephi love Melano gas there and not to be etc And then enriching this with additional terms additional relations to describe as much as possible all animals Now, of course, there's always a work in progress Whether it's the gene ontology or uber on every ontology captures our knowledge at a given point in time and as we learn more Whereas just we have time to add things it improves So whenever you use an ontology be careful to use a recent version and do not be surprised that It improves with our knowledge because anything which captures Computationally our state of knowledge at a given point going to change as our knowledge changes Now what makes an ontology useful? Sorry, I think I heard that they had a question, but I don't know how to see it. Yeah, okay. I'll continue this for now What makes an ontology useful for what makes the gene ontology useful is that it is used by many resources So you can see the same terms in the same way uber on is made so that it's coming to all animal species So it can be used in different Applications to different species whether it's human health farm animal model organisms such as fly or nematode or zebra fish And so on for example, it's used in large projects such as g-tech phantom. And now the human said atlas An ontology is more useful if it covers a large domain of knowledge as a gene function or animal anatomy Which is relevant to all animals and relevant to a lot of functions And an ontology is more useful if There are many tools leveraging it. So for example with gene ontology many people use gene ontology enrichment tools Which make the gene ontology directly useful So I will present today tools, which we develop which use your brain. Of course other people can also leverage uber on and This is it's newer than the gene ontology. So it's a work in progress so we Annotate to four types of conditions all this manually curated in BG Anatomy which is the biggest number of terms and work of annotation Which is what I just showed you we annotate to uber on So we can be very precise because uber on also includes goes down to cell types You can be very general say this expressed in the body or in the head or you can be very precise This expressed in beta says of the punk rats We also annotate to development and life stages So we have an ontology per species that we develop and make available to the community which describes From zygote to old age each species and this has to be per species because development and aging is a bit different in each species And the ontology is we develop for this and our reference ontologies and I used for example by the GTEC or human cell atlas for human We also annotate sex which in most species is simple male or female or undefined when we did not get the information and in some species a bit more complex but I don't think I'll go into detail of this now If someone is working on weed Daphne or ants you can ask us and we also annotate when the information is available strains or populations so for humans will be populations like European ancestry or Japanese ancestry and so on and for model organisms it would be strains and for livestock it can be different breeds This information we don't always have and we only put it when we are sure of it when it is clearly provided by the people who Provided the data and again all of this is manually verified by a curator Now I told you earlier that we only integrate some types of data So the main curation we do at that point is to only take what we call wild type healthy or normal expression So the idea is that we want Gene expression, which is informative on the causal function of the genes and which is evolutionarily relevant What do I mean the causal function is a concept in biology That this is the function for which something was selected It's the function which it has to do for example the function of Brca1 is not to cause breast cancer. That's when there are mutations Problem can arise which is breast cancer, but the primary function of the gene is not to cause a cancer Okay, and so each gene in the same way that the heart is primary function is to pump blood It's primary function is neither to make noise on our chest nor to have heart attacks The primary function of a gene is what it was selected to do And this we will not learn from the expression in the cancer or in a knockout. We will learn from the Expression in a wild type healthy individual. So wild type healthy means we do not take Cancers we do not take diseases. We do not take Experimental mutants such as knockouts knockdown knock in We do not take modifications where you put a chemical product in the tank or in the food or when you put You modified the expression by say microRNA injection or something so Wild type healthy now some cases you can discuss What is the exact definition of healthy? So we try to be rather broad and to capture maximum information Because sometimes you can eat a little bit more or a little bit less and it's still healthy Evolutionary relevant because if I want to compare the gene expression say I want to find conserved expression In the brain between different mammals I want to compare the expression in the healthy brain I don't want to compare a healthy mouse brain to a tumor in a human brain because this does not inform me on the conserved expression And while when we started bg our main concerns with these two we have found with experience That's actually very relevant for biomedical studies where often you have data on medical issues so diseases cancers and so on mutants And you want to be able to see how does this compare? To the default what I have when I am healthy and we can provide that reference and guarantee that we verify that carefully So I'll give an example of how we have curated this wild type healthy from the g-tech data So g-tech is the genotype tissue expression project It's a very large project to do RNA seek and genotyping from many humans with many tissues sampled And if you look at the definition on the home page of g-tech they say from non-disease tissue sites And many studies use the g-tech data as a reference of healthy But if you look in the faq You see we have not excluded specific donors from specific tissues based on their cause of death or medical history So in fact, you do not know that they were healthy And so our curators read all the descriptions and all the pathology reports of all the data in the version 6 of g-tech And we found that many were not actually healthy data healthy people so about So there were 179 About a third of the the individuals used Whom we did not use at all. We did not use people who died of drug abuse Who died with an invasive cancer who were morbidly obese and so on and then among another quarter There were many where we discarded at least one tissue. For example, if someone had Alzheimer We kept the liver tissue the muscle tissue, but not the brain tissue If someone had isetes, which was a liver disease We did not keep the liver, but we kept the brain the muscle the intestine etc and so in total After this large amount of work which represented actually one year full-time equivalent of curation by several people We kept only half the samples of g-tech version 6 So half the samples are in fact not representing healthy gene expression the gene expression which was either modified by drugs or impacted by a very severe disease And we do this for every data set we integrate even though usually it's less work than for g-tech Which is the largest data set that we have So in total for g-tech version 6 out of 11,900 libraries, we kept 4,800 which is about half And we annotated in the database in our annotation files Very detailed the exact analytical entity the exact age the sex and the ethnicity I should say that also we remove those where The pathology report showed that they did not actually sample the analytical entity that they said they would sample so you can have that the G-tech guidelines said to sample a certain Tissue and the pathology report says I didn't manage quite and I took a bit of the tissue next to it And in g-tech this will be put annotated To the original tissue wanted, but it's actually not true. So we remove it And we are not allowed to make all this data available because g-tech is A restricted data because of the respect of the privacy of the people who donated So we provide of course all the expression Annotated to the exact analytical entity, but not the exact age and not the ethnicity And so in the end from g-tech we get 539 conditions, which a condition is a combination of anatomy age 6 in 75 analytical entities What are the types of expression data that we integrate into bg? So there are different ways to measure expression and the oldest Loud-scale quote-unquote because it's not so loud scale modern standards the first untargeted way To obtain gene expression was est so for those of you who would be old enough You would remember that when we started doing genomics we started getting expression from this Sanger sequencing of random mRNAs So it's in bg because we started putting it at the beginning And there's no reason to remove it, but it's no longer updating and actually the official database of est At the ncbi was retired last year and we have it in four species Which were the species which we had when we started annotating these est so it's a bit anecdotal now, but I include it Now this is very important in situ hybridization in situ hybridization is when you do manually experiments of hybridizing typically in an embryo, although it can also be In slices of an adult or depending on the species a whole adult And there are 44,000 in situ experiments in bg which correspond to 343,000 evidence lines an evidence line would be saying here Aria our gamma is expressed in for example here the end of the tail And we do not annotate this directly we have Agreements with the model organ and databases of mouse zebrafish sea elegance and Josephina benelogaster so that we can Recover the data they already create this is not to duplicate work There's enough curation to do in the world that we can let them curate this which is their main job And they do it very well and we curate other types of data But we integrate it and it's very important because in situ hybridization provides an amazing level of detail Usually when you do large-scale experiments you take big chunks of body You take brain you take liver when you do in situ hybridization you can be extremely specific On where the gene is expressed and this information is very precious For us to be able to say where genes are expressed And although these are small-scale experiments there are many of them So you see we still have quite a lot of data 44,000 in situ experiments is not nothing Now the big source of data for Transcriptomics has been a Microarrays historically and we only took a few metrics because they were the most Common and we did not want to have to manage Different data types which didn't bring a lot of information So microarrays unlike is TISO in situ a quantitative And they cover mostly that all the transcript of not all because you need a probe set For the gene and if you did not know this gene or did not manage to synthesize a good probe set for this gene Then it's not there It is very important because a lot of experiments have been done with microarrays really a lot And many of this have not been redone So it is easy for young researchers to think okay. We don't need microarrays because we have RNA-seq and it's true That today I would not do a microarray experiment. I would do an RNA-seq experiment. It is better But if you want to use historical data A lot of data on aging we only have microarray data a lot of data on circadian rhythm We only have microarray data a lot of data where reproducing the experiments is costly Or very heavy have not been reproduced yet And so the microarray data is still very important to complete our view of the of the gene expression We have at present 12,000 1200 sorry experiments annotated in bg which corresponds to 12,000 chips It's for more species than In situational though, not all species because only species for which a reliable aphymetrics microarray was indeed developed and used And we do a lot of quality control which frédéric bastion will explain to you in more detail But one thing I would just mention is that we remove redundant chips And this we are the only ones as far as we know who do because we discovered this problem Many people would submit different experiments for different experiment identifies to the databases But using the same control experiment So they would just resubmit the same data exactly as if it's a new experiment That shows the problem of uncurated database and of redundancy and uncurated database If you take all the aphymetrics data from say human or mouse From geo or ra express you will have the same experience several times and would increase your statistical confidence wrongly We remove this finally The bulk of the data nowadays is RNA seek bug RNA seek should I say Most of you know RNA seek. I suppose it's quantitative. It can cover the whole transcriptome unlike Microarrays, we don't need to design a specific one for each piece. So it's very easy to integrate new species You once you have a reference genome the RNA seek would be of good quality in every species We have at present 8400 libraries annotated and it's our main effort of annotation right now is increasing this We also do quality control that really would present you and although we can integrate different types of RNA seek libraries right now Our priority is to poly a Once we start that messenger RNAs So the gene expression of protein coding genes, of course with this you sometimes recover also expression of other genes Have we integrated but we have not right now given priority to say Specific libraries for short RNAs or long non-coding RNAs. This will probably be in the future Now we have these different data types and we need to bring them together So we do data integration. That's a very important part of bg bg is not just a pile of data It is Data which is integrated to give you biological knowledge about the genes about the organs and about the expression So this is a very broad view of how we integrate the data We have these different data types. I already mentioned the est is from these databases afimetrics from rix fresh and geo RNA seek from sra geo and dp gap, which is where the restricted data such as g-tech lives And in situ hybridization data from model organ and databases And on all the data that we ourselves annotate we do quality control condition filtering Which means we remove what is not healthy and wild type We reanalyze the data to make sure we detect active expression And we map everything to the uberon ontology even that which comes from the model organ and databases because They do not necessarily use uberon or in the same way that we do and then all this is integrated All curated on the same condition or map to the uberon ontology or integrated into bg And how do we integrate these very different data types? There are different ways to do it and One main way we do it is that from every data type we call genes expressed present or absent Okay, I hope this does not disturb the microphone because I hear noise from works next door. I'm sorry about this This being said, I've already taught in a classroom with works next door. So this thing happens So cause present absent The sign which is comparable between all data types if I have an in situ Hybridization experiment I can say this gene is expressed in this place. Yes or no But I can also get similar information from having est the gene is expressed if I have A significant microarray signal the gene expressed if I have significant Irony-6 signal the gene is expressed and what do we mean by significant in these cases frédéric bastion will go into more detail and we also Use this expression in a more quantitative way To see how important is the expression of a gene in a given anatomical structure in a given organ tissue cell type and for this We use a score which takes account the rank of the gene expression from every data type in every anatomical structure And this uses again We can apply to each data type in a different way and then we can integrate these scores So that we can have in the end one score which represents how How important the expression of a gene is in a given anatomical structure and age or developmental stage And so in the end we can give you an information which integrates everything You don't have to look separately What is the information from irony-6? What's the information from microarray? What is the information from in situ which in the other database you have to look at these things separately We give you one Information where we have summarized the knowledge in an expert way BG covers many species most of the examples I took now for a human But in fact we cover right now 29 species so obviously human and the main model organ Of animals so mouse zebra fish to the fair man or gaster sea elegans, but also you see here a variety of mammals some other vertebrates some flies And One of our priorities in the future is increasing this as we have good reference genomes and sufficient transcriptome data So we have already curated data for the next BG release just it takes time to run all our Pipeline and software to put the release out there, but it will come out this year and we will have Almost 18,000 new irony-6 libraries. Well, that's what was curated We'll still do the quality control and it would add many primates many many new fishes And many other species which might interest some of you or not others Some relevant to our culture such as turkey or sheep or honeybee some evolutionary interesting such as the sea Lacan or the sianimony, etc If you have an animal which you would really like to see in BG do not hesitate to send us an email Two conditions have to be met that has to be an animal because we do not integrate for example plants or fungi And when there needs to be expression data available for diverse animal structures If you have your favorite mosquito species and all that was done with whole body transcriptome To annotate the genome we cannot integrate it because there's no anatomy to annotate it too So it would not we could put up that would not add anything any knowledge we Leverage anatomy And speaking of anatomy Since we want to have we have different species want to be able to compare them and to be able to compare them We use the concept of homology. So you all know autologous genes probably Which allow you to compare genes between species We can also define such concepts at the anatomical level So for example, if you want to compare the tetrapod lung to zebrafish data, you want to compare it to the homologous Structure the homologous structure is the swim bladder not the gills This is the homologous structure between tetrapods and rafin fishes and so we manually annotate The anatomical homology between the different anatomical structures among animals And so how this is done it is by reading papers of evo-divo evolutionary developmental biology of paleontology And of comparative anatomy and so on it is not from our gene expression data This is very important because if you compare the gene expression between Anatomical structures, which are declared homologous in bg. It is not a circular reasoning We did not use this data to define the homology The homology was defined from external expert evidence from papers and sometimes this Even forces us to improve uberon as I told you such things are work in progress for example here. There was a paper which showed the homology between anneleads and chordates for the chord which is kind of like think of the vertebral colon and the nerve nerve in the back so There is a chord in the back of all chordates and vertebrates is where the vertebrae and the I'm missing words here left the the nervous system in the back is And in other animal species it's different and the anneleads there is an axocord So we had to add a presumptive axocord term so we could define the homology between the notochord and the axocord And then we capture this information in a very detailed manner according to every statement which was made in every paper So here you see a detailed table from our curator work We have the homology between the axocord and the notochord defined the home the uberon terms the type of homology the references the type of evidence we use which are the same codes as used in the genotology The evidence how confident we are because maybe in the paper they say That this is clearly showing this or maybe they say this indicates so we capture this At what level it is homologous so two things can be homogous at different evolutionary levels And who annotated it so and the supporting text from the paper or the book So we capture this in detail and this is available publicly on our github But because not everyone wants such a level of detail We also summarize it So here you just have that these two classes notochord and axocord are homologous and bilateria and these two classes Are homologous and bilateria also and this is then something that you can simply use And again, this is available on our github and it's also used in bg So this is the end of my first part I went a bit faster than I expected because I'm really not used to speaking with no one in front of me Looking at me and asking me questions. I'm sorry about that. So I hope I was clear Um, so we have more time for questions. Please Don't hesitate to ask