 Okay, so last presentation of the morning, I hope you're still all awake. Those were still here. You're doing well. So we have all this data and as was asked in the first presentation, what do we do with it? Good question. So I first show you how we can, you can get the data and then some tools that we've developed to help you use it. Of course, your imagination is boundless and if you want to use it in other ways, you are most welcome. So how do you get the data? Well, first I should say is that you can get this data and do what you want with it. We have made it all available under the license called CC0, which is a fancy way of saying no license. So public domain. CC0 means you can take this data, modify it and sell it without quoting us. You can do what you want. It's as free as the writing of Shakespeare or as actually any sequences which are, for example, in gene back. Of course, this does not mean that academic norms do not apply and we still appreciate very much if you cite us if you need to use our work. But this avoids what's called license stacking where you have to cite everyone who has to cite everyone who has to cite everyone and it becomes impossible. So if you use us and you feel it's a real idea that you cite us but don't feel that this hampers your work, okay? So the first way to get our data is to download files in which there is the data. And these files are available as TSVs, which is tabulation separated files, which means you can import them easily into R, Excel or whatever. You should know that many of our files are too big for Excel, but you can try. So we have two main types of download files, files of what we call process data and files of expression codes. The process data is levels of expression essentially and we only provide them for affymetrics and RNA-seq and they are in separate folders per species and in the folder then there is a file per experiment. The expression calls, there is a file per species with the calls summarized from the data as Frederick just explained and there are simple files where there is just a summary and advanced files where there is all the detail. We make both types of files so that the simple files are smaller and if what you want is just to know this gene is expressed or it is not expressed in this place, you don't want all the detail, you want the file which is actually manageable on your computer. So I'm going to show you a bit in detail the content of these files, which is maybe not the most exciting slides I ever presented in my life, but it'll give you an idea of what's in there and of course all this is entirely on our documentation on the BG website so and on a GitHub if you need more details. So the process data files, so here I turned them around 90 degrees so that we can actually read normally what I hear rows should be columns in the real file and then there are many rows and so what we have is for every row we'll have a combination of experiment and chip ID and a probe set ID. So on aphymetrics arrays, for those of you who are too young to remember it is not mapped to a gene but to a probe set which is a set of probes as its name indicates on the microarray which targets a gene and one gene can be targeted by several probe sets and they can have different signal so we separate, we provide the different probe sets separately. Of course you also have the gene to which this is targeted so you can just decide to pass out of this the gene and then they are not to get entity and the stage and for which you both have the ontology code which is unique and reproducible and the name which is more human readable but might change with time and might have be more ambiguous and all the other information. We have the sex, the strain and here the intensity which we provide the log of normalized signal intensity which is what is actually useful usually for aphymetrics and we also provide an addition how this is integrated into BG so whether we detect the present or absent high quality, low quality like what they presented this allows you if needed to go back to BG and see that your results are consistent with what we show because we give you the information we use but primarily what you're really going to want to need here to use is the gene, the ontological structure, the stage, sex and strain and the log of normalized intensity. Similarly for RNA-seq we provide the same type of information so now there are no concept of probe set and as Frederic said we map all the different transcripts to the gene we don't present the different transcripts as was discussed earlier and so you have the gene ID. I don't know if we mentioned this in all these presentations but our primary identifier for genes is always the ensemble ID. So if your primary identify something else like Uniprot or HGNC gene names you have to map them to ensemble IDs but that is normally pretty easy with the tools provided by ensemble and then we provide the read count and two normalizations which are very commonly used TPM, transcript per million and FPGAM fragments per kilo base per million. TPM is the recommended one I would say from most literature but we want you to be free to incorporate this into your pipeline and your usual analyzer so we provide them all and then you can manipulate them as you want and again we provide the way we integrate into BG so present absent, level of quality and so on and so these files are quite big because you will have one row per combination of gene and the structure, stage, sex and strain. So you will have and if needed replicates so you will have many, many rows but these then are the ones the files you can use if you want to integrate into your work and do differential expression, do clustering anything you want. Now Frédéric just presented how we call expression present absent and this we provide you if you want to be able to use it for example to filter only genes expressed in certain condition before you do present absent cause or before you look at the genes which are deregulated in a disease or if you want to compare present genes between species to take only those which are common and so on and we provide simple files and advanced files and this is what the simple file looks like it's indeed quite simple these are the columns there are again I turned it 90 degrees so these are columns in the real file you have a gene ID, its name, the anatomy, the development you do not have here age, sorry, sex and strain so it's very simple and you have whether the gene is called present or absent with what quality and the rank that we use the rank score that Frédéric just presented and you have actually two versions of this file if you want something even simpler if you're only interested in anatomy I only want to know what is expressed in the brain I don't care what age or development stage we actually have a version without the development so we have for each species expression simple development which has these two columns and we have expression simple.tsv which does not have these two columns and then you have really a file which just says for every combination gene anatomical structure which can be an organ, a tissue, a cell type is the gene present or not and what is the rank? Very simple and now sometimes you want to be able to reproduce our analyzer so to filter on data type for example you maybe don't trust DSTs but trust RNA-Seq and so on and so we have the advanced files which in addition to the columns I just showed before have all these, I should have written columns here not rows sorry for each they have including observed data yes or no because of the propagation Frederic showed you we can be calling a gene present for example in the brain when there was no experiment done on the brain the only experience done on sub parts of the brain so in that case would be no here and if it's yes it means there was at least one experiment actually done on this anatomical structure so you can trust this was studied directly and then if there was observed data which types for every data type we tell you whether there was from this data type was used either by propagation or directly and whether there was directly observed in this tissue this data type and how much was used for present or absent of different qualities and so this way you can sort that I only want genes called present with direct observation of in-situ hybridization for example I have a question according to its blinking here I don't know how to see it okay so there was a question if it recounts it okay I continue sorry and so this in the complex in the advanced five we have this times the four data types so you have a lot of additional rows which then you can pass especially if you're the kind of person who passes big files if you're by partition whether you want to use some simple tool like grep or import it into R or program your own Python function to do this you can then find all the information nothing is hidden and these files you can get them from the web page which lend you to the FTP or directly to the FTP server if that's your religion and so for each species you can click and get the process data or you can click and see the present substance files and here you can decide I want to know the development stage and I want the advanced columns yes or no and there's always a documentation available and this is for every species so on our home page where you see all these pictures of species you click on the species and you get this option and now these files you may want directly the files to manipulate them but maybe you want to integrate this into your analyzers in R and so we have made a package in R in bioconductor called BGDB which allows you directly to query the database and obtain these data without downloading everything yourself so I'm not going to go into great detail of this package those of you who registered for the hands-on this afternoon there will be a detailed tutorial on this but briefly what this package allows you to do is to retrieve the annotations of the RNA-seq and micro-experiments which are in BG so that you can say I want all the experiments which have which are in the brain of aging adult and you can get them I want all the experiments which have both sexes and so on or RNA-seq or microarray and you can obtain the process gene expression data so what I showed you for AFI metrics the logarithm of expression intensity and for RNA-seq counts TPMs and FPKMs and so you can write into your R code directly find the experiments which have male brain and RNA-seq with these identifiers now go through them and get the TPMs and now use the TPMs for whatever you wanted to do downstream with male brain expression TPMs and this will directly come into R and there's also a function to reformer this into an expression set object which is a type of object for those who use R which is used in many functions which manipulate expression data so this is to make your life easier if you need and the package and its documentation available in Bioconductor and this is the stable URL for Bioconductor which I put here on the slide and since the slides are available on the Google Doc you can download and copy the slide or frankly if you Google BGDB package you will also find it and there's also a link from the BG page so this package allows you to easily integrate so you can make a workflow where I say I'm going to use BG to get the ironicycle microwave that I need and then do other analyzes downstream which makes it easier than having to download everything and upload to R and pass only the ones you want another way to get BG data which I'll explain briefly because for more advanced users are priori but I want to mention it I don't know your level is a sparkle endpoint so first I should say what is a sparkle endpoint so sparkle is... how can I explain this simply is a way to obtain data from another database programmatically using a tripper so you're going to say I want... I have an example afterwards I think I will show the example afterwards because I don't know to explain this simply okay so those are no sparkles you know that's cool what's important to know is that our sparkle endpoint does not query directly BG because BG is very complex, very big and a lot of the concepts which are useful to users are in fact what is called implicit so in the database it's not directly this gene is expressed in this tissue when we make a call on the web page or on the app or on the package or on making a download file at that point we do all this propagation and the resolution of conflicts that Frédéric showed you and give you a conclusion programmatically so in EasyBG we have already done this so that you can have an easy database which just tells you what you want to know what gene is expressed in what organ and what stage development or live stage so it corresponds to what you see on the gene page on the web and this EasyBG database is much smaller easier to understand and so it's what we suggest that people download if they want to put a local database and it's what the Sparkle queries will query and this will allow you to query to make simple queries such as in which organs is a gene expressed in a given species which genes are expressed in a given organ in a given species which genes are expressed in a given organ and development stage or in a given development stage and so on and this is the URL of the Sparkle endpoint we have a documentation sorry, with examples and we are also part of a project called Biosoda which allows you to combine Sparkle queries across different databases so that you can query both BG and say uniprot and say I want the genes which are expressed in the liver and have an annotation to being involved in liver diseases and so what Sparkle query looks like briefly is something like this where what is important is that here you see but I'm saying that I want to recover where is it written I filter I want only genes which have this gene name and I want to recover there are anatomical entities and so I'm going to get the anatomical entities where the gene with this name is expressed and I restrict it to a certain species from the taxonomy identify so the rat where are rat apocone genes expressed in which anatomical structures and what's good with such a query is that when BG updates the results automatically you can easily integrate it with other such queries now all this was about recovering the data now I'm going to show you some of the tools that we have built on top of BG which allow you to take advantage of our structured data to get some biological knowledge and the first of these is top and at so top and at is very similar to a gene ontology enrichment so I suppose most of you have already done a gene ontology enrichment the principle is you get a gene list from some experiment or the gene shall conserve between two species or the gene is duplicated or the gene shall differentiate expressed between the treatment and no treatment whatever and you want to make sense of this gene list and you're not going to read all the papers about thousands of genes so instead you pop paste that gene list into a gene ontology enrichment tool and it tells you hey that gene list is enriched in kinases and so that's interesting about signal transfer and how this works is that it compares using a contingency table and a Fisher exact test you would have your gene list and the universe of other genes and whether the genes are notated to a certain term or not in the gene ontology and whether the frequency is different than what's expected by chance so if say 5% of all the genome is kinases but in my list 20% of the genes are kinases that's more than expected by chance so I'll say it's enriched in kinases and these gene ontology terms as I said earlier this morning they are annotated to the genes through either automatic or manual annotation now we can do exactly the same because instead of the gene ontology we have another ontology but it's the same structure computationally which is your brain which is anatomy and we have also association between the terms of the ontology and the genes instead of being like in the gene ontology association between a gene ontology term and a gene we have association between an anatomical term and a gene and this association is through gene expression so if a gene apocone is expressed in the liver then the gene apocone has an association to the anatomical term liver so now I can build for each anatomical structure of the ontology I can build the table where I have the genes from my gene list of interest say genes which have duplications in human and or genes which are involved in autism or anything I have my gene list and I look at the expressed in this structure for example the liver a certain number expressed a certain number not expressed in all the universe of genes possible a certain number expressed or not expressed and I can do a fissure test or a hypogeometric test usually a fissure test in our case to see if I have an excess or a deficit of genes from my list which I expressed in this structure this is very similar in principle to a gene ontology enrichment test one difference is that we only use experimental data for these genes in this species so if you do the test on mouse or in rat and mouse will only use gene expression data experimentally derived in mouse and a rat will only use gene expression data experimentally derived in rat when you do a gene ontology enrichment test most of the annotations have been transferred by autology between species so you don't actually know that this function really exists in the rat say whereas here we know this gene expression really exists in this species and we also have a deconvolution of the ontology graph I will explain this a bit in the next slide but basically we use our package which does gene ontology enrichment called top go and we modified its code to be adapted to the ontology of anatomy which allows us to use all the same statistical tools because it is in fact quite similar it's the same mathematical object the ontology and so we can manipulate it the same way and so this looks like this here I have an example where I put I took the genes which have a phenotype which has been annotated in the database so the database of the zebrafish model organ Zephen we took all the genes where there is a phenotype known when you mutate this gene you have a phenotype in the pectoral fin we put these genes here on the top and at the webpage of the BG website and I get here these are anatomal structures where these genes are expressed and I have seven times 7.4 times more genes expressed in the pectoral fin than I would expect if I put there's 98 genes here these 98 genes if I sample them randomly in the genome I would have seven times less expectant pectoral fin and this is very significant we provide a p-value and an FDR from the test so this means that I have a very strong enrichment that genes which have a phenotype when they mutated a phenotype in the pectoral fin are expressed in the pectoral fin which is not in itself a very biologically surprising result but that allows us to check that the method works and I see that I have several questions so I'll interrupt for a second to see those questions so someone asked why TPM is recommended I'll come back to this at the end I think okay and here you see that there are several options on the website one option here is that the background here was by default all the genes for which we have data in BG for zebrafish you can change this and put custom data why is this important because if I have a data set which I filtered for various reasons then this can bias my set and I should compare only to what I can reasonably expect I'll give you an example if I look at genes which are which duplicate specifically in primates relative to other mammoths I could only do that analyzes for genes for which I had orthologs between primates and other mammoths and so my background should be the genes for which I have orthologs between primates and other mammoths it should not be the genes for which I could not decide whether they were duplicated or not if I take genes which have a certain protein function this is only possible for protein-coding genes so my background should be protein-coding genes etc. Sorry Marc, you get a question about the enrichment test how can I find out what level of a go is muslim meaningful to describe my data? The go? So this is not a... Sorry, not the meaningful cut-off to be not the broad and not restrictive that's a very good question Yeah, so I have a 45-minute lecture on the go which is this is not but let's say briefly so if I go to this part here decorrelation so a problem with ontology and enrichment tests whether it's gene ontology or anatomy you're wrong is that there's a lot of redundancy because a gene which is say if I have a gene list which tends to be more expressed in the cerebellum then by chance will also tend to be more expressed in the brain then by chance and in the nervous system then by chance but this is actually the same information several times so if you say no decorrelation here this will give you what most enrichment packages give you treating every different term as independent although they are not and so here there are three decorrelation algorithms which come from this package called top go so they explain in the top go papers alim is the simplest to understand alim as this name indicates eliminates so if I have for example a significant enrichment of expression here in the pectoral fin then all the genes here which are annotated to the pectoral fin will not be used at all to study other structures might be parents so for example fin or pad limb and so on so anything which gives me a significant signal and a precise structure would not be used further up a result of this is that this algorithm would give you only almost precise structures unless there are genes which are specifically called expressed only in broad structures most of the signal will be in the precise structure that's all you will get so in that case it will bias your result not in a bad way okay it's a correct result towards precise structures which is something you sometimes want conversely you can use a parent child which will bias towards the broad structure first tell you what are the broad structures whether it's over a presentation and not give you the sub part so in that case it would tend to tell you something like limbs or something like that or maybe even something broader and weight is in between I like it that's why I took it here it's going to give it's going to give you in priority the more precise structures but it's not going to totally remove their information looking at the more broad structures it's just going to downweigh it so the more information gene brought to a precise structure the less information will be allowed to give to the broader structure which is its parent in the ontology graph but that is still taken to account a bit so we don't totally lose this information and that's what I used here so I don't think it's a good idea to give one cutoff in an ontology because the ontology are not a bit like this they will have more or less levels of granularity and detail in different parts of the graph because of the A, the quantity of knowledge we have at the time it was captured which is always evolving because biology is changing science and also because simply some structures are more complicated to describe than others so there will be more terms so you see sometimes some tools for gene ontology will say we can cut at level four or something but level four means nothing sometimes it will be a very precise term sometimes a very detailed term so I think it's better to use algorithms like this which either start from the parents or start from the leaves and usually we are interested in the precise term so we will start from the leaves from the precise terms and go up emphasizing the leaves so that would be my recommendation not to put a hard cutoff but to use an algorithm such as provided in top go and here top and at which will weigh differently the information from the precise terms and the broad terms and usually in my experience we want information from the precise terms but sometimes we want from the very broad terms and then you can use that algorithm for the details of how to use this on gene ontology I would send you to the top go paper which we can put a link in the in the Google doc hope this answers other options we have here we can choose to look we give you the results separately for the embryo and the post embryo because we consider the expression in say embryonic brain and post embryo brain and not necessarily the same structure don't have the same meaning and you can separate by expression data type or use them all together and this I should emphasize works it's only based on our cause of present absent and so the fact that it works very well and we've made many tests since we have this and it usually works well as in this example that we've tried quite a few it shows that our calls work well because when we make the cause we can never be 100% sure what we do obviously and we know we make some mistakes because that's life but how reliable is it well here we have a tool which constructs a whole complicated thing based on these calls and it works we get the biology out when you put the gene list say we've made various tests of sanity you take genes which are have the gene ontology terms parameter genesis and you get the test this and that's parameter size you put the genes which are involved in autism and you get the parts of the brain which are involved in autism and so on and so forth so it works really well so I start to speak a bit about this already but top and at analysis have the same pitfalls the same things you should be very careful about as gene ontology enrichment tests and one which is really important I cannot emphasize this enough is the background if you use the wrong background you'll get the wrong conclusions because if for example for example there is a lot of gene ontology annotations done to in human and mouse to ovary and breast and so genes which are involved in sexual differences have a lot of annotations to this even if they are male specific so you have to put the background that I want the genes if you have a subset of gene ontology as your test you want the background has some gene ontology enrichment if you have a subset of protein coding genes you want the background protein coding gene and so on you have to be very careful about this or it can completely change your results in the wrong way you should think basically of all the steps in your pipeline to generate your data and the step you're testing if you're testing this step your background should be the data the step just before another pitfall I just mentioned is the non-independence so that there are algorithms which allow you to deconvolute the graph so to take into account in simple terms this non-independence and of course there's multiple testing there are tens of thousands of gene ontology terms tens of thousands of anatomy terms thousands which are used in BG and you do all these pressure exact tests or hypogeometric tests term by term so you have a huge problem of tests of multiple testing and so some people want to take it into account and some not there is a philosophical debate in the field so we give you a choice we provide you the uncorrected p-value and the FDR but it's something to be always aware of at least and I should emphasize I did not specifically pitfalls of top and up but pitfalls of any ontology enrichment test and top and up I showed you on the web but it's also in the BGDB package and on the BGDB package there are more options so for example if you want to run an anatomical enrichment test to a very specific age or stage of development you can which you cannot on the web page for example and you can include it in your pipeline so that for example you do various analysis and do the genome ontology enrichment and the anatomy enrichment as a step of your analyzers to see how to interpret your results and what they make sense so I see that I'm running late on time so I will try to go not too slowly we told you this morning that we annotate anatomical ontology we make this available to you in two ways the first way is simply a web page where you can paste uber on ideas for one species and it tells you what is the homology with other species so for example here I took the uber on ideas of all the human tissues which are in the GTEC big data set and I asked for human and zebrafish so I have here what is the homology between these tissues and zebrafish and you see some have a direct easy homology hypothalamus exists in both zebrafish and human it's homologous at the core date level core date is the group which includes vertebrates and others and some are a bit more complicated for example heart left ventricle in is homologous to primary heart field and these are the 31 which have a homology and there's also all the others which do not have any homology from the evo divo and paleontological literature between human and zebrafish so this means that if you want to compare GTEC data to zebrafish data say because you have phenotypes and zebrafish you want to use to interpret the data you can only interpret for these 31 tissues so you have this information right now you have to paste the uber on ideas so to find this uber on ideas you should go to the uber on web page there's a link here put the name and you get the idea and paste it and another way we use homology is to compare gene expression between species so say you have a list of autologous genes in species and you want to see do they have conserved expression and how high is this expression well you want to compare the expression of the autologous genes between the homologous organs and so this is what we provide you here in the expression comparison tool you paste here your list of here is autologous genes it's all autologous of SRRM4 which is a gene which is known in several numbers to be in brain specific and here we see that I have the expression conservation so I have of the 13 genes I gave the 13 are all expressed in the brain with high scores in the 13 species and if I would here's a screenshot so I cannot do this but on the real website I could unfold and see the detail of every gene see that they're also all expressed in central nervous system in four brain I have 10 out of 13 which I expressed three which have no data and I have none which have no expression absence of expression and if I go down at this and I could sort it different ways clicking on these columns I could say I want to see in priority the places where they don't have expression and so on and so this concludes the presentations of this morning I see that I already have several questions so I'll go soon to the question a new one appeared magic so the aim of BG is really to make gene expression useful by providing our expertise so we spend a lot of time trying to understand what are the best methods what are the best data what are the best ways to treat them when there is something from the literature we use it when there is not something we invented and benchmark it and the idea is to make this step easy for you we provide you the cause and you can trust them because we did all the work before we provide you the curated data and you can trust it we provide you the enrichment and you can trust it of course as Ronald Reagan famously said trust the verify that's why all our code and our data are open and available but really our aim is to make things easier for you because what you want to do is the analyze which are downstream from this so thank you for your attention