 So, I don't need to explain to you how this functions because it's the same as the previous days. Since I didn't attend all the course of the previous days, don't hesitate to tell me if something you already know or is unclear in the chat or in the live document. Today I'll be speaking about gene function and gene expression and how they evolve in autologs and paralogs. So, I will have three parts. The first part I will be less about evolution directly and more about the bioinformatics of function and expression and which resource and database we can use with the focus on the gene ontology and on the database BG which we developed in my group for gene expression and then we will speak about gene and genome duplication and finally evolution after duplication. So, first part we're going to speak out gene function and the BG database so this will be the more bioinformatics of the three parts and the other two parts will be more evolutionary. And during this presentation I'm going to try to get your feedback on various points so for this I ask you to follow this link which is also in the live doc and I will activate the first question so now you will be able to answer normally Paul about what is the function of the heart so I don't see any answers yet so I wait a bit so you go to WooClub the code is SIBCG3 and most of the votes are obviously to pump blood through the body so in the case of the heart this might seem obvious that the function of the heart is to pump blood and not to make a noise doctors can listen to but if you think about when you speak of the function of a gene if you ask what is the function of BRCA1 for example many people would say it's a breast cancer gene but the breast cancer gene saying that gene is a breast cancer gene is a bit like saying the function of the heart is to get heart attacks is describing what it does when it doesn't do its job right so this makes us think more about so this is your answer so far everyone is answering about pumping blood obviously but if now I go back to my slides one thing sorry we are not seeing your slides yeah I need to get back to it sorry I forgot to click on the share there so we can distinguish in biology two meanings of function the selected effect the finish of the function which is what the structure was selected to do so the heart was selected by natural selection to pump blood organisms which had a heart which was better at pumping blood were kept by natural selection organisms whose heart was less good at pumping blood would not get by natural selection there was no selection to make the heart easier to listen to by doctors there was no selection to allow the heart to have more heart attacks but these other roles exist and so that's what we call the causal role definition of function which is function is what it does whether it was selected or not and for genes this is very important for genes and genomes because for example if all the discussion started with the encode project which looked at RNA-seq and chip-seq all across the human genome and if you find somewhere where this transcription or somewhere where a transcription factor binds this is in a way a function but it doesn't tell you this is the function was selected for it doesn't tell you this is what it has to do for the organin to survive but yet it is we we studied it and it is somehow relevant and if you look at medical genetics or even the function of genes say in fly where we know from mutants a lot of the functions we know are the function you get when it doesn't work so we have these two roles the selected effect definition of function and the causal role definition of function and the only one which is interesting for evolution for comparative genomics is the selected effect now a point I want to speak about first when I speak about function information is by curation what is by curation it is verifying the information we put in databases so all of you have used databases but it's not magic what's in it has had to be put here and it can be put in several ways and one way is what we call uncurated databases so an uncurated database for example NCBI nucleotides which we also call gene bank for historical reasons there's all the nucleotide sequences which were ever sequenced and published are there all of them so if something is wrong it's there if something is redundant with something else so several people sequence the same gene it's all there there's no organization if someone put the wrong species when they entered it it's the wrong species if someone wrote the function instead of writing it in the function field wrote it in another place it's there it's very lowly organized and you cannot trust the knowledge there on the other hand it's complete and up to date every night gene bank is updated you know that all the new sequences are there and everything which was ever done is there the other type of database a curated database of which the most famous example is Swiss port Uniprot so the Swiss port part of Uniprot and their information is verified by humans we call bio curators so people read the information verify it and put it into the database and approve it so things are verified a trustworthy there's very low redundancy annotations are standardized so it's much easier to compare things with everyone uses the same words because professionals putting this in and it's organized and reliable on the other hand it is not necessary up to date because humans take time to do things so by curation definition is the translation integration information relevant to badge into a database or resource that enables integration of the scientific literature as well as large data sets so when we say the scientific literature for example Swiss port the curators read on average one paper per day per curator to add this information that then you can access automatically I've been saying annotation what is annotation it is associated a biological object for example a gene to a feature for example a function so for example if I do a blast of a gene of unknown function to Swiss port and I find gene ontology term function I can associate that term to my gene of unknown function through the blasted there will be an annotation I'm carrying the function from one gene to the other but it's uncured it is directly automatically from the blasted I can also read an article which made experiments on this gene and say aha it has this function from the experiments then that would be a curated annotation both cases annotation which is associating a feature for example a function to a biological object in our case what interests us is genes now for annotation you can use anything and historically the start of Swiss port which was the first annotated database was with free text which means people write and people read but it has a limitations when you want to do this large scale and use it automatically like we would want to do in genomics so to solve this we use what is called ontologies so what is an ontology in bioinformatics ontology is first a list of terms so we agree on the terms that we use if we have just a list of terms and nothing else that we agreed on it is not an ontology it's a control vocabulary then we have definitions of the terms so maybe we have the term enzymatic activity then we have a definition what is enzymatic activity is the activity of a protein which catalyzes some chemical reaction and if I have a list of terms with the definitions I have a dictionary and what makes it an ontology is that we have relations between the terms so I'm going to say that hydrolyze activity is a type of enzyme activity and that allows us to do reasoning that is to say if I want all enzymes I don't only take those which have the annotation enzyme I also take those which have the annotation hydrolyze this is a very simple reasoning we can do more complex reasoning but what's important is that allows us to organize the huge knowledge we have in a way which allows us to manipulate it automatically so the most famous ontology in biology is the genontology so here's an example from the genontology you see that cell migration in the hind brain is a part of hind brain development and is a cell migration so we have different types of relations so I hope you see my mouse sorry so here you have in blue part of and black is that you have other different relations are not used here so cell migration is part of hind brain development which is part of brain development which part of head development but also part of central nervous system so here we see that it's a graph not just a tree because you can have several parents and you can also have several offspring so one term can have several relations both up and down if you want that's because biology is complicated so we want to represent all this information and then this can be used either directly manipulating the graph or just taking out the terms which are interested in for example if you go to uniProt for each gene you will see on the function all the gene ontology terms you notice here that is separated in monica function and basket process I come to that in the next slide and you see here the source so how was it annotated here was annotated by the curated database Zfin which is a database for zebrafish here it was annotated by similarity which means just has a similar sequence to another gene and so on and the gene ontology has actually three ontologies in it molecular function which you can imagine as what the gene does when it's alone in a test tube but biological process which is what it does in the organism and cellular component is where is it in the cell relative to what I was saying at the beginning the gene ontology does not describe all the possible ways you could want to describe function it describes selected effect functions so what we think the gene or the protein was selected to do so for example the gene ontology does not cover pathological rules you don't have a gene ontology function making lymphomes it does not either cover experimental conditions evolutionary relations importantly to us so this we must find elsewhere or gene product type for example if you have an RNA which catalyzes a reaction or protein which catalyzes a reaction this is the same function for the gene ontology does not distinguish them there are many ways we can use the gene ontology I just show you for example they are used in annotations in Uniprot to make things consistent and comparable between databases so Uniprot took some notation from the zebrafish database Zedfield and put it into Uniprot and since both database use the gene ontology it can be easily coordinated but one of the big uses we have for it is what we call gene list enrichment so in genomics we often get gene lists you look for all the genes are differentially expressed between two conditions you look at all the genes which don't have an autologue in this in another space you look at all the genes which were duplicated you look at all the genes which were clustered on the chromosome whatever you do in genomics you end up with usually not one gene not two but many genes now if I have many genes it's hard to read all the papers on every gene and come to a conclusion what they do the gene ontology or other ontologies give us tools to allow us to get to an automated understanding what gene lists do I'm just going to take a short break here because I hear lots of alerts and I want to see if they have to do with messages for me but no so okay no special messages for me there so how does gene list enrichment work you have a bargically relevant terms so annotations to your genes for example gene ontology terms so I have genes and they're associated to something where I'm interested in I have a list of genes which to me are interesting maybe it's the genes differentially expressed between young and old maybe it's the genes which are differentially expressed between male and female maybe it's the genes which are found in all mammals except human whatever and I have associations between my genes and my terms which are the annotations and now my question is in my list of interesting genes are these associations for some terms more frequent or less frequent than expected by chance and my genes which are for example found in great tapes but not in other mammals more often involved in behavior are genes which are lost in birds more often involved in walking so I'm going to try to find these associations these are stupid examples but I have better so the principle is very simple so this I stole from the head of the gene ontology consortium you have a reference gene list which is the genes which are not only your genes of interest but all those your gene of interest come from so for example if I'm comparing human genes which have an autologon chimpanzee human genes do not have an autologon chimpanzee my reference is all the human genes but if I'm comparing among human chimpanzee autologs those which were selected in humans those would not under selection in humans my reference is all the genes for which I have a human chimpanzee autologon and among these some are notated with your gene ontology term of interest and some are not for example here my gene ontology term of interest is the dark blue and so I have the reference which is all the genes I was able to study and my gene list of interest which is the ones I'm really interested in for example the ones with selection in human and so I asked is there more dark blue here than expected if I took this number of dots by chance here oops and for this we simply do a contingency table where we have whether the genes have this annotation or not whether they're in my set of interest or not and this will sum up to my background my reference set and from this contingency table I can do a fissures exact test a cry score test a by no mere hypergeometric usually we do a fissure exact test any test which tests for non-independence if there is independence it means that my set of genes of interest is random relative to this gene ontology term if there's no independence it means it was not independent I get a very low p-value and I can get a ratio of how much I get relative to how much I expected and that's an enrichment and for example you can do this on the panther database website which is the one recommended by the gene ontology consortium and the one we'll use for those of you who do the exercise this afternoon and so you get something like this where you have the gene ontology term of interest as it's a as there are relations between them you see that the hierarchy is kept that metabolic process is a type of organic substance metabolic process and so on oh sorry it's the other way organic substance metabolic process type of metabolic process lipid metabolic process type of organic metabolic process and for each you can get number of genes which were in your list in your background how many you expected your how many you observed in your foreground how many you expected how many more you had so you see here by chance I would have 1.4 fact I had 17 it's a 12-fold enrichment that's an enrichment so the sign is plus here not minus and it's very significant so the p-value is very small 10 minus 9 so now let's after this short introduction about annotation and gene ontology go to another question on wool clap you trust information from databases so you can reply honestly although I was speaking about curation now so either this is a bias of young people or it's a bias because of what I was telling you because I know that quite a few more senior researchers do not trust annotation databases I've had people tell me no the gene ontology is all crap but you seem pretty confident that's good it's going down it's going up I'm influencing you it's terrible I should not speak new poll do professors know to stop speaking no okay I think we reached an equilibrium for now going back to my slides time is it I hear something in the microphone yeah mark this is a question which website you were referring to earlier the panther database website I can put a link in the if someone puts the question in the live doc I'll put an answer the break thank you okay so now let's speak about the database that we make in my lab which is curated so I'm very happy the majority of you trust curated databases and BG is a database of gene expression and our aim is to help people biologists understand and use gene expression so that we get an expert on gene expression and share this so that gene expression is easier to use for you because gene expression is very complex object it can be seen in terms of amount genes are more or less expressed in terms of presence genes are expressed so not in terms of relative expression genes are overexpressed in some conditions relative to others gene expression can be seen at the level of an organ a whole body a developmental stage a sex and age and so on and so forth so we try to organize this and so the first thing we do in BG is that we curate we verify manually only one type healthy gene expression this means that we look at all the available gene expression data sets and anything which comes from a tumor a knockout a treatment by some chemical product in the water of the frog tank and so on we do not use that is because what we want to capture with our gene expression is the selected effects function of the gene so we want to capture what the gene does has done historically in evolution and so for this we want to be as close as possible to the evolutionary conditions and not for example a knockout mouse or a tumor which is not the selected function of the gene so what type healthy gives us information on the selected function of the genes it's evolutionary relevant we can compare between species if I compare the gene expression in a human lung tumor and in the lung of a mouse which had a knockout of a gene I don't know what I'm comparing I'm comparing some weird stuff if I compare a healthy human lung to a healthy mouse lung then this makes much more sense and it allows us actually we built it for evolution but we find now that it's also useful for biomedical studies because when you study for example cancers you want to know how is it when I don't have the cancer and so we serve as a reference for biomedical studies to know what is indeed the healthy expression we annotate gene expression we annotate it to five features I only put four here because no to four features sorry yes indeed so one is because the fifth of the species but I didn't put it here so the most complicated one is anatomy so I'll come back to this but we use an ontology which describes in detail the anatomy of every organ down to the cell level because gene expression doesn't mean anything to say this gene is expressed in zebrafish you have to say where is it expressed secondly we have an ontology separate per species of development and life stages which goes from the egg to the aging old adult and so allows us to say this gene is expressed in the liver of the juvenile in the liver of 45 year old and so on and so forth sex which is pretty straightforward in most species male females although some species like sea elegans are a bit more complicated and when we have the information strains that is for example black sesame mouse or for humans different population origins when we have this information and when we're allowed to share it because sometimes for humans there are confidential parts of the data for the anatomy as you can see on this slide it is rather complicated anatomy is a complicated thing because a you have the different structures which there are many and which have different terminologies so we have to put together the terminology for example humans from medicine and for various animals from zoology which might use different terms for the same thing and the relations are very complicated because things can be inside other things but they can also have the same function as other things or they can be developing from other things so for example the the cerebellum is physically in the head and is functioning part of the nervous system and then we use this data that we curated to be white type healthy these ontologies we have to annotate to integrate all the different types of data that we have because for gene expression you have many different ways to measure it and we want to integrate all this together between different experiments between different data types between different species so that you can use it in a transparent way all together so we take we took historically estes which are the first type of measure of gene expression then microarrays and we continue because there's still microarrays available surprisingly more and more of course RNA-seq i mean now mostly RNA-seq also in situ hybridization data which continues to grow and there are whole projects of systematically doing in situ hybridization and for example the embryos of mouse or zebrafish we integrate all this data with common standards of healthy wild type quality control per data type but always with consistent criteria we reanalyze everything to say when a gene is expressed or not we map all this to the same ontologies notably uberone for the anatomy which is the most complicated one and this is what we integrate into bg so all the different data types are integrated together so that when we say this gene is expressed we took all the information available and how do we integrate because these are very different data types so the first level of integration is that for every data type we call whether a gene is expressed or not expressed in a given condition so a condition will be an anatomical structure at an age of developmental stage in a species in a sex in a strain of course we don't always have this information so sometimes we might know it's either mouse liver but we don't know the exact age nor the strain so that's why ontologies are useful you can map to more or less precise levels in any case we always call present options so for in-situ hybridization that's the only information you have all the others are quantitative but you can say you can define a background noise level for example for microwave and say everything about this background noise level is expressed for RNA-seq we also define a background noise level by mapping reads to non-genic regions intergenic regions and this allows us to say what is expressed more than the background noise is present and then we can integrate all this if we have different data types which all say this gene is expressed in the liver at different ages we can then put this together and say this gene is expressed in the liver at all these different ages we can also try to integrate a more quantitative aspect because it's also important to know if a gene is more expressed or less expressed and for this for every data type we do a non-parametric ranking so we can say this gene is more expressed than the others or less expressed than it does in a given condition and that you can do for every condition every anatomical structure every age and so on but then if we want to take the genes point of view we can say for this gene in which anatomical structure in which age in which sex did it have the highest rank if a gene has the rank 10 000 out of 20 000 in the liver but in other organs it has ranked 15 000 out of 20 000 i'm going to say it's all the same more important in the liver than elsewhere and so then we integrate these rank scores across all the data types and all the conditions to get the general idea of where a gene is most expressed and in that way to get an idea of where this gene is most important we integrate the scores by weighing by the amount of information so that you know that RNA-seq which brings more information we actually bring more weight in the end than for example in situ hybridization which has very low information on this score so for RNA-seq i'll just do one short slide to explain how we do because this is the less classical one most people use RNA-seq simply to do quantitative comparison but not to call present absent or if they call present absent they put an arbitrary threshold of say tpm2 with no justification so our starting point is that it has been found that when you look at RNA-seq distribution over genes what you get typically is a kind of bimodal distribution with a smaller shoulder here and a big peak here and this big peak is in fact the genes which are actively expressed so it's not just that they are expressed as in the polymerase was around there and made some RNA but actually that there is active opening of the chromatin transcription factors binding and a function for the gene and here this shoulder is more the background transcription of the genome and what you can see is that if you look at intergenic mapping of reads it overlaps a lot with this background of genic expression so we define for every species a reference into genic so this is from an old study we have refined this but my graphs are more complicated so i put the old study here we refine the intergenic for every species to define a real intergenic which we guarantee is not genes for this we analyze all available RNA-seq this is important because in non-model species genes and intergenic are poorly defined a human or mouse or Josephina Melano-Gaster this is not a problem you go and even another Josephina or another primate and suddenly lots of genes are not annotated and what we call intergenic might include a lot of non-annotated genes so we define a reference into genic and then for every library you map the reads to the reference into genic to the genes compare the two and those genes are more expressed than the reference into genic we call expressed we actually have an r package i will put you a link later which allows you to do this on your own data sets if you want and so in the end when we integrate all this information you get the page of a gene like this so this is the gene apoc1 it's a liver specific gene in human and here you have first you have all the anatomical structures where this gene is found present and you see that 217 so this means it's a very widely expressed gene so gene which is found in many structures but yet we know from its function that it's most important functions in the liver and indeed if we look now at the scores we have from integrating the levels of expression in all different organs we see that the top score is in liver and the next is right lower liver and the next is a gland cortex so we have here this highest expression in the liver across 14 developmental or aging stages so we can both find all the structures in which this gene is expressed and find which is the top one and we can do this in every piece for which we have data so that you see that in different species the top expression information for this gene for the autolog of this gene is always liver in mouse and zebrafish in chimpanzee in macaque in rat and dog in guinea pig here I stopped because the limit of the screen you see that always we have apoc1 top expression in the liver and it's from various information sources for example here it's aphymetrics and here it's aphymetrics and RNA-seq here is aphymetrics in situ aberration and RNA-seq for the non-model species sorry it's usually only RNA-seq because the only thing we get from non-model species so let's go back to do a poll you can already follow the link i'm going to share the screen soon so what would you compare expression and human lungs to answers are less homogeneous than in previous this is interesting what will win nobody no love for the gills you know that gills are what allow fish to breathe no no fisher no people who do fishing in this spare time in this group okay it's stabilizing oh someone was influenced by what i said this is interesting so i don't know how many of you have studied zoology but fish don't have lungs that's an interesting observation a zebrafish does not have lungs so if you're going to want to compare to the zebrafish lungs you're going to have a tough time that's where it's interesting that experts are annotating databases and sharing their expertise with you because of course you cannot know everything right so it's useful that we provide you information so that you can find this information even when you of course did not study everything about you or if you studied you don't remember it so when we compare expression we want to compare homologous and anatomical structures this is exactly the same logic as comparing autologous genes right you want to compare what is comparable so when you compare the expression between two species you want to compare the autologous genes but you want to also compare their expression in the homologous anatomical structures so if you want to compare i actually don't have the lung here so if you want to compare the human lung to something in a fish the homologous structure is actually the swim bladder the swim bladder in teloste fishes is a organ which allows them to change the buoyancy so how dense they are relative to the water so they put more or less gas in it and it allows them to float up and down in the water and this actually derived in teloste fishes from an ancestral organ which allowed to exchange inefficiently gas with the atmosphere and in the ancestor of tetrapods such as us land vertebrates this inefficient organ was selected to become more efficient and eventually became the lung and in lung fish which are actually closely related to tetrapods and not to teloste fishes you have both you have the gills which allow to exchange oxygen with the water and lung not so efficient but allows them to survive when there's no water and so if you compare the human lung to something in zebra fish it should be the swim bladder if you want homology if you wanted the same function it should be the gills if you look for the lungs and zebrafish you won't have much luck there are none and so one way to do this is that we built we have put into the uberon ontology of anatomy or the knowledge we have from the literature of homology at the anatomical level and we did not do this by comparing the expression of everything against everything because then it would be circular if we use the expression to know what should we compare and then we compare after that then we are comparing what by definition had the highest expression similarity and we don't know what is the biology here we're just knowing that what's the highest correlated is the highest correlated it's circular so we want to have external information so we read the literature of pedontology of developmental biology of comparative zoology to know which anatomical structures are homologous between species all the way down to the level of animals metazoa and at every anatomical at every taxonomical level like the taxonical level you can have in the hierarchical autologous groups we have a level of homology at the anatomical level now this makes a horrible file that you don't want to read but we made a tool on the web so that you can do this easily so if you put autologous genes here you will have so here are different genes which are all known to be brain specific from the literature we know the human gene is brain specific we put the autologues in various mammals here we search and we get that there is that all these 13 autologues are all expressed in the brain and that's also where there is the biggest score among these genes and the next to the central nervous system so there is conservation of expression of this gene in the brain of various mammals and here we can guarantee you that we took the homologous structure so the brain for mammals is kind of trivial but if for example you started with human data and you wanted to find the comparison in zebrafish because you are interested in verifying some biomedical hypothesis in zebrafish it would automatically take you only the structures which have homologues in zebrafish i spoke before about the gene set enrichment on gene ontology but i said this is in principle something you could do with anything where you have an annotation so we can also do this with our annotation of genes expressed in anatomical structures and that's what we've done we've created a tool called topanat which allows you to find the top anatomical structures for a gene list and whether they are significant so the principle is exactly the same as a gene ontology test but instead of associating a gene to a gene ontology term by annotations such as in swissport or in the model organ and databases such as mgi z fin fly base and so on we associate a gene to an anatomical structure by gene expression and so for each anatomical structure for example for brain you would have in your gene list of interest for example genes selected in human those are expressed in the brain or not expressed in the brain and for the genes the other genes for example all the genes for which you could find autologues on which to compute selection you will also have whether they're expressed or not expressed and you can do a fissure test a hypogeometric test the chi-square and so on as usual we have implemented only fissure and hypogeometric and both for the gene ontology and for uberon as i told you we have relations between the terms so it means that the comparisons are not independent so we actually i'm not here putting slides on the detail of how this works mathematically but we allow what we call a deconvolution which allows you to make these comparisons more independent and so on the website it looks like this here are genes which were chosen that when you knock them out you get the phenotype in the fin in the pectoral fin in the zebrafish so you knock out this gene and the fin looks weird you put all these genes into bg top and at and we run this comparison and we find that they are much more they are seven times more frequently expressed in the pectoral fin than expected by chance and this is highly significant very low p-value very low force discovery rate and so you can do this on any gene list you like and on all the species for which there is data in bg a disadvantage right up to the gene ontology is that you don't have all the aspects of function you only have expression and advantage that if you are working in a species other than say mouse or josephia melanogaster which are very studied we guarantee you that this is the gene expression function in that species we only use the gene expression curated and annotated in the species you are studying whereas with gene ontology because most annotation is done by orthology if you look at the function of genes in say chimpanzee you're actually looking at what was the function of these genes in mouse or human because that's what was studied no one studied function in chimpanzee if you're looking in um flatfish no one studied in flatfish you're having the zebrafish or mouse data or maybe in fly for the expression if you do hear the expression in chimpanzee it will give you only the expression in chimpanzee even if there's specific different expression chimpanzee than in other species both topanet and gene ontology enrichment analyzers have pitfalls you should be careful of and the number one which is really important and I cannot insist enough is the background if for example I look at genes which have which have autologues within human chimpanzee and have selection in human my background should absolutely be the genes which have autologues between human and chimpanzee if my background is all the human genes then I am testing not genes which have selection right up to genes don't have selection I'm testing genes have an autologue and selection right up to genes which either don't have an autogue or don't have selection and if a lot of genes don't have an autologue that's the main thing I'm testing you have to be super careful about this it might be the genes for which you can find an autologue it might be the genes which when your microwave it might be the genes which for which mapping is possible in your RNA-6 experiment but you should always be very careful about this you test many gene ontology terms on many anatomical structures so there's multiple testing we provide you an FDR measure so there's the Pantodibi tool which I'll give you a link afterwards and then non-independence of terms which means that of course cerebellum is in the brain so if there's more expression in the cerebellum there's more expression in the brain of course hydrolase is an enzyme so if there's more hydrolase genes there's more enzymes so for this there are algorithms to decorrelate this non-independence which come from the package top go which can be used as its name indicates on gene ontology and which we use in top anat on anatomy the BG data is available from the website what I've been showing you and we also have R packages which I'm not demonstrating today because that would be a whole different course but we can put you a link to a course we gave on specifically only BG in June if you want to use BG data go wild it's CC0 which means there is no copyright at all you don't even have to cite us legally you can just take it repackage it and sell it under your name and it's legal it's just not very polite so we ask you to please cite us because that's the polite thing to do among researchers and there are packages with the links here which you have in the PDF I gave you for the first package allows you to recover BG data and do top anat and this one allows you to call present absent from RNA-seq according to our methods and so that was my presentation of BG I will now look at the questions and we can also take a short coffee break if you want but I will first look at the questions so I will stop sharing so let me see what how do I see the questions now okay so there are questions in the slide in the zoom sorry good morning good morning good morning could I provide the citation um I think it's on the slide let me check because I gave you the PDF slides right so the citation should be on the slide let me check yes so it's on the slide uh so Rob but the house says that the point I make on background is very important thank you so I will put okay Rob put penta db here but I'll also put in the live doc okay a question from James odd many gene expression experiments will have data from multiple treatments all the data incorporated into the annotation or just the data from control conditions so for BG only the data from control conditions so if on the one hand if you took some I don't know what animal is interesting to you say you took some uh flies and some of them you put an endocrine destructor on them or you put them in a very high heat or something and others who left them alone as a control we will only take the gene expression from the control so this allows us to cover as many gene expression data sets as possible because many datasets where there's an intervention there's also control hopefully but to guarantee that what we integrate is only the ones with uh wild type healthy non-manipulated expression what's the difference between intergenic and intron so introns are part of genes they are transcribed so the definition of an intron that it is transcribed and then spliced out so there is as much RNA coming from introns as from exons produced in the cell it's just used differently intergenic means it's not in a gene at all so it should not be transcribed if there was zero errors in the machinery if the cell function perfectly intergenic regions would never be transcribed right so if you define a gene definition of a gene is a whole other question apart from definition of function but say let's for our purposes here define a gene as something which is transcribed to make either a functional RNA like ribosomal RNA a micro RNA and so on or to make a protein then intergenic is what is not in any of these genes and within these genes within the boundary of these genes there is not only for example for protein coding gene everyone knows protein coding genes within the boundary of the gene there is not only what codes for the protein there is untranslated five prime which is transcribed but not translated then the translation starts you have exons which include the untranslated and the translated part then you have introns so you would have transcription of all this then splicing out of the introns and then translation will start at the translation start site and go to the stop codon leaving at the two ends the five prime and three prime untranslated regions all this it is legitimate to get RNA from all this was transcribed which is what makes RNA was the intergenic regions are regions who are not transcribed at all if I found transcripts mapping into intergenic region does it mean the gene is misanotated no because there are because the the transcription machinery is not perfect there is noise in the cell processes are noisy so what we do to define reference intergenic is that we take all the RNA-seq libraries available in healthy conditions for an organ all of them across all conditions all sexes all anatomy all ages and we map them all and then we if there are regions which are annotated as an intergenic in the genome that we find actually a lot of reads mapped there and we do especially in non-model organim then we say this is suspicious we don't think it's a true intergenic we think there was a misanotation in fact there should be a gene here now it's not our job to improve gene annotations so we don't do that for now but we just put it aside and now those regions where there is no gene annotated and the there are very few reads which map over all libraries all experiments those we call reference intergenic and that's what we use then for each library to estimate the noise and also there is what rob said that it could be low complexity reads that depends on the parameters of your mapping obviously how often do you update the database with newly submitted data so now we'll go away and not so no the serious answer is not often enough now we work continuously on updating behind the scenes that we push updates to the web every one or two years and we are trying to do this faster now so we have one which came out earlier this year and one which come out at the end of this year and another probably in the spring 2021 i should also say we make small updates where we add data to the species which are there and big updates where we add new species so right now we have 29 species in the next big update we'll have more than 60 what would be an estimate of outliers of data set of a given tissue would you provide the flag for these data sets for other researchers stefan boots could you take the mic and explain your question more please yes sure um just um maybe if you're analyzing i don't know over time 20 testers data sets for mouse gene expression um would you see often outliers occurring where you say okay this is for sure not the testers or this is maybe bad quality and would you then maybe have like a kind of a blacklist of data sets where you say maybe don't trust this data maybe go to these yes so first for the end of the question however we get there we do have a blacklist and it was on project to publish it separately we haven't yet done this because it's kind of delicate to publish an official list of don't trust these data sets we're discussing also with the people in some other database like swiss prot also has this experience that there are papers they know are false should they say so um how we do this is data set data type dependent so we have different quality controls um such casting as you describe we do in single cell which is not yet available on the website right now it will be available uh early in 2021 for uh microarrays for example we have a quality control which correlates very well with your question but it's not how we do it we do it rather uh by a measure of the dispersion of the genes i'm not going to get technical here the thing is the danger with what you're saying is that we have different conditions and some we capture in the annotation some we are not able to capture the budge is very diverse so suppose we would have um data for many uh virgin fly females and now we get data from one mated fly female and the expression is different and we didn't capture the fact virgin or mated then we should not say it's wrong it's also a bona fide uh female fly it's in a different condition so we see this when we dig that there are things should look like outliers but there are different conditions so we use other criteria which often correlate strongly with the fact of being an outlier but are more objective than that otherwise it's very difficult uh do we have any some database in mind to recognize plants this is like my most frequent question and no uh i'm sorry i don't have enough funding to do it also for plants and i don't know anyone else doing it um but yeah it would be cool um could you quickly go over the importance of the background and why is that important okay so i will share a slide again with the second or wait i will try something i will try to use the whiteboard here let's see how this works i hate drawing on these things but let's try okay you all see my whiteboard so now um what we do we have a contingency table you see that i draw super well where we have uh what interests me and what doesn't interest me and i have a function or a dynamic structure f and not having this function right and the sum of these two is what i will call my universe which is everything or the background i draw super well it's a pleasure i feel like i'm back in kindergarten there okay now i'm doing a contingency table on let me change color how can i change color i don't know okay too bad i'm doing a contingency table on this this is from my measures that's cool this here is the difference between these two sides everything i have and what is in my list so the what's in my contingency table depends strongly on what i call my universe so let's suppose let's suppose that this is all the genes in the genome and these are the genes have a certain function and now i'm looking at orthologs between two species say between a human and a bird and a bird there's a big bias because there's gc rich chromosomes which are very hard to sequence so there's parts of the bird genomes for which i don't manage to get uh to get um sorry ortholog and maybe this is biased maybe these genes have less my function than other genes so it's like this okay so now if i compare this frequency to this frequency i have one result if i compare this frequency to this frequency i have another result okay and so for example to give an example we had for real the first time we tried to look at we looked at terms which have a gene ontology related to um to test this function and we found many of them expressed in ovaries that was weird and then we found out that that's because there is a lot of gene ontology annotation of breast and ovary cancer data sets which means that there's a lot of functional rotation for female reproductive system and much less for male and so the genes which are in the gene ontology if you take just all genes in the gene ontology not looking at what term just doesn't have a gene ontology annotation of some detail and look where they're expressed they're more expressed in ovary so our background was wrong our background should not be all genes should be all genes which have a gene ontology annotation and when we did this we recovered that they were more expressed in tests but not in ovary these genes which had a function and test this which is what we expected i hope this makes sense a flag for data sets which reflect the tissue the most hmm that's an interesting question i'll think about it it's always delicate to tell people which data sets are good or bad because the data sets are made by people are published a part of students cv people's careers it's an interesting question i think about it what i can tell you is that if you're interested in human the biggest data set in human is g-tech and the g-tech data set is uh is very good but is not directly curated by g-tech they take everything that the pathologists give them so if they say give me this piece of brain and the pathologist says here it is they use it but the pathologist also write a report so the curators of bg read all these reports which was a lot and we removed about 50 percent of the g-tech data which is not from healthy individuals or not actually what they say it is because the pathologist wrote i tried to take the structure was hard to dissect and i also got a bit of the neighboring structure but you also have things like the brain of someone who has Alzheimer the liver of someone has a liver disease we removed the whole sample's entire of people who died from drug overdose because we said the gene expression is not normal so you can access through the bg r package for example the subset of g-tech which we curated as healthy and well annotated that kind of half answers your question i think okay so five minutes later i'm ready to restart i hope you are i'm just setting up the right things so part two we're going to more comparative genomics and evolution with patterns of gene and genome duplication so let's first with the little poll what's more common so the races on poll mutation is on the lead but duplication has a strong camp in its favor still seems to be stabilizing for a while it's three quarters point mutation one quarter duplication so how frequent is duplication in evolution well actually quite frequent so this is a rather old paper but it's i've checked this results have kind of stood not too bad the test of time they took the genomes available in 2000 so for those of you who were born after 2000 we actually were started doing genomics already and um this is the genomes we had available at the time in eukaryotes that's all of it that was complete comparative genomics of 2000 and for each genome they looked for genes have a homologue within the same genome so paradox by definition and they looked at their divergence and the idea is that genes which uh diverged recently have lower divergence and genes diverged a long time ago have higher divergence and what they found was this exponential decay here in every species they looked at and exponential decay is what you expect if you have very frequent birth of duplicate genes and very frequent loss and they estimated in this study that the frequency of duplication is about the same on the same scale as the frequency of point mutation i've tried to update this because it's from 2000 i found several papers which update they always get to the same scale question and then how you exactly measure point mutation frequency and how you exactly measure duplication there's always a lot of debates but let's say duplication is really not rare it's on the same scale as point mutation was kind of surprising because i would have voted like the majority here uh 21 years ago i mean we expected duplication to be a rare event and that's because of this exponential decay if you don't look at the recent duplications you see not so many because most duplications are negative and actually i thought but i then i thought we don't have time in this course to have you have a discussion about the differences between what we see in evolutionary genomics comparative genomics on the one hand and medical genetics medical genomics on the other hand because when you look at the frequency of events in diseases you get completely different statistics because in disease you get the things which cause phenotypes which decrease fitness whereas in evolution we see kept over the long term genetic modifications which either do not change fitness or increase it and very rarely those which decrease it so for example you will never see trisomies being fixed in evolution because if you have a non-balance of chromosomes it's negative but it happens of course in medical genetics as we all know now there are different types of duplication i just spoke of trisomies and here i'm illustrating them rapidly with the hoax cluster so everyone knows the hoax cluster i hope it's one of the most famous gene clusters in biology so rapidly these are genes which are transcription factors which are found in a cluster along the chromosome they were first discovered in food flies so you have this genes which are in a cluster along the chromosome and they expressed in embryonic development notably from one end of the embryo to the other according to their order on the cluster so that's very nice and you find them in all animals and most times it's preserved as a cluster and i'm not going to go into the evo-divo biology of this cluster now but just to say how do we get this so if we have a cluster of genes which are actually all hoax genes so all actually homologs it means that they are paralogs and that they came from tandem duplication local duplication so we had a gene which duplicated locally to another gene and then maybe they duplicate two other chromosomes came back together and in the end you got this array of genes which were duplicated in small-scale duplications one gene at a time but then for example if you take human or mouse you have four hoax clusters and these four hoax clusters all have sorry at least part of the cluster in a way that you can see that the whole cluster was originally duplicated so here we have a large-scale duplication which duplicated the whole cluster and actually what we know now happened is that the whole genome was duplicated so when you duplicate the whole genome the whole cluster is obviously duplicated because it's part of the genome and so now you get not one hoax cluster but two and now if you duplicate the genome again which happened in the ancestor vertebrates or two whole genome duplications you get these two these two times two four hoax clusters and in for example zebrafish there are seven because there was another genome duplication fishes which four times two makes eight then you lose some and you get seven so these are the different types of duplication which affect most evolution there's also like big segmental duplication that they often lead to the kind of problems we see in medical genetics so they're rarely kept so what you usually get is either small-scale one or two genes or whole genome whole genome duplication is of special interest to me because it allows us to study the evolution of gene duplication in a very controlled manner all the genes duplicated in a genome duplication duplicated at the same time with all the regulatory elements and so we can then compare if we study their evolution we have a controlled experiment kind of natural experiment with all the genes which we are comparing have the same age the same history and genome duplication has become a big topic in comparative genomics since the late 90s the first eukaryotic genome which was sequenced was the yeast saccharomyces cerevisia and saccharomyces cerevisia was sequenced because it was simple it was studied because it's simple it's supposed to be a simple eukaryote and it has a small genome and to their surprise they found that there were lots of blocks here colored which were homologous between chromosomes and when I say blocks it's whole groups of genes which keep their order and if you paint the chromosomes with these homologous groups you paint almost everything this is the 97 papers been approved since but what we find is that in fact most regions of the genome have a double somewhere else a paradox so this means there was a genome duplication in the ancestor of yeast so even a simple organism like a cerevisia had a genome duplication in its evolution so genome duplications happen of course not as frequently as gene duplications but they happen and an interesting thing about genome duplication is that it duplicates everything but then not everything is kept I showed you before on small-scale duplications there is huge loss there's this exponential decay so it's not quite the same mechanism for genome duplication but you have a lot of loss and so what you will have here this is from the study in 2004 which proved there was genome duplication in yeast this suggested it in 97 this proved it by comparing to an outgroup you've heard of outgroups outgroups are important so if you had this gene order on a chromosome in the ancestor and you have a split a speciation so autologs in this lineage no genome duplication so let's say that nothing change and you get to this in saccharomyces you have a genome duplication now you have everything exactly double immediately after the duplication of the genome all the genome is double but now these genes which are in double they are not always useful you don't need both genes so you can lose and you lose randomly one or the other copy because they are symmetric at the start so now you have this feature here where many genes and only one copy you only have one copy of gene one only one copy of gene two two copies of gene three only one copy of gene four and so on so if you count the genes and duplicate between these two blocks which is what allowed to first suggest the genome duplication you have not so many so it suggested but didn't convince many people because it's not overwhelming but now if I compare these two blocks to an out group which is similar to the common ancestor I find that these two chromosomes have an alternating autology to the cell group on this copy one I have these genes which are autologous to the cell group and here those genes and so sometimes I have double but often I only have one that alternates like this because these two regions of chromosome come from this duplication of genome and thanks to such studies we can estimate the loss and see that most genes are indeed lost after genome duplication but quite a few are kept otherwise we would never have any evidence so the order of 10-20 percent depending on genome duplications will be kept on the long term so now I'm going to ask you to do something a bit different on the Wuclap which is not a pole but brainstorming so when genome duplication was first suggested to be important to evolution in 1970 it was by Susumu Ono who said vertebrates and especially mammoths are too complicated to have come to have developed only by point mutations so there must have been massive duplication which allowed the complexification of vertebrates and mammoths relative to invertebrates and so I'd like you to think about this I'll give you a few minutes so I'm going to go to the Wuclap stop the previous one okay so here you can write what do you think is a higher animal and what is the relation between genome duplication and higher animals and you can also like the answers of other people I'm not going to share for now I'm letting you do it so I see some answers on what is a higher animal you can also answer the second question on what is the relation to genome duplication yeah thank you one answer and as I said you can also like other people's replies you cannot dislike them because it's a very polite environment here okay so I'll share the screen so we can look together okay so some answers on higher animal evolve more through time two people have liked that so three people have written it a concept that assumes they're a higher and lower organ and obviating tree like evolution this is a complicated way of saying the same as the no such thing answer I think a more complex genome more complexity subjective more duplication more brain further down the evolutionary intellectual so I kind of see three clusters here those who say there's no such thing those which say that it evolved more or longer four clusters those which related to behavior intelligence and those which related directly to the genome so more complex genome more regulation more redundancy and on the relation to genome duplication it increases variability provides more genetic raw material does not necessarily to more complex functions provides the possibility for the same gene to move differently allows for redundancy increase no functionalization allowing increased expression of beneficial genes increase possibly for pollution so there seems to be a general trend of saying that it increases possibilities basically it opens doors right so I leave this open if you want but I will go back to my slides so whole genome duplication is found uh repeatedly in the plants and a bit in animals and also in other organisms so here I have some slides on plants and animals but for example I showed you in yeast there's also multiple genome duplication paramesium which is an amoeba so this is a review of 10 years ago but it gives you an idea of the distribution here you have uh amoeba paramesium here you have the yeast here you have in animals the only ones know the time two genome duplications and vertebrates and one in fishes and here you have plants what you see all over the place so genome duplication is quite common in plants actually and if you think even about I don't know if any of you work in plants but if you think about uh plants breeding or or the diversity of plants you have many hybrids many tetraplates or octoploids and so some of these hybrids or tetraplates will eventually get fixed in evolution and become genome duplications so genome duplication is quite common actually in plant evolution much rarer in animal evolution but and let's say when we discovered this one at the origin of vertebrates it's very intuitive to say hey we have vertebrates we feel kind of superior to those stupid flies and we have genome duplication there must be something to this sorry we don't see your slides you don't see myself oh i'm sorry thank you for telling me uh i forgot to click i didn't click the okay sorry sorry sorry back one slide i said this then this sorry okay so start again here are the so these are genome duplications and various pieces known in 2009 oops here are the parameciums yeast here are plants with many genome duplications as you can see and here are the animals with only vertebrates because that's the only ones we knew and you see two invertebrates here and as i was saying when we found genome duplication vertebrates was kind of intuitive to say that's linked to our complexity as vertebrates but then that doesn't really square with the fact that there is also genome duplication telos fishes there's more which are not shown here in other fishes there are in parameciums which are not known to be especially complex behaviors for example plants have way more genome duplications than animals including vertebrates so it's not clear in any way how you could link any concept of complexity to genome duplication as far as the phenotype goes and as far as higher animals goes i would agree with those who said there is no such thing as a subjective thing there is no objective definition of higher animals depending who you speak with they will say as humans as primates as mammals it's vertebrates whatever the only common point people would have is to include us basically the person speaking and the same would be for higher plants something which is going to include in the end the arbidopsis but apart from that we're not very clear so you should realize that all these species have evolved since the same time right all vertebrates for example have the same amount of evolution the same time evolution since the cambrian to us this is an updated slide of genome duplication and plants you see that it's really very common the dates are sometimes unsure but it's really very common in plants and unrelated to any clear features such as complexity or so there's a lot of debate whether genome duplication of plants are linked to increase in speciation rates and to my knowledge this is not yet solved so if there is something is very weak it's not a strong feature which will be easy to see and for example yesterday Rob told you that the biggest amount of animals is arthropods and the biggest number of species and arthropods is beetles and there are no genome duplication beetles so you can have diversity clearly without genome duplication and I told you there are more genome duplications known in fishes this is a recent paper on genome duplications in vertebrates you see that there are two at the origin of vertebrates here there's a more recent one just in one frog Xenopus laevii there is at the ancestor of telehost fishes there's also in Cyprinids goldfish and carp in salmonids salmon trout and so on and sturgeon we're not clear how many there were some sturgeons are octoploid depends on species but it's very diverse and sturgeons for example sturgeons and gars are those fish which did not have the telehost radiation so almost all fish you can meet a telos this is like 99 percent of fish species and these are sister groups which were not so successful and yet this one had lots of genome duplication so it's not very easy to relate these things as I told you before not all genes are kept in duplicate after duplication after genome duplication and this is actually very interesting because it is not a random set so this was shown originally by a paper in 2004 which as far as i'm concerned is one of the really classic papers of comparative genomics because they showed the importance of controlling so if you look at duplicate genes in one species for example as cerevisiae you can compare the genes are duplicated and the genes are not duplicated and the genes are duplicated differ from the genes are not duplicated in two ways one sorry they evolved after duplication so anything which happens after duplication happened to them and that's the obvious thing that's what we're usually interested in but two it means they duplicated and they were kept in duplicate and how do we disentangle these two features when i compare these genes to those genes they have these two differences and so the idea of this paper is to use outgroups so you have the insect autologue of a gene which did not duplicate in yeast or c elegans and the insect autologue of a gene which did duplicate and now i can compare the properties of these and these properties are independent of the evolution after duplication so they only depend on which genes what were the ancestral properties which made some genes be duplicated and kept in duplicate and other genes not kept in duplicate and so one of the ways to characterize these genes is codon evolution so i'm going to do a super rapid uh one slide on dnds so dn also called k depending on studies is the non synonymous substitution per codon so when you change the codon the dna of a gene you change the amino acid so presumably the function which is carried by the protein ds or ks synonymous changes so you change the dna but you do not change the protein so presumably you do not impact the function which is carried by the protein so in a first approximation only the non-synonymous changes seen by selection on the function and so if it's neutral if there's no selection then the two rates will be the same so the ratio will be one so if the protein is evolving neutrally the ratio will be one if as in most cases there is selection to keep the function not to break it it will eliminate more substitutions which change the protein and eliminate less substitutions which do not change the protein so the ratio will be less than one dn will be smaller than ds okay will be smaller than ks so when you have negative selection this ratio is less than one and you can quantify the stronger the selection is the lower this is so for histones which have super strong selection because if there's any change you break the chromatin and you're dead dn is essentially zero so the ratio is zero okay and some other genes are kind of important but not so important can be 0.5 for example and if there is positive selection selection to change things that there's a change in the amino acids in the protein which is improving the fitness of the organ improving the function of the protein then this will be fixed by selection faster than it will be fixed by drift and so there is a higher rate of change of the amino acids than the dna synonymous and so the ratio becomes larger than one and again this can be quantitative the higher the stronger the selection to change the higher the cis you typically see very high dnds ratios for example in immune system genes are in an arms race with pathogens why is this important because it allows us to compare the selection between different categories of genes depending on their duplication history and so what we see here is that the genes which were kept and duplicate in yeast their ka in insect is lower so they had less non-synonymous changes fixed during evolution those genes are kept and duplicate and i remind you on this graph we're looking at the autologues here so there's no impact of the evolution after duplication this is just saying the genes are kept and duplicate at the genes which were on the stronger purifying selection the k is lower the same as the elegance the same is also true for their the fact of getting insertions and deletions so the genes were kept and duplicate and not random genes the genes are under stronger purifying selection and there's also differences in the function and in the ka between the genes which were k or dn the non-synonymous rate the genes were kept and duplicate and small scale duplications or whole genome duplications okay so the whole genome duplicates have even lower case who are on the stronger stronger purifying selection so the genes did not duplicate one the weaker purifying section those which were kept under duplication after small scale duplication were under a bit stronger and those we kept after the genome duplication were even stronger so you see that if i compare if i look at the evolution of the function of genes after genome duplication i should take into account this is a bias set also biased in terms of function so this is a very naive was 2004 gene ontology test and we see that there are functions which are much more common for example here in the whole genome duplicates there's much more structural molecules and we see this also for example in fissures so this is a study we did as a follow-up to this and we find notably interest that there is much more developmental genes and behavior genes neurological genes kept after genome duplication in fissures which kind of goes in the direction some of you are writing about saying that the genome duplication allows more complexity of behavior at the same time as humans did not have that genome duplication so we should temper this and there's also the same observation that the genes after genome duplicate which were kept duplicated in fissures their orthologs and other species which did not have this genome duplication here human and mouse have lower non-synonymous substitution okay so here it's higher bar for the lower ka or dn this duplication bias was studied in the early 2000s i just showed you on evolutionary rate of genes but it's also true of other features and notably we looked into gene expression more recently and so this is in fact that's how we invented top and up to make the study so sometimes the biology motivates the by and thanks sometimes the by and thanks allows the biology the two in pro and so we looked at where genes are expressed which are kept after genome duplication in fissures and we find that these are all the anatomical structures which are enriched and if you look those in bold they're all in the nervous system and there's a huge significant enrichment of genes which are expressed in a nervous system kept after genome duplication in fissures and this is also seen if we look quantitatively at the level of expression the genes which are kept and duplicate here in here are much more expressed in the nervous system than the genes that are not kept in duplicate and this is also true for the orthologs in mouse so it's not because they evolved after duplication to be expressed in nervous systems because those which were preserved after duplication were already expressed in another system otherwise we would not see that expression in mouse now this is an interesting question of causality because we know from the old studies that the genes kept in duplicate after genome duplication evolve slower and there's also studies from the early 2000s which show that genes which are expressed in the brain invertebrates and also other enemas evolve slower because there is selection against misfolding of proteins so selection very strong section on proteins expressed highly in the brain because the cells in the brain do not divide during life like in other organs and so it is very difficult for these cells to get rid of misfolded proteins so there's a very strong selection on these proteins so we have genes which are kept after duplication which have low evolutionary rate genes which are expressed in the brain which have low evolutionary rate and genes are kept after duplication which are expressed in the brain which way does this go so what we expected is that either the genes which are kept after duplicate always have low dnds and thus are enriched in gene expressed in the brain or that the genes are kept after duplication are always expressed in the brain and thus have low dnds what we find is actually a weird mix where we have the genes which are expressed in the nervous system the duplicates evolve slower but when the genes are expressed elsewhere there's no difference in evolutionary rate but if we compare duplicates for expressed in the nervous system and duplicates which are expressed elsewhere we find that the duplicates were expressed elsewhere I expressed the same as the genes the singleton genes and the singleton genes are expressed in the brain I expressed the same also as the gene expressed elsewhere so we have only one category which is kept which is really under strong selection as genes are kept and duplicate in the brain this was a very complicated observation which took us a while to understand but in the end oh yeah okay I will skip this one we found a model which was published by another group on medical genetics and evolution which is that if you have genes which duplicate and then some mutations of these genes gives very gray phenotype then it is very hard to mutate these genes if you have a gene whose mutations are very deleterious the natural selection removes the mutations and so then it's very hard to lose the gene because most mutations which make you lose the gene go through intermediate steps which are very deleterious so genes which whose function is very fragile for example genes which can be misfolded in the brain I kept them duplicated under strong purifying selection on the sequence with genes which are either expressed elsewhere or less sensitive to misfolding there there's less selection on the protein and you can lose the gene because there's a larger target of mutation to remove them and so it's an interesting observation which now has been done by several groups and several species that genome duplication tends to enrich your genome and what some people have called dangerous genes genes which are targets for mutation which when the mutation happens is actually very negative so here we have an effective genome duplication which is not to improve function which is not to increase complexity but actually to increase the number of targets for mutations which give you problems increase the targets for medical genetics if you want so we find that many of the genes which are linked to medical to genetic diseases are genes which were kept in duplication after the vertebrae genome duplication because they are hard to lose so the genome duplication duplicates everything and because when you mutate these genes you are sick the mutation on the genes are eliminated by natural selection so the genes are kept and so they are there as a target for selection so one of so in a way genome duplication makes us and even more fish makes our genomes more fragile so you see that it's not easy to put the direction to evolution maybe that's something I don't know that Christopher and Rob already said but evolution does not have a direction right there's not improving things it's just what happens and maybe the selection short-term on an individual level to keep these two genes is going to make the species on the long term more fragile but that there's nothing to do about it natural section does not see the future so we enrich our genomes in um dangerous genes and they're kept by purifying selection that's an interesting thing that then the the intuitive thing from the 70s was that if you have duplicates kept it's because they have a new function so they have a new function the intuition is that it was positive selection you get two copies one has a new function so that's advantages and so you keep the two copies because one is advantages but here what we see repeatedly that what we have is genes which are fragile which are under purifying selection were expressed in a complex organ and which actually don't necessarily bring new function but just it's hard to get rid of so what we have is the duplication is a random event it's a mutation a macro mutation if you want that it's a mutation then the genes are kept and duplicate by purifying selection which prevents you from breaking things but it is not fixed by positive selection it's not advantages so you're going to get more and more genes without really an advantage for the individual or the species at least short term that's not what fixes it okay I was afraid of being too long in this part but I'm on time so I'll be able to look at your questions cool so an important take home on this is that if you look at paralogs and you want to look at the evolution of paralogs you have to think that your set is always biased so a biased generation I didn't speak so much about this because I spoke most of our genome duplication if we look at other mechanisms of duplication for example tandem duplication can duplicate more easily short genes okay because if you duplicate when there's a duplication of region of the genome you should not think that the mechanism of duplication this is a random mutation it does not know where the gene starts and stops it's going to cut randomly so the shorter a gene is the more chances that you duplicate the DNA region which has the whole gene and not half of the gene because if you duplicate half of the gene it doesn't work and there are other biases for example another mechanism of gene duplication is retroposition so you have the gene transcribed into an RNA and then there's a reverse transcriptase which translates which makes us back to a DNA which is integrated in the genome now this would happen more probably if there's more RNA for the gene before it to be relevant to evolution it has to happen in the germinate line at least anonymous it has to happen to genes it has to integrate in a genome which we pass to the next generation so if in my arm muscle an RNA is reverse transcribed and integrated into the DNA it does not affect my descendants the only way it affects is if it's in the germline so as a result the genes are more expressed in the germline spherontogenesis or genesis would it be more duplicated by reverse transcription and because there are many more cell divisions in the spherontogenesis then it's much more those which are expressed in spherontogenesis so you have a bias to genes expressed in spherontogenesis duplicated by this mechanism so you have a bias in the generation of duplicates then your bias in the retention as I showed you genes which are hard to lose and then when you look at the evolution of the paradox you should think this always affects you so you should always control for it you should not just say I'm comparing the expression or the function or the dnds or the serial organization of the duplicate genes to the non-duplicated genes in an organism and seeing the impact of duplication I see both the impact of duplication but also before that the impact of retention of the genes and before that if it's not the whole genome duplication the impact of the generation so that's it for this part which is a bit shorter so now let's look at the questions okay so such high levels of gene duplication would mean that organ in the same population would have different amounts of genes in pretty much every generation has anyone calculate the half lives of gene duplication so people have calculated the half lives but I haven't managed to find one consensus number it's like depending a bit on organ studies it's no even for mutation point mutation rate is relatively recent that we have a consensus number there was the studies of families and of evolution and divergence and of cell culture gave different numbers but to answer your question on population yes in the one species where we do really a lot of population genomics as humans and we find a lot of what we call copy number variants and the copy number variant is a duplication a duplication which is not fixed but is right now polymorphic in the population so the equivalent for a duplication so a substitution when it's not yet fixed we call it the SNP and a duplication when it's not yet fixed we call it the copy number variant and we find copy numbers all over the human genome and all over human populations so it's very common and it's actually a big source of explaining human variation in various aspects how do I know about the loss of about 85 percent of the whole gene duplication so that's what studies such as I showed the simplistic this cartoon figure but it's been done in more detail I will try to go back to the slide and share it wait okay so you see my slides again so when you have things like this here you can count how many you know that all the genes were duplicated because it was a whole genome duplication and you can count how many now are duplicated with a chromosome position and age of the duplicates which is consistent with the genome duplication and so you can see which proportion of the genes now have a parallel consistent with the genome duplication which proportion of genes do not and that's how you get to about 15 percent depending on organisms but that's for example what you get typically in the tailost fish genome duplication as well as in yeast 15 approximately of the genes have this pattern of having both kept in duplicate and 85 percent having this kind of pattern depending on the age of the duplicate you get anything from 50 percent to 10 percent of the genes kept in duplicate usually once you reach 10 15 percent the stops going down because these genes have been fixed for various reasons either because they have created a part of a network where you cannot break things by removing a part or because they got a new function eventually something so in all the example studies what could be the impact of absence of information such as expressed absent might be a third level category information unknown how can this be controlled well that's where it's important depending on the type of study you do to indeed have the proper control so if you're doing for example an enrichment table you should only if you're if i'm looking at genes expressed in a certain circumstance for example in the brain my reference should be genes for which i have expression for example that table i showed you i didn't go in detail but that table specifically is only from in situ hybridization data which has the highest level of resolution zebrafish and for this our reference was the genes which have been studied by in situ hybridization and zebrafish but this will depend you should always think for every study case by case what are my potential biases and what are my controls that's a very good question you should always control for dnds a synonymous non-synos mutations always identified by comparing the sequence of interest to an outgroup or common ancestor yes always by comparing to an outgroup so to calculate dnds or kaks you always need at least two sequences but if i'm comparing paradox between themselves i need to compare them to an autologon if i go back to a slide there actually i'll get back to the right slide and then i will share it okay so sharing the slide okay so here that's why for this study so this 2004 there was still it's not 2000 but there were not that many genomes available but it was important to have these two insects here so that you could calculate dnds between the two you couldn't just have one insect and calculate dnds on its branch dnds always has to be in a comparison so now we have phylogenetic models maximum likelihood models which are more complex than the original ones were just comparing a pair of sequences but still you need for the phylogenetic models you need at least four sequences actually to calculate dnds on the different branches so yeah you have to be able to compare something always it's funny every time i leave zoom and come back i have to read scroll down to the latest questions is whole genome duplication linked to speciation if you have the answer call me um seriously it's been a big hypothesis for many years for example when genome duplication was discovered in telost fishes telost fishes the most diverse species group of vertebrates so it's about the same age as birds which there are about 8 000 species and there are about 25 000 species of telost fishes which is about half of all vertebrates so you take all vertebrates including sharks and birds and mammals and lizards and whatnot half of these species are fishes and telost fishes they had a whole genome duplication so this was a hypothesis which proposed in the literature and also there were some anecdotal cases in plants of clades which had a hybridization or a genome duplication at their origin which also suggested this but then it's been extremely difficult to test because you can always find an example so i look at the species group i find the genome duplication cool but now how do i control this how do i make a an unbiased study because we don't sequence random genomes so our knowledge about genome duplication is not random and it's very difficult to estimate also what happened in time so if a genome duplication happened for example in telost fishes 300 million years ago but the big diversions of telost fishes was i don't know exactly 100 million years ago or something that's a huge lag it seems difficult to explain like this and we know that groups like beetles are very diverse without genome duplication and we know that some groups have genome duplication like surgeons and went nowhere in number of species there's a handful so there are there's a whole debate among people who are specialized in modeling specification and extinction rates who recently regularly published with a new and improved model and find the opposite of the other team and argue so there is some there are some specialists of specification rates who think that genome duplication does increase specification rates there are some who think it does not but it decreases extinction rates and there are others who think it has no impact i don't know i i can tell you we tried to study this in the case of fishes what we found is that we could find no relation between the timing of the duplications and the timing of specifications and the fossil record so we are nice to get the genomes and the fossil record which was kind of fun but we found no pattern which was kind of disappointing and made us a plus one paper which is what happens in such cases so yeah i don't know i'd like to know um are there questions can the assistants ping me on zoom are there questions i should look in the google doc no okay so i suggest we make a five minute break and start the next part a bit early because i never know if i'm going to be a bit faster a bit slower okay so five minute break coffee bathroom and i start again okay so welcome back so i hope you all had time to have a small break i'm sorry i'm such a slave driver i'm not giving you any breaks i'm awful so um i'm putting directly on the screen the the next woke up poll this way you can look at this no love for paralogs for now i feel the influence of rob's talk here he mentioned the phylogeny for autologs okay so i leave the poll up but it's pretty clear that what's winning is first that you need autologs for the species phylogeny secondarily that only autos can be compared or only autos share function which was kind of two ways of saying more it's the same thing and no one thinks paralogs interesting which makes me very sad because i started using autolog databases to find the paralogs there's something also by information torture to make you give you something as you can take a database of autologs to get the paralogs because they're the ones should not autolog so i will stop sharing this you can continue voting and i will share my slides okay so part three and last evolution after duplication so now that we know that we can characterize the function of the gene somehow that we know that they duplicate not unfrequently that this duplication and retention are biased that we have to take into account can we all the same see okay how do they evolve after duplication and so we just did this and most of you think autologs are important for species trees i bet that's because rob said so yesterday i bet most of you don't do species trees and practice when you do comparative genomics i should make a new poll on this do you actually do species trees in your research you know the main reason most people want to distinguish autologs and paralogs is that they want to annotate function and they think that function can only be annotated by autologs and this is a rather old idea so susumu oh no i mentioned him before he was the first to suggest that genome duplications were important in vertebrate and mammal evolution he had a rather let's say to the outdated view of evolution he thought that mammals are more complex and superior and that this could not happen just by point mutations but you needed large-scale macro mutations of gene on genome duplication to create the new material to make this complexity and he has this very famous quote natural selection merely modified why redundancy created if you try to really think about this and put in equations it doesn't mean much but it corresponds to a kind of intuition that when you have redundancy when you have more copies of the genes then you can get new functions and this is a very strong intuition and for many years so the terms autolog and paralogs were invented in 1970 and i did my PhD in the 1990s and i was told always that you need to find autologs because autologs have the same function and paralogs change function because after duplication you get new functions and this was not tested for a very long time and only recently in the last 10 years of people started investigating this and although i'm pretty proud to have published a paper in 2009 before anyone is asking the question i'm not so good at marketing apparently and someone has invented the good cool term which is the autolog conjecture so that's how we call it the autolog conjecture is the idea that autologs keep function and paralogs don't so you can think of this hypothesis or conjecture in a more phylogenetic way or more pairwise comparison way so if i think of it in a phylogenetic way here i have a gene tree okay here are the species sorry species one species two species three species four species five species six species four again species five again species six again species six yes again yet again why there are species again because i had duplications and this is a gene tree so in a gene tree you can find several times the same species because after the duplication you get two copies of the gene and so you have the species twice and here under the autolog conjecture let's assume that every time that the function evolves very slowly after speciation almost not so the color here represents a kind of quantitative measure of function like you would say an enzyme activity or a level of expression so you can quantify and the letter is more function that you would describe in words such as genontology term so let's say that here after the duplication i have a jump so i have a huge quantitative change i go from these variant reds to these variant blues or i have a difference in the function i describe say i catalyze a different reaction i bind a different ligand and so the function goes from a to b and here at this duplication it goes from b to c or to a totally different color again and so in this view function can evolve very slowly most of the time when this duplication jumps and so the autologs without duplication at all have the same function when i have one too many autologs such as between this species three and the species four there's this one and that one then there's one autog which kept the function one which didn't so that's a phylogenetic way of seeing it and an alternative is this hypothesis is not correct is that during evolution i have duplications and in an unrelated manner i have at random paces in evolution shifts in function but the two are not related and so this can mean that i can have as here two autologs which have closed functions but i can also have paralogs here which have closed functions or i can have autologs like here which have very different functions now the thing is here this is important is that if i compare a few species i might not see the difference between these two patterns so if i'm only looking inside one species at the paralogs within the species if these paralogs have different functions for example i look in species four here i have this paralog and that paralog here they have different functions a and b here they have different functions a and b so the fact that i find paralogs inside one species yeast paralogs human paralogs arbalopsis paralogs which have different functions does not tell me that the autolog conjecture is true it does not tell me that the part that the duplication created the functional change just tells me since the duplication enough evolution happened that at some point there was a change in function but maybe it happened related to the duplication like here or maybe it happened unrelated like here and another way of seeing so to test in this case i need to make phylogenies and have some kind of measure of function that i can map on phylogenies another way of seeing it is to compare pairs of genes and these pairs of genes are either pairs of autologs or pairs of paralogs and they have a certain age a bit like in the graph i showed you at the beginning of the previous hour where we had the duplicates which had different ages measured by synonymous change so i can use some proxy like this either phylogenetic estimate of aging or synonymous variation or something like that to estimate the age of duplicates and now under the autolog conjecture at birth i mean just after the speciation autolog should have very high function similarity oops sorry so here i'm measuring some measure of similarity of function maybe it's that they have the same expression maybe the same genontology term maybe the same interaction partners some measure that they have the same function and the autologs at speciation are very almost the same whereas the paralogs already at the duplication might be different because maybe they duplicate to a different genomic environment or something and then as time passes evolution happens so function diverges so the function similarity decreases but always for given age the autologs are more similar in function than the paralogs so here you allow evolution to happen for both but you see it happens faster and steeper for paralogs this is the alternative if there's no autolog conjecture they both diverge in the same way with time and again depending on the experimental design you have you might see different things if for example i'm comparing human mouse autologs and then human paralogs which come from vertebrate genome duplication then the human mouse autologs have this age and so at this function similarity whereas the vertebrate genome duplication is much older so the paralogs have this age and this function similarity so under both models in that case if the paralogs are older than the autologs obviously the paralogs are more divergent in function so you have to be very careful and that's something we saw when we started studying this in 2008-9 is that you have to be very careful of your study design and a lot of the early papers thing that duplicates diverge early i mean papers from 2000 to 2010 they just compared paralogs within one species and they found they were different and they said hey see the impact of duplication but in fact any model where evolution is non-zero will have the paralogs a bit different the question is are they specifically different are autologs more similar and paralogs more different that's the question and so this question despite i would like to say that when in 2009 i published the paper saying hey we should study this it had a huge impact but the sad truth is no one read it at the time and what happened is that in 2011 a study was published giving apparently experimental support for the autolog injection being false and this study had a huge impact and then people read our old paper and everyone else continued studying it so this paper from 2011 by the group of Matthew Hahn and said they compared the similarity of gene ontology annotations as well as the similarity of gene expression measured at the time by microarrays between autologs and paralogs in human and mouse according to the divergence between the genes which gives you a measure of of diverge of age and what you have here you have so this is print similar to the theoretical graphs i had here but this with real data is the gene ontology annotation similarity and here you have in red the autologs all the way down here or down here so not very similar and up here you have in paralogs which are paralogs which duplicated after the speciation between human and mouse all the way up here and here within species out paralogs so duplicates which are older than the speciation human mouse for example from the vertebrae genome duplication but where both copies were studied in the same species and what you see in these graphs is that you have not only not the autolog conjecture you have the opposite the genes have the most similar function are the genes are the paralogs and the genes have the most different function are the autologs so don't start taking too many notes on this because it's not correct but it's an interesting error there are two types of errors errors which teach us something and errors which don't teach us something and as a scientist you want to make the errors which teach you something because if your aim is to never make mistakes i have news for you you will make mistakes so will these be fecund mistakes this and so actually this was very interesting to christoph dcimo and i started the collaboration 2009 after we published that maybe the auto conjecture should be tested and they published paper using genotology annotations to test autology so we said you know to test autology using genotology is do the opposite yes use autology and genotology to test the evolution of function so we did this and we got almost exactly the same graphs we didn't choose the same colors but apart from that we had the same graphs but we didn't publish it at the same time as they did because we found there was a error in this there was a bias and this is actually an interesting bias because it shows how we have to be very careful in comparative genomics and what we compare and how so here i'm going to summarize a rather complicated bias unraveling in one slide so bear with me if you think of the labs you know which are experimental they usually study one species so you have you don't have many labs which study the autologues of one gene in mouse, human, zebrafish, trozophila melanogaster, sea elegans, arbidopsis, yeast and so on you have a lab which will study some process and the gene found is involved in it in zebrafish or in fly or in arbidopsis and so and actually these labs then are part of communities which do the same type of experiments use the same vocabulary to describe things exchange students and postdocs and so what we found is that if two genes were published from the same paper so have the annotation from the gene ontology from the same paper the similarity of function here is very high and it's not different between autologues and paradox sorry if the gene ontology comes from studies published in different papers which share authors so the same labs basically the similarity is much lower and again not a very striking difference in autologues and paradox if I only take gene ontology annotations which come from papers with no common author then the similarity is even lower but here I have a difference this is the color of the autologues in this case the one-to-one autologues are significantly above all along and now this is interesting because same species paradox so paradox from the same species say uh hawk say one hawk say two hawk say three of zebrafish would tend to be studied in the same papers much more often okay 47 percent of same species paradox annotations come from the same paper but autologues much less often only one percent of one-to-one autologue annotations in the gene ontology come from the same paper so what we have is that when you study things in the same paper you do the same type of experiments ask the same questions answer in the same way using the same terminology the same vocabulary and then when there's the bicarration remember bicarration this morning the bicarators read the papers they don't do the experiments so they read a paper and this paper on gene a and gene b finds the similar stuff with the same terminology described in the same way with the same type of figures same type of results so they give the same gene ontology term with some other papers different papers on autologues one did some set of experiments the other the different set of experiments and they describe them differently so since the gene ontology is capturing words in the end which represent concepts but they're reading the words they're going to capture different things and so we find that the autologues tend to be captured with different terms in the gene ontology because they are studied by different people and the paralogues especially the in-species paralogue tend to be studied by the same people and so have the same gene ontology annotations and so that explains actually this result here that even the old autologues within species out paralogues the old paralogues sorry have more similar function here than the autologues because they were studied by the same people this is not biology this is actually very interesting and i i mean i i have a 45 minute talk just on this part but i'm not bothering you with it today but you can really disentangle what in the gene ontology allows us to do comparative genomics or not and it's actually very tricky to do comparative genomics with the gene ontology it was not meant for that it's not meant to be compared so you have to be very careful if you do compare and these biases are at all kinds of levels so what we found at that point in 2012 is a kind of tentative answer that if you correct for a lot of biases and here i showed you one bias but actually this gene ontology annotation had at least four major biases you get this difference here which on the one hand is significant statistically because we have a lot of genes this is numbers here you get i don't know if you can read 160 000 okay so it's significant on the other hand the effect size which is what should interest us as biologists is very small what we want to know is not just is the p-value small because in genomics scoop you always get small p-values because you have big numbers but that's boring what you want is to get big effect size something which actually explains a lot of biology and here the difference is tiny so if this is the final word it means yes autologues are a bit more similar than parallels but really you shouldn't care so we look for other ways to do this and i told you the gene ontology is not very well adapted so us in many groups tried to look at gene expression because gene expression i showed you this morning can be compared between species and it's kind of objective measure of function so it only captures a part of function but you can characterize it much more objectively the problem is that there's a lot of issues in comparing also gene expression between species so if you correlate then you have biases on the more specific genes have higher correlations i'm not going to go into detail but there are many issues with this and there were quite a few papers comparing gene expression between species which each said it was conclusive but if you put them together was not conclusive the the so we decided to look at gene expression different way not just the level of gene expression but the specificity so the idea was that one measure of function is either a gene is housekeeping it's ubiquitous expressed everywhere at the same level or it is specific for example it's a gene expressed highly in the brain and lowly elsewhere or highly in the muscle and lowly elsewhere so it has a tissue specific function and this describes an only very small part of function but the advantages that we showed in some boring bioinformatic paper that it is very robust so you can measure different ways using different data sets and you always get the same conclusion that's very reassuring because we want to compare different species and in different species you have different experiments done by different labs with different techniques so you need a measure which is robust otherwise you're just measuring the differences between labs there's a famous paper a few years ago where they said that actually autologues totally don't have the same gene expression and human and mouse and there was a follow-up paper by another lab said well actually you use different generation Illumina machines for the human and mouse and that explains everything you see so you see have to be very careful so tissue specificity we showed that was very robust and so then we compared the tissue specificity between autologues and between paradox and we got this so here you have one point is a comparison it's one data set of expression compared between either two species for autologues or the paralogues in a species for a given age and we did the age not this time by sequence divergence but by the phylogeny so you can make the tree the phylogeny if I show you this is something in fact people often have issues with so I'll show you how it's done if you have phylogeny like this I can say this duplication happened after the speciation and before that one just from the tree shape okay so I can date this duplication at this point somewhere on this branch so that's how we did so it's a more robust way of aging of sorry dating and so what we get is here the autologues have very similar tissue specificity so if you take recent ones like here is human chimpanzee and human gorilla there's an almost perfect correlation of tissue specificity between autologues and as the species are more divergent this goes down which is kind of reassuring on the fact that evolution happens because anything which is a flat line is kind of disturbing when you look at evolution you think the more time passes the more it will change and the paralogues also go down but more importantly they're always under the autologues here so here I'm comparing all the paralogues inside human we also did it centering on different species but here it's centered on human and you find that the paralogues have a much lower correlation they don't have a zero correlation okay paralogues and on random pairs of genes they are genes which share a lot of features so they do correlate but much less than autologues so when we published this we're very happy because we thought that now we have answered the question we had this question since 2009 are autologues more similar in function than paralogues and my secret hope was that the answer would be no because it's much cooler to overturn what everyone thinks than to confirm it but okay I can settle for second best confirming it and this looked very robust and I was very happy but science doesn't stop just like that and actually I told you this is per the phylogenetic view and a few years later so that was 2016 this is 2018 two years ago Kassidun a specialist of phylogeny in Yale published this paper he was nice he gave me a head warning of a few months saying if you do the pairwise comparisons don't comparing things are not independent there's a tree actually the evolution see here you have twice PCC because there's a duplication so here you have the duplication you have the paralogues so the comparisons here are all mixed up what you should do is take the tree and look at the evolution on the duplication branches and on the speciation branches and there are actually tools developed for this since the 80s for the evolution of phenotypes on trees to calculate how fast phenotypes evolve on phylogenies notably phylogenetic independent contrast PIC phylogenetic independent contrast is simply how much the trait diverges on the branch relative to how much we expect from the branch length which is expressed as a variance I'm not going into the math here but basically you transform the branch length into a variance that and then you look at the how much the phenotype changes relative to this and then you can compare these normalized changes and what's interesting is that you can reconstruct the ancestral states at the points here of the phenotype and so you can calculate also at the internal branches here oops sorry when I move the mouse it tends to change slides and so you can compare now the PIC the phylogenetic independent contrast for the duplication branches and the speciation branches and now you can have a proper test of the autologonjecture and so what they found is that when they took real data so they reproduced our analyzers with only part of our data for various reasons they found the same result as us which is kind of reassuring and shows that we did reproducible science so all our code was shared publicly so they would just pull our code and rerun it and they thanked us for us and I think it's natural and I really encourage you to share your code because it allows such things to go much smoother and not to fight with people but to work together because then they shared their code and we are able to build on it further so really do open science share your code we'll all get further together but when they did phylogenetic independent contrast I find this figure very unclear but what's important they compared by Wilcoxon test the phylogenetic independent contrasts between the duplication branch and the speciation branch and they didn't find any difference on the real data and now they did simulations if they simulated data under the autologonjecture they found a difference like we found on the real data on the pairwise comparison and they found a difference on the phylogenetic independent contrast but when they simulated without autologonjecture they still found a difference with the pairwise comparisons which implies that they are biased they find a difference even when there is none because when you simulate you know what you're doing you know there is no difference whereas the phylogenetic independent contrast managed to find no difference so they argued from this that actually when you correct for the phylogeny there is no difference between the functional evolution of autologs and parallax now I had the number of issues with this study not that they contradicted me as I told you I was actually hoping to find the auto conjecture force because it was more fun but because here they treated the branch lengths as something very reliable this is what you normalized by but we know that duplication and evolutionary rate of genes are not independent we know that duplicate genes are not a random set of genes relative to their evolutionary rate which is going to change the branch length because the branch length is the product of the evolutionary rate times the time evolutionary time which passed and also that after duplication we know that there's weaker purifying selection for it otherwise so genes accelerate so branches are longer and so we really revisited this and the details and this bio archive preprint which I really don't have time to to show you all the details because it gets quite technical but basically if you correct the trees for all the biases which in fact are in trees have both specifications and duplications then you recover a difference in phylogeny independent contrast which is larger for the duplication branches so in fact if we correct for phylogeny taking into account that phylogeny of a gene tree with duplications and specifications is not the same as phylogeny of a pure species tree then we do recover the autolog conjecture I wouldn't say this is the end of the story because obviously every time I thought it was the end it wasn't and the same for my colleagues who say the opposite but it shows that you have to be very careful with these methods when you have gene trees which mix duplication and specification because there's a lot of biases to take into account just to give an example another bias is that you can date you need absolute dates for these methods you can date the nodes which are speculations from fossils we cannot date duplications from fossils we have to estimate them so it's not the same and this introduces also a bias and this is still not the final word but the recent paper I found very interesting on the autolog conjecture and in general a really cool paper if you have the opportunity to read it they knocked out genes in yeast and replaced them with the human autolog and so if they recovered the function so if you knock out the gene you lose some function if you put back the gene you recovered the function now instead of putting back the gene I put back the autolog from human in yeast and if you have one to one autologs you replace in about half it works but if you have one too many so there was a duplication somewhere in the branch in the human branch that could be anywhere ancestral to animals to vertebrates human specific whatever they don't know if you have a duplication here then only in a quarter of the cases you can replace the function and if on both sides you have duplications then only in a small proportion so it means that it is there's clearly something which happens during the evolution of the function after duplication since it's much easier to replace and here we have a full measure of function right the yeast lives happily or doesn't so this I would not say it's not a real test of the autolog conjecture because you could immediately see that they don't control for differences between generation retention and evolution after duplication but it's a hint that yes one to one autologs do have more similar function than other genes so I would like to know what you think of all this do you believe the autolog conjecture or not so I'm going to go to my work lab to go to the next question there so vote from what I told you do you believe the autolog conjecture the race is on and it was very close at the beginning but now there's a clear majority coming up for the autolog conjecture which I suppose is reassuring to the speakers of the previous two day whose full-time activity is to find autologs my specialty is working on duplicate superlogs so I'm a bit I have a different perspective on this although I've spoken with specials of duplications who clearly believe strong in the autolog conjecture because I think the only way you get new functions is by duplication so okay so let's say that after all I said I don't know how biased or unbiased I was it's supposed to be totally unbiased we get 80 percent of you who believe in the autolog conjecture and 20 percent were unconvinced okay so let's go back to my slides next slide so I think one thing I take home from this I don't take home whether the auto conjecture is true or not at this point I think it's probably true because I've looked at these data really a lot and most of the time when you're careful you find the autolog conjecture but what I find really interesting is that it's really hard to prove something which is supposed to be obvious when I was a student I was doing my PhD in the 90s it was obvious that autologs have the same function and the first time I presented in a meeting in a conference in late 2000s 2008 I think hey in fact we should ask this question we don't know people were shocked people who work on autologs were telling me no way you cannot even ask this question it makes no sense and actually when you ask it it's surprisingly hard for something which should be obvious it's surprisingly hard to prove it and this I think tells us a lot about where we are in comparative genomics when we want to go beyond genes to function if we are not able to answer this question then it means it's very hard for us in general to answer any question about the evolution of function that's why I spent time on this not just because I was involved in it but because I think it's it's an interesting window into the limitations of all our methods for doing functional comparative genomics and this is a little table I published recently summarizing a bit the ways you can get information which can be compared or not like from the genotology or expression or whatever if you have you're comparing two genes whether they're autologs or paralogs and the function was transferred by autology or by homology you're going to see they have the same function but they don't mean anything because you transferred it by homology and this looks ridiculous when I put it like this but there are papers using for example the enzyme committee nomenclature so you know those little numbers between other enzymes 1.12.4.1 these numbers are overwhelmingly transferred by a homology and there's no annotation allowing you to know how they transferred and so of course they find similar function between autologs because they transferred them between autologs but the same goes for a lot of function annotations if you did one if you have function x and one species and function y and the other species but because you did different experiments you the functions look different but no just you did different experiments so you cannot directly compare that was the problem of the different papers and same papers now if we did the same experiment that we get the same function this gives us information about evolutionary conservation and the easiest case of this obviously expression I do RNA-seq I have the same expression then it's conserved and if I do the same experiment I have different function I do RNA-seq I have changed an expression this tells us evolutionary change just to tell you that this can be tricky for example there have been papers comparing the essentiality of genes between autologs so an essential gene is gene if you knock it out you're dead austere so you do not your fitness is zero in evolutionary terms and so people have compared the knockout phenotypes between say zebra-efficient mouse or between mouse and human medical genetics results but then when you dig deeper I tried to do this and we never published because it becomes very complicated because knockouts are very rarely clean I knock out the gene from one end to the other exactly you knock out part of the gene you make a you put the stop cotton in the middle you remove part of the promoter you have a conditioned knockout on in some tissues when you look at all this it's almost never the same experiment which was done so you cannot really compare so it becomes very very difficult to compare the function of autologs or paralogs when actually the experiments were different another little poll I will go to Wuclap to activate it there so this one should be fast because we've been speaking about this a lot I see I'm running a bit late on time unlike previously where I was early on time so I will go back to my slides where you guys can vote in the background okay so now we're going to come to the most interesting part at the end uh no it's not true everything is interesting love all comparative genomics evolution after duplication so here's a little figure I did to summarize all the different levels at which function and duplication can interplay so you have the different types of mutations the fixation of the duplicates after so you should think by the way that duplication a type of mutation even genome duplication is a type of mutation because sometimes people would say mutations uh can't create new functions but duplication scan but that's false because duplication is a type of mutation so then you have the fixation we saw before the retent the we saw the retention before this is population genomics and we would not speak of population genomics in this course but when we spoke for example about the variation in populations of copy number variants it's at this level then we have this bias by gene function which retains different duplicates according to their function but then after they are kept they can evolve their function in different ways and so the most easy way here is dosage so now I had one gene with two patterns of expression from different promoters I had one originally and now I have two and the two are kept and expressed and so I have twice the amount of product and this happens for example ribosomal proteins you have arrays of duplicate genes which express the same protein because you need a lot of ribosomal protein but oops sorry the hypothesis which where did I go there the hypothesis which was the most popular when people started spying on gene duplication was what is now called neo-functionalization so we have two copies of the gene one is on the purifying section to keep the ancestral function which is still needed so the other is no longer under strong selection so mutations including large effect mutations can get fixed and so this allows the evolution of new functions whether it's new promoters new domains of expression or new protein functions and that's neo-functionalization and in the early 2000s Mike Lynch the same person who made this frequency of duplicates I showed you earlier this morning proposed the third model to explain why we found so many duplicates because if these are the if this is the only way and that was for a long time what we thought to be the only way it should be very rare to keep two genes because the probability of a mutation which breaks something is much higher than the probability of the mutation which improves something okay if you give your computer code to someone you don't know and say change what you want randomly chances that he improves it are much lower than chances that he breaks something right so random change usually don't improve things and the beauty of the sub-function model is to say if I lose if I have several sub-function which are independent here for example it's domains of expression but it could be protein interaction partners it could be sub-cellular location localization anything you can imagine to be separate it's easy to have mutation which loses one because that's breaking something and now if one copy loses one sub-function the other copy loses the other sub-function I have no gain no gain of function but now I need to keep both genes to keep the ancestral function to have for example these two expression patterns these two domains of expression I need to have the two copies and now I'll keep both by purifying selection because I want to keep the ancestral function which was useful to the organism without any gain and so this model was very powerful and interesting when it was published in 2001 because the first model to say we can complexify genomes without any gain this comes back to the question I asked you earlier whether genome evolution is related to some concept of higher animals we can actually increase enormously the complexity of a genome without any gain of function it becomes a kind of a Rube Goldberg machine if you know Rube Goldberg machines is when you have the ball falls and breaks the balloon which overflows the water which wastes the cat which jumps onto the ladder which makes the book fall with that and in the end serves you coffee or something so it's useless but if you remove part once you make something so complicated if you remove part it doesn't work anymore so duplication can create that actually through sub-functionization and so a bit like the autologonjection once you have these two hypotheses you want to test them to see in reality in practice I mean I can make some theoretical model and another theoretical model this is frequent and this is rare and that's what they did in the early 2000s but the theoretical model has a lot of assumptions and we don't actually know what really happens so what really happens so it's been studied in several ways and one way is to look at natural selection so dnds on the sequences and just to say what were the expectations of selection on the different models if so here this is a bit complex but bear with me here you have the gene frequency so gene 1 is always there at 100% frequency in the population and gene 2 appears at the duplication one individual increases in frequency and then gets fixed and now I have two duplicates so now it's at 100% in the population everyone now has two genes and new functionization means that at one point so here nothing is happening here I have the two genes and it's drifting but then I have a mutation in black here which gives a new function to this copy and now positive selection the see the curve increases slope tremendously positive sections pushing for the fixation of this gene because a new positive function positive sections also pushing to keep this one because you need both and then you have purifying selection which keeps both and this is sub-functionization I have a gene I have a duplication I'm drifting now I lose a function in one of the two genes but they're still redundant on part of the function so still drifting but now I have a mutation on the other gene the other sub-function and now I have purifying selection keeping both because I need both to make my function and the big difference between these two graphs is this red part here in this model there is positive selection and the sub-functionation model is neutral there is no positive section to fix mutations this has been tested in several ways and never convincingly so I don't have a convincing result to show you but I can show you a little bit of evidence so one thing we see repeatedly is that duplicates evolve asymmetrically so you have two duplicates and one it was fast and one it was slower and this has been interpreted as evidence for neo-functionalization because here one of them gains a new function and changes and here not but then and here they have when you cartoon this you always put up sorry each has one loss of function and that's also what I put in my cartoon earlier here they each have one loss of function right so you always draw it symmetric whereas here it's asymmetric but actually there's no reason for that there's no theory we said it should be asymmetric or symmetric so you can have asymmetry under sub-functionation also so asymmetry although there's been quite a few pays which said it was indicative of neo-functionization that's not necessarily the case so what you need to decide whether it's sub-functionation or neo-functionation is to know what was the outgroup state because a big difference between neo-functionalization and sub-functionalization is that together they describe the function of the outgroup of the sorry of the ancestor or one has the function of the ancestor the other has a new function now we don't have a time machine because the physicists still haven't given us the time machine and if you have any physicists friends you can complain to them but we have our groups which is second best and that we have thanks to comparative genomics and autologues so let's rejoice we should compare our duplicates genes to an outgroup not duplicated and see if one copy is like the outgroup and function and the other is different this looks like neo-functionalization if the two paralogs are a bit different and together they describe the function of the outgroup that looks like sub-functionalization and I don't have a result of mine on this because I've been trying to test this for years and every time I try to start testing it the new results on the autologue conjecture and I go back to that so it's still my plans for next year but it's been my plan for next year for six seven years but other people have worked on it in the meantime so here for example you have RNA-seq human mouse where they looked at the expression of the duplicates which are human specific or mouse specific comparing to the outgroup and the other species and for example here sorry here they don't have an outgroup sorry my bad so you cannot tell so this is an example of asymmetry here you have these two genes have different expression see this gene is expressed highly in the lung the kidney the liver and the testis and the paralogue is also expressed in the testis but highly in the heart where this one wasn't highly in the cortex where this one wasn't and here I have a gene which has highly expressed in one copy and lowly in the other and actually this is quite frequent it's never in our mothers but it's quite common so this kind of study is interesting but don't really allow us to choose so we need outgroups and this is where the fish genome duplication are very interesting because in fishes we have not only genome duplications which means a lot of genes which are duplicate with the same age so we can compare them but we also have outgroups for one thing a nice outgroup to all fishes is human and mouse where we have a lot of data so when I work on fishes I treat human and mouse as the outgroup which is very convenient because there is a lot of data and we also have extra genome duplication salmonids and salmonids have an outgroup all the other fish for which we have data notably zebrafish and I wish I could tell you I'm going to give you the final answer here but it's not the case so this figure is a bit complicated but what's important is that here if you compare so here what they did they sequenced the gar and the gar if you go back here is a fish which is which diverged by speciation from other fishes before the teleost genome duplication so it's an outgroup to the teleost genome duplication so all the genes are duplicated in all teleost fishes from the genome duplication a single copy in the gar and so they did the genome and the transcriptome of the gar and compared expression and so we have genes like this where you have the gar in gray and one paralogue in zebrafish which is expressed in the same place as the so one paralogue here in red is expressed in the same place as the outgroup the gar and the other paralogue in blue is expressed a different place so that looks like neo-functionization and you can do the same thing when you have the medica so it allows you a biological replicate the different fish also teleost and they have the same pattern but you also have patterns like this where the gar gene is expressed in the brain and the heart and one copy in the of the gene duplication the onologue in zebrafish expressed in the brain and the other in the heart so this really looks like sub-functionization and when they counted the two types they had much more sub-functionizations than neo-functionizations they also verified this by correlating the levels of expression so the here for example you see that the here you have the one copy of the in the zebrafish is lowly correlated to the gar outgroup the other copy is also lowly correlated to the gar outgroup but if i sum their expression so and sub-functional sub-functionalization i'm going to assume that the sum of the two corresponds to the outgroup then have a high correlation to the gar similar to what they get for singletons so this for the genes which have a pattern like sub-functionalization whereas those which have a pattern like neo-functionalization the duplicates differ from the singleton so it's not the same and you repeat again this in zebrafish and medaka with slightly different scales and here they found evidence for overall sub-functionalization that's very nice it's a lot of data it would be nicer to finalize the question but the same year another group studied the salmon in genome duplication so they sequence the genome of a salmon which had a duplication the salmon salar the usual salmon and pike which is closely related to summons but didn't have the genome duplication and they compared the gene expressions like this so you have gene expression every column is a tissue a sample and every line is a gene and so you see that when you have a gene which for example here all these genes are highly expressed in the brain for one copy in salmon salar in the salmon highly expressed in the brain for the outgroup in the pike but much more lowly expressed in the brain for the other copy in the salmon and on the other hand they had high duplicate high expression all kinds of other places which was not present neither in the outgroup oops sorry nor in the first copy so this goes in the direction of saying that we have no functionalization we have repeatedly genes which have one copy which is like the outgroup and the other color which is different so these two groups have fought by different papers they use different methods different metrics so there is a paper by the salmon group which says if we realize everything it's clearly no functionalization and a paper by the gar group which says if we realize everything is clearly sub-functionalization at this point i would say i don't know and i'm collaborating with the gar people to get more data and try to understand but and there was also a paper more recently in aphioxus so aphioxus if you look at the phylogeny is the outgroup to the vertebra genome duplications and what we find here we cannot really distinguish so this is a complicated figure but what's important that when you have on the logs you can compare the expression in the tissue in the aphioxus to the zebrafish and what you often have is a kind of specialization so it's not really a sub-functionalization in the cartoonish way but more that one of the sorry two zebrafish copies is broadly expressed in more it's the same place as the aphioxus and the other is much more narrowly expressed and this is something we see a lot we also saw in the human mouse data although we didn't have an outgroup is that we have a lot of cases where when there's two duplicates one is lower expressed and more specialized and the other is higher expressed and more broadly expressed but how this relates to sub-functionalization and no-functionalization we don't yet know so this is something this third hour i'm sorry is full of question marks we don't have the full answer to the auto-conjection we don't have the full answer to neo and sub-functionalization that's research for you but i think we are refining our view and seeing that things are more subtle than the cartoon images we had 10 years ago so if i want to keep a few minutes for questions i will go to a conclusion slide so on average after genome duplication there is evidence for both sub and neo-functionalization and one of the things i think that people would like to do now is to understand better clearly both happen when do they happen is it because neo-functionalization happens more short term and sub-functional long term is it because different categories of genes are kept is it because some are at the expression level and some at the protein sequence level i didn't show you but there's starting to be some evidence for selection on the protein sequences also many genes do not diverge an expression i didn't show this too much in the slides because kind of boring pictures but in fact these slides are only a part of the picture here you have the genes diverge between the copies but someone has had many more genes than this they have about 40 000 because they had several genome duplications and most of them are actually all the same and we see this also for example in xenopus and we are studying a lot these questions on expression because it's i would not say easy but feasible tractable to get expression in different species across the whole genome and compare it other features of function are much harder so they're harder to get large scale and when we get them large scale they're harder to compare so we have limitations both on the experiment side it can be very difficult to do it's almost impossible to systematic knockout in any species which is not yeast or E. coli it's very difficult to do systematic protein-protein interactions it's very difficult to do systematic and most then features of function actually specific of subcategories of genes so if you have enzymes you want to study their the catalysis if you have receptors you want to study their ligands if you have transcription factors you want to study the genes they regulate but then this becomes much harder to put all together and by informatically as soon as you go outside of RNA-seq which already has biases then it becomes even worse like what i showed you for the genotology complicates also all the other things whether it's knockout phenotype whether it's protein-protein interactions whether it's enzyme activity all these things become very complicated to compare so i'm not saying we won't manage but it's clear that we start with expression because it was tractable so in conclusion of this morning thank you for listening first i see that some of you are still here so i hope i showed you that gene and genome location are quite common and so we should take them into account in comparative genomics i think i showed you that the evolution of function is hard to study i don't want to discourage you it also means it's interesting challenges make for interesting research so far as i was saying expression is the best studied proxy for function evolution it doesn't mean it's the best and maybe in 10 years we'll look back and say everything we did on expression was really crappy and lucky now we have a better way to measure function but that's what we can do now okay so we're like the tail of the drunk person who is looking for their keys under the streetlight not because they dropped the keys on the streetlight but because that's where there's light to look for keys and finally i hope i showed you that curated databases are useful and to study function we really need this curation and these structured databases thank you for your attention i will now look at the questions so questions questions are pseudo genes included in sub-functionalization no pseudo genes are included in loss of the gene so a pseudo gene is considered a step between losing the towards losing the gene in most cases so if you have two copies of the gene i showed you only the scenarios where you keep the two copies there's actually another scenario which was the most common one you have two genes and now you go back to one gene you lose and that's that exponential decay we had at the beginning of the morning and now how do you lose genes it's very rare that morica scissors come and chip exact cut out exactly the one of the two copies what happens that your mutations which little by little deactivate this gene so that it's not expressed it's expressed but it's not translated it's translated but it's not folding that at the top so you get a pseudo gene so the genes can be transcribed okay uh sometimes and the pseudo gene pseudo genes are basically uh duplicate genes being lost and actually this brings back to the question someone had earlier this morning if duplication is frequent we should see a lot of duplicate genes in the population and i spoke of copy number variants actually the pseudo genes and there's a bit less pseudo genes than genes in the human genome but close so it's like 19 000 to 17 000 sorry i don't remember the exact numbers the pseudo genes are all evidence for duplication which happened not so long ago and is being lost because once a pseudo gene has no function at all it's not under any purifying selection that will be a target for mutations which will occur and get fixed by drift without any selection to prevent them from getting fixed and so pretty fast you will lose them so it's you cannot compare autologous non-function sequences between a human and a mouse that's too far away because after 80 million years on each side so 160 million years total evolutionary time mutations scrambling so much that you lost everything and should know that in most at least eukaryotic genomes that i know small deletions are more frequent than small insertions so if you have a piece of sequence which is useless you basically lose it little by little just by deletions if you wonder why our genome isn't then much smaller it's because we also have transposons who come and increase the size of our genome but that's a different discussion um going to the questions okay does co-expression of two genes speak for sub-functionalization co-expression of two homologs or paralogs or co-expression of two unrelated genes paralogs so if two paralogs are co-expressed that says that that they share function so they it it's uh from an expression point of view it's would favor um just increase in in quantity now i know anecdotal stories whether it's shared expression with changes of function but it's hard to study large scale but for example um estrogen receptor invertebrates there are two genes estrogen receptor alpha and beta and they form a heterodimer and they express together and you need both paralogs to form the function heterodimer in some circumstances sometimes you need the homodimer actually and sometimes the heterodimer so the heterodimer has a new function and you have various cases like this the most well known is hemoglobin hemoglobin is an assemblage of paralogs which express together but have different sub-functions so yes this can happen but it's hard to characterize on a large scale because you can look at detail at one transcription factor one globin but to characterize this large scale is difficult would you consider duplication somewhat useful because we can compare them to an out group useful for what because it depends useful is useful for something i like duplication because i studied them so they're very useful to my career um but i'm not sure what the question is if this is the question i know some papers which use duplications as an additional phylogenetic information so they improve the species phylogeny by the dating of the duplication so that's the question i can give you some links to that yeah so i can give you a there are some methods which exist um and probably Christophe Desimone knows also some others which use the the the order of duplications in the phylogeny as an additional evidence for species phylogeny additional question as a rough measure can use the number of present absent calls in bg uh use it for what so use it as a measure of tissue specificity yes you can if that's the question yes and uh in our benchmark for um tissue specificity measures we showed this works not as well as a quantitative measure but not too bad and actually we're we're implementing this into bg but it's not yet on the website because before we put something on the website we have to verify it a million times so yeah is there a cause or biases of genomic analyzes seems like a vast and interesting subject the thing is once you look into it it's interesting but if we put this as a title no one will come so that's an interesting suggestion i don't know if patricia palagius here or not right now but uh yeah she's here i am she's coordinating the training at sib so we can take note of the question like you know when genomic uh methods don't work it would be fun to teach i don't know if a lot of students would come