 wherever you are we're really pleased to have this this course actually for the second time but the first time online and hopefully this is going to be an okay experience I mean it's necessarily not going to be as interactive as if we were all in the same room but we can try to make some efforts and we're really happy to to have your participation so yeah my name is Chris of Decimo I run a lab mainly based at the University of Lausanne and we are obsessed about the topics of this course so hopefully we can we can provide you a bit with our our experience and you know inferring trees interpreting them and identifying and relating genes across different species so the way we are going to run this as you as you could see there's already that link to a Google doc the reason we do the question and answers with the Google doc is that we found that it's easier to keep track of the questions and and provide some answers in a bit of an acycronous way with the Google doc it doesn't mean that you are not allowed to use the chat function if you if you have like some some you know small remarks that you'd like to make or or you have some you'd like to draw a bit more my attention it's possible also to use the chats because I will not be monitoring the Google doc during the lecture however yeah as you could hear we'll have some teaching assistants who may draw my attention or you know might directly answer some of the question there or my attention if there's something that really merits clarification I suggest that perhaps in the in the first few minutes that that that you if you fancy this you know briefly right in the chat you know who you are and you know where you're connecting from maybe have a one sentence you know on why you're joining the course that that will provide also a little bit of a sense also to the other participants you know who's joining and I'm going to then you know proceed I mean I'm we can do that in parallel with the introduction to to this first hour yeah so I think that that's all for kind of like the meta business so for this first so I well maybe the last thing I will say is that divided things we have now three hours and I've divided the like these lectures into three parts I hope to I think about 45 minutes for each of the part which will then leave a 15 minutes break at the end of each session during which you'll be able to you know stretch your legs maybe get hydrated and if you wish also interact a little bit with other members of of this of this course so other participants we're going to create some breakout rooms so this is a little bit how we're going to get organized the first part now is quite general we are going to talk about our sequence evolve typically along trees how to interpret these trees and how to infer them and it's just you know it's fairly brief to cover a very big topic I think many of you will already have worked with trees but you know we think it's a good investment to get a little bit of a peep on the hoods so I just want to start by stressing how pervasive trees are in modern biology of course you know the tree of life is still something that is hope you see my cursor else maybe I will have to you I mean maybe can can I get a quick feedback here in the chat do you see where I'm pointing at with the mouse also isn't probably another way to do that in slack I don't see it's okay I don't see that either okay so let me see if it is okay that may not be possible okay well in the upper right part I'm going to go more or less here clockwise I can point with my finger over there it's mirrored up there the tree of life this is this remain you know something that is high of a very high interest you may have heard there are some initiatives to keep on sequencing new species and they are still part that are not resolved and so yeah you know Darwin's tree of life is remains an important topic but there are many also other uses of trees which you know some of which I depicted here that that are very applied you know trying for instance you have to model lineage trees host pathogen co-evolution cancer progression which is also an evolutionary process and we've seen also quite a bit in the news this year how this can also be used for contact tracing in epidemiology sometimes and perhaps not to to pinpoint definitively a particular transmission route but perhaps rather to exclude many very unlikely roots so trees I use across life and so as I mentioned earlier we're going to first discuss a little bit how to interpret the tree so this part will hopefully be already quite familiar to many of you but you know again I think we want to cover this just to make sure that they're all at the same page and so here you know again please make use of the of the the Google Docs if there's anything that that is unclear and so where we can spend a little bit more time on okay so first thing to note so the phylogic trees I used to relate molecular DNA molecule DNA and protein sequences biological sequences and this is I think it's a natural model for that because of how the DNA divides right it divides splits into two and then the two the two copies can take on their their own destiny and typically they diverge then and so a tree is usually quite a good model at least to capture the evolutionary history of particular fragments of DNA then things can get a bit more complicated but here in terms of also representation I've just shown three different representation there's Carl Woos's iconic three domains of life tree here which is depicted as as an unrooted tree so we there's no specification exactly of where the root is that's why I call it unrooted and there every branch has you know a meaning in terms of also the amount of evolution that that is conveyed on a particular branch if you have a short branch you have fewer changes at the at the sequence level or it could be less time if the if the tree convey is calibrated and conveys a certain time we're going to discuss the connection between sequence evolution and time in that just a few slides but you have other representation you may have seen this type of representation either here like in this vertical layout or it could be draw this could be also drawn horizontally it's a very common type of representation in this type of representation we maybe we have to be mindful that the the branch length that carry some meanings I mean in some cases they might not carry any meaning here that may just depict actually the the what we call the topology of the tree which is the order of connections or it could also depend and we need to look at the fine print maybe the caption of the figure to know whether the branch length have some meaning and if they do that would be here the vertical length of the branches the horizontal branches here I just used to space out the tree so it would have exactly the same semantic meaning if I widened it if I widened for instance that branch which is just used for layout and it's really here the vertical branch that potentially conveys some meaning and this is also quite a popular representation which is a kind of a circular circular tree these are okay in these both cases they are rooted you can see here the root is in the the center the root is at the top and here again the meaningful branch length are here the radial distances but the the length of them in the other direction I'm not quite sure how to describe that the angular maybe distance are just for layout in fact you can see they're all equally spaced and this has been sometimes a bit of a criticism of these trees in that you know sometimes you have kind of closely drawn species that actually quite distinct here that's maybe very obvious but in some other cases is a bit less obvious yeah you cannot see the pointer this is a bit of a problem I think with this this representation so you I may have to I may have to change the way I share the slides you know so here I thought this would be a good idea because you you can see both myself and the slide at the same time but perhaps we'll have to go back to the more classical layouts so that you can see the pointer which in this case is perhaps really quite helpful so let me let me just change a little bit the way I share this so okay can you see my pointer now yes now we can now you can okay all right so we're gonna do it that way I think it's probably preferable I however now no longer see the chat well oh well you know we can't we can't see everything well I've got my secondary device here okay all right well thank you for I see plenty of feedback well thank you for all of your feedback yeah so the angular distance here is not really meaningful and okay but really the key takeaway here is that when you see a tree you have to ask yourself okay first of all you know what is always conveyed by any tree is a topology so the orders of branch branches on the tree you know here the fact that if we go now back in time the chimp and the human have a common ancestor first they call us first and then sorry then you go back to the common ancestor of that common ancestor and the gorilla so this is the branching order all the trees will convey this and then you have to ask yourself okay do you also have the branch length that I conveyed and sometimes even if you have a rooted representation such as this one of that one the root is not really meaningful I will discuss this also later the programs used to infer trees they typically can't infer the roots so it may be quite an arbitrary roots but in some other cases there's there's a meaningful analysis that is attempting to ascribe a roots to the tree and so that even if you have a rooted representation you've got to be a bit cautious sometimes in your interpretation thinking oh is this actually really I mean did the whoever produced this this figure really intended to make a claim you know that that's their you know best roots most likely roots okay so I think that's enough for this slide if you have more questions you know ask me later we're gonna revisit this I asked this in the ask your question in the Google Doc so again you know what at the beginning we're just familiarizing ourselves with this representation it could be rooted or unrooted is against it's a bit of an unusual type of representation here for an unrooted tree but you know that that also works and and and again in this case the horizontal line would be meaningful in terms of the evolution distance if this is conveyed at all so now I'm coming maybe to already this is a bit bridging the questions about you know how to infer trees we need to get a little bit more familiar with you know the complexity of trees and so I'm starting a little bit with a with a question that hopefully will set some thoughts in motions here and so the first question I have is if we are just considering a fully binary tree so every splits only ever has one incoming edge and then two outgoing edges how many branches do we have let's say for instance if we have three leaves so I have a little question for that so we this is a one small interactive part I have here so you have the link in the end if you want to answer this question which is here provided in the Google doc maybe someone can paste it's also in the chat it's a book lab dot com link in the schedule and so I can open the question now sorry you know start this with this first question how many branches are there in an unrooted binary tree with three leaves so try maybe if you have some pen and paper in front of you you can yeah and for Wooklap you see the code is SIVCG1 like what they want so a piece of paper you could try to draw this if you have three leaves and you have an unrooted binary tree how many branches will you have in your tree so let's go ahead and and answer okay I see some answers starting to come okay and I guess I'll try to draw things a little bit okay just just have to don't be shy it's also quite helpful for me to get a sense of your your familiarity with with with this with this data structure okay so far I see six answers if you don't understand the question feel free to to ask some clarification I would recommend yeah either on the Google doc or the chat someone ventures n-1 guess okay you are muted we cannot hear you oh thank you very much I don't know how come got muted maybe when I switch device I get automatically muted okay what I was explaining is that you know if me you know if we start with the three three leaves you know this is one way to connect them and then here you would have three internal branches three branches so those of you answer three you're in you're in good shape now you might ask well is there maybe a different way to connect that so let me just go back here and and so you might think well you know can you get away for instance with just two you know two internal branches but that I would argue in this case this is not this this one is not a leaf all right this is an internal note okay and so perhaps you might wonder where but what about what about it this is this is just like so short all right that it's not quite an internal note then we have three leaves and don't we just have to branch well actually we don't have to branch we have here we need to zoom in here but but really here we have that third branch which is just very very short okay so the all of you who answered three branches are correct okay we have for unrooted tree and under the binary tree with the three leaves you have three internal branches okay if this is not obvious to you or now clear to you then you know you still have some question you could you could ask them in in the grill dog and but I'm going to move then to the next question which is about four so let me see how can I how can I stop this okay I'm gonna get to the second one all right you see the second question so how many branches are there now for an unrooted binary tree with 40 so imagine now you you have one more leaf some answers already four or five okay this seems still to be and four and five almost evenly evenly splits don't want to influence you too much okay well you have to draw this draw this tree on a piece of paper to convince yourself okay so far the only answers I see are either four or five but that's quite evenly split so so I think if you look at the tree I mean one way to to try to to answer this question is to to grow our tree so we we start from what we had before and we're going to add one more one more leaf okay and now you have to ask yourself okay oh yes we have so now I need to connect it somewhere so can I can pick a branch where I'm gonna connect it to so I'm going to pick this branch for instance and now if we count we have one two three four and five five branches what happened is that we had we had to add one branch to connect but then in so doing we divided an existing branch into two okay that's a bit of a it's not that if this is a b c d a more common representation now would be like to have rough you know kind of a more balanced layout and maybe we will draw this like this a b c and d this is the same topology and maybe you've seen I've swapped actually a and b but that is also okay it's just I'm just explain I'm just conveying how I've connected this four these four leaves so you see that by adding one leaf we increase by two the number of branches so now this is good to know so it's roughly there's roughly twice as many branches as we have leaves and it's not exactly twice as you could see because when we have three leaves we have got three but every time we add one we're going to add one leaf we're going to add two branches okay so that's fantastic and we have our actually in sort of formula it's clear this is I'm going to show this late in the in the in the slide but it's essentially two and minus three and you could see these works for three you know if you plug in three for and you've got six minus three that's three that's fine and then it's going to grow you know every time and increase by one you're going to add another two so this is this is the formula I mean the formula is not that important but the important bit here is that the growth is linear in the number of leaves so for instance we want to estimate the length of this branch length if we add some more taxa that part and if we know how they are connected actually we it's not such a difficult problem because the number of branches doesn't explode okay but now how about the number of topologies so how many ways do we have now to connect our for instance our three taxa okay so I had a b and c here so how many ways can we how many topologies do we have with a b and c someone answered four and a half as a solution yeah so I'm not sure what would be a fraction of a branch length let me see how can I get back here so yeah the question here is how many topologies we have here so what do I have okay I've asked as a question is actually for 40s so maybe don't answer that question quite yet sorry I maybe should have asked that for 40s as well for three leaves as well first so maybe use the chat for that okay sorry for all the gimmick we are learning as we go here how many ways are they to connect three different three different leaves okay some people are suggesting three ways so I wish I could ask you to draw the other ways if for instance you're suggesting this maybe to put the b the a and the c like this well that's still the same tree it just is still connects the the the the leaves in the same way if you just do a rotation it doesn't really change the meaning of the tree so okay now I see most of you now are going to one so this is indeed the case right it's only actually there's only one way you can connect in an unrooted tree your three leaves again you know you the kind of the edge case I mentioned before where one is for instance an internal node that's not really so yeah that that is not really a difference so I see vertical and horizontal no plus vertical and horizontal I'm not quite sure what you mean if you mean again if I am going to draw this if you mean for instance drawing it like this a b c and and like this and I'm going to make it extra confusing well now let me just do it like this a b c all that is the same tree now you may say well but how about that for instance b and then here I have a and c well these are different trees but that's because they are rooted differently right so these are not unrooted trees now if you view them as unrooted if you you're saying well actually really the root here is meaningless I've just drawn it in this way then it is the kind of the same tree let's say I say it's equivalence kind of in an unrooted interpretation why because you can imagine the way I you know I'm thinking about this is that if you just you know you keep all of the connections alike and you just now let's say you you pull that branch here you pulled it pulled it pulled it pulled it and you keep all of the connections then you would get that tree right so you don't have to change any of the order of connection this is going to be you know if you pull it then you know it will look like this you know you pulled it and now you would have C A and B and now you can see how this is very much you know the same topology as this so you can see so we've already now covered so these are these were all unrooted topologies okay but with three I hope this is clear which is three nodes I'm gonna clear everything now all of the drawing we have really just one way of connecting them I just to show you that all of these are the same I'm not putting them in order okay so ABC now how about four okay and let's forget about these who clap thing because I see with it with the chat it just works just as well and I really I regret to have done that gimmick now but anyway let's just use the chats and tell me if we added now one more one more leaf how many ways I can add one more leaves that's another way of putting putting it someone think there's still only one possibility that's the majority seem to think three someone to okay so indeed indeed we have three branches we can connect our node D either here you know that would be one topology and in this case if you look at the the semantic and D are especially closed and then B and C but I could also connect the here okay in which case C and D are are close to one another and an A and B a close or I could connect here B and D and then A and C okay so sometimes people just write this you know in a text manner you know it's just saying okay now you have AC and BD that are to get the princess that that is also conveying the tree now in a text format okay this is actually if you add a semicolon this is actually a new week format okay anyway so indeed there are three ways so we went from one way to three now how so can D be added to this at the center very good question so what happens if I add the same center a C and I do this okay so now what is this this is no longer binary tree all right we have here what we would call polytomy or a multiplication and we have more than one in see I mean that the three edges okay it is no longer binary now you may say well but what if it's a tiny weenie middle branch then then yes indeed then that's the same tree as this one for instance potentially which as this one with this branch being set to zero right then you get that tree but now what happens is that there's still three ways you might be able to connect that and so it could be you know A and C that are like this and that branch that is of length zero or it could be a B that are like that and that branch is equal to zero so if you go back to binary tree you are back to your three topologies and you may say well isn't this a bit you know for the theoretical question but actually for instance if you look at the corona virus the SARS-CoV-2 virus to be to be precise and you look at them at the tree they are so few changes that there are many many such internal branches which are essentially of length zero and this is important for the interpretation because you know these zero branch length are meaningless and even though the most of the programs they will still return a binary tree you cannot draw conclusions about the order of all of the branches that are connecting at the same place and if you know that for instance on average well at least that was true about about six weeks ago when we're looking at sequences and they were at that time maybe already around 50,000 sequence available worldwide they very very few differences you know these are 30,000 character long and these on average only about 30 differences and so you know you think of this as really a tree that is extremely compact with most branches or yeah the vast majority being of of length zero one or perhaps a two substitution over the entire genome and so that's actually important for the interpretation if your branch length is zero then you may have multiple topology that actually multiple binary topology that mean the same thing okay well I digress a little bit and that's a risk of having a whiteboard and interesting questions so maybe we should go back to the slides and try to progress a little bit given that's according to the timetable we only have five minutes to cover all of tree inference which seems a little bit ambitious we'll see what we managed to do okay so let me get back to my slide deck okay so we've covered all of that that is wonderful okay so yeah okay the important thing here and which we could have spent a little bit more time on is what happens when we add more taxa and as you can see you know for three we saw that the way had one unrooted tree that we went to three and then the thing is then it grows very quickly because when we go to five remember we have three possibilities of a tree of size four and each of these trees has five branches so when we add our fifth taxon we've got five choices to connect it to in three three different starting topology so that's three times five and that's 15 and then we when we go from five to six taxa we've got a tree so we remember when we go from four to five we're going to add two more branches we have seven branches and we have 15 topology 15 times seven that's 105 so really the key here contrary to you're sharing the wrong screen oh we see the book sorry thank you a few transition that can be improved yeah it's it's treacherous because I took this book cover for the slide and so it's so where am I now am I sharing the right screen yeah okay so here that's better yeah sorry about that and thank you rob for for this intervention gone a long time before noticing yeah you can see then when we add more and more taxa the the number of possibilities multiplies it is a number of starting topology times the number of branches where you can add these extra branches and so that grows very quickly and that's certainly the case for on rooted trees and for rooted trees it's very much the actually it's the same in fact the same number just shifted by one because you can think of the roots as you know being like an extra taxon and the position of the root it would be like where is the position of that extra extra leaf by the way I say taxon or leaf interchangeably this is jargon about trees a taxon or a leaf of the tree okay so we've seen already and in my drawing I already tried to convey the notion that that these that there are many equivalent trees as well and I think one mental model I've seen you saw that that was like maybe to think of these as a strings and then you're putting some of these strings actually one also quite nice metaphor or yeah or analogy is that of the the mobile you know these these devices that I often uses for for for babies and so there too the orders of connection doesn't change but everything can rotate and and it's still essentially the same topology and I'll in this case also yeah the same same branch length so as a result if we look at this tree of of primates these two these two representation are equivalent the only difference are you know some rotations around internal edges around internal nodes you can see here Gibbon they are all they're still together and agile and crescent Gibbons are forming a clay for instance however this is different from that topology and so maybe someone can put in the chats this can someone spot the difference between these two trees forget about the mobile this is what I meant I mean the difference between this or that tree with the and this one so these two equivalents so what was the difference between these two let me see your responses there in the chat okay yes the orangutan very good this is where we have a difference right you can see and on the top here the orange is outside here of the the hominids and and in this case the gorilla is outside and the orange is inside so indeed this is the difference you could so I guess you you start to see you know the difference that are meaningful and the one that are not excellence so one important thing and I already alluded that to that is that there is also some connection between amount of divergence at the sequence level which is the thing that we can infer and and time and so there's this really classic paper by super candle and Pauling from 1965 you can see this is shortly pretty shortly after even DNA was discovered if you think you know 1953 people start to understand the the genetic codes and fairly quickly there's this idea well maybe the changes that are observed might be indicative also of amount of time that's also connected with the mutual theory of Kimura which was published a bit later on early 80s but around that period people are trying to understand you know this relationship between sequence divergence here on the x-axis and well time divergence here indicated by the fossil record and it looks like for instance if you look at cytochrome C that there is a pretty linear relationship particularly if you get rid of all the outliers then you really get a very nice line and so this is already you know hinting at the fact that you know it's not a perfect law and and you you know any gene you will find already some exception and typically the you know the further you you go back in time the biggest departure you'll observe between the sequence divergence and and time divergence however it is still something that is helpful for the interpretation of the trees so this is a bit of follow-up word I think beautiful figure and you can see here different type of genes that evolve at different rates and in this figure you know it looks like it's I can you know perfect molecular clock where this is this relationship you know holds beautifully but of course you know all this data was quite seriously messaged and as we got more data coming in you know these things started to break down a little bit so this is more recent work still you know arguably before really the high throughput genomics era and already we can see depending on the clades and the distance there are some jumps and so it's no longer really you know a straight line things get a little bit more messy but still it is so that you know you you will often observe fewer changes for sequences that are that have diverged more recently so maybe if there's not a linear relationship at least you can get a loose order ordering and a sense and okay so there's this here so if the clock holds then you should be able to to calibrate your tree and you know and give it maybe some dates for some of the key splits and importantly if all of these these DNA all of the species here are still alive today still extant species then you should be able you know they should all appear on a on the same same distance from the roots the same height of the tree now for instance if I go back now to the COVID data there because this is a quite a quick evolutionary process and some sampling is happening you know at different time points even for a clock like tree you may have a variation in terms of where you know where the leaves are in terms of the height of the tree okay so now okay actually I'm running out of time and I'm thinking that we are going to leave some of the methods part how to infer a tree to the practicals this afternoon but I think you know at the end of the day I think the interpretation of the trees is much more important for all of you than to know how to infer them because ultimately I mean okay both important just have to make some tough choices but I'd rather have you you know start with like a good interpretation of the tree which is why we may have to couple bits on the on the inference part but I think this could be done also a bit more hands-on together with the exercises that's an of course I will I will give you my slides so you can you can also have a look at you know kind of the summary just enough so that you could also do the exercise and kind of learning a more active way but maybe just the last thing I'd like to cover before we we have a break is how to root the tree because that's you know we've already started discussing this a little bit and so the the typical way of doing this again you can't get it straight out of the of most programs and most models of evolution because they're essentially really just measuring differences and it's hard you know if you just see some differences to know who started you know in which direction did the change happened now if you just have two you know people apart and you know they were once at the same place it's hard to know you know who walked away or maybe both people walked away so you know you can think of it in a little bit in this way if you have a tree you know it relates all of these distances to all of the beliefs but again it's very hard to know where it all started from just from this divergence the only method that can attempt to do that from the sequence themselves are methods that assume that there is a there's a general trend towards say for instance a different type of GC content or amino acid composition in which case maybe you can you can polarize your tree in this way but that's these are very unusual methods the way most people root the tree is using prior knowledge so in this case if you have some prior knowledge that you know you may use for instance these are all apes and you may want to use a monkey which you think you postulate is an outgroup you can add this monkey to the tree reconstruction and then to simply postulate that the root is then on that branch on the outgroup that would be a very common way to do that but I want to draw your attention and we're going to discuss this also in the next hours when we are going more like from from a species perspective to a gene evolution perspective that's it's not so easy sometimes to find a good outgroup because you know yeah for reasons that I will discuss later it's not always available so now group is what you need here of course it you know if it's really quite distant for instance if you took a bacterium you would say okay I really want to be on safe ground I'm gonna take E. coli as an outgroup you could in fact for some genes that we are that we could still relate across primates and you know in bacteria across all of the tree of life you might be able to do that however the problem is that your tree then becomes really quite uncertain in terms of where this bacteria is connecting so you you then have more of a problem that you know definitely on which branch it is but the rest of the tree particularly the point where this branch connects that may start now to become you know quite uncertain so the ideal outgroup is not too far so not to ruin your topological inference but is close enough but but is far enough that you know with confidence that the bro the root is on that branch and I will for the sake of completeness and sorry I do have this tendency of not being able to to to stop talking and there is also another way which is really quite common is to take the midpoint and the midpoint routing is to say okay well let's just find the point here in that tree that will result in the most balanced tree and this usually might fail when you have several competing choices because the trees is rarely completely balanced for instance if you looked at the mammals the rodent clay tend to evolve faster and so you know if that may mislead you if you use that to root the tree of mammals in the midpoint you you might get the wrong roots however it gives you a rough idea and you know it stands to reason that for instance if the if the root was here in a place that would really unbalance your tree that would imply that would imply that you had the same amount of in the same time as you went from the roots to the crest of Gibbon sorry from the crest of Gibbon you had all of these other evolution taking place which is really unlikely okay so it still gives kind of a rough idea the midpoint routing giving a balanced tree and so I wanted to mention this sometimes it's also disparaged but actually it's better than nothing and in some cases particularly when you have few splits toward the middle it may be a very good estimate like and I'm thinking again if I'm thinking ahead of what we're going to see later if you have one duplication near the roots and then long branches and then you know two sub trees it's going to be very difficult to to to envision a routing that requires you know something completely unbalanced okay so I think you still deserve some break so what I'm going to suggest that we do we're a bit late but I don't want to cut too much into the break so how about we try to start at 10 or 5 well in 10 minutes in whatever time zone you might be and you know you can grab a beverage and I'm gonna open up like this breakout rooms if you like so what you can do if you like to spend some time sort of chatting with with some people on the on the on the course is when you have your your beverage or you've done your your you stretch your legs you can join one of the break break out rooms otherwise you don't have to and we will restart very soon thank you very much so let me actually actually Patricia are you still on the call yeah I am you've got to do that because yeah I need to be host for that's not not not merely co-host okay have created some some rooms okay and people can decide to join in or not whenever they want yeah I see my iPad is asked to join breakout room two and then my computer break out room six so you have to double your personality okay very good yeah this is a very short break but actually I've been I had an exchange in a breakout room and also I was thinking about it as I was pouring my coffee perhaps some of you would be very disappointed in not hearing more about true reconstruction now practically those of you not attending the the tutorial this afternoon so I have actually prepared a little poll here and so if you could go there's no correct answer again I prepared a little poll on who clap to ask you whether you'd like to hear a bit more about trees now kind of to survey a little bit your preferences so please go and answer this now and it will give me a sense of whether you want to move on to orthology or you want to do a little bit more trees so I'm yeah I can see in fact yeah okay so not to spend more time on the survey than on the on the on the topic itself I can see there's about two-thirds who would like to have just a very brief overview so to leave enough time for the rest and they would be 20 28% who would be keen on just moving on and very few who want to about trees all more about your inference all morning so that's very helpful to know so I will suggest I will try to be really concise and then you will have a chance to for those of you who are in the practicals to get things in practice a bit later so here let me just share my slides share my screen yeah okay so how do we infer found the tree so I'm just going to give an overview so the one thing to note first is that the trees are relating an average of the evolution over the entire sequence you're going to have some positions that are highly conserved and some that are more variable and this is what all of the model that we observe they treat every column uh in our aligned sequence as one piece of information one independent piece of information and so as I already pointed out now you need first to find the corresponding characters and so this is this process of alignment you I'm sure that most of you know about this already and there are some tools for this and this is the purpose of that is to identify the characters that are homologous that were once you know they are that all descended basically from the from the same from the same molecule uh while the same residue in the same molecule and there's lots of tools for that and I'm sure that you've got your favorite tool I'm not going to spend time on this um what I would just say is that they are I mean it depends how you want to um classify these things but one way to do that is is that there you can say there are two main uh groups of methods some that are trying then to infer a tree that relates this uh that explains you know this this this pattern based on the similarity of every pair of sequence um and for that you know you have methods such as UP GMA and neighbor joining and I'm just going to be you know just give like 30 seconds overview of that or you could try to build the tree which then has a more explicit model of how this sequence might evolve along the tree and look at lots of different scenarios and then decide whether these scenarios uh are are likely or uh yeah are likely in light of the data that you observe and so for instance if you observe very few differences between uh mouths and rats but more difference between the rat and the human when you when you have a model of evolution that is along a tree where the human the rat are very close then this is uh going to be rather unlikely and so maybe this topology and branch like are going to be dismissed that's rough that's roughly like the categorization and in general what what is happening under the hood when you run a typical tree inference package is that you first need a way to get a starting topology first guess of what the the the wiring might be of your tree and for that you could use a method for instance this type of uh clustering approaches let's start with all the pairwise differences and then um score it score that that that tree with a model now that uh is a little bit more uh for instance a probabilistic model of what is the chance that the sequence evolved along that scenario what's the chance of observing the data that we have now and what you're going to do then is just to compare this with another tree and if you see actually this is more likely to have given rise to this data so it's kind of a maximum likelihood type of approach then you're going to favor that or you may have also kind of a Bayesian flavor of that where you have also some priors and you don't only consider the maximum likelihood estimate but a distribution of trees um and you know how do you how do you search should that's enormous space we saw there are many many different topologies um you know I um no longer have the slides on but you remember they were yeah many billions of topologies already for trees of a handful of species you know maybe a dozen leaves already you have your billions of topologies topologies and usually we're dealing with much larger trees we cannot test them all um and so typically they are applying some local changes to the tree and seeing how do they get an improvement or not so I'm not going to go now into the details I mentioned yeah UP GMA is this approach that starts with some sort of pay wise distance and then you recursively try to group the thing that look like they're the closest so you know I mean without going into detail here uh if this was a representation in uh an embedding in 2D you know you would pick first these two because they are very close and then maybe these two and then you build you group everything and that gives you a topology um yeah distance trees is a bit different is that we have you have some unknowns in terms of the the branches that you have on your tree and you have your the distance from any pair of leaves and you're trying to reconcile the two by finding the the branch length that explain best you know this this pairwise distances and because the number of pairwise distance that grows quadratically uh but we've seen earlier that the number of branches only grows linearly you know with just every time you add one more leaf you have a constant number of additional branches this can be solved but I'm not going again into detail here then trees that are using parsimony what they do is again so they need already to have like some example topologies uh not example sorry some some uh candidate topologies and what they will do is to say okay how many changes do we need um to um to explain the data that we have given that topology and so for instance here to give a very concrete example here this is not DNA these are just characters that I have drawn here but for instance uh if all of these species have a tail uh no matter what way you're considering as a tree you will never need any change you could just assume that all the ancestors also had a tail and that doesn't need any change but for instance you know here that species one and two can fly and species three and uh one and two can fly and species uh cannot fly and three and four can fly here you just need one change with this topology you could assume that the ancestor here couldn't fly and then here on that branch you know they they acquire this ability once so that's one change and then that's retained in species three four so that topology will get a higher score than if you had for instance one three two four because then you would need um flights to be invented twice or to be lost twice and so this is how you can start scoring and this is just the parsimony approach just counts the number of changes it's got also some shortcomings uh but it's extremely fast um and yeah maximum likelihood and I mentioned already it's essentially this is in this mathematical formula to we want to maximize the probability of observing our sequences given a model of evolution which may have some parameters uh but don't worry too much about about about the model here I mean all we've discussed so far and that's I think enough for this discussion is is just counting the number of changes and of course we know that some substitution are more likely than others there's a difference between transition and transversion at the DNA level uh in fact there's some difference between any uh of type of substitution and we see this again I'm relating this now to the to the SARS uh cough two again where we see an excess in C2T changes that's very specific that could be taken care of by the model uh but here we've just talked about just you know counting the number of changes and and and assuming that this is equivalent for any character to any character and that's our that's a start um and also this is given the the tree so this is given a topology and some branch length and now you can see that if you change the model on the tree and you get a high likelihood of preserving our sequences that may indicate that we have the correct model on the tree but you know the cost of that is that you have to try many many different uh combination um and uh you know at some points uh you know it's it's not always easy to do this optimization and yeah you're never going to be able to to consider all the possibilities so even under the model even if your data uh is very well modeled by your your model it may still you may still not find the best trees in some cases okay the only thing maybe that is worth uh uh really mentioning here it's bootstrap bootstrapping um because and branch support in general this is uh I've highlighted some of the difficulties we have in inferring the correct tree so to gauge uncertainty is something that is of high interest and so I'm sure that many of you have already seen these numbers that are provided as branch support measure first of all I should I just want to make it very clear that almost all of these measures they are with respect to a branch not a node not the routing or or some other aspect of the tree or a global measure of the tree now this is some something that is measuring the confidence we have in a branch now what does this mean this is something that is actually also even quite debated in the literature but the rough idea which I hope is something that is going to be defensible in all circumstances uh the rough idea is that if you have a high branch support you're you have a high confidence that that branch exists and what does it mean that a branch exists it means you can separate things that are on one side of the branch to the things that are on the other side of the branch so for instance if I go back to a tree yeah here if for instance uh let me take this maybe the simplest example if this branch has a high support you know I think this is really I have a high confidence in this branch this means I believe that uh the gorilla human and chips are on one side of the tree and the orangutan and all of the gibbons on the other side okay this is has an unrooted interpretation the the supports it's always uh almost always like this I say almost because you know I'm sure someone can brandish a paper where they did it a bit differently but you know 99 percent of the of the cases um so the support indicates a you know a splits a branch and maybe that one is uncertain in fact you know there's no that there are some because of incompletely linear shorting you can find some genes that um where human and gorilla um uh the the the human and gorilla gene um well actually some genes in the human and the gorilla population uh coalesce earlier than uh than the chim and etc because these were not just a single individual you had a population and uh the same way as if any any two person participating in in this course we have some genes that have a much more recent common uh ancestry and some genes that uh you need to go back deeper in the history to find um common ancestor because of uh you know recombination um that could be that could even span beyond the boundary of species so that's actually that short branch is actually uh rather uncertain in the sense that we can certainly find many you know a lot of genetic material where that is not even going to be the correct topology in terms of the species in general and we think of the time of speciation uh it is still so that uh you know it's well accepted now that human and chimp you know should branch uh should group together however you know you could say from a tree reconstructing point of view depending on the molecular character that you're looking at that branch may be quite uncertain but if this branch has a high support value regardless of the particular um the particular ordering of the branch in that subtree we will definitely be able to to to split these two parts okay so how do we compute that well bootstrapping is uh is is quite an elegant approach uh and so I do have a slide on this I want to show you that the idea is that remember we have our input matrix our input alignments and each column is is like one observation and so what we really ideally would want to do is like to say okay well why don't we take another sample another sample of the data and just try to reconstruct the tree and see how different the trees are and if we get exactly the same tree and and by which I mean all the branches are present still in that tree then we have a very high confidence if however we see you know there are some rearrangement around some of the branches then we may wonder you know how how robust our prediction is so it's a kind of a permutation test um and the way this works with bootstrapping well we we can't generate extra data typically for our data set but we can resample from our data so they are this we are resampling with replacements and you know this is backed by some theory if you've got enough data you're sampling from the underlying distribution so for instance here if you're you know your best estimates uh your maximum likelihood tree for these four um sequences give you a b and c d on the other side uh now you can do some replicates our pseudo replicates as they are called sometimes and you can see in this case I sampled the second column twice okay because it's a sampling with replacements and I've lost the the fourth column but you know uh this is uh this is still a legit thing too and when you have large data matrices it it it approximates quite well the again this underlying distribution and uh in this case maybe it doesn't affect the topology and you do that a couple of times maybe quite often people do that a hundred times and then you could look at for instance your middle branch here which is the only branch which is uncertain um you can see in how many how often do you observe the branch that you observe on your original data and here we have just two replicates and we only got its a b on one side and c d on the other for one of the two replicates so here we would say that the support measure is 0.5 and so you can see it's a bit costly because you have to do all these replicates and you know if your inference process took you a certain time t now when you do a hundred tree you need a hundred t uh to um to to the inference of course you know these days people have used clusters so you know maybe it's just like a parameter that you change to now ask for a hundred process in parallel and then you're not spending more time waiting for your tree um but there are also lots of other measures that have been developed that are typical typically approximations just for for speeds but the boot shab is is is still something that i would call um like almost like the the gold standard for for for branch support okay so i just see and i was gonna stop here i just see there's one question i'm going to take this question um oh and actually there's been a few more questions uh okay let me let me answer these few questions so uh first one is there really big difference between trees inferred by different algorithm okay that's a good one uh so you could have sometimes differences even between trees that are inferred by the same algorithm if you run it on different machines because of some of the decision but in terms of you know if you look if you take into account the uncertainty of the tree with the support measure uh hopefully you won't get a really inconsistent to inconsistent results if however your your conclusion depends on the choice of a particular tree you have to be very careful and uh you know they're in the literature and you know there's been a lot of debates usually the what the methods that i've viewed as being more reliable are the likelihoods approaches maximum likelihoods uh basing approaches rather than parsimony parsimony could be uh you know has you can find some cases where parsimony will fail systematically uh and so uh people don't like that you know the idea that's as you add more and more and more data you converge the wrong answer that's unbearable to some people but i think you know if you need to rely on a particular model to support your conclusion you should really have a very careful look at your tree and ask yourself you know why is it that these assumptions are affecting your your conclusion and so ideally uh your conclusions don't don't hinge on a particular choice of method okay now another question if the butcher value is less is it better to take the sequence of our data well again this depends on our conclusion i think it's perfectly fine to have uh part of your tree that are unresolved if these are not the parts of the tree you're drawing strong conclusion on uh that's okay let's just you know you let the data speak provide some confidence intervals and it may be irrelevant if you however this is like to support your main finding uh then you have got to maybe discuss your conclusion and then weaken it a little bit uh and you know in some cases they may have some tactical decision if it's a part of the tree that is not so relevant and you'd rather have like a fully resolved tree yeah it could be good to remove it in fact i'm you know coming back also to covet data uh to to sars sculpt to data and uh and there we also have this phenomenon because we have so many very very short branches it may be that you know many of these short branches are hard to resolve but we can still at a more kind of general level tell that you know that group of sequence is separated from that sequence so if the focus is on this more distant one then yeah definitely remove all of the the intimidated one you know they are not that relevant but then you end up with one long branch which is well supported and which at least enables you to relate these uh these two groups so you might want to remove some sequences if they are actually not that interesting but actually uh induce a lot of uncertainty in terms of uh you know collection of short branches okay see also some questions on the google doc there are really big differences between trees inferred by different algorithm okay so that that is the same um covered that uh well the tree similarity measure uh i should say this is specific to a an algorithm so i would lump that into into the same thing i mean the similarity measure will be uh could have uh also um a probabilistic interpretation there's many different distance measure so i would just consider this as part of the algorithm uh how do you combine several protein sequence similarity several protein sequence to measure species distance okay very good we are going to have that in the practicals and actually this is already kind of something that we can discuss in the second part in the second part we are going to uh to identify um the relationship between genes and then indeed if we think of the species as like a bag of genes then a common way to do that is is then to concatenate to put together the observation that you have for uh orthologous gene but we you need orthologous gene we're going to see that in a few minutes so maybe hold hold that question for a bit later and then could bootstrapping be misleading looking at highly similar tandem repeats containing units in the multiple sequence okay this is a quite a specific question so i mean the so i think what um whoever asked this question means with this is when you have tandem repeats it is quite difficult to do a good alignment because you have different units they are all quite similar and you're not quite sure which one relates to which so the alignment methods may struggle and what's more if that number of repeats there uh varies rapidly in the evolution you know you tend to have you acquire one more unit you lose some in fact you could argue that you don't even have a one to one relationship between uh one unit in in in one gene and the you know maybe like two units that are the result of a very like recent expansion and so for that you know the alignment itself may may may not even be a very good model to relate these individual residues so i think this is it's a it's a big problem for the alignment potentially now in terms of the bootstrap it is true and i think now i see where you're going with this is that the bootstrap usually fixes the alignment assumes that the alignment was correct and only re-samples the column so you're going to you know to the extent that you made mistakes in your alignment and there's uncertainty in your alignment you're not going this is not going to be reflected in your re-sampling procedure so one thing you could do is uh you know what okay one i mean this there's been some specialty literature on this and i don't want to spend too much time on this because it's really quite specific but you could think about re-sampling uh a re-sampling procedure also takes into account the the alignment uncertainty but that's also quite tricky because you see when you are treating every observation as independent each column as independent now they no longer are if you are trying to sample in a in a different way particularly one that connects with the alignment so this is a this is a tricky question i should can i see also one more thing now already ahead of time is that this afternoon for those of you who are in the practical the last hour from four to five we're going to have like a clinic where some of you who have like specific questions about your your own project your phd project or poser project or whatever project that you might have if you want to have some discussions and some questions about that you'll have more time and we could do that even like one-on-one with myself or some of the posts i can agree okay um i think okay so that we have we still have two more questions i mean i i really enjoy the question i think this is also you know some of the added value here as opposed to me doing like a recording and putting it on youtube so it's hard for me to resist answering this question now but i'm also mindful of time so i'm just going to take these two but please don't write more questions for now i won't be able to resist not answering them otherwise and we don't have any moderator to cut me off so referring to the previous example if chimp human gorilla branches high support will the orangutan score the same support could be different yeah so it is with the classical bootstrap this is completely orthogonal the the measure is going to be based on uh the different uh replicate and you only focus on the branches that are in the original tree and see how frequently they occur and so if you're a long typically the longer branch they are more they are more stable so you could i have this scenario where you're a long very clear branch and a short branch next to it and the short branch you know the rearrangement here are all over the place and so the support is very low and that long branch is a very good support now of course you know if you have a whole part of the tree that is highly uncertain and you have like a say a rapid radiation of species it could be that you have a whole cluster of of branches that are poorly resolved so you know everything's possible okay and then neighbor joining and distance method don't consider the evolution of the sequence but i still widely use in papers how much these functions are reliable for inferring the functions okay they do consider the evolution of the sequence and maybe the two things that are you know the the prom with neighbor joining and distance method is that you know conceptually is that because you look at things in isolation to one another um to compute the distance then when you put everything together you have to make sure that your distance are are actually consistent they measure the same thing so for instance if you did the measure based on very different sequence length and you had some part of the sequence that is uh you know some pairs of sequence have a lot of character in common which evolved at a different rate than than other pairs which are very few in common that could skew your distance a lot uh in some cases even the pairwise distance are done with inconsistent alignments you know you you don't even have uh it's not consistent in the sense that if you are looking at what's you know your the distance between a and b is computed on a set of characters you know that implies some homology and then you look at b and c that implies also a certain homology and then it's you can see that that implies also then some homology between a and c and then actually then the alignment is different so these are this is more like the type of problem that you have with the distance method which is that it could be less consistent but it means also less consistent sometimes could be also less consistently wrong right it could be in some cases a bit more robust so i i am quite open minded about this i think sometimes neighbor joining trees could be surprisingly good given how quick they are but it is true that you have to be very careful when you use this type of methods in papers because you know referees have very strong opinion about these and the the problem when comparing these different methods is that well quite often we really don't know what is the correct tree and you know the method that is kind of that is has the most elegant model and is shown at least in simulation to do the best is is also going to be the method that then everyone favors and the likelihood approaches they have the advantage also is that you can add a lot of bell and whistle you can you can add a lot of sophistication to these models um and you know it's so it's tempting to argue from first principle but i caution anyone to think that uh the neighbor joining a distance stream method are necessarily worse i would say they they you can find cases where they are not as good but the devils in the data depends what you're using to uh to compute your distances and also what sort of checks you're doing are you checking also uh that you know whether your your distance are additive or not and and and whether yeah you know how much again your conclusions are affected by the choice of method for instance and i'm i think it's trying to relate this also to these current data from SARS-CoV-2 there you have so few changes that actually whether you use a parsimony or um a likelihoods um concept or in fact a distance uh measure you get essentially the same estimates um and so there it is much more about the challenge is much more about trying to remove characters that are where you have systematic errors and some some you know it's a different set of challenges and not so much this this these different approaches okay so i think many questions about that i hope that was helpful um so let me start still a little bit the orthology business okay i'm i'm aware that you see this even with the with all of the the discipline i try to to have in in not getting drawn too much into these fascinating discussions um i'm still behind badly behind time but anyway we'll try we try to um to teach you as as much interesting material as we can during the imparted time and you know um you will have the feedback form at the end and if you want another course more material uh you can express your views in there um okay so let me share my screen here okay so we're moving now from relating species to relating genes um and we'll see this essential concept of orthology and paralogy and i love this quote here um that just highlights how much you know even more so at the molecular level than at the morphological level um you know there's so much that is shared across all of the species which is why our motivation to relate them um keep in mind that was already that was i mean that albert clover was a biochemist so you could see this in terms of the biochemistry of different species and that is certainly true at the genetic level too um okay so i think i may be able to cover a bit of like the the basic definitions uh and motivation for for for orthology, paralogy for the brain. Yeah so duplication is really kind of the key concept here gene duplication you know the genes they duplicate in a way that um that species uh don't i mean okay the speciation event is you could think of it as a kind of a duplication also but it is really the whole of the species that duplicates and then um that evolve uh separately here we have like some small parts quite often that just like individual gene that duplicate and you know that's i think really the key book there is uh is ono's uh book from 1970 evolution by gene duplication that's the 50th anniversary this year of that really wonderful book um very accessible encourage uh all of you if you're interested in uh in in in this topic in general to to have a look at that that that book which kind of is mind blowing to think that a lot of the ideas are you know are are still current uh 50 years later um in any case and so in this book you know there's i think there's a great deal that is uh discuss about the fact that there's this tremendous um constraint on the on the functional part of the of the of the genomes and the you know the way you can evolve new function is by creating a duplication i mean having some part that duplicates and then the new copy might take on some new role okay so okay what's possible it might still keep on doing whatever it was doing originally but um but generally if that is the case then you know you could you could think well but how how can then purifying selection apply and get rid of uh you know random uh mutations that are happening uh in in these parts and um so quite often what will happen then is that you know after the duplication actually what happens is you lose uh the functionality but in some cases uh maybe then it's free to evolve any function so you know functionalization and yeah the the other phenomenon that is often uh also mentioned michael lynch uh is a is a good reference for that is the idea of uh of sub functionalization where now both copies can kind of specialize a little bit into into different subtasks that are maybe that was the performed in a in a general way uh when you only had one copy but then in a necessarily kind of a bit of a trade-off between these two more specialized tasks but in any case so it's it seems duplication seems to be a very interesting phenomenon to understand the emergence of new function in the genome and uh but this also has then some practical implication and so here um giving like um very simple case of you know here say mouse and rats and uh we have now these two copies and so that's very clear in this case when you have the two copies you know for instance two copies um that you will have a pair I want two pairs and if you take the ones that's uh the result of the most recent uh duplicate I mean sorry so here the model is we have an ancestral duplication uh that give rise to these two copies and when we compare the mouse and the rats uh the corresponding copies are the ortho so the ones that yeah okay I don't want to get too much ahead of myself so we have got the orthologs and then the paralogs will be either within the species um you know these two still related copies you know they were once just the same gene um but they are the result of a duplication that you know that is uh that happened quite a bit of time in the past but you could also have paralogs across the two species sometimes a misconception so uh you know here we still call these two genes paralogous even though they are not within the same species but they are the results of this uh ancestral duplication now where it starts to get messy is when you when you you know you add like some extra species maybe some of which have only one copy and then you've got to ask yourself okay what is the ortho you know maybe these are um the orthologs uh maybe these guys are the orthologs in fact in this case and we're going to look at the definition will be now a bit more precisely actually the human gene here is orthologous to all of them so how do we define this now a bit more rigorously so this is the scenario that happened uh in this example where we first have a speciation event between here the rodents and the primates and then within the rodents uh we have uh I mean actually at the let's say still before the speciation of of uh mouse and rat which takes place here we have a gene duplication here depicted by a star which gives rise to this to copy the blue and the red copy and then we have a second speciation event and now because the speciation events uh affect the whole species right the all of the genetic material that induces now in terms of the the gene family tree splits both in the in the blue and in the red gene okay this is why we have two internal node that correspond to the second speciation event and so the last concept I just want to so the formal thing I want to introduce before the break is that the orthologs uh the pairs of genes so it's a pair-wise relationship and in um originally um they are the pairs of gene that start diverging through speciation events so for instance these two blue uh genes these are genes that's just found where I just depict the the species in which they are found just don't get confused by the fact here we're not talking about two rats two mice two rats and one human we are talking about two mouse genes two red genes and one human gene and then if you're looking at for instance that's the mouse the blue mouse gene and the red mouse gene then because they start diverging through a duplication they are called paralogs they are not orthologs okay and now I told you that the black gene here was orthologous to all of these and that is because um it is you know the last common ancestor is a gene that's diverged through speciation again so you can always if you know the tree and you know the type of events in each internal node you know what are all the orthologs and all the paralogs just to cover the last case we said you know paralogs within the species here that's a duplication event paralogs across species that blue gene with that red gene if you go back in time you can see there's a duplication event so they're paralogs and here I have you see here what I really have is kind of an orthology graph where each node is a gene and I've added an edge when two genes are orthologous and you know that's an edge between two nodes is a pairwise relationship okay so we've covered already this definition I suggest we have another break so it was quite a short break of the previous one I suggest we still keep our 15 minutes so we can reconvene at 1105 I don't know about the breakout rooms how it worked for the others perhaps you can leave a feedback on the chat and I was just with one other person I suspect since not everyone's joining the breakout room we're going to try to have bigger rooms you know you don't don't don't feel obliged to join but if you want to join you can you can and if you join for a while and then you know you run out of things to discuss it's also okay you can leave and yeah I can see it was the only one in the breakout we're just two so we're gonna do fewer rooms bigger bigger rooms again you know you don't need to to join but if you'd like to to have a chat reflects on the on the course of discussing something else then we're going to do that again now and we will carry on at 1105 enjoy your break the full way I know that some of the breakout room have are still closing now but I'm gonna get ready we're gonna get ready to continue we actually already this last session our time flies when we're having fun so we so let me recap briefly so we we've discussed the definitions of pair wise orthology and paralogy and as you remember this is you know the orthologs they start diverging through speciation and the parallax through duplication and if I go back to that idea from from ono about evolution by gene duplication you can see why this may be relevant it's because if we think that something special happens after duplication the fact that is you have the two copies that are still retains may be indicative of something new then we've got to be a bit careful when we are looking at pairs of gene that's you know that that that are the result of duplications because maybe they've you know something has changed at least at the functional level so that is one motivation to distinguish orthologs and parallax but actually it's not the only the only reason and so one reason now if we now relate to what we've seen earlier today is if we for instance want to build a species tree and we had this question earlier so we might have this this is essentially the same no no it's actually not the same thing here now the the duplication happened even earlier you see this is in the ancestral mammal I just have a different representation someone asked me James asked me earlier you know is do I always do we always use a star to indicate duplication not necessarily you can see here in this case this is just a circle from a different color so you have to I'm afraid you have to to look at the read the the caption of the figure to make sure that or the legend for the interpretation but here we have an ancestral duplication in ancestral mammal and then the speciation of human well you know primate and rodents and then the rodents and we have you know after the duplication here the the the two copies have been called k and l now the thing is if you just go and sample from gen bank human mouse and red gene from the family uh if you're not careful you might end up you know taking you know for instance sampling the the human k copy the rat k copy the mouse l copy and then if you look at the relationship of the true tree here you have human and rat on one side and then mouse on the other so the extent that you're looking at at the root of tree you're going to get the wrong topology in terms of the branch like you know things are going to be really out of whack because you have seemingly much less divergence between human and and rats than to to mouse which is all of these branch lengths and so when you're building a species tree or you're trying to use the genes as a proxy for the for the whole you know for the genome in general then you've got to take autologous sequences or else you might end up with the you know reconstruction the history of that particular gene family because yeah it will look like that so and the more general point is that the gene tree is not necessarily congruent with a species tree either two different things and however if you take a gene tree and only consider genes which are autologous to one another that that means there's only speciation nodes on the tree then that should you know be congruent between the two I'm saying should because we've discussed also the fact that not every gene in a genome even when you're taking autologous genes and well turgis the corresponding loci because of population level effect and the fact that it takes some time to coalesce back in time because of incomplete linear sorting you may occasionally have some some some small rearrangements around the tree but if you take a parallel then you will almost definitely have a different tree okay so you know just as trees pervades life science orthology also using lots of different contexts so I mentioned how there are some functional implications and you know we have very few model species and complete genomes that are well annotated and then usually whenever we sequence a new species a new genome we really rely on on finding orthologous sequences to to then propagates functional knowledge in some cases we even use some parallel sequences because you know there's I mean although the orthologs tend to be a bit more conserved you know the difference not always very large and so it may still be better to know you know the kind of to have a rough idea about the function using any sort of homologue than knowing nothing about the gene however still orthologs are usually favored as a case of you have to make a choice between different potential sequences which you can use for to propagate function people use the ortholog and that could be also useful if you want to know what sort of model species is appropriate for a particular disease or particular aspect of human biology for instance you know if well I think I'm thinking of the diabetes and the fact that insulin is duplicated in in rodents you know maybe arguably these one to two relationships suggest that there's also you know quite you know some some functional changes within the rodents it's maybe not ideal to use mice as a model for for diabetes now first also some practical consideration maybe your favorite model species the one where you already invested all of the technology and infrastructure into but nevertheless you know where you have some choices it makes sense to try to see which part which aspect I can serve and I think the trend is towards more model species because of the tools that we have beat you know at the molecular level they are they are getting to be you know they make it a little bit easier to establish new model species okay I see there's a question here about I'm gonna take this first question if protein K duplicates into K and L which what determines which will remain as protein K and which is protein is it arbitrary yeah so indeed is entirely arbitrary in this case we just have two copies and we call one K and one L but we could call have call them alpha beta one two three etc now in terms of how the function evolves so hold that thought I'm going to I'm gonna have something about about that aspect we'll refine this and then the other question would it be right to say that orthologs would be best suited for study of say arboviruses I think that's a too general question for me to say to answer I mean you know what do you mean by study arboviruses I mean we need to to look a little bit more specifically about which question you're trying to back out there okay so okay and one more question what if we're looking at pathways involving a group of genes rather than a single gene protein yeah okay we're going to talk about about group-wise orthology but maybe that's not what you mean what you mean is yeah so I mean here if you want to see what is best conserved indeed you may want to have a broader view looking at the whole pathway and seeing for all of the member of a pathway that you know in a particular species how many of them are conserved and that that would be like you know can you find an orthologue for each one of them and then you know maybe that's the whole pathways conserved or in some cases you may you may see you know some only some parts are well-conserved and other are not in which case you may get a little bit nervous about about doing some inference about the pathway in general there's also the technique of phalagic profiling which we you know which is relevant there is the notion that parts of the same pathway they tend to be kept and lost in a correlated manner so this is also this is more like an application of orthology I mean you can say this is something I could have added to that slide if you're interested to identify genes that are part of the same pathway you could also look at what are the genes that tend to be to have orthologs in the same species in a correlated way and which one are the ones that then when you have one copy that is lost or that gets duplicated you know the other ones also get lost or duplicated so that's also something that is maybe giving a broader view than just a single gene at the time and I see one more question here in comparative analysis when searching for a gene in a genome using a tool such as BLAST how can you be sure that you're returing an orthogous or probably okay hold that thought we're going to look at the inference now we need to move forward to cover that part so that's a good question but just a little bit too soon so here I want this is maybe relevant also to the question about you know k and l how you said which copy there sometimes orthology is a little bit misused so okay the first case is you know I did mention that it's it could be indicative of having the same function one one aspect also to keep in mind and you could see this in the tree is that between two species the orthologs are also the most closely related sequences so they had the least time to diverge functionally because they were the same gene in the last common ancestor by definition right if you just go back to the last organism the last common ancestor all of the orthologs were then you know mapped back to the same gene so at that point they are the same thing and it's only then that they start diverging where's the parallel they start diverging earlier um however this is not part of the of the definition so don't say that these are orthologous genes if you mean these are two genes that have the same function uh say say use a different word for that I mean sometimes people use I mean there's not really a good word that is stuck I prepared this slide many years ago and I can't say that there's been much improvement in the literature in terms of having a word to define genes with that have which have the same function maybe you should use a paraphrase for that maybe isofunctional homologs or equivalents but still you know the isofunctional homologs these also requires your uh two copies to be also homologous which is you know common uh that have a common ancestry if you have two genes that converge to the same function but they're not related you wouldn't want to call them isofunctional homologs equivalents that's been used but I just want the bottom line here is that although orthology which is entirely based on it's an evolutionary definition may tell you something about function it's not defining that way okay so don't use orthology to mean genes that have the same function and then the other thing also that relates a bit perhaps a little bit to the question is that quite often when if you think about the the process of duplication you may have a tandem duplication you know perhaps through uh slippage during replication or you may have retro copies that are you know that are copied somewhere else or you you know or you may have some rearrangements but you may have some situation where you you think well actually there's one copy that is in its ancestral locus like this is what I've tried to convey here imagine in the ancestor you have red black yellow and then you have on one side okay you remain in the red black yellow order and here you have a duplication but it's really that black gene that goes and gets copied somewhere else and that you know you might be tempted to say that you know that copy for instance you know the two copies that are symptomatic that are in the same context that are more orthologous than these two copies but actually orthology has no um in itself there's no implication about uh about positions and about uh these are symmetry right which one is the original copy and which one is the new copy now both are would be called orthologous to that black gene here you know the following duplication so if you want to be more restrictive then you can say for instance uh positional uh orthologs okay so if you the positional ortholog here will be you know of that black gene here and this species would be that that that ortholog and this is ortholog will not be positional not symptomatic okay um see there was another question about whether there's a publication that could guide the points um that I made about genes in similar pathway having uh correlated orthologs yes I'm gonna send remind I mean I will we'll make sure to add this question to google docs and we will uh we will make sure that this is answered but uh yeah pellegrini 1999 is the reference that springs to mind and there's been many since then okay so now infring orthology um so uh yeah so now that was the question now how do we now find these orthologs so there are two main classes of method okay I have to say the you know this this has been the classical way of of teaching orthology and a classifying orthology in the in the literature but as you go by you know I'm questioning more and more uh you know how helpful this this distinction is in practice because most method these days they kind of mix up a little bit these two concepts but let's say for deducted purposes uh it may still be helpful to think of this as two distinct ways and so I'm going to handle them now so one approach is called them graph based pairwise approaches and then there's one that is based on trees and so we learned all about trees earlier today so we're going to put this knowledge into practice um so let's start and for now just view this as a little pictures but we are going to get uh to get back to these so pairwise approaches they are based on this observation which I just made a few minutes earlier that if you're looking at two species here fish and a human and again you know this is a very simple case of an ancestral duplication and then you have two copies in the common ancestor of the fish and human and which give rise to to and which are retained and in the present the genomes so remember which one are the paralogs and which one are the the ortolog if you could leave a message in the chat if you like x1 x2 y1 y2 please give me a pair of ortolog or a pair of paralogs um but the the the key point here is that the ortolog for instance okay no one's typing anything in the in the chat but I'm I'm sure that most of you following here um remember the definition I mentioned that of these four genes like the pairs of ortologs of the ones that start diverging through speciation events so for instance exactly x1 y1 or x2 y2 these are the ortologs you note that they have less divergence they are closer in terms of the distance on the tree than any pair of paralog which requires us to go back all the way to the duplication event so we have kind of that distance and that distance extra so this means you know the closer genes they usually uh yeah sorry the the ortologs are closer than the paralogs okay this is just when you're comparing two different species so that's why people are using this this uh blasts approach that was I mentioned in the earlier you know in in the message in the chat from Eliza Ramos it's because if you start for instance with x1 and you look at you do a blast in the human well y1 chances are will give you a better score than y2 because it is a little bit closer okay closer genes they usually have higher alignment scores and so the top hits is likely to be an ortolog now this can fail sometimes imagine if you start from x1 and somehow in your database uh well maybe y1 in in the yeah y1 was lost maybe there's a loss then the closest one is going to be y2 so you're going to make a mistake so this is why also as a as an extra requirement this often the symmetry that is required so x1 is closest to y1 but then y1 the closest uh entry in x is also x1 so now you have ways referred to as bi-directional best hits or reciprocal best hits it's very commonly used to try to find some ortologs between two species okay um what I will say also yeah so um you know any question uh about bi-directional best hits does it make sense okay so they yeah you can feel free to type the question uh but I'm already going to to be moving um to uh you know to some improvements so there are some limitations with this first thing is that you know the score um the blast score that's all nice but the problem with the blast score is that it really depends on the length of the of your match if you double the length of the the sequence you typically uh will will roughly double the the score um so uh for instance if you're comparing fragments of genes you know that will fail usually quite badly if you use just uh blasts and so what was introduced very early on is the idea of reciprocal smallest distance okay so if you have now if you compute an evolution distance you know particular number of for instance substitution per size um that that could be used that uh that was introduced by by wall at all in 2003 um you you may also okay one thing also to note here I have a bit of a more complex scenario well this is very much a scenario it's a bit more abstract but is a scenario we have uh seen where we have remember the human gene that was orthologous to every genes in the rodents this is basically a another depiction of that so we have a first aspeciation event then a duplication followed by two um by by by the speciation and then in this case the problem is that you remember the with bbh you're only taking the top hit so a even though it's it's orthologous to b1 and to c1 or to b2 and to c2 it's actually orthologous to all of these guys um when you're comparing with your genome you're only going to pick one of the two the one that has a marginally highest score perhaps because there's been a you know fewer substitution in one of the two copies and so you're going to miss out on uh non one-to-one orthology and one-to-many orthology bbh is going to fail so there are some methods and we've developed one in omap with omap but there are others that address this issue and try to um to capture also these one-to-many relationships you may also want to take into account the uncertainty in your distance estimates I see also there's a question about that in the chat I mean what do we mean by distance and so there is yeah again you need a model of evolution and the distance is typically going to be computed in an average number of substitution per site uh but uh again you know this is uh this is uh you know the model could be more or less sophisticated however this will have the nice property that it's not you know the distance the expected distance is going to be the same whether you're looking at half of the sequence or the whole sequence so it's a little bit more robust to this variation in sequence um sequence length and particularly when you're working with low quality genomes that tend to have uh much shorter uh you know fragments or sometimes if you're looking at different isoforms uh and and and you know in some case you have fewer or more exons that are shared across the different species the different protein sequence that you are comparing across different species there you will be a little bit more robust with distance approach rather than just taking like you know alignment scores um yeah and then finally there's you know the last point the last refinement and this is something that we did so we're particularly proud to mention this in lectures I mean how essential this is I mean I think again you know this will depend on on how often this scenario occur but a differential gene loss is a case where you have first a duplication and then well actually maybe let me get back to that example here and then in this copy in the in the fish copy you lose one you lose one and in the in the in the human you lose the other so now your friends is only left with x2 and y1 and they're even with a bidirectional vest you're going to fail you're going to these are going to look like the mutually closest uh copy simply because you've lost the other ones and and in this case and this can happen actually after whole genome duplication where you often have you know you lose one copy or the other on some of the branches and there's a reduction uh in many uh pathways to just one copy and that could be a differential gene loss and then so there's been some extension and we've introduced one where we then take a you know compare this scenario to a third species which may which will have hopefully retained both copies and which can act as a witness of non-orthology I mean we've got some reference for that but you know for the purpose of this presentation we're not going to go into detail it's just to say that there's been some quite a few refinement and you know you probably shouldn't use just DDH uh they have better approaches these days to infer orthology that will take these things into account this is going to be revised in quite some detail in the tutorial this afternoon um now a few caveats uh and actually the main caveats which is a little bit like the question we had also about distance method for trees is uh is then how do you get from a pair to uh to more than than one and are we going to see this is a real challenge um but I do before I we get to this groupwise orthology I do want to talk about tree base methods this is also very useful this is so this other paradigm to infer orthology and there the idea and actually this is the you know probably the most elegant method is this species overlap approach which is very simple but very effective um and so let me show you how it works let's say you build this tree this is a gene tree now so this is why you know you may have two copy of the frog two copy of gorilla two copy of human okay and this is your tree and you've done you've used a good methods parsimony likelihoods distance they all give you a consistent tree somehow you even manage to root that tree okay this is important the rooting is important for this approach so that that's already here's already a weakness of the approach that it needs a rooting okay but okay let's say that you've got a rooting let's say that mid-point rooting here work and it is quite reliable now the the approach is that can we now infer what are the duplication and what are the speciation nodes because remember in my introductory slide if we can mark every internal node as duplication or speciation we're done we can just now take any pair of genes and we will um we will figure out whether uh it's an ortolog or a parallel right so um sorry yeah so so if we just have to find duplication nodes I mean you can look at this scenario and ask okay is there like is does anyone um can make a guess uh of like one if we take the tree at face value of one internal node that is very likely to be a duplication I'm gonna I'm gonna give you a little hint here you know look concentrate on that part of the tree you see these two frog genes okay we see here we've got two frog genes quite closely related you know how could this be a speciation node that doesn't make any sense to think that we would have a speciation node that relates two genes that are inside the same species right so here for instance the fact that we have a duplication inside the frog is telling us this is this must be a duplication node okay because of this overlap in the species here but we can take that a bit further you know we can look at for instance you know here you see uh yeah for instance here the gorilla or the human yeah the human you see we have a human gene on this side and a human gene on that side so surely this is not a speciation events okay so we're gonna tag this as a duplication events and actually we have one more you see here these two gorilla genes this is indicating to me but this got to be a duplication event so we can already tag you know three of these internal nodes as duplication event now is this a duplication event or a speciation event hard to tell you know we've got frog genes on one side then gorilla and human that could well be a speciation that could well be the speciation between the amphibians um you know i mean you know within the vertebrates between the amphibians and you know maybe the mammals uh but it could also be a duplication followed by some losses however you notice what i did here the reason i'm thinking this is probably a speciation nodes is i'm using my knowledge of biology about the the about the relationship between these species so far with species overlap i didn't take that information to account i was just looking for nodes that are that have that are connecting two sub trees which contain the same species left and right that was enough to tag these three nodes here i'm not tagging it because but i'm using some extra information so that's actually the other approach this is actually by the way the species overlap is much more recent than the the classical gene species species tree reconciliation where you're also a bit like what i did here try to use your knowledge of yours of the species tree to do them to infer the duplication node in the species tree in the in the gene tree so for instance here if i do reconciliation you can see here i cannot i cannot use species overlap there's no overlap there's only one gene in each of these species however i know that gorillian human in terms of the species tree are together so the only way this could be the correct tree is if you had a duplication here way before the duplicate the species between the amphibian and then the you know the the other uh you know the mammals or their common ancestor and then you have a loss you've got loss in the in the frog and loss in the human and on this side you've got a loss in the gorilla here okay but that's how you do the reconciliation so this is in a way taking into account more information it's taking and you need a model of the species tree and then you reconcile the species tree and the gene tree what do i mean with loss well genes can duplicate and they can get lost we saw this was the first the first scenario if you duplicate and you don't you're not really useful or maybe even detrimental in some cases these duplications got to be wiped out maybe it's going to be retained in some of the species but then it's wiped out later okay so uh you know here this is one scenario another scenario is that you know you lose one of these frog gene that's possible and if you love if you if you're looking at the gene family which has a lot of duplication and losses you might end up with a very sparse tree okay and i'm going to skip the slide here on on a simple algorithm to do the reconciliation if you're interested uh you know you don't have access to the slides you could go and have or you know also ask me this question on the google doc and we'll make sure to to add the reference at the end um this is a uh a description of a method to do this reconciliation it's actually it looks a bit daunting here but it's a very simple idea okay just a few caveats because i do want to spend a bit of time also telling you about groups okay so um a few caveats about tree based methods we've already seen one uh that was the rooting but okay of course the tree accuracy could be challenging you know how how difficult um species tree um inferences i mean i mentioned to you there's many part of the tree of life that are still hotly contended and now you can imagine with a gene tree you have so much less information to infer these trees and and because of deep duplication they might also and compass even like a longer evolutionary time range so it's even much harder to to infer a gene tree than a species tree usually in terms of the amount of information that you can take into account and you see this tree based approach approach they they you know will most of them at least will not take into account the the quality of the of the gene tree again you know i'm mentioning these caveats but there are ways to overcome most of these things if you're aware of that the rooting uh could be tricky uh now this gives me also the opportunity of mentioning a new another rooting technique which is that maybe you could use the number of implied duplication loss as a rooting criterion there's been some papers trying to do that so in other words you could test more different routines and see which one looks like is more most likely uh so that may be a way or so so consider multiple routines and then pick the one that seems most likely but it is clear that both the species overlap and the tree reconciliation uh technique that we just saw both depend on the accurate rooting um and there's an issue also about air propagation i mean uh you know if you may have like one really you know yeah one or two really bad sequence sometimes they call road taxa that you know that are hard to place in your tree and kind of ruin it for everyone else because that will force you to introduce a duplication node quite deep in the tree and that has to be fixed through lots of losses etc and so there's a risk also if you're you know if you're comparing let's say 10 different genomes and one of them is a very bad quality that this might affect tree based methods you know which are trying to build a global picture for all of the species more than if you are using a graph based approach okay and if you are interested a little bit more about this different technique i mean we have a book chapter that is referenced in the material and there's also a lot of work that went into trying to compare these different approaches in different datasets benchmarking uh uh you know benchmarking would be a topic for a whole course so i'm not even going to attempt to to to to um to to say anything about about benchmark here other than you know that's maybe something if you're interested in these different approaches it's probably something that you could be looking into happy to give you some references to and yeah trees tree methods they tend to be also a bit more computationally demanding than graph based approach now you know i i have quite a few points here but i'm not saying that you know the tree based approach are necessarily going to be worse than the than the pairwise approaches i mean they do consider more information at the same time some tree bed methods are very quick so you know these i'm giving you like kind of more conceptual challenges and then you know the devil's in the detail okay uh yeah i see some uh some helpful so reference that i provided here in the chats okay so one thing i do want to cover uh before we close and this really quite critical is you know how do you go from pairwise orthology to kind of groupwise orthology usually we don't want you know the the times where we just compare human and mouths or human and c elegans or you know just two species at a time um they are over i mean usually most of i mean it's it's it seems for most application foolish to ignore all of the other genomes that we have to at least to relate you know different species usually we like to look at groupwise orthology but how to deal with this is actually quite challenging you know ideally we ideally we like to think in terms of classes in terms of groups this is the way our minds especially as biologists is is is is is wired ideally we would like to have groups of orthologs and inside your groups you would have all of the orthologs you know every every member of that that that orthologous group is orthologous to one another and none of the orthologous pairs are in different groups unfortunately and so that's sometimes mathematically you would call this an equivalence class where you know which capture all of the pairwise relationship within a group the problem is that this is not possible it's because orthology is not transitive what do we mean by transitive we mean that if a is orthologous to b and b is orthologous to c it does not necessarily follow that a is orthologous to c you know and if you don't believe me i'm going to give you an example here okay for instance on that tree okay for instance you know we we we annotated that tree let's say this is complete so you know frog two is orthologous to gorilla two right because this is a speciation event and gorilla two is orthologous to frog one but frog one and frog two they are paralogous okay so you know how do you group now these sequences in a way said that you capture the orthologous and the the paralogous relationship and and and this is a false errant in you know at least not with simple groups you cannot do that you are going to either miss out on some orthology orthologous relationship or you are going to include paralogs within the same group so what do we do well we could go back to our reconcile tree here slightly different representation but the stars here are still the duplications and the speciation here I've just thrown it over in a different style but as I point out now several times if you reconcile a tree you have related this whole group and you've got all of the information that's wonderful I guess the challenge there is that sometimes this tree this reconcile tree could be you know even if they are correct they could be quite messy and complicated to interpret you see with just a few species you start to have like very large trees but there are some methods and some some database that provide that you know ensemble compare a panther phylum db there are others that provide you with annotated trees and and that is still you know quite a powerful approach you can get a lot of information from that certainly all of the paralogous relationship but also more information you see it all here all these guys here as speciation nodes so so this group you know they form a very nice also group of ortholog where you have kind of a one-to-one relationship between a1 b1 d1 e1 you know and here you have another another nice subgroup so that carries a lot of information there's the cogs I don't know if some of you have come across these they are based on the triangle of of bbh of well not exactly bbh actually it's one directional hit but it's essentially pairwise matches and they form triangle and they they build groups and so you get things like that where you know the way it's defined is that's inside a cog everything has evolved from a one ancestral speciation event so you don't know all of the more recent relationships this is what it implies in terms of the tree you don't quite know but you know that at the root you had a speciation node if you know if the cogs are correctly identified so that's very good if you're interested in this kind of last common ancestor and they are cogs defined for bacteria or for eukaryotes and you know they tell you something about maybe the last common bacterial ancestor or the last common eukaryotic ancestor but they don't tell you so much about the more recent relationships okay there's ortho mcl that is also a very common methods and but the and so that also forms some groups the problem is also again the interpretation these groups they will not contain all of the orthology they may also contain some recent paralogs and so you get some groups but it's not really quite clear what you what is it that you get you can think of that as taking this in terms of the gene tree i don't know if i have this representation yeah it's like you cut your gene tree somewhere in the tree it's not really quite clear where and then you take all of these things together and you know some of them will be ortholog some will be paralog but you're not quite clear i mean here in this case it's quite easy because you have c1 c2 c3 and they're all in the same species you know they're paralog but if you had c1 you know d2 and e3 yeah it will not be completely obvious which one of the ortho which were on the parox from the grouping i'm going to skip this one but uh yeah i want to mention the strict orthologous groups so you may say okay let's forget about capturing all of the orthologous relationship what we want to do is at least have a group in which i'm sure that any pair inside is orthologous so i'm going to lose out on some of the friends want too many relationship but let's just capture a subset such that i only have speciation events and if that is the case one nice thing is friends that can then build a tree from these sequences and i'm guaranteed that since these are only speciation events that well i'm not guaranteed but i think i can expect them to follow the species tree i mean i can use them as as a material to infer the species tree okay me you're those of you in the tutorial you're going to look at these are my groups and then finally and so this is the the last concept i wanted to introduce because that's something that is close to our heart that is not completely trivial but it's going to be very useful and i would say upfront we have a youtube video that is also introducing this concept so if after my explanation now you're still confused i encourage you to go on youtube and typing you know what are what are hogs what are your archaeological groups i'm sure you will find this video and this is a really for us i think this is the most exciting concept if we want to relate multiple genes at the same at the same time and here here's how it works and if i can just explain this to you and you know maybe i'll take on take take some of your question but i think it will have been i would have achieved all of the essential points i wanted to cover this morning so consider are we going to go straight to like a real-case scenario that i want to use as an example and so here these are is the alcohol dehydrogenase in human that is just the first copy adh1 which is itself in three copies so this is a family you know the enzyme that turns alcohol into aldehyde is a very broadly used gene in in a highly duplicated gene across all of the tree of life and so we have these three copies in the human and so you might say you know you sequence the chin and there let's say through vbh and similarity search you actually find okay there are two copies that seem to be quite close and you want to understand the relationships okay in this case maybe it's fairly straightforward and you have a one-to-one relationship you know this is maybe your adh1a copy in the chimp and adh1b copy you know through pairwise orthology but how about adh1c we don't find any counterparts okay maybe it was probably we think at this point it was lost in the chimp but we want to know more about that so we need some more species okay so let's go to the baboon we go to the baboon there we find four genes that are somewhat related we can look for the pairwise orthology again through using the techniques I mentioned earlier and that is what we get okay we have here a one-to-one relationship but here all these three they seem to be orthologous to to adh1a that's an actual example by the way now what is happening here there's a little bit confusing still no trace of adh1c and it's starting to get a bit confusing maybe let's dare to go a bit deeper in the primate tree of life we get the marmosets there's the only two copies and we look at the orthology and now it gets so messy you know and so you know even as someone who's taking this course you may think okay what's going on here you know I have no idea what happened with this you know with this in the history of this gene and that's just a small part and that is because your pairwise orthology even if you infer correctly doesn't really now give you like the greater context doesn't give you like one scenario that captures all of that so what do we do with the hierarchical orthologous groups well okay what we could do if we take for instance now these strict orthologs I mentioned maybe you take clusters of one-to-one orthologs that simplifies a lot the picture you can at least do a group with these three sequences and another group with these two okay let's do these two groups that's nice it's much much more simple but you lose a lot of information you know all of these genes all of these misfits you know they're not part of this simple but two incompletes so what do we do with the the hawks what we do is we go back to key ancestors we're going to now go and it's nice because it ties our above our species tree with the gene tree so back to the species tree here where we have the ancestral primate then the marmoset on one side then the simian and here I don't even try to to resolve that that node so this is not a binary tree it's here we have a polythene but in any case let's try to have a model for the ancestral simian and maybe through our pairwise comparison we can infer that there were really three genes already in the ancestral simian so there are three groups and we can assign them to these groups you know all of the the the orange genes here they have all descended from that ancestral copy the green one has descended from that green copy and these three blue ones have descended from that blue copy and so in a way human is a very good representative of the of a simian because it's just kept these three ancestral genes but the baboon this is why things were confusing because it's lost one copy and you know you had further duplication of another copy now in viewed in this way it's much more simple to make sense of what has happened here if we think in terms of these ancestral simian and but you can see that we can do now the same exercise at the level of the ancestral primate uh yeah I just know that you can also then compare you know the events that you happen on each branch then and at the level of the ancestral primate we can see there's you know maybe through our pairwise analysis we may infer that we have one group so everything descended from one gene so really in terms of the coloring everything now at that level is inside the same group and so you can see how we have now nested groups that are going to correspond to one ancestral gene in all of these key ancestors and we can play with the resolution if we want to relate the marmoset you know if we want to relate a marmoset gene with a human gene we have to think of it as in terms of you know all of these genes have descended from one ancestral gene for instance it's not helpful for the chimp even though the marmosets and the chimp both have to copy since they are the results of different duplication let's not try to relate you know you know and ask which of the pair you know is better which you know should we relate one five one or two to three nine three six zero or two four one three five six well they are both orthologous to one another and we can see they all descended from the ancestors primates and to the extent that there's a difference between these two genes this is the result of changes that happen on that branch which is quite distinct unless you have some convergence which is typically not what's happening um that would be distinct from let's see any changes that have accumulated between these two genes so you're not going to learn much and so it makes sense at that level to lump everything in one group okay so I think I did want to cover this I'm not telling you how we infer this orthogous hierarchical orthogous group you can just take them at face value in resources I want to plug in sorry I would have spent some time on this but I do want to plug in our method on my browser if you're particularly if you're not coming to the to the practical this afternoon you can use this we have a lot of material online tutorials and some more videos and some more explanation in case you want to try to use this and you're not part of the the practical this afternoon and and we will make the slides available some of the material even if you can't take part in the in the practical this afternoon if you're interested in doing this some of the exercise on your own get in touch with me and I will try to give you some access to this some of these exercises and tutorials that could be done in an autonomous way and I will I'm going to still stay for a few more minutes to take on some questions but since we are at the end of the sessions I do want to first thank you all for participating for asking lots of questions along the way it's been really it's you know it's really fun to to deliver a lecture when there is a bit of interactions and you know we know it's a challenge when we're doing these things remotely but I think it's been as as good as I could have hoped for in terms of getting you know some of your inputs so thank you very much keep in mind this is only day one of three days tomorrow you're going to revisit some of these concepts with Rob Waterhouse who's going to tell you also a little bit more about another resource and which who will also provide a broader perspective about gene evolution and also how to relate it across multiple species and on the third day my colleague Marc Robbins is going to tell you also about how function evolves and the selection across this data and you know if some of the concept that you've seen today they're going to be revisited by my colleagues and so even if you know there are something that are unclear this first in this first exposition you will have a chance to consolidate this knowledge so I'd like to thank you all at this point and I'm still staying for a bit longer to take on questions of course with the knowledge that you know for those of you who have to who are taking part in the practicals this afternoon this is starting at one o'clock sharp so we all also need a bit of time also for lunch but I'm going to take now some question for a bit of time if you have any and again thank you all for your attention I feel free to leave if you don't have any question and you you've got other plans