 Don't hesitate to jump in with your questions. So the first part of today, we're going to revisit a few of the topics that Kristoff briefly touched on yesterday, focusing here on the concepts and methods in orthology, and less on the sort of phylogenetics and tree theory that Kristoff concentrated on yesterday. Then we'll take a short break. Just like yesterday, we'll have the breakout rooms, which are totally optional, of course. You can go and get a coffee or you can stay in a breakout room and say hello to whoever else is joining this morning. Then in the second session, I'm going to switch a bit and talk about how we use benchmarking universal single copy orthologs to assess the completeness of genomic data. So walking you through a little bit of the behind the scenes of what BUSCO does and how it does it. And then finally, at the end, instead of talking about concepts and methods, I want to just present a few concrete examples of a question in comparative genomics, the approach that we took to try to address that question, and the answer that we were able to reach. So a bit more of a sort of cookbook recipe style with examples of using orthology data to answer questions in comparative genomics. So for the first session, we're going to focus on concepts and methods in orthology. And hopefully by the end of this class, this lecture, you will have a better answer to each of these three questions. Firstly, what do we mean by orthology? Then a little bit on the methodology behind how we actually go about delineating orthologs. And finally, why do we do this at all? Why do we need orthology in our lives? So before I focus on orthology itself, I think it's worth taking a step back and thinking about orthology as a sort of child term of the parent term homology, where orthology is a subclass of homology. So we need to first talk about what we mean by homology before getting into the details of what we mean by orthology. In essence, homology is simply the designation of a relationship of common descent between any entities. So there is no further explicit specification of any evolutionary scenario. It simply posits that any entities have evolved from a common ancestor. So they share a common ancestry. And I put this review at the bottom here of the slide by Cunin. Even though it's from 2005, it's in annual reviews and genetics. And he very elegantly summarizes a lot of the concepts that we are going to be talking about today, such as the relationships between homology and orthology and paralogy and the use of orthology in evolutionary genomics. So orthology then takes it further and specifies that this common ancestry for genes in particular mean that the genes originate from a single ancestral gene in the last common ancestor of the species that are being compared. So you can see how the more general definition of homology, which simply posits common ancestry, now becomes a much more specific definition when we're talking about orthology because we introduce the evolutionary scenario of the last common ancestor of the species that we are comparing. Now, paralogy is another child term of homology where genes are simply related by gene duplication rather than by speciation. So paralogs are genes in the same genome of a given species and have originated by duplication. Here we don't yet specify where or when this duplication occurred, simply that the two genes currently reside in the same genome. So putting them all together, we have the general term of homology that simply posits a common ancestry and the two child terms of orthologs and paralogs where orthologs arise through speciation events and paralogs arise through gene duplication events. Now in the world of genomics, of course, we are focusing on the use of sequence homology to define orthologs and paralogs in the genomes of species that we are sequencing today. So we need to first think a little bit about what we mean by sequence homology. So homology at the sequence level, whether it be between protein or DNA sequences, is typically inferred from their sequence similarity. So here we're just looking at protein sequence alignment. And even if you've never really looked at alignments before, you can see that there are a number of residues that are identical or similar between these two sequences. So what sequence homology search tools, such as the all famous blast, are trying to achieve is simply to detect an excess level of similarity between two sequences, so greater similarity or greater identity than expected by chance. So basically a statistically significant sequence similarity. Where we have to be careful here is when thinking about sequence homology, is this link between what I was just talking about here, the statistically significant sequence similarity and the definition of homology, because it is often misunderstood. And Bill Pearson's introduction to sequence similarity homology searching sets us out quite well. Just to be clear, a pair of sequences just as we saw on the previous slide can have high or low sequence similarity. But this does not translate to strong or weak homology. So even though you might see in scientific papers, even this phrase referring to protein in the mouse has strong homology to protein in the human, this is technically incorrect language. Homology is simply the conclusion. Given a particular level of similarity, the sequences are likely to have risen from a common ancestor. Hence the expectation or E value associated with a blast sequence search or other sequence searches. So this is something to bear in mind when thinking about what we mean by homology and the distinction with sequence similarity or sequence identity itself. And just to point that out that it's still worth mentioning this in 2020. So if you search for strong homology in PubMed today, you will see that the misuse of this term has been growing over the decades until I guess people like Bill Pearson pointed out this misunderstanding. And slowly but surely this incorrect use of this term has been decreasing. So I think in 2020 we have about 10 so far publications with that exact phrase in. So if you do see it, you are not alone. But hopefully by the end of today you'll have a better understanding of this distinction and hopefully avoid using it in your own texts. So we've talked about the general term of homology and the child terms of orthology and parology. Now where do they come from? The term homologue itself was introduced by Richard Owen back in 1843 long before sequences, long before DNA, long before evolution really to designate simply the same organ in different animals under a variety of form and function. So already recognizing that although we're different species, living in different niches, there are commonalities that would suggest that they are the equivalent organ in different animals. So although we might think of this concept as being extremely useful to think of when trying to understand evolution, Darwin himself actually never used the term homology. But very quickly after the publication of the origin of species, in fact Huxley when he reviewed Darwin's work invoked homology as evidence of evolution. For us it might be very clear that having the same organ in different animals under a variety of form and function and starting to think about how that could have happened, it's obvious at least now that there is some common ancestry and that they have descended from some common entity in a last common ancestor. But this wasn't obviously that obvious in the late 1800s. So that just gives you a little bit of history about where some of these terms have come from. A century later Walter Fitch actually now in this sort of classic paper actually begins to really define the terms of orthologs and paralogs. Now of course in the sequence era because we have protein sequences and we can really see through sequence alignments how the relationships between different sequences match up to our understanding of the different relationships between the animals that they come from. So the distinction was first introduced by Walter Fitch in 1970. So enough of the history lessons, now I just want to walk through a very simple scenario using a tree-like form which you got to grips with Christoph yesterday to describe the most simplest inheritance from a last common ancestor. So here we have six species, frog, duck, rat, mouse, chimp and human and the last common ancestor of all six species. Now under the most simple evolutionary scenario the only thing that we have happening are the speciation events at each node here on the tree resulting in two descendant lineages giving rise to a single copy ortholog in each of the six species that we are comparing since their last common ancestor. Now of course evolution isn't always that simple. You don't just have speciation events, we also have for example gene loss events. So this is a really simple one to understand and to illustrate. This gene has simply been lost somewhere in the history of the duck lineage and is therefore missing and so we have still a relatively simple evolutionary scenario just by introducing a single gene loss event where the rest of the genes in each of the other five species remain as single copy orthologs. Now of course the other thing that can happen along the way as Chris described yesterday are of course gene duplication events and here we've just added a copy in the human lineage to create human gene one and human gene two. So now we have single copy orthologs in those four species and a pair of paralogs in the human genome. Then of course not all gene gains necessarily are young and lineage specific we can have older gene gains. So here in the common ancestor of both rat and mouse a gene duplication event created a second copy in the common ancestor. This results in a pair of rat paralogs gene one and two and a pair of mouse paralogs gene one and two. All together these genes still form an orthologous group all descended from the single gene in the last common ancestor of these six species. Now at least one other evolutionary scenario that we should consider is of course sequence evolution. The other scenario that we were looking at earlier losses and gains was just considering copy number evolution but of course sequence evolution is inherent as well. So imagine that for here we had this duplication in the ancestor of rat and mouse that was followed by fast sequence divergence and here in this tree like form we normally represent that with longer branch lengths. So you can begin to see how our relatively simple scenario that started off just with speciation events can start to get more and more complex as we add more and more realistic evolutionary events such as losses duplications and rapid sequence evolution. So all together these genes form an orthologous group consisting of the rat paralogs, the mouse paralogs, the human paralogs, the single chimp gene and the single frog gene because they're all descended from this last common ancestor of the six species that we are comparing. So here's a different representation of the same sort of concept whereby we have four species A, B, C and D each with different numbers of genes in them and depicted in a tree like fashion the evolutionary scenario from the last common ancestor of A and B to the last common ancestor of all four species and the last common ancestor of species C and D and I think this helps to illustrate why the definition of where your last common ancestor is is important. So from the perspective of last common ancestor one where we only have two species we have an orthologous group consisting of gene A1, B1 and B2 all the genes that descended from that ancestral gene in last common ancestor one including this duplication event. Now if we move to last common ancestor two so the root of all four species again we include all the genes descended from that single gene in the last common ancestor A1, B1, B2, C2, C3, C1 is missing because of a loss event D1, D2 and D3. So all of these genes would cluster together in a single orthologous group taking into account the two old duplication events that happened in this lineage and the one young duplication event that happened in this species. Now this is where it perhaps becomes more interesting for illustrative purposes if we consider the last common ancestor three here which is simply the ancestor of species C and species D. Now hopefully you can see that in their last common ancestor there were three genes because these two duplications happened prior to this speciation event. So we have three distinct copies of this gene in last common ancestor three. So if we're delineating orthology just between species C and species D we now have a much more fine-scaled resolution where we have one group consisting of C2, D2 another group consisting of C3, D3 and a singleton D1 because of the loss in species C of C1. So broadly speaking homology is simply the task of recognizing similarities and using that as an evidence of shared ancestry. When we refine that and introduce the concept of the last common ancestor and talk about genes in particular we can define orthology or orthologs as arising by vertical descent from a single gene in the last common ancestor of the species under consideration. This inherently means that orthology is a hierarchical concept as I hope I was able to demonstrate with this figure and that it is relative to the species radiation under consideration. So you're going to get much more fine-scaled resolution if you don't go back so far in time and you're going to get much broader scale larger orthologous groups consisting of all genes descended from that last gene in the common ancestor when you go further back in evolutionary time. And this gives rise to the concept of our orthologous groups that try to capture all of the genes that have descended from a single gene in the last common ancestor. So hopefully that has helped set the groundwork if you like for some of the concepts in orthology and just to check that you're paying attention. I would like you to please quickly answer this quiz question. And don't worry the link is going to appear in the chat now. Oh, there it is. Okay, excellent. So you don't have to type out that horrible URL yourselves. Given what we've just been through in the last 25 minutes or so, which description best describes your understanding of orthology? Orthologs are genes in different species that have evolved from an ancestral gene without duplications or losses that perform the same specific biological function that evolved from a single gene in the last common ancestor that have the highest significant sequence homology or that produce a gene tree that matches the species phylogeny. And hopefully I can switch to the results and see if any of you have been listening. Okay, 46 responses. That's not bad. Out of 72, 48. Okay. So I think I may have mentioned that orthologs are genes descended from a single gene in the last common ancestor a few times in the last 25 minutes. And it definitely sunk in with 90% of you for the time being at least reaching the correct conclusion. Okay. So now we go back to PowerPoint. There we go. Are there any questions on the first part? Talking about definitions and concepts and history. I don't see yet any in the quiz. And I see one. How is speciation defined? Yes, tough question. We could certainly talk about that for a very long time. So technically speciation occurs when two populations are no longer reproductively fertile. That's one working definition if you like. So over evolutionary time populations may be separated either geographically or genetically and eventually become so separated that they are no longer able to produce fertile offspring. And therefore you have two new species emerging from the ancestral species. I think we'll probably leave it at that for this because it is a rather tough question. Okay. We have another question coming in. How is the species tree decided? Again, another tough question. So in the examples that I gave it was fairly, how should we say, non-controversial because we're looking at frog, duck, rat, mouse, chimp and human. Now if you were to add a bat in there for example or a horse, some of these nodes are a little bit contentious and jump around a bit. Essentially, one of the uses of orthologs is to collect enough common markers across the species of interest to build a sequence alignment using as many single copy orthologs as possible and to use all the changes that have accumulated in all the species to try to infer the most robust species phylogeny. And we'll probably, yes, we will touch on that in the second part this morning. Some paralogs are useful. Why did you decide to with mainly orthologs? There must be some good paralogs as well. Okay. So I was talking about for inferring the species phylogeny. Generally, we try to stick with single copy orthologs because in fact, if I go back to the very first slide, what I was talking about here is this super simple evolutionary scenario where only speciation events have occurred. So we haven't had any duplications. We haven't had any losses and we haven't had any rapid sequence evolution. Therefore, these kinds of orthologs are going to be the most useful to determine the species phylogeny because their evolutionary history exactly matches the species evolutionary history. Yes, there are cases of course where paralogs and losses and where those duplications and losses have occurred that could help you to infer the species phylogeny. But that would use slightly different methods than what I was talking about in terms of sequence alignment of universal single copy orthologs to reach a best estimate of the species phylogeny. Okay, good. So now I will move on to a little bit more of the methodology rather than the concepts and focusing now on how we actually go about delineating orthology. And just to advertise Kristoff's book chapter here, which gives a very good overview and detailed discussion with lots of examples of different types of methods and approaches for inferring orthology and parology, which we will not be able to cover in detail today, but I just want to give you a flavor of what's out there. So overall, they can be separated into graph-based approaches and tree-based approaches. So what do we mean by that? First of all, this is just to illustrate to you that there are lots of different methods, both for tree-based approaches and for graph-based approaches. And I think that a lot of people can become rather overwhelmed by the choice of methods that are available and which one to go for for their particular dataset for their particular question. And hopefully we can discuss a little bit about this further. But here, just broadly speaking, to try to illustrate the difference conceptually between the tree-based approaches and the graph-based approaches. Actually here, I think that this illustration shows how they are really aiming for a very similar outcome just using slightly different approaches. So here we have the same tree that we saw earlier. And basically you're using sequence alignments to build gene trees to infer the orthologous relationships between the genes or proteins in the species of interest or the species that you've sequenced and annotated. And the graph-based approach is similar except that instead of trying to infer a tree-like relationship between all the proteins in your extant species, instead we start off with an all against all pair-wise distance matrix between all the genes in all the extant species. And then we try to analyze this graph to determine the last common ancestor and all the orthologs descended from that last common ancestor. So which approaches are best? I think that would probably be a three-day conference in and of itself to try to fully answer a question of which is the best orthology delineation approach. Lots of benchmarking has gone in to try to assess the performance of different methods, either using reference sets for benchmarking or different metrics that compare the performance in terms of recovery of false positives or true positives based on some sort of golden standard. So here the golden standard of reference gene trees or curated protein families can be used. And for now I think I will direct you to the Quest of Orthologs website if you want to learn a bit more about some of the different methods that are out there and how they perform in these different benchmarking studies. Okay, so a little question here. Is it possible to convert tree-based approach to graph-based approach? So I'm not quite sure what you mean by convert. I think so if we go back here I try to illustrate that conceptually they are trying to achieve very similar things. And yes of course if you have all the pair-wise distances between the proteins of interest you could imagine inferring a tree from that. But the graph-based approaches are trying to avoid generally the computationally expensive task of trying to compute this tree in the first place. So there really are two diverging strategies to try to tackle the same kind of question. And one of the advantages of graph-based approaches is that you don't have to perform all the multiple sequence alignments which you would then need to infer the trees which you use to then infer the orthologous and parologous relationships. So here we just have six species. You can imagine that's a fairly tractable computational problem. But imagine you have all the thousands of bacteria in CBI and you're trying to create multiple sequence alignment and use that to infer orthologous and parologous relationships. Then suddenly the computational task becomes a lot more challenging and perhaps you might opt for a graph-based approach where the alignments are all pair-wise. So you still have to make a lot of alignments but overall the task is a lot less computationally demanding. Okay, so we're not going to discuss all the pros and cons of tree-based versus a graph-based approaches in the time that we have today. But I point you to some of the resources out there to learn a bit more about that. So because I have most experience in my postdocs working on the orthoDB hierarchical catalog of orthologs, I would like to spend a few minutes just discussing how orthoDB goes about delineating orthologs. And this is an example, if you like, of a graph-based approach. So the first step is to select your input data and just like any orthology resource, you first need the proteins from all your species of interest. And we also collect as much functional information as we possibly can about those proteins. So usually those are coming from model organisms where most experimental work has been carried out. Because as we'll see later, one of the main uses of orthology data is to be able to make hypotheses on gene function from genes in newly sequenced species based on their orthology to genes in species where a lot more experimental work has gone on to try to understand the functions of genes. The first step then is to select the longest protein coding transcript from any genes with alternative transcripts because we're performing a gene-based orthology delineation. Then for each species, we first filter their sets of proteins to remove highly identical sequences. So here we have three yellow ones and four red ones that are highly identical. And we just select one protein sequence representative to go through as input for clustering. Then we perform all against all Smith-Waterman pairwise alignment, in our case using a software called Swipe. This is to begin to build the graph that I was talking about earlier. So basically we have all the proteins from species A and we compare them with all the proteins in species B to find what we call best reciprocal hits. So the best match from A to B is reciprocally the best match from B to A. Now in some cases, of course, we can have more than one match, but what we're looking for is this reciprocal nature of the best hit. So the best hit from A to B is mirrored by the best hit from B to A to form what we call this BRH or best reciprocal hit. Now you can imagine that when you have a lot of proteins coming from the whole genome themselves being homologous. So lots of kinases, lots of proteases, lots of large gene families, zinc finger proteins, etc. This is when the best reciprocal hit can break down, where the best hit from species A to species B is not matched reciprocally from species B to species A. So we start with the best reciprocal hits and here it just shows why we started by removing the very highly similar proteins to start with. So imagine we have two homologs in species A which are distinguishable enough, so less than 97% identical. From the perspective of species B, we can find a clear winner and therefore we can define a best reciprocal hit. Whereas if we have two proteins in species A that are highly identical, so 99% identical, from the perspective of species B they would score basically the same and therefore you could lose the possibility of finding the best reciprocal hit. So that's simply a technical reason why we remove them to start off with. Then clustering proceeds from the best scoring best reciprocal hits between pairs of species, so here A and B. Moving down the list to lower and lower scores. Forming triangles with certain cutoffs, so with a species C is a best reciprocal hit with species A here and then we ask the question, does that close the triangle between C and B? Yes it does. This begins to form the core of our orthologous group. As we move down the list of best reciprocal hits, the distances if you like start to increase but we're still requiring a reciprocal hit between each species here A and D and then again asking does it close the triangle with another species that is already a member of the cluster? If so then it can join the orthologous group. We keep growing the orthologous group adding more and more distantly related species so long as they form triangles that close the best reciprocal hit triangles and allow the new genes to enter the orthologous group. Now those which don't form triangles but which themselves do form best reciprocal hits are also allowed to join the orthologous groups but we have a slightly more stringent cutoff. So here we have a protein from species G which forms a best reciprocal hit with species F but with none of the other species in other proteins in the orthologous group. We apply a more stringent cutoff and we allow it to join the group. Just very briefly on this alignment overlap requirement. So in a pairwise alignment you can have the full sequence being alignable and therefore homologous between species A and species B then when you add species C what we're asking for is that there is a commonality in this alignment region so this green bit here between A and B and between B and C and between C and A so we are comparing like for like regions instead of the situation here where between A, B and C A and B match over here B and C match over here and A and C match over here and there is no overlap between their region. So finally we have to think about how we also consider what we call in-parologous groups so these are within species homologs that form distinct clusters during the clustering procedure. So here we have essentially the same group that we were building earlier and we have a trio out here with species A, B and C that is nevertheless homologous to the proteins in this major cluster. So these black lines are simply indicating that they are within species homologous homologs so A is homologous to A prime, B is homologous to B prime and C is homologous to C prime. This is conceptually slightly hard to grasp so we resort to our tree-like visualization and this is essentially the scenario that we're talking about here. So from the last common ancestor of species A to G what we're talking about here with this A prime, B prime and C prime is essentially a duplication that occurred here at the last common ancestor of species A, B and C to produce this in-parologous group here and here which from the perspective of the last common ancestor of all of these species should be included and should join this orthologous group because they are all descended from the last common ancestor gene. Finally we also add in-parologs so these are singletons from the same species that didn't join any other cluster and the very last step is going back to those highly identical protein sequences that we excluded in the first step and of course adding them back to form the final group. So we have a quick question here from can you quickly clarify does within species homology equal parology in-parology? Yes again from the perspective of the last common ancestor so if I just go back to this tree where is that tree? Okay so if we were only looking at species A, B and C then we would form one orthologous group with the primes A prime, B prime, C prime and one orthologous group with ABC but because we've gone further back in evolutionary time all the way to E, F and G these are now parologs in species A, B and C and should be considered a part of this orthologous group. Hopefully that clarifies the situation and actually it's a good question because it really highlights the perspective of which last common ancestor you are considering and how that affects the resolution of the orthologous groups that you're going to obtain. Okay so quickly I think I'll skip that because it's a bit complicated. Okay and finish here with a real world example of the kind of graft-based approach. So here we have I think this is in vertebrates yes okay and these are all the distances if you like the similarities between two groups of proteins. Pop three up here and pop two down here. What you can see is that within each orthologous group there is a very high level of connectivity so lots of best reciprocal hits connecting all the proteins in pop three group and pop two group. However because of missing genes in four species in pop two and in several species in pop three what we find is that the best reciprocal hits start to cross between pop two and pop three. So this is just a nice example of how messy real data can actually be when trying to use a graft-based approach to delineate orthologs. Okay so that's the methodological considerations if you like and so I would ask you now to fill out the second little quiz which you should hopefully receive the link for in the chat and I see there's a question from about the resolution. So we have quiz B okay so the link now for the second quiz is in the chat and let's see this question here why are you reducing this to 97% resolution whereas in BLAST the higher the similarity the better. Okay I'll try to interpret that question. I think for you're referring to the 97% filtering that we did before starting to build the graphs of pairwise relationships between all the proteins so I had to I went through that a little bit fast perhaps basically highly closely highly similar proteins from the same genome mean that when you compare proteins from different species you could get almost exactly the same BLAST E-value or exactly the same alignment bit score to those two highly similar proteins in species B therefore we choose just one representative rather than including both of those in the all against all search to improve the recall of best reciprocal hits then at the very last step of the clustering if the representative protein made it into an orthologous group then it's highly similar 97% identical. Parallogue will join that orthologous group so this is a a technical specification and obviously one might alter these parameters depending on whether you are looking at species that are very closely related evolutionarily speaking or if you're going much further back in time but this is just the default we have another question here on how to add gene expression distance to the sequence distance in the overall BRHs okay I'm not sure so here we're not talking about expression at all but it could because we're talking about trying to delineate the evolutionary relationships between proteins in extant species but it is possibly something we might talk about later and okay one final question is there a way to calculate the likelihood that one ortholog can be missed in one species clade due to real loss or bad genome annotation so that's excellent question because it is exactly what we will come on to in the second lecture this morning and of course the exercises this afternoon so I think here I will switch to see the results of the poll quickly and then we will take our first break for the morning so I just need to get quiz b up and running okay 41 responses what was the question which description best describes your understanding of how ortho db delineates orthology I need to click responses okay so you definitely got it best reciprocal hits determine how genes are progressively added to form orthologous groups okay so we are now one hour and four minutes into this morning session so it's probably a good time to take a pause yesterday we launched some breakout rooms and some of you joined them the idea of these breakout rooms is simply to say hello to other people who are joined for this morning but you don't have to and you're more than welcome to go and grab a coffee take a short break and I will aim to be back to start let's say at five past 10 so according to my clock that is in 13 minutes time so take a break grab a coffee say hello on the breakout room and we will be back to the second part at five past 10 see you later and before we do so there's a question about fast evolving sequences so if we fail to find a good best reciprocal hit are these going into outgroups in the graph or are they basically simply excluded from the orthology analysis would you be able to account for local sentinine in these cases as some kinds of proteins for example zinc fingers often emerge in cysts nearby in the genome so there are some orthology delineation techniques that do take sentinine into account they can be useful for resolving some of these tricky situations if you like and yes fast sequence evolution after gene duplication will often result in those descendants being called as a completely separate distinct orthologous group so when you find orthologous orthologs that appear to be restricted to a specific lineage so for example an ortholog that is only found in rodents for example there are two evolutionary scenarios here that we can think of one would be that it is a de novo gene that was born in the ancestral of rodents and therefore really does not have any orthologs in other mammals the other would be what we were just talking about so a duplication in that ancestral lineage that was followed by rapid sequence evolution leading to it no longer being able to be recognized as an orthologue and therefore it's being called as an orthologous group that appears specific to the rodents if you think about it in terms of a tree you can imagine that after the duplication you would then have a very very long branch length because of the relaxed constraint on sequence evolution of the descendants of those of that duplication and depending on where you cut the tree you would either be able to include those genes as orthologs or you would indeed exclude them from the orthologous group and they would form their own independent orthologous group so there is no golden solution if you like for that kind of scenario but it certainly does happen so that's a very valid question okay so now I will try to share my screen for the second part of this morning's presentation there we go okay you should be able to see my screen now and for those of you who may be just joining now again feel free to say hello in the chat ask your questions in the google doc and and the zoom chat as we progress through the lectures and I will try to answer as many as I can as we go along and the rest we'll try to provide answers for you in the google doc especially if they're a bit more complicated to respond to immediately right just my laser pointer which is here there we go okay so in this second session now we are going to switch from purely talking about orthologs and paralogs to actually using orthology data to assess the quality of genomic data with the tool called busco the benchmarking universal single copy orthologs and hopefully at the end of this session you will be able to answer the following three questions so a little bit about what busco actually is what it's attempting to do how it how it goes about its business if you like and finally why we need it which from the questions that we see in the chat already it's it's it's quite clear how these universal single copy orthologs can be used in different scenarios for assessing quality as well as other interesting evolutionary questions in comparative genomics so even though you're all on mute I can I can sort of hear you groaning when you see this slide again of the fall in cost of sequencing over the years at least this one is updated so it goes to July 2020 but it's really to illustrate the point that over the last few decades we've gone from recognizing what DNA is and being able to see it and look at chromosomes to being able to sequence it and then to be able to sequence it a lot so we're now reaching this what we call a nucleotide resolution of genetic and genomic data and there's a huge flood of new sequence data arriving every single day in terms of arthropods which of course my favorite animal group here we're just illustrating the increase in the number of fully sequenced genomes over the years that we have been including in our orthodB catalog of orthologs with the latest version orthodB 10 now consisting of 170 arthropods and 1200 eukaryotes so as I alluded to earlier the tree-based approaches where you are required to align all the sequences in order to infer the tree to then infer orthology and parology relationships become computationally very very expensive when you're talking about thousands of species so in the world of arthropods here a more up-to-date view of the genomes that are currently available and the different orders for which we have samples so you see across the whole of arthropoda we have more than 500 assemblies for different species and several of those species have been done more than once so there are different versions of those assemblies this lighter purple bar at the top just represents a re-sequencing or reassembly of species for which we already have at least one genome so the amount of data is increasing every single day and here I choose to show the example of the i5k initiative so sequencing of 5 000 arthropod genomes here we're just showing the number of predicted genes in each of these species versus the assembly size for previously sequenced arthropods and arthropods that were sequenced as part of the i5k pilot project and what i'm just trying to demonstrate here is that over the years now we are more and more able to sequence and annotate larger and larger genomes so a lot of the species that were some of the first to have their genome sequenced and annotated were specifically selected because they had very small genomes so they were much more tractable when the cost of sequencing was much much higher nowadays we can move into much larger genomes a much more diverse set of species to sample the genomic diversity that is out there but this of course raises the important question how can we gauge the quality of all of these resources that we are generating which is a very important question when it comes to downstream analysis in comparative genomics as someone alluded to in the chat how do we know if a gene is lost because of a real evolutionary event or it was simply missed due to technical problems with the genome sequencing genome assembly or the genome annotation steps and so the first thing or the first question that you might ask to try to gauge the quality of these genomic resources is simply does the assembly size match the expected genome size so we're not going to go into details of how one might achieve that but it's certainly a very important question to ask when dealing with newly sequenced genomes now the second question is hopefully familiar to some of you but it's more of a technical definition so we're asking the question how fragmented is this genome assembly the ideal genome assembly will come in as many parts as there are haploid chromosomes for that given species but that's not the reality genome sequencing is difficult and assembly is hard and so what we end up with quite often are contigs and scaffolds that for which we don't necessarily know the complete order and orientation according to the carrier type or the chromosome structure for the given species and so here we'll often hear spoken of this term called the n50 so the contig or scaffold n50 size and this is a measurement of the level of fragmentation of your assembly simply specifying that half of the assembly is found on contigs or scaffolds of length n50 or greater so it gives you an idea of how contiguous your genome assembly is now a third question that often gets overlooked is how gappy is your assembly so if you have all of these small little red chunks here and you just simply glue them together with ends to denote unsequenced or unknown regions of the genome you will start to improve your n50 value because more of the genome will be contained in longer contiguous scaffolds but because of introducing the ends your assembly remains still somewhat gappy with missing information a fourth question that you can ask is simply does the assembly contain all of the genes that it is expected to one way of trying to answer that question is to sequence the transcriptomes from multiple live stages or different conditions and then try to map those back to the assembly so you're asking the question how many of the expressed genes in my species can I find when I search the genome assembly and if you can't find all of them then your genome assembly is likely missing some of the genes that should be present in your species so that's one way but of course that requires you to do transcriptomics to get the samples and it's very species specific so you'd have to do it for every single species that you look at so an alternative is to simply ask the question how many of the expected genes are actually present in this assembly and this is where uh bosco comes into play so what do we mean by evolutionarily expected genes here I highlight the blue fractions which are what we're calling widespread orthologs so in this kind of typical orthology plot what we have here on the left are the species and their evolutionary relationships and the bar charts are simply the counts of proteins in each species or group of species and the extent to which we can find orthologs across the full representation of these species so the blue ones here have orthologs in almost all species and as we move to the right they become lineage specific so here we have dipteran specific orthologs lipidopteran specific orthologs coeliopteran specific orthologs hymenopteran specific orthologs etc so these are genes that uh orthology could only be found within that group of organisms and finally on the right we have either genes that exhibit some level of homology to genes in other species but not enough to delineate orthology and species specific or lineage specific genes for which no significant homology could be obtained the evolutionary expected genes are these ones in blue here because they are found in most organisms we expect that when we sequence a new organism from this group we should find these genes so instead of talking about genes to try and get this concept across here we just talk about physical attributes so let's say you've been studying flies for many many years and you are counting the features that you consistently observe in all the species of flies that you look at so you would say they all have six legs they all have two eyes they all have one pair of wings etc therefore if a new fly comes along someone says this is definitely a fly there would be an expectation that this new species should have also six legs two eyes one pair of wings etc so this is simply what we mean with this evolutionary expectation but in terms of genes now instead of um morphological features and I should say at this point that this concept of using um evolutionarily conserved genes to define this expectation is not a new one that we completely invented um when we started to develop busco and many of you may have heard of the previous incarnation if you like uh segmas of the core eukaryotic genes mapping approach which identified about 250 genes present in five eukaryotic species and defined these as universally conserved genes that you should be able to find in any newly sequenced eukaryotic genome so it's busco builds on this conceptual concept and assesses many many more lineages now so I should also thank the the core flab and and Keith Bradman for their support in promoting the use of alternative tools such as busco since segma has been discontinued I also want to take this opportunity to say that you know there's a whole team of people working on the busco project um so Felipe and I worked on the original implementation and then Matthew took over for the subsequent version and now Matthew is working on the latest release and so when you ask for help these are the people that will be providing to you so it's not just me it's a whole team behind the resource okay so now to go back to this concept of evolutionarily expected genes and these widespread blue genes now if we just take a slice and turn it on its side um busco is actually looking for both widespread and unique genes so here we just delineated these genes according to how widespread or how restricted they are but we also need to consider a second dimension which is their uniqueness so how often are they found as single copy orthologs as opposed to multi copy orthologs resulting from gene duplication events so this is a slightly complicated uh graph to understand but basically this is representing these two dimensions that I was talking about so the orange dimension here is duplicability from orthologs that are mostly found as single copy genes in all species examined to those that are found mostly as multi copy genes in all species examined and the second dimension here the green dimension of universality orthologs that are widespread so they are found in almost all the species to those that are very specific or sparsely distributed across the set of species that we are examining and this landscape is uh one built for drosophila melanogaster with orthology from 80 different insect species and delineating uh the distribution of orthologous groups in these two dimensions so hopefully you can already guess where on this orthology landscape uh the benchmarking universal single copy orthologs might sit hopefully you were also looking at this peak in the back corner here um where they are mostly single copy and very widespread so these genes are found in the majority of species as single copy orthologs this gives rise to the evolutionary expectation that for any newly sequenced genome you should be able to locate and find and annotate orthologs of these genes and busco implements uh assessments in three different forms to allow users to assess the completeness of their genome assemblies but also of their annotated gene sets so that's the genes that you annotate in your genome assembly as well as your assembled transcript terms so that could be from a an RNA seek experiment in different life stages or tissues etc and the bonus features are of course that um buscos are also ideal markers for performing um phylogenomics analyses and uh because busco when assessing genome assemblies is trying to predict genes it also um produces valuable parameters and valuable data to use for gene predicted training so for a brief overview what busco is attempting to achieve here is to identify and classify orthologs as either complete duplicated so present in more than one copy fragmented that means an ortholog was identified but it was not uh completely annotated or missing so that's a complete failure to identify the given busco in your genome assembly in your transcriptome or in your annotated set of genes that's the basic principle now the simplest uh type of assessment that one might perform is on your annotated set of genes so you would extract all the protein sequences from uh your annotation and give them to busco to try and identify and classify those genes as complete duplicated fragmented or missing simple steps the next would be um if you were starting off with a transcriptome for example then we first have to introduce um the open reading frame so identifying the putative protein coding regions of your transcriptome and then feed them to busco to do the identification and classification process now the most complicated is when you're trying to assess a genome assembly this is because there are several steps that need to occur before we can start to try to identify and classify the orthologs first step is to try to identify the regions in your genome assembly that might encode a putative ortholog to a busco gene that's done with tblast 10 then to try to predict genes in these regions because remember we're giving it just the raw genome assembly with no annotation information whatsoever and for this uh we're using augustus but i put a little um asterisk here because uh busco version five now is out in beta version and for that uh we've decided to switch from augustus to meta yuk which also removes the time-consuming search step in an effort to try to speed up the process of assessing genome assemblies after predicting the genes then we have a first round of identification and classification to find the complete single copy busco genes these are then used to train augustus to retrain augustus with the species specific parameters derived from the first round complete genes that were identified a second search is then performed using a more extensive um library of sequences and the second round of prediction for all of those that failed to be found as complete in the first round then we have a second round of identification and classification to finally reach our um assessment of complete duplicated fragmented or missing now that's all uh in the papers all in the user guide all fairly um standard in terms of what busco is trying to achieve so here i want to just take a brief moment to explore a little bit behind the scenes of how we actually uh create these busco sets and the work that goes in to uh building all of the uh lineage data sets and libraries so the first step is to select orthologous groups from ortho db at each relevant level so what i mean by relevant level is uh different levels in the tree of life that are present as single copy orthologs in 90 percent of the species this is to ensure that evolutionary expectation that in any newly sequenced uh species these genes should be present and they should be present as single copy orthologs the first step then is to uh perform multiple sequence alignments for each of these orthologous groups with these multiple sequence alignments we then go on to produce hidden markoff model profiles for each alignment we also produce a consensus sequence for each of these alignments so the most common amino acid across the alignment as well as a consensus sequence for each busco uh we develop a variance of this consensus sequence that are used in the second round search to try to identify putative regions in the genome that were missed in the first round search that only used the consensus sequences on top of that we also use these alignments to build the augustus block profiles these are simply um scores according to the alignment that are given to augustus to help augustus identify the most conserved core of these proteins while it is trying to predict genes in the genomes another step that uh is perhaps not so obvious when uh choosing the species that go into each of the lineage datasets here we have uh tree representation of the whole of eukaryotic species in uh orthodibi and here we have uh on the left every single species and on the right a filtered set of species when i say filtered we are trying to select the best representatives from each clade to avoid biasing the alignments with an over representation of many closely related species so here in yellow i'm just highlighting uh mammals so you can see there are a lot of mammals that are sequenced and annotated and have ortho orthologous uh calls in orthodibi but the mammals are all very very closely related as you can see here compared to the diversity of eukaryota so after selecting based on sequence identity of best reciprocal hits we can filter out just six representative mammals to include in the whole uh dataset so here we have a question uh about this how do you control for good genomes going into the busco sets uh chromosome scale genomes so chromosome scale genomes would be one way but if we did that we would be left with very few because there are still not that many chromosome scale genomes available um so there are two considerations for this filtering process one like i said we want to remove too many closely related species because you can imagine if we included all of these species in the before tree the alignment would be very much biased to this region of the tree where there are a lot of closely related species the second criterion of course is the phylogenetic placement so we would like represented representatives from as many different uh lineages as possible so even if the species doesn't score very highly in terms of completeness we may still retain it if it has a phylogenetic position that is really informative from the perspective of building consensus sequences building h m m's and building these block profiles um another way to control for good genomes is to iteratively select different representatives and ask how many orthologous groups does one uh retain so if if you select a poorly sequenced or poorly annotated species as your representative you're going to get a lot fewer buscos out because it is missing uh more of these core universally conserved genes and therefore the filter of being present in 90 percent of these species would result in fewer uh buscos at the end so it's a it's an iterative um curated process if you like thank you for that question uh additional filtering uh continues after that so um we run the h m m profiles against all of the proteins that went into building the alignment in the first place and we we filter the scores and we determine the score and length cutoffs to fine tune them such that the proteins that were used to build the h m m's in the first place are identified as true positives by the h m m's and homologs those that are from other orthologous groups or not from the group in question are correctly identified as not belonging to this orthologous group so again this is an iterative process to define the score and length cutoffs for each busco group so that it accurately identifies its true members and no false positives so we're trying to keep high sensitivity and specificity and finally we also test the profiles against uh non-input species to try to remove those where the augustus block profiles fail but that's more of a technical consideration we have another question if a genome has a high number of repetitive sequences in that case how does busco work during annotation so in theory uh busco is not annotating your genome it is just looking for a given set of conserved genes in your genome when you're annotating your genome normally you would start by masking repetitive sequences so in theory the highly repetitive sequences in your genome assembly should be hidden from your annotation pipeline and hidden from the busco assessment so in theory once you've done the masking the amount or diversity of repetitive sequences in your species shouldn't adversely affect the assessment procedure thank you for the for that question so here's a brief look at the sets available in the busco version 4 and as you can see this is a lot of work so all of the selections of the orthologous groups the building of the hidden markup model profiles the iterative selections and the delineation of the score cutoffs and the length cutoffs takes a lot of work so it doesn't happen overnight and it involves a lot of man hours if you like so briefly to summarize we saw that from your annotated set of genes all we do is use the hidden markup models to identify and classify orthologs from your transcriptome assembly we first find the reading open reading frame and then use the HMMs again whereas from your genome assembly we use the consensus sequences to identify regions we use the block profiles to predict genes with augustus we then use the HMMs to do the classification and identification and in this second round we use the variance of the consensus sequences to perform a much more thorough search of the genome assembly so even if you've never run busco yourselves if you ever do bear in mind that assessing a genome assembly is a lot more complicated task than assessing your annotated set of genes or your transcriptome so here we have a quick quiz just to check that you're all still paying attention which description which should appear in the chat in a moment so that you can get the link which description best describes your understanding of what busco aims to achieve okay it's not appearing yet so I will do that okay so the link is now in the chat for you to be able to answer this very simple question basically is busco attempting to assess the sequencing quality of genomic data such as genomes gene sets and transcriptomes is it trying to identify and score all the highly conserved genes in a newly sequenced and annotated genome or a transcriptome or is it trying to estimate completeness of genomic data such as genomes gene sets and transcriptomes in terms of expected gene content let's see if we're getting some responses yes we are excellent okay this is interesting as the responses come so clearly we have a winner to estimate completeness of genomic data including genomes gene sets and transcriptomes in terms of expected gene content nevertheless a quarter of you still are answered to identify and score all highly conserved genes in a newly sequenced and annotated genome or transcriptome so I purposely uh made the answers options um quite close to each other so busco is not trying to identify absolutely all universally conserved orthologs in your genome it is trying to identify and classify those orthologous genes those universal single copy orthologs that are good markers for completeness so expected gene content is the the real key here so we are trying to estimate completeness in terms of expected gene content and so in this way it's a very complimentary if you like tool or a measure of the quality of your genome so it is not really assessing the sequencing quality while of course the quality of sequencing that went into producing your genome will affect the busco scores that is not what busco itself is aiming to achieve and this is uh important to bear in mind when you're looking at the different metrics for your genome assemblies they are often very complimentary and tell you different things about the quality of your data whether it's the per base quality of the sequencing itself whether it's the n50 that is attempting to describe the contiguity of all the sequences in your assembly or in this case whether it is a measure of the expected gene content have you managed to sequence and identify and annotate all the genes that should be present in your genome of interest and there's a question in the google doc um this one i presume so when evaluating a new genome assembly would you consider superior or more reliable the mapping results of a reasonably good set of transcriptomes or the busco results i.e if i get 90 percent of orthologs from busco but 100 percent of the transcripts from the transcriptome should i be happy or should i suspect something is wrong and the the good old question what do you consider a good busco value excellent questions okay so let's start with the transcriptome first of all um one of the aims that that the busco is really trying to achieve is to provide a score that one can use to compare different assemblies from different species so this transcriptome approach that we mentioned earlier is very good for assessing the completeness of your genome of your species of interest if you have a good set of transcripts from a lot of different tissues and life stages etc you can get a pretty good feeling a pretty good sense of whether or not your assembly has managed to capture all of these genes that you expect because they are being transcribed what it then doesn't do is allow you to say is my genome assembly more or less complete in terms of genes than another genome assembly for a related species so you know across a set of rodents or across a set of beetles how does my assembly compare to others because the transcriptomics that you've done are going to be very specific to your species so you can still get a good sense of completeness but you won't be able to do a relative assessment if you like of your genome assembly compared to others of closely related species so 100 percent of the transcripts from the transcriptome is fantastic um because it means that everything that you've managed to sequence in your transcriptome you can find in your genome and indeed that of course surveys a lot more different types of genes than what we would be surveying when we're just talking about buscos because buscos are really a limited subset of all the genes in the genome based on this evolutionary expectation for them to be there and to be single copy whereas your transcriptomes are trying to assess a much greater diversity of all the genes that are present in your samples um so you should be happy and you shouldn't suspect that something is wrong on the same topic what do you consider a good busco value so here i go back to this idea of um the relative comparison of your species versus other similar species so 90 percent could be good for a lineage where not many other species have yet been sequenced and sampled and therefore we don't have a very accurate estimation of this expected gene content for that given clade right in a clade where we have many many more um species represented then you would probably expect a higher busco score now given that the buscos are selected to be present in single copy in at least 90 percent of the species basically anything above 90 percent is good because that was the bar that was set for selecting them in the first place this bar however allows for this 10 percent of missing genes to be missing because of evolution or to be missing because of technology there's no distinction there we do not know a priori whether or not these genes were missing due to a real evolutionary gene loss in that species or whether they were simply never sequenced never assembled and never annotated because the genomes of the species that we are using to build the buscos data sets are themselves not necessarily 100 complete beautiful chromosomal fully annotated genomes we can only work with the data that we have so hopefully that um answers that question and provides some food for thought for the rest of you as well okay so uh very briefly now in the last 10 minutes or so i would like to switch from what busco is and and how it works to um looking at a few applications in comparative genomics so uh this is just a pretty picture if you like word clouds if you don't i'm sorry of the words most frequently appearing in the abstracts that have been citing busco over the years and so unsurprisingly we have a lot of genes and we have a lot of genomes and a lot of species and assembly and sequence so here in action um this is uh using busco to answer a very simple question about the quality of your data here we have a selection of different insect gene sets ranging in size um in in size by that i mean the number of predicted genes and we're simply showing that here large gene sets so genomes that were predicted to contain many many genes are not necessarily the most complete so just because you have more genes doesn't mean you've got more universal single copy orthologs and a small gene set the the inverse doesn't necessarily mean that you have missed or incompletely sequenced and annotated all the genes in your genome here in contrast we have a another small gene set that scores much much lower in busco completeness so this species i would be slightly worried about the completeness of its assembly in terms of gene content whereas this species with almost the same number of genes predicted scores almost a hundred percent and is therefore um probably a much more reliable um annotation set and assembly here i want to go back to the very beginnings of the busco development day so this is before we had the let's say that even the prototype and using universally conserved orthologs to try to assess the quality of the mosquito genomes that we were sequencing and annotating at the time so we have i don't know 20 or so different mosquito genomes and here we have the complete orthologs found the ones found duplicated in blue and the ones missing putatively missing in red and please note i've committed the cardinal sin of cutting the axis at 80 but we need to in order to zoom in here and there's a couple of points to note here so first of all most of our genome assemblies are remarkably complete in terms of gene so yay we're happy most of the data that we've produced seems to be pretty good then uh one to note here is um Anopheles maculatus which has the lowest completeness all the way down to 84 percent here with uh quite a high level of duplication and the most missing orthologs and indeed this mosquito genome Anopheles maculatus was the most highly fragmented of all the assemblies that we were working with at that time so this assessment of gene space is reflecting the quality of the underlying assembly here in contrast we have a second one Anopheles christii here um which is here which was also almost as fragmented as the maculatus assembly just a little bit better but in terms of gene space it's actually doing pretty well so while it does have quite a few multi-copy hits uh most of the orthologs have been found so this little red bar here is pretty small so in this case even though Anopheles christii genome assembly was rather fragmented we could say at least we had managed to recover most of the genes that we were expecting to and a final example here in terms of duplicates in Anopheles melis so here this um species with the biggest uh blue uh fraction if you like this alerted us to the fact that our assembly had failed to reduce haplotypes to collapse haplotypes so the assembly for the species contained alternative haplotypes of the same haploid region of the genome and so we had to go back in for a second round of assembly and cleaning up of the haplotype information to try to get a better representation of the haploid genome assembly for Anopheles so just a couple of examples of the kind of red flags that this kind of analysis can throw up and then you can go back to your data to try to find out what is the underlying problem here perhaps a more typical um scenario of the use of busco in uh in a publication in a paper or in the supplementary materials this goes back to the question we discussed earlier about showing how well does my species genome and gene set score in comparison to other related species so here we're looking at the the focal species of this beetle the Asian longhorn beetle and we assess the completeness of its genome and its gene set in the context of a whole lot of other insects so here you can get pretty quickly an idea that it is as good as or better than most other sequenced and assembled insects at the time and that's the main message that you're trying to get across to your collaborators to the community eventually to the reviewers of your paper that the data that you have produced is as good as or better than the current standard in the field and is therefore useful now to proceed with your downstream comparative genomics analysis here uh just to not focus on arthropods all the time i have a example from plant genomes and gene sets so in plants they had already identified a core set of genes that they were using to perform a similar kind of assessment of the completeness of different plant genomes and just here in this example it's a nice example of where the busco genome score is much much higher than the busco gene set score here in dark green the genome way above 95 percent and the gene sets just maybe breaching 90 percent and this is a red flag for your genome being much more complete than your annotated gene set telling you that perhaps you need to go back and reassess your annotation protocol because it failed to identify all the genes that should be present in your genome assembly so very briefly that's the more obvious uses for quality control but now quickly to show you some examples of busco uses beyond quality control in comparative genomics the training of gene predictor algorithms and in phylogenomic analyses so in comparative genomics our downstream analyses after comparing all against all species to identify orthologs for example are going to be sensitive to the quality of the data of your input data sets and so therefore assessing the completeness of the gene space using busco can be a useful mechanism for selecting the species that you want to include in your analysis now several years ago of course you would probably just take all the species that were available but now these days with many many more genomes being sequenced and annotated we can afford to be a little bit picky in the ones that we choose to include in our comparative genomics analysis so obviously other factors will come into play like the taxonomic sampling the availability of complementary function genomics data so do you have expression data for the species do you have population genomics data for the species but all else being equal you can use busco to try to provide a logical selection criteria to help you focus on really the the genomic resources that are of the best quality to use to address your questions and here just showing an example from the world of bacteria where obviously if you're looking at the number of missing buscos on the x-axis and the number of fragmented buscos on the y-axis if you want a few streptomyces genomes to include in your analyses which ones are you going to pick you're obviously going to pick representatives from down here where there are very few or no missing buscos and very few or no fragmented buscos and you're going to avoid including any of these lower quality genomes for the same species this more or less represents the same kind of idea the the little triangles are a little bit hard to see in in these plots but the point is they are the reference genome assemblies as specified according to NCBI assembly database and all I'm trying to show here is that there are non-reference assemblies that are as good as or better than some of those assemblies that are marked as reference assemblies in NCBI so just going on a tag that says this is a reference this is the best is not necessarily always the safest route to go and you can use busco to obtain a much more qualitative a quantitative assessment of which are the best representatives to use briefly in terms of gene predictor training it's a complex task and especially when supporting evidence is lacking and you have to perform ab initio predictions and so gene prediction tools are very often the first step is to try to optimize their parameters for each species that they are trying to predict genes in in order to achieve the best results and so optimizing these predictors is usually performed by using a predefined set of high quality genes to start with quite often from a sort of RNA seek analysis now busco is because they are widely conserved they are often found to be really good for use as a predefined set to use in a gene predictor for their parameterization and training step so basically you can use the first round of complete busco matches to train augustus as busco already does but you can also use those complete matches to feed to any other gene predictor that is using a training step to fine-tune the parameters for the species under investigation and so here just very briefly to illustrate this point let me take the example of a butterfly here so danaeus plexipus the top firstly the more dark blue that you see in this plot the closer the ab initio annotation matches the official gene set annotation so the more dark blue you have the closer you are to the true annotated truth if you use fly parameters so that is parameters trained on the fruit fly to run an augustus ab initio prediction you get this level of match to the official annotation set whereas if you use the parameters that were trained using busco also reported gene set then the you get a massive improvement in the amount of match between your ab initio prediction and the official gene set so that's just one example of where you could use the parameters that are built during busco analysis to perform better ab initio gene predictions across your new genome assembly lastly these are universal single copy orthologs so they are inherently reliable gene markers for phylogenomic analysis that is for estimating the true phylogenetic relationships among the organisms that you are trying to study and this is almost a prerequisite for any evolutionary study that you might go on to perform afterwards and so they represent a predefined set of reliable markers that you can use in phylogenomics studies and this is illustrated here where we have some rodents where some of them we have only transcriptome data sets so these asterisks indicate we don't have a genome we only have a transcriptome and some we have genome so the mouse for example um and we can perform a busco analysis using either the metazoa data set or the more the larger mammalia data set or the even larger your contiglius data set to identify um busco matches in the transcriptome and the genomes subsequently filter to identify all of those that were identified in all of the species in question and then build super alignment of the amino acids that were identified as buscos in each of the species to then build the molecular species phylogeny so obviously analyzing with higher resolution data sets means you get more markers to use for your phylogenomic assessment but in this case at least using either the mammalia or metazoa produced a reliable phylogeny and that agreed with all three different data sets so i hope that i've managed to explain a bit about how busco works and what it's trying to do and convince you for its utility in its main purpose which is the quality control of genomics data but also highlights just a few extra um uses of the results from your busco analysis in selecting species for comparative genomics analysis in training your gene predictors to annotate your genomes and in uh large-scale phylogenomic analyses to determine the evolutionary relationships among the species that you're interested in and because we're on to the end of the hour now i'm going to wrap it up there um with hopefully you're now having a better idea of what busco is how it works and how it's built and the kinds of things that we can use busco for so i see a couple of questions in the chat um please go ahead if you have any others that we can briefly address now before taking a short break because we are over the hour um okay so we have a question here it's quite a specific one for species with almost 50% completed genome shows us that potentially we do not have yet enough genomics responses so not confident enough to continue with comparative analysis okay i should try and translate that that question a little bit yes so if you go back to what we were talking about first thing this morning an orthology delineation if you imagine trying to build that uh all against all graph that we were talking about where we're looking for best reciprocal hits between all pairs of species if one of the species that you're including in this analysis is incomplete and and dramatically so then of course you're going to miss a lot of those best reciprocal hits your graph is going to be incomplete and therefore your inference of orthology is going to be impacted by that incomplete graph so yes an incompletely annotated or incompletely sequenced genome will limit the kinds of analyses or the quality of the kinds of analyses that you can then perform downstream so while some cases you're stuck with the data that you have and you have to move forward no matter what but at least you know that this is the case with your data and therefore any downstream interpretations that you make you need to take into account what could be arising from technical artifacts or if it's not your focal species you can simply avoid those that are less than perfect if you like and choose a related representative species for your comparative analysis that is much more complete in terms of expected gene content so that you can then be more confident with your orthology calling your tree building or any downstream comparative genomics analyses that you want to perform yeah so thank you for for that question one more question before we break in busco can i browse the already built database busco genes and compare species for of genes of interest part of a path okay i'm struggling to understand that one so perhaps we can copy and paste that one into the google doc where we can elaborate on it and then i can try to answer in full perhaps after the break um okay so we are seven minutes past the hour so i think it's good for us to take a little break now before we come back for the third part of this lecture series so according to my clock it's eight minutes past so let's say at 20 past 11 so in 12 minutes time we will reconvene for the third part of this morning's discussions thank you very much and see you all in okay good so we have everyone on board for the third and final session of this morning thank you for all the questions and i see a couple more in the google docs that perhaps we can get to at the end if we have time i'll try and keep this last one fairly short especially because it's already 20 past the hour but please yeah continue to put the questions in the chat and the google doc and we will get to them at some point so in this third part i'm now going to shift from talking about sort of methods in orthology and what busco is aiming to achieve and how it does it to try to provide you with a few real-world examples if you like of comparative genomics using arthropod genomes where i just want to briefly set up the question in each case and then talk through the approach that we used to try to address the question and then a brief look at the result and these are really just tiny little snippets of real-world examples to give you a flavor of how orthology and busco can be used to answer questions in comparative genomics so the first one we're going to look at possibly some questions that are you know some of you might be asking of your own genomic data i've done my annotation i predicted a very small gene set why is this real or the opposite i finished my round of annotations and i have way way way more genes than i thought i was going to have because other related species have fewer also i've been working really really hard to improve my annotation has it actually improved at all then also looking at how we can use in a bit more detail the busco results for building species phylogenies and then finally i think a nice little example that goes a little bit beyond orthology because orthology we're usually focusing on the genes for which we can recognize equivalent counterparts in different species but we usually tend to ignore the fraction of genes that appear to be lineage specific or species specific and so how can we find out what these genes might be doing what are their functions because orthology and homology are not helping us here so in the first example we focus on particular humanus which is the human body louse which you can see a little cute picture of up at the top if you're not too offended by lice and here the question is from a comparative genomics perspective because this body louse is an obligate parasite that lives on our bodies it it has a small genome is there any evidence now that we've sequenced and annotated the genome that it has lost genes over evolutionary time by genome reduction and this comes from the idea that obligate parasites that live in a very controlled environment where they might be obtaining nutrients from their direct host might lose genes over evolutionary time that are no longer necessary for their survival because of this niche that they have adapted to so that's the the evolutionary question if you like we we've sequenced the genome it's small there are a few genes is this because of evolutionary loss driven by some sort of genome reduction process and so the approach that we designed to try to answer this question is we first perform orthology delineation with representatives from different insect orders and an outgroup so this comes to what we were discussing about earlier selecting the right kinds of species to perform your comparative genomics analysis now i will say this is back in 2010 so we didn't have many species to choose from back in 2010 but we did have representatives from different insect orders and an outgroup in order to design this experiment once we have delineated orthologs across our selected species we can then examine or count the numbers of orthologs that are shared amongst different pairs and different sets of species in our analysis so that's the question in the approach and this is more or less the summary of what it looks like in terms of counting the number of shared orthologs so on the left we have the species phylogeny we have drosophila melanogaster triboleum castanaeum nasonia vitropenis and particulars humanus and our outgroup species dafnia pulex and homo sapiens and what this Venn diagram is showing here is the number of orthologs that are shared between each pair or combination of species so here in the center we see we have 5,693 orthologous groups that are present in all four of these species 80 that are present only in the dipterans 163 that are present only in the body vows etc and the key numbers that we want to look at now are the orthologs that are shared between our little body louse here and nasonia vitropenis here in yellow or triboleum castanaeum in green compared to how many they share with drosophila melanogaster the fly and what we can see is that compared to nasonia vitropenis or triboleum castanaeum the body louse shares more has more orthologs in common with the wasp compared to the fly so 182 is bigger than 141 and it has more orthologs in common with the beetle than the beetle does with the fly so 275 is bigger than 213 so it seems that this body louse has more exclusive orthologs in common with beetles and wasps than either of them do with flies despite them being more evolutionary closely related to flies so that's already an indication that the body louse actually has maintained many ancient insect orthologs the other contrast that we can make is here in red showing the number of orthologs that the body louse shares exclusively with both the wasp and the beetle compared to what they share with drosophila melanogaster and here the contrast is much bigger so 576 compared to 395 so again this is pointing to a conserved set of orthologs that are present in all insects including our body louse but in fact are not present in the much more derived drosophila melanogaster the dipterans suggesting that the body louse has in fact maintained this ancient repertoire of conserved insect orthologs rather than having suffered losses so it suggests that instead of undergoing some sort of large-scale gene loss to reach to reach this lower overall gene content it is not because of some progressive loss in the sort of parasite lifestyle sense so perhaps instead this small gene set that we observe could be due to a lack of duplications a lack of gene family expansions rather than an excessive loss of ancient conserved genes so here again we can use the orthology data and we're here filtering to look for orthologous groups with at least one ortholog in each of our four representative species but with a total of at least six genes so these are orthologous groups that are present in all species where at least one or two duplications have happened over their evolutionary span the first thing to note is that less than half of the orthologs have a duplication in the body louse compared to 60 70 or 65 percent of the groups in the wasp beetle and fly that contain the duplications so more of these groups are showing duplications in the wasp beetle and fly than they are in our body loss and also what's shown here in the box plots is that the body louse in purple has a lower overall mean and median proportions of orthologs in these orthologous groups and this is statistically significantly different so it's basically saying that instead of gene losses leading to a small gene set what we have is fewer gene duplications fewer expansions of gene families in the body louse which again we could possibly relate to its lifestyle close association with mammals over evolutionary time obviously we are its host at the moment but our ancestors long before us and hosted it over the years may provide a sort of protected stable environment where having many copies of genes such as those perhaps involved in sensing your environment or communicating with others are less important and less needed than in insects that are free living such as the wasps or the beetles or the flies that have to navigate their environments and therefore the expansions of gene families and gene duplications that allow for a bigger repertoire of genes for sensing and interacting with the environment could be driving the higher copy numbers in those species rather than a gene loss contributing to the lower number of genes in our body loss and this slide you've seen before but just to reiterate that this is the body loss that we were talking about in this study where despite having very very few total genes way below the median for this set of insects at least it showed a remarkably complete genome in terms of busco score so here we can establish that this small gene set is not necessarily incomplete and that it has not arisen because of progressive losses of genes of evolutionary time but it's more likely a result of expansions that are occurring in free living species so that's the kind of approach that I'm hoping to to work through for each of these examples setting up the question having a look at the approach and then looking at the result now we have the opposite scenario where we have many many more genes than we might have expected and we want to find out why is this a technical problem or is there perhaps something about the biology of these species that could explain the larger gene set and here we turn to mosquitoes my favorite so this is culex quinca faciatis the vector of West Nile virus and the first thing to notice here we have a typical orthology plot for culex quinca faciatis Aedes aegypti another mosquito anonopheles gambii the malaria mosquito and our newly sequenced culex genome was annotated and it had many many more genes than the two previously sequenced mosquitoes so we needed to establish whether or not this rather large predicted gene set is simply due to the inclusion of many haplotype regions so we mentioned this earlier during genome assembly of a diploid organism if you have high heterozygosity it can often be difficult to fully collapse the essentially the two alleles that you're sequencing into the single haploid genome and so you can end up with certain regions of the genome included in your assembly twice or multiple times if these regions contain genes then of course you're going to predict more genes you're going to annotate more genes and you're going to look like you have a higher copy number of genes where in fact it's an artifact of your assembly process so this is the question that we wanted to try to address so the approach that we took is to again use orthology delineation to identify pairs of paralogs in each of these species so within the species Aedes aegypti anonopheles gambii and culex squintofaciatis then we asked the simple question of what does the percent identity distribution look like for all the pairs of paralogs and differentiate those that occur on the same scaffold with those that occur on different scaffolds so this is 2010 so we're still talking scaffolds we don't really have chromosomal level assemblies and I'll try to explain why we contrast the percent identity distributions between paralogs on the same scaffold versus paralogs on different scaffolds in the next slide so first of all globally across all paralogs what we're looking at here on the x-axis is the percent identity of 44 000 culex 37 000 Aedes and 20 000 anonopheles pairs of paralogs from the highly identical probably very young paralogs so recent gene duplications all the way back to the much older barely detectable paralogs with much lower percent identity and indeed if we look at the blue line here which is culex we can see it does have a higher number of very highly similar paralogs now where might those highly similar paralogs be coming from this goes back to our question about possibly including alternative haplotypes in the assembly the alternative haplotypes are different enough to have failed to be collapsed during the assembly process but they are still very similar because they are currently circulating alleles in the population so genes that have been erroneously included in the assembly multiple times would appear to have very high sequence identity because they are very close they are alternative alleles they're not even duplications if they're coming from the haplotypes so it does look like the culex mosquito has slightly more of these very closely related paralogs than the other species which could suggest that indeed we have included multiple haplotypes or we fail to collapse those haplotypes in our genome assembly so this is where it becomes important now to distinguish between paralog pairs that are on different contigs or chromosomes and paralog pairs that are on the same contig or chromosome now this requires a little bit of an explanation gene duplications that occur locally in the genome are going to produce tandem copies of paralogs all within a short genomic region around the original progenitor gene whereas haplotype regions are generally going to be much longer and are not going to be assembled into a contiguous supercontig like the duplicated genes like the truly duplicated genes so it's a bit of a technical assumption but it's the basis for what we're going to use to try and distinguish between what we consider real gene duplications so those that have happened and have stayed next to each other in the genome probably real gene duplications and those that are present in different contigs or different scaffolds and that therefore may be more likely to be from haplotype regions that we fail to collapse so first of all on different chromosomes or different supercontigs in fact now the blue line which is Qlex goes way down to about the same expected value for an off lease the red line so it doesn't look like there's an over representation of paralog pairs on different supercontics which would indicate potentially a problem with haplotypes instead that's a high number of close paralogs that we saw in the previous slide are almost all coming from paralog pairs that are on the same supercontict or same chromosome so they are local tandem gene duplications and so therefore we expect that these are much more likely to be real tandem gene duplications rather than a problem with failing to collapse the haplotypes if these two graphs have been inverted then indeed we would say that we probably have a problem with the haplotype regions and we need to go back into our assembly to try to correct for those errors so it does have many many more genes and it doesn't look like the paralogs are coming from the erroneous inclusion of haplotypes and here we can see again that in this orthology plot the all the categories with n in culex are greater in culex so this blue region the brown region and the purple region are all bigger in culex so these are orthologous groups that contain multiple copies of gene that is also found in the other two mosquitoes what we haven't managed to resolve with this approach of course is what's going on here with this large fraction of apparently lineage specific or species specific genes so here a comparative genomics approach is not really going to help us because we can't identify orthologs in the first place so to try and get a feeling for whether these are real gene annotations or not you would need to perform some sort of transcriptomics map them back to the assembly and quantify the expression levels of these genes to give yourself confidence that they might actually be real genes encoding real proteins performing real biological functions this being back in 2010 we didn't do that so this fraction remains a mystery but at least the high duplicated genes and the paralogs that we were able to identify don't look as though they are coming from erroneously included haplotype regions so now as genomes go through several different versions of improvements with new technologies being used to improve the assembly and therefore often require a re-annuation a question that is important to ask at that point is did my genome assembly upgrade and genome annotation upgrade actually work so do we now have genomic resources that are better than what we had a year ago and for this we'll look at the example of improving the honeybee genome annotations so we've done a lot of work to try to identify more genes correctly in the genome of the honeybee but has all of this effort actually paid off do we now have a better annotated gene set in this species and you can see it's the approaches that we used involved a lot of people and basically every single thing you can possibly think of from back then lots of ESTs lots of manual curation and I think I have it yes on the next slide here so running 32 different annotation sets comparing them and trying to produce a consensus gene set of all of the different predictions looking at conservation with RefSeq sequences and peptides to also quantify back in those days SEGMA because BUSCO didn't exist the conserved core set to try and identify the best possible consensus of all of these annotation sets which finally resulted in a official gene set which contained 5000 more genes than the original annotation of the honeybee genome which was done back in 2006 I think so we have 5000 more genes but are they better are they real does it actually result in an improved annotation compared to the first official gene set and just briefly here we have a question from the last example is there any particular software which can help with comparing paralogs on different contigs versus the same contig or does this just require a bit of manual work I'm not aware of any software that does that but it's fairly simple to do once you have your orthology delineated then you can identify all pairs of paralogs and if you have the GFF files so then you know where those genes are located in your genome you can simply flag each pair whether or not it's on the same or different contigs or scaffolds and then produce a similar plot to the ones that we just saw so I'm not aware of anything particular that does that job but it's it's it should be fairly doable with minimal programming skills okay so what we did here for the honeybee annotation set was to assess the quality of the new set we didn't have busco in those days so instead we were counting the number of rare gene losses but you can already see how these kinds of questions are leading us to start thinking about developing busco in order to have a one stop shop that can answer these questions for us instead of coming up with creative design experiments to try to convince ourselves that the genomic data that we've produced is good or better than what we had before so of course rare gene losses can happen but we can also use them just as we do in busco as an estimate for how many genes might be missing from your annotation set so here very briefly we have a selection of arthropod species and we are counting the number of orthologous groups for which there are orthologs in all the other species but not in this given species so basically the larger these bars are the more potentially missing orthologs that species has now in this case I would say ignore the outgroup because that's not really a fair comparison here they're quite likely to be missing more ortholog simply because they are the outgroup in this analysis and what we want to focus on here are the three hymenopteran species because that's more like for like evolutionarily speaking we have the honeybee before and we have the honeybee after and then we have two ant species here and basically what this is telling us is that there were many many more orthologous groups that were missing a counterpart in the honeybee in version two annotation and this is about halved in our latest version 3.2 annotation suggesting that all the work that we did to try to improve the annotation actually did recover more widely conserved orthologs so now moving on to using this for phylogenomic questions here we have the question that I need species phylogeny in order to proceed with my evolutionary analysis but I want to include some species for which we don't yet have a genome for which we don't yet have a genome annotation for which we don't have orthology how could I possibly come up with a strategy so that I can include species both those that do have genomes and annotations and those for which I could only find transcriptome assembly from a collaborator or from NCBI but I would really like to include them in my species phylogeny so here we use busco in two different modes so we use it in the genome mode to assess genomes that we have available and we use it in the transcriptome mode to assess the transcriptomes that we have available then we take the results from the genomes and from the transcriptomes and we identify the conserved set of buscos that were found in all of the genomes and that were found in in this case both of the transcriptomes so we have two transcriptomes here and the rest are all genomes so we can identify the buscos in the transcriptomes combine them with the buscos that we identify from eight other genomes produce a super alignment of amino acid sequences and use that for phylogeny inference in this case using a software called Raxamel and here we were able to build a species phylogeny with representatives of different arthropod insect orders the new genome that we had just sequenced and annotated and at that point there were no other ordinate genomes available there were only a couple of transcriptomes available but by applying this combined approach we were able to include them in the species phylogeny so we had more than one ordinate representative in our species phylogeny this one you've seen before where again we were using data from transcriptomes and from genomes assessing them with three different busco lineage data sets with different levels of resolution identifying those that are in common amongst all our species using those to build a super alignment and then using those super alignments to build the species phylogeny and this is along the lines of one of the practicals that we will be tackling in the class this afternoon so finally in the the last few minutes I would like to present a last example of comparative genomics from arthropods using orthology but in this case looking at these lineage specific genes for which we often have no idea what their biological functions might be here we turn to the Hessian fly which is renowned for secreting through its saliva into the wheat plants that it's feeding on a number of effector proteins that control and limit the ability of the wheat plant to reject the Hessian fly larvae that are feeding on the wheat plant so our approach in this case because we're asking a functional genomics question we also require transcriptomics not just comparative genomics but here we're going to combine the transcriptomics with the comparative genomics to partition the genes according to whether or not they are found in other species or whether they are specific to the Hessian fly and then use the transcriptome data to see whether or not these species specific genes are coming from the saliva and in fact might be the key proteins that are responsible for this manipulation of the plant host so this kind of plot you've seen a couple of them already today this morning where we have the widely conserved genes across all species on the left and as we move to the right we have more lineage specific and finally species specific genes and here is our Hessian fly appropriately called M destructor for the destruction that it causes to wheat across the world and here we have 15 percent of all genes predicted in its genome that showed no orthology to any of the other genomes species that were included in our analysis when we looked at what those genes might encode by comparing them to the transcriptome from the salivary gland a very large chunk so this yellow chunk here of these lineage specific genes matched those that were found in the saliva in the salivary gland transcriptome either in the self-homology only category or with no significant homology to any other protein so just to zoom in this is the yellow bit that we're talking about here and so compared to most of the other sequence genomes the Hessian fly had a large fraction of genes without homologs in other organisms but within this fraction there were almost a thousand of these salivary gland proteins with a perfect match to the annotations in the genome with two nearly 300 of them present in the single copy no homology fraction but a large fraction this dark yellow one in the multi-copy self-homology only fraction what we're what this means is that in the Hessian fly genome this kind of protein with this overall structure that appears so far to be unique to the Hessian fly until we start sequencing more flies it's going to be unique so it appeared at some point in the evolutionary history of this lineage and has subsequently duplicated many times because it has created this large repertoire of more than 600 self-homology only genes in the Hessian fly genome so clearly it's an important gene family because there's many of them and by doing the transcriptomics we could establish that in fact a lot of them are being expressed and secreted into the saliva and are therefore likely to be some of the key molecules key proteins that are involved in manipulating the wheat plant host response as it tries to fight off the the larvae of the Hessian fly so that was a slightly alternative example to using orthology in comparative genomics where for once we actually focused on those genes for which we could not identify any significant orthology so hopefully with these examples you get a flavor of some of the questions that we can ask in comparative genomics for which we can use orthology data and or busco assessments to try to reach a conclusion sometimes it's a technical trying to make sure that the data you've produced are good good quality and usable for downstream comparative genomics analyses other times to build a species phylogeny which is going to be important for almost any downstream evolutionary analysis that you want to do and finally looking at the genes that we normally ignore those poor little no orthology category orthologs genes at the end of those orthology plots so that just gives you a little bit of a flavor of some applications of orthology and busco in atheropod comparative genomics and that leaves us just a few minutes at the end for me to take any questions that might be in the google doc or any questions generally about any of the presentations that we've been through this morning an interesting one yes i do manual curation to bring context containing missing busco genes into a denova assembly from assemblies generated with other assemblers for example to improve my busco score is this just cheating my busco score to make the genome appear better or can i assume it brings in other missing genes which are also on the context alongside the core busco genes would you recommend this strategy perhaps we should have a poll to see how many of the audience would recommend this strategy no basically this is cheating i wouldn't recommend this strategy at all um all manual manipulation that you perform on your assemblies afterwards should be limited to cleaning out contamination cleaning out errors correcting incorrect scaffolding so especially now with some of the the high c data there are in fact tools to be able to visualize how your assembly has been put together um and highlight potential errors that could benefit from some manual curation so basically i would i would recommend against it you are cheating the system because you're bringing in busco genes just to get a better busco score at the end of the day you are going to use your genome and your genome annotation to answer your biological or your evolutionary questions and you're going to provide this data to the community hopefully for them to use to ask and answer their questions so by simply pulling in extra context to bring your busco score up i don't think you're doing anyone any favors in the long run you just need to keep working at your assembly strategy uh fine tuning the parameters and trying different approaches especially with pre-processing for example of some of the reads to eventually arrive at an assembly that is both scoring well in terms of busco completeness but also showing good contiguity and other metrics that we use typically for quantifying the completeness and quality of your genome assembly so a nice question there to to finish on um yep my official recommendation is don't do it because it'll only come back to bite you in the future