 Hello, my name is Anthony Broutodou, I'm from Rennes, in France. Today I'm going to show you how you can use Galaxy to perform a genome annotation of eukaryote using a fun annotate. So we will use the genome of fungi species, mekor musedo, which is a 40 megabases long genome, and we'll try to predict all the genes on this genome. To do this we will use fun annotate, which is designed to perform a genome annotation. Originally it was written for fungal genomes, but now it works on all eukaryotic genomes. And fun annotate to predict gene on a genome needs some evidences data, like alignment of proteins or transcript from other species, or using RNA-seq data that was sequenced from the same genome you're trying to annotate. Fun annotate also uses ab initio predictors like Augustus and Snap, but it is done entirely internally by a fun annotate. You don't have to specify this. Once we have run fun annotate to produce our gene prediction, we will run some functional annotation tools, in particular Egnogmapper and Interpolscan, which will be able to assign some genes, names, and function to our prediction. And finally we'll try to evaluate the quality of our annotation to compare it with another annotation and to visualize it using Jbrows. So as I said, we're going to annotate mucormecedo species. If you look in the training material, the sequence of the genome was assembled using some sequencing data, and this tutorial, genome assembly using pack bio data. So if you run this tutorial, you will be able to reconstruct the genome sequence we will use for the annotation tutorial today. And after having assembled the genome, we have followed the repeat masking tutorial in the genome annotation section, which is here. And there's a video for that too, which you can follow all along to produce the genome ready to be annotated in Galaxy using fun annotate. So let's begin this tutorial. First, we need to find our tutorial, which is here, genome annotation with fun annotate. Then I create a new history that I will name fun annotate. And the first step will be to get the data we want to analyze. So if you look at the beginning of the tutorial, you get all the URLs to get the data directly into your history. So we just copy this whole block and you upload the data by pasting here all these URLs. And you click on start and wait a little. It will become green soon. Okay, so the data sets are green now and we can have a brief look at the content of each one. The first one is genome masked. That's the sequence of our genome we want to annotate today. So you can have a preview here. That's it. As you can see, probably, yes, there are some regions that are soft masked by repeat massacre in the previous repeat masking tutorial that that information will be used by fun annotate to correctly predict genes on the genome. And you have a few other data sets. The two following ones are harnessing data. So it's just fast queue data directly from Illumina sequencing data set. So it's paired data with a read ones and read twos. Then we have some Swiss pot subset. So these are sequence of proteins from Swiss pot that were prepared because these sequences are found on the genome. We want to annotate and they will be used by fun annotate to correctly predict genes based on this expected sequence. So these sequences come from other species like Solanum, Nicos, Persi, Com, for example, but there are many, many other organisms that are, yeah, that have proteins similar to what is found on our genome. In real life, you would want to align the whole Swiss pot on the on your new genome you want to annotate, but here for the tutorial we reduce this with broad database to, to only a subset to speed up the computing. And then we are three other data set that will be explained a bit more later. First, alternate annotation. So these are files from another annotation that was performed a bit differently. And at the end of this tutorial, we try to compare our results of the annotation we will perform with an annotate and this other annotation to see which one is the best. And finally template that SBT is a file from NCBI because we will use this file to try to, to, to prepare on an annotation to submit it someday on the NCBI server. Okay, so as a first step in our tutorial. Here. We have uploaded the data. Okay. And now we want to prepare the genome sequence. So we will use two tools. Fun annotates assembly clean. And then sort assembly assembly clean. We'll try to run it here. We'll just select the genome sequence from our history. And then as you see there are a few parameters. That means that in our genome, we will try to remove some sequences that are not useful. First, all the contigs that are shorter than 500 megabit 500 bases will be removed from the assembly because these contigs are very short and there is a low probability that you can find some gene on them. And this tool will also try to identify repeated sequences like chunks of contigs that are included into another one. Which can happen, for example, in deployed genomes. So this step is not mandatory. Maybe you will have an assembly to annotate that is already considered to be perfect and you don't want to clean it more. But in this tutorial, we do it to make sure we have a perfect genome. So that's it. We run it. And then we will want to sort the assembly. Because at the end of this tool, we will have a lot of contigs that will be randomly sorted with short ones in the middle of the file. So here this tool will ensure the longest contigs come first. And it will also ensure the name of the contigs will be standardized for the rest of the analysis. So we just need to select the output of the clean step. We want to make sure all the contigs name will be based on the prefix which will be scaffold. And then an incremented number starting from zero. And we don't want to filter shorter contigs because we've already done that in the previous steps. So we don't need to do it again. We execute it and we wait a little. Okay, so now the assembly is cleaned and sorted. We can just have a quick look at the first day file. First, the cleaning. As you can see, the sequence looks quite the same. But if you look at the number of sequences, you only get 1425 sequences while the original genome sequence contained 1461. It looks like that just as expected. And after sorting here. So if you look in the clean, clean sequence, you have headers that look like contigue 1000. If you look at the sorted one, they all have standard names like scaffold one, scaffold two, etc. And the longest one is at the beginning of the file. That's what we need for the rest of the tutorial. So we don't want to look at the whole file. And we still have the mass regions that are like this lower case. So we are ready for the rest. To perform annotation fun annotates like all the annotation tools for eukaryotic genomes need RNA seek data, because these data correspond to the section of the genome that were expressed in the living cell, which indicates regions where there are some genes on the genome. Here we have a data set here fast q format forward and reverse data set that correspond to data from the sequence read archive. And we will just take this fast q file and try to map them on the genome to identify all the regions that were expressed in the conditions of this data set. So to perform the mapping, we will use a very widely, very widely used to which is star. It's in the. It's here if you search for star and we just select RNA star. We have some paired data so we say, okay, we have been paired and data has individual data sets are one and two. So the forward read is our one and the reverse reads is our two. And this reads. We want to map them using a reference genome that is in our history here. It's the sorted and clean assembly data set number nine. And finally, we'll set this length of a say printing string to 11 that's it's explained in the tutorial it's not a magical number if you if you have left 14 here you would have a warning in the output of the tool saying that you should use 11 so you can already know it so we put 11 but you have a way to to find this. Okay, so we can leave all the other option by defaults and execute. So, in a moment you will see the output of this mapping step. All right, so now the mapping is finished. You get three data sets. The least interesting one is plus junction that bed file. So it's just the position of splice splicing junctions on your genomes. You can safely delete it will not use it in this tutorial. The other two data sets are quite interesting. If you look at the log file here, you will get information on the on the on the results of your mapping and the most important line is probably unique uniquely map treats. 96% which is very good, which means 96% of the data from the fast Q file were correctly mapped on the genome. So this is quite satisfying and it means the RNA sick data was of good quality and we can use it for the rest of the annotation. And if you look at the map file here. So this is not really useful to read it line by line but it's the content of the alignment of each read on the genome. Okay, so with this data, we should be able not to run our structure annotation. So now it's time to run for an audit predict. So you can find it again here. So you have all the tools of the finite it's feet. We'll take this one. So that's really the place where you will determine which option to use for fun annotates. And it's very important to feel it correctly. So first we want to specify on which assembly want to to perform the annotation. So, once again, it's sorted and cleaned assembly as we have prepared it at the beginning of the tutorial. Then we have to select the the latest database from fun annotates so depending on the instance of where of galaxy where you're running this tutorial. You might have different values just take the latest one. Okay, you can leave this option on. What is important after is to specify a few information about your the organism you want to annotate so it's a species name first, we call me said. A strange name so here we have my one. Yeah, so that name is strange name here. There is this question is it a fungus species. I, I recommend to set it to know unless you know what you're doing because if you specify that it is one it will run an additional tool which is coding query, which might give some not so good result depending on the input data sets. And it can even make the whole fun, fun annotate tool to to fail, depending on the data you provide. So on in this tutorial, he said it to know and in real life, you would probably want to test with or without and see if it works better in a case or not. Then, the most important part is the evidences. So that's where he will give as much as as many data that can be used by fun annotate and the tools that it will launch to find the correct gene structure. So first we have some RNA sick data. So we just give the output of star here. Then we don't have full mRNA or EST sequences. So we just don't select one here. But we have some protein sequences. So by default, you should use Swiss pot, but in this tutorial to speed up the competing we select the subset as I explained a bit earlier. So it's Swiss pot subset that faster. And that's it. And then you have the busco setting. So busco is a data set of genes that are expected to be found in one single copy in a specific file. So for example, if you select here, something the most the closest to a species so the micro LS taxon. Busco will provide a list of protein sequence that are found in most of the species of this cloud. And so we expect to find them in this genome. And and finite it will be able to use this data to train the ab initio predictors to recognize a gene in this particular species we are annotating. So we want also for Augustus to specify a speech a cloud the most the closest to the genomes we want to to annotate. So here it's rhizopus or is a and I think that's it afterwards you can filter the result of an audit to say for example, I don't want to have some genes with a very short introns. Maybe you could say I want to maybe introns that are this size. At the minimum here we will just leave the value as default so it's then and you can have some other filters here. And you can even select some advanced option here if you if you know exactly what you're doing. And we here we want to select which output we will get from from a finite eight in here. We want to have everything. And I guess that's it. You can then execute. So, this step will take a bit of time because what finite it will do is take your RNA seed data. And, yeah, take this data takes the protein sequences from three spots upset, try to align them on the genome, then run a few ab initio predictors like Augustus and snap. And try it also to align some busco sequences from the cloud of this species. And from all these data after aligning that aligning them correctly. We tried to produce good quality gene predictions, taking into account all these data. That's a real genome we want to analyze here so it will take a bit of time so maybe two hours or things like that. So maybe it's a good. It's the good moment to take a break and have a cup of coffee coffee and come back later to see the result. Or so you can continue to follow the tutorial to launch the following steps and then get the final result at the end and and look back at each step after once. And it's just as you prefer here I will just put the recording and and come back when the result is there. Okay, so for an annotate predict is finished now we have all the data sets green. You can have a look at them one by one the first one is the gene bug format annotation. So that's the whole annotation in gene gene bug format as on the NCBI or EBI websites. So you can see the position of each gene on the genome with an identifier, which is with always have the fun profit prefix and an increment in number after that, and you have the sequence of the protein and and the position of the exams and yes, etc. You can get the same information in GFS three format which is much widely used. Just have to click. It will come soon. There we go so the GFS format is always the same you have it's a tabular format. The first column is the scaffold where the gene was found. The second one is the source so here all the genes were predicted by fan annotate and then here. Well, each line is a feature with a start and hand position on the genome on a specific scaffold. And in this third column you have the type of the feature so you always get a gene. And then an mRNA, then the exons in the same RNA and the corresponding CDS coding sequences, including in these exams. So you have a position of each one, the strand, and you have some basic information afterwards so an identifier for the gene for the mRNA. The relation between the gene and mRNA is here, and then you have the identifiers for each exon and CDS. As you can see here, we just know the position of the gene with a random identifier and the only information about this gene is hypothetical protein. You don't know the function of this protein you can have a look at all the genes they all have hypothetical protein. In the next step we will learn how to assign a good names and function to these genes. If we continue to look at the data sets we have an NCBI TBL annotation file which is yet another format for the same information. And then we have three important files, which are first the mRNA sequences. So the full transcript that can be seen should come soon. Here we go. So we have the mRNA sequence beginning by an ATG most of the time. Then we have the CDS sequence. So the difference between CDS and mRNA is that in mRNA if an annotate was able to predict some un-transcripted, un-translated regions in 3 prime or 5 prime of the gene, they will be in the mRNA sequence but they won't be in the CDS sequence. Here they all begin because they all start with an ATG which means it's the same in mRNA sequence file so it means that there are not a lot of un-translated regions that were predicted. We will be able to check that using JBRAS a bit later. And of course you get also the protein sequences of each gene that was predicted. That's it. Another important data set is the stats file. So if you look at it here, you first have a trace of how fan annotate was launched to produce the result you get. That's not very interesting on its own, but afterwards you have the exact version, the date of when it was launched, and also the exact version of all the databases and data that was used by fan annotate to predict the genes. So that's very important to keep this information. Finally, you have some statistics on the assembly, so all this information, we already knew it before annotating the size of the genome, the number of contigues, that's not very surprising. But now the most important part is the annotation statistics here. So by looking at it, you see that fan annotate predicted 30,000 genes in our genome. And you have the corresponding number of mRNA and tRNA. You have the average gene length, the number of CDS of 5 prime or 3 prime UTR. The number of CDS, which don't have a start codon and things like that, and the number of exons, the average size of exons of proteins. All of this information are very important to have an idea of how your annotation looks. If you had a very small average gene length, for example, it would mean that fan annotate had trouble finding genes on your genome. And it could give an evidence of having a bad result. You often use this number to compare two results between them and decide which annotation looks correct. Okay, and finally, you have three other data sets that are able to ASN data. We will not see them in detail now, we will see them a bit later after a functional annotation. All right, so now we have an annotation, a structural annotation, we have all the position of the genes. Now we want to know the function of these genes. And for that, we will use two tools that are eggnog mapper and interpose scan. So let's launch them running eggnog here. So we will run them on the protein sequences of that were predicted data set number 16. It's a protein, they are protein sequence. You take the latest version of the eggnog database. And you can leave all the other option as is, except in the output option here. There's an option here that say exclude header lines and it starts from output files. So we don't want to activate this option. Otherwise, we won't be able to use the output file in the following in the rest of the tutorial. So let's execute it. And then we will run Interpose scan here. So you take this one, this one is an older version that is not up to date. Okay, so once again, we select the protein sequences. We want to say Interpose scan that they are protein sequences. We select the latest version of Interpose scan here. And then we can select which application in Interpose scan want to launch here you can select them all, except if you are very in a hurry. Interpose scan is using can use a few other programs that are non free software, which means it requires some manual. Install installation by the admin of your galaxy instance so here we will not use them, but you could try choose to use them by selecting them here. If you know that they are installed on the galaxy server. It is the case on use galaxy dot you if you're if you're interested. Okay, and the other options can stay as is we just in the output format here we will select also the XML format. And that's it. Just a few words to explain what each tool will do so first Interpose scan is just a tool that will look at each sequence each protein sequence and tries to identify the presence or not of a huge database. A huge database of motifs and patterns. So in there is a database which is named Interpol which contains a lot of protein family signatures or catalytic catalytic sites signatures and Interpose scan will use all these signatures and try to identify them in all the sequences of the of the annotation we have generated before. We will try to compare each protein sequence to a database of protein orthologues from a lot of different species. So it will try to to recognize your protein being as being a member of Northology group in in a lot in a lot of other species, more or less related to to use the species you're annotating. And based on the presence of the motif for Interpol scan or the correspondence to to an orthology group, each one will be able to assign each protein of your notation a few information like a name, or function, or, or even gene ontology terms or things like that we'll see that in the, in the output files. Let's have a look at the eggnog mapper results. First, you have a seed orthologues output, which is not so interesting. In fact, it just tells you this protein from your annotation matches this protein from the eggnog database. It's interesting, but not that much. And you have a lot of information on which part of each sequence match which part of the other sequence. But if you look at the annotation output here it's a tabular file and there you have very, very interesting information so for each sequence of your annotation. Sometimes you don't get results. You can see there's no fun 00001 line because there's no matching in a mapper. But for when you have results for this gene, you know which sequence of the eggnog database match this one, you have a score and the value to know if it's significant or not. You have identity identifiers for orthology groups from the eggnog database. And most importantly, you have a lot of information on like a description. So human readable name for for your protein. And a lot of other identifiers like short names or symbols and gene ontology terms or keg terms. Yes, this, this ones are EC numbers, for example, and other databases information like Kazi or PFAM. So that's a lot of information for genes for each a gene in this file. And you get kind of similar information in the Interpose scan output here. So it's easier to read it in the TSP output but it's the same information as in the XML file and here again for each gene. So don't be afraid it's not sorted so we begin by 527 but there are other ones later. Here you can see for each gene in this column identifiers of motifs or patterns if you prefer that were found in this in this sequence with the exact name of the source database where this identifier was found. You get some name for each motif that was found. And scores of course identifiers in the Interpose database name again gene ontology terms. Meta-seq information, you get some reactive I think we can see a bit, yeah, in this place. So a lot of information and links to other databases. That's very important information. Now we can say, okay, this gene has this function because we have found this motif and it matches data from the unknown databases and this information are coherent. Fine. So now with Phenanotate we have a structural annotation, we have launched Interpose scan and eggnog mapper to get a functional annotation and we will want to reunite all this data into a single file. And a very interesting feature in Phenanotate is that it is possible to generate files ready to be submitted to NCBI so at the end of running this tutorial you should have data that you can submit to the NCBI database and reference in the publication for example so that's really comfortable. Just don't try to do it during this tutorial because you don't want to submit yet another annotation of the same genome because it has already been done for this genome. So to do this Phenanotate needs a template file that you can generate on the NCBI portal. So if you go to this URL, you need to log in and then you get a form as shown in the tutorial here. It looks like this and if you feel it correctly by following the instruction here. You can download the file and upload it to your history and that's the file that is here, template.sbt. I've already done that before to make this video not too long. That's it. So the next step will be to use this template file, the structural annotation and the functional annotation to reunite all this into a single annotation in different formats that we can use for any other analysis later on. To do this you have one, two, which is Phenanotate functional annotation here. So first we select the structural annotation from the previous Phenanotate predict run. So it is one in gene bank format. That's fine. As usual, we take the latest version of the Phenanotate database. Then we can select the NCBI submission template file. So this is the data set number seven here. And there you can select the eggnog mapper output in tabular format annotation output and the interpose scan output in XML format. For other analysis, you can also input anti-smash output, but it's not very interesting for this kind of fungi genome. And phobuse too, but this can be done already by interpose scan if you have selected the restricted tools. So we don't need to add it again here. Once again, we select the buscom model that is the closest as possible to the species we're analyzing. So we call this here. It's still the same strain name here, so Mc1. And here we have a locus tag. That's the prefix of each gene name. Usually it's the NCBI which assign you a locus tag for your genome before submission. So here we will consider we need to use this one and then this can be left as is. And finally, we select to keep all the output. She's a long list. Now we just need to run it and be a bit patient. So now these tools is finished. We can have a look at the output. And here, once again, we will have the annotation in different format, for example, the gene bank format here. So it looks a lot like the previous output file. The only difference to the gene bank file that we have seen earlier is that now we have some functional annotation that is directly included into the gene bank file. Like code terms here are eggnog identifiers or dbxref identifiers to external databases and symbols like this. So this is very, the file is very much enriched in important data on each gene. As you can see, I still have in this file the fun prefix while in the form before I've selected the m-muse-do locus tag. So that's a mistake only due to the way I recorded this video. But if you follow this tutorial by writing everything as written in the tutorial, you should have m-muse-do here instead of fun. And the same in the other files that we'll present after. So let's have a look at the other output files. You have the gff file here, I guess. Why not? It's not the gff. It's just a tabular file with all the annotation. You have the sequence of the genome itself. So we already added before. So it's not very interesting. The agp file also, which is the correspondent between contigs and scaffolds. That's also information about the assembly. It's not very useful for us, but it could be asked when submitting the annotation and genome to NCBI. That's why it is generated. You also get the annotation in formats that are needed for submission to NCBI. So sequin and TBL here. You have the scaffold sequences, the protein sequences again, like this. All the proteins are there, the mRNA, CDS, and the gff free. That's very important. And that's a format that is used by a lot of tools. And just as a gene bank format, it contains the same information as earlier, but also the functional annotation that is included into it. So the position as usual. And here you can have some eggnog identifier or different names sometimes and good terms and a lot of information that is directly into the gff file. You get some statistics. So that's the same time as earlier, but this time in the functional section, you have a lot of numbers, but the numbers of results from interpro from eggnog, on PFAM, and Kazim, etc. That are included into interpro scan and eggnog, number of good terms, etc. So this can be useful to know if you have a lot of functional annotation or not. That's the most important output files. Also, you have three files like this, like need curating here, or product must fix or product new names here. And here the summary report. This gives you information about potential problems in the functional or structural annotation that you should have a look at before submitting to NCBI. We'll not go into much details here. Okay, so now with FunAnnotate, we have a good annotation. We have some numbers on this annotation. We have functional annotation, but we are not even sure our annotation is of good quality. There is one way to evaluate this by using Busco. Busco is available in Galaxy, so we will launch it now. So here, we want to run it on the proteins that were predicted from the last FunAnnotate functional run. We want to look at proteins as written here. We don't want to auto-detect the line age because we know it's a genome from the Mucorales line age, so we just select it. And the output we want to get is the short summary text and the summary image. And I guess that's it. We just execute. So here, this tool will just try to look at all the sequences of the proteins and see if it finds all the proteins that are expected to be found in a genome of this line age in a single copy in this genome. So we just have to wait a little to get the results. Let's have a look at the result now. So the text summary looks like this. Busco searched for 2,449 genes that are expected to be found in this genome. And it could find 2,312 genes to be complete in the annotation that we have done with FunAnnotate. That's a quite good score. Among these 2,312 genes, there are 2,281 genes that are found complete and in single copy. That's the most important. There are a few ones that are duplicated, 31, but it's not too much, so it should be okay. Even fewer that are found to be fragmented. There are still a bit more than 100 genes that are not found in the annotation we have generated. So it could be either because there is a problem in the assembly, some portion of the genomes may not have been sequenced properly and then assembled properly. Or maybe the annotation itself didn't detect these genes, even if they are in the genome sequence. To check this, you could run Busco at the level of the genome itself. And we have done it previously to prepare this tutorial and we know that by running Busco on the full genome sequence, it was able to find 2,327 complete genes, which means a few ones were not found by the annotation. But still, even if the annotation was perfect, you would still have a bit more than 100 genes that would not be found by Busco. So probably because they are not in the genome sequence, either because they were not properly sequenced, or maybe also because these genes are not present in the species that was sequenced. That could be a real miss. So now we have a good quality annotation as we just seen with Busco. We might want to visualize it so we can use the JBrouse tool here. So it's a genome browser, so we just select the genomes we have annotated. So it's this one, sort assembly. And then we insert a few tracks that we want to display. At first, we have annotation tracks. So we had this annotation group and we had a GFF3 track. And we text the GFF3 output of the fun annotate functional tool. And that's all we need to do. And then we had another annotation, another track group named RNA-seq. Well, we will select the output of star that when we align all the RNA-seq data onto the genome. And we want to display it with a SNP track. That's it. We just execute and wait a little. Let's look at the output just by clicking the I here. So you have, yes, you can reduce it here and here. So here you have like any genome browser. You have the sequence of scaffold, scaffold number one here. And we have zoomed on this region that is displayed from here to there. And you have a few different tracks that you can hide or show. So you can show the gene models by clicking on the GFF3 track here. So you can see that there are regions that are more or less genome, more or less gene models here. You can open it in another time. I think it will be easier. And you can have a look at the RNA-seq data. So here on the GFF3 track, you can see the exons that are these kind of rectangles and these are introns in this direction or this trend. What is interesting is that here you can see the RNA-seq data and you can see in light gray, the regions where some reads were aligned. And in darker gray, you have the regions that correspond to splitted reads when there is a splice junction. So for example, if you have a read that maps at the end of these exons, maybe half of the read will map to the following exon because it was a sequence on the spliced sequences. So in brief, light gray means exon and dark gray means introns. So here you have a base dark gray, which is some artifacts because some reads were mapped at one position in there and another one much farther. So it's not very relevant here. These ones are the most relevant in matches with the exons that were predicted by a fanonitate. So in this case, it was ideal by forward fanonitate because you had a lot of RNA-seq data all along the gene, so it's okay. If you look at the other gene next to it, you can see that there is almost no RNA-seq data at this position. And this illustrates the fact that fanonitate is able to use RNA-seq data when there is some to predict genes. But for other regions where there is no RNA-seq data, it will be able to use the alignment of protein from SwissPOT or even using ab initio predictors to predict that at this position, there is a gene because it matches the statistical model used by ab initio predictors or because there is a match with SwissPOT protein. So by doing this, you can learn a bit how fanonitate predicted genes on your genome. Okay, so now we can go on with the last step of this tutorial, which is comparing annotation because here we have an annotation which looks good. But we may have generated another annotation by using another method and we don't want to, we don't know how to select the best one. So we have some tools in Galaxy that can help to do this. So we have our good annotation and if you remember well, at the beginning we retrieved GFF3 and gene bank format of an alternate annotation, which is a bit different. It just is not exactly the same. The identifiers here are different and also if you would, you could display them in j-brows and see the exact differences. But here we will use two tools to compare our two annotations. So the first one is Asian passive values. This one here. So we have to choose a reference annotation, so it will be the one we have generated. And a prediction, which is the alternate one, and we want to see this prediction is better than our reference one here. We can leave the rest as is. It's quite advanced options. And in the output type, we just want to use the HTML output. And we execute it. And finally, the last tool we will use is fan annotate compare. And here, we need to choose the two annotation in gene bank format. So you just click and by you, you, you click while holding the control key. You select the latest of an annotate database. And that's it. When it's finished, you can have a look at the HTML output. It's not in the same order just because of the way I recorded this video, but you should have the same result as me. Let's have a look at Asian first. So here you have the list of all the sequences that were compared between the first annotation and the other one. You can have a look at the whole file. You have some general matrix. You can see, for example, that there is a total of this number of genes in the reference gene annotation. And there are 8,000 that are shared between the two annotation. No way. There's this number of gene in the reference annotation. And there is this number that are shared between the two and this number that are unique to reference and this number that are unique to prediction. So you have a few other matrix. You can see how many genes were perfectly matching between the two annotation. You can have some numbers on the CDS and exons. You can explore this on your own. And also, maybe the most heavy downed way to compare this is to look at the specific scaffold. So if you look at scaffold 11 here. You will see all the low side that were used to compare the two genome, the two annotation. If you select the first one here, click on the plus. And here you should see yes, this kind of representation. So what it means is that on the reference annotation of this portion of the genome. There were four genes that were detected with these identifiers in different on different strengths. So three on the reverse trend and one on the forward strand. And what you can see is that on the prediction annotation. So the one which is named alternate that we got from another source. There is only one gene model on disposition that includes all the portion of the other models that are in the reference annotation. So, in this example is it is quite striking that probably the reference annotation is is a better than the alternate one because there's one gene in the other strand and it's probably more it's better to have these three genes separated. You can also compare with the RNA sick data just look back at the at it in the jbras output if you want to to make sure if it's better to have separate genes or one long one. But in this way you can have a good idea of what are the key differences between the two annotations. Here I know that the reference annotation is the best one because the alternate annotation annotation was done by me. But by running for an update with a very badly chosen parameters so I choose the bad. The species in the busco busco select list. I didn't give enough RNA sick data or protein sequences to to help find out that pretty pretty good gene structures. So I know this alternate one is probably not a good annotation. Now let's have a look at the other two finite eight compare here. So that's another way to compare the data from the two annotations. You have some general statistics on the genome level. The number of genes here. That's it. If you look at the orthologs tab here should come soon. If you look at this orthologs tab, you should see all the orthology relationships that were found between genes from the reference annotation to the alternate one. So you can see which one looks like other ones in the other annotation or even in the same annotation if there are some duplications, for example. You have some links to the agnog database if you want to have more information on this group. You can also have a look at the enter interpro or PFAM tabs here. So you will have some some numbers. So on this one you have some identifiers of different interpro motifs that were found in either one of the two annotation. And here on in these columns, you have the number of times they were found in each annotation. So the first annotation is the one you have generated in the tutorial. So the reference one and it's as you can see in many, many, many, or even maybe all the interpro entries, you have much more terms, much more proteins that match in the reference annotation than in the alternate one. You can of course get more information by clicking on each term here. And you end up on the interpro page describing exactly what this motif is and how it looks like and in which protein it is found. It's quite the same for PFAM or Merobs like this or Kazim like this too. And finally, maybe we can have a look at the Go tab where we can find for each term. Wait a minute. Yeah, on this tab you can see if some terms were found to be enriched under represented or over represented in a specific annotation. So for example, this term was found to be under represented in this annotation while being over represented in the other. So it means, well, we have more genes having this gene in this term in this annotation. So, each information on its own can help you decide which annotation is the best for the rest of the analysis and finally you will select one and stick to it for the other analysis. Okay, so congratulations you have arrived at the end of this tutorial. It was a pretty long one if you have followed all the steps with a few long steps, but I tried to make this tutorial as close as possible to what you would do in real life to get a proper annotation for your genome, using a fun annotate but of course you could use other tools. So that's it. Often when you end up with an annotation on a genome. In fact, you will realize that there's no perfect annotation and sometimes on specific gene families, you might get some strange results while the rest of the annotation looks overall quite good. So in this case, you might be interested in following the Apollo tutorial. So here's the genome annotation section here. Yeah, so you have a complete tutorial on how you can use the Apollo server at use galaxy that you to modify some gene structures based on all the right evidences and how to modify some functional annotation on on specific genes, manually in the sort of Google Doc of annotation. That's it. So, just to finish, don't forget to Yes, to have a look at the at the feedback form at the end of the tutorial just just to feel it as soon as possible because it is very, very useful to for us to improve the training material on on the GTN. Thanks for listening. Thanks for watching this and for doing the tutorial.