 Hello, my name is Antony Brotodot from Ryan in France. Today, we are going to learn how to annotate a new genome using fan annotate on a Galaxy instance. We're going to use usegalaxy.u, but it should work on another usegalaxy server. And we are going to do this by following the tutorial in the genome annotation section. And we'll use this one, genome annotation, with fan annotate. OK, so why do we need to use this tool? Fan annotate is specifically designed to annotate eukaryotic genomes. At the beginning, it was written to annotate fungal genomes, but now it works on any species. And we need this tool because eukaryotic genes are often quite complicated compared to prokaryotic genes, because you have introns and you have some signals at the sequence level that are not very specific. So like intron, donor, and acceptor sites, for example, there are very short sequence of two letters, which are found in many places that are not every time an intron donor or acceptor site. So fan annotate is a big pipeline that will try to align some evidences against the genome sequence and run some ab initio gene predictors to predict gene structures taking into account all this information. So in our case, we're going to try to annotate mucormucidogeno, which is a fungal species that was assembled following the fly assembly tutorial on GTN too. And we are going to use some RNA-seq data from an RNA-seq experiment that was made available on the public data banks. And we are going to use also some protein sequence alignment from protein of public data banks. We'll solve all this a bit later. After we have run a fan annotate to predict the gene structure, we will try to add some functional annotation to know what is the function of each gene that has been predicted by a fan annotate. So we'll run eggnogmapper and interpolar scan. And after that, we will try to integrate all this information, the structural annotation and the functional annotation, into a single annotation file that can be submitted as is to public data banks like NCBI. And at the end, we'll also visualize our annotation using a genome browser. So it's time to begin the tutorial. So the first step is to get all the data that you will need during this tutorial. So they are all available here. We're going to upload them to your history. So I'm on newsguest.cu. I've created a new story that I will name Fan Annotate, which is not very original. And then I click on Upload Data here, Paste page data. And I paste all the URLs here. Nice dot. OK, so the data sets are now uploaded. So let's have a look at them. The first one here is the genome sequence, as was generated by the fly tool. If you look at the fly assembly tutorial, so it's the mucormucidoo genome, where you only have the sequence. And it has been masked using a repeat massacre with the human data set. If you can see how it was masked following the repeat masking tutorial, which is also available on the GTN. Other data sets include RNA-seq data. So it's paired in data, sequencing data, for an RNA-seq experiment. So it looks like this. And it comes from a public data set from SRA. So you have the R1 reads, the forward reads, and the reverse read here. So this will be used by Phenanotate to see where there is expression along the genome sequence. And Phenanotate will also use this protein sequence file, SwissPod subset. These are sequences from the SwissPod data bank, which is a bank of very good quality protein sequences. It's a subset, which means we have only kept nearly 4,000 sequences from SwissPod. It's a subset because we wanted a smaller data set to make the tutorial a bit shorter, to have some quick processing by Phenanotate. But for real-life data, real-life annotation of a genome, you will use the whole SwissPod data banks. And the two last data sets are alternate annotations. In fact, it's a result of running Phenanotate on the same genome, but with different setting. And we will see at the end of this tutorial how to compare the annotation that you will generate and this one to see which one is the best. I know the two. OK, so now to run Phenanotate, the first step is to prepare the RNA-seq data to align this data along the genome, to produce a BAM file that will be used by Phenanotate then. So to align RNA-seq data, we will use a star, so RNA star here, which is a tool designed specifically to align RNA-seq data along the genome, taking into account intron and exons, as you would expect for this kind of data. So this is paired-end data, and we only have one experiment, so we will use individual data sets. The forward reads is R1, and the reverse one is R2. And we will use a reference genome that is in the history. It's a genome masked, a file you have just uploaded. And we will set this value, the length of the essay, pre-indexing string to 11. Why do we do this? Because 11 is the recommended value for this genome size. And you can guess this value if you run star a first time on your genome and look at the output. And that's what I did. And I saw that star was saying, hey, you should use 11 for this option. So that's what I did. That's it. We're going to run it this way. And well, if you have multiple RNA-seq data files, you can align them all with RNA star. They will produce each one a specific BAM file as output. And at the end, you can merge them all with a merge SAM files tool here. And there you can select all the BAM files you have generated. When it's finished, you can see that you have the BAM files that you will use for an annotate. It's the half gigabytes of data in binary format that you can't really read like this. This file will not be used. It's the spliced junctions position on the genome. We will not use it. And you have this file, which is a log file of what was aligned by a star. And if you look at the statistics, there is one that is very interesting. It's this percentage, which says that 96% of the reads were uniquely mapped onto the genome in the expected orientation. So that's a very, very good score. And it's normal because, in fact, the RNA-seq data here is not the raw data that was downloaded from SRA. In fact, we've reduced it to a subset of reads that are mapping onto the genome once again to make the fan annotate and RNA star run faster during the tutorial. Anyway, so now we have this map file, and we will use it to run fan annotate. So now we will run the structural annotation prediction using fan annotate predict annotation. So it is this tool here. And we'll see which setting we'll adjust. The first one is which genome we want to annotate. So this one is easy. It's the genome mask sequence, still the same. Then you have this fan annotate database. So in this box, you should select the latest version. So for usegalaxy.eu, it's this one. You can have other version on other usegalaxy servers. And if there is no version, you should contact the usegalaxy administrator to install it from the admin panel. OK, then, perfect. There is this information after that. That fan annotate requires to have some information about the genome we are trying to annotate. So the species is mycore-musedo. The strain name, we don't have an isolate name, but the strain name is macwan. And there is this box where you can select if it's a fungus species or not. Here, we will keep no. You could choose yes, because mycore-musedo is a fungal species. But the thing is, after testing it, you don't get really better results by selecting this option. And you might even get some errors during the process. So in this case, we'll use a no. And in most cases, it's fine to use no. We'll consider that there is only the ploidy of the SMD1. So there are not multiple copies of the same genes because of multiple chromosome copies. The rest will be kept as is. And now, the important part is the evidences here. So this is where you need to select the RNA-seq data you have just mapped in the previous steps. So the fan annotate will use this data to train abinishu predictors, like Augustus and GenMark, to recognize a gene on the specific genome. Then if you have full-length mRNA or EST sequences in fast-stay format that you would like to align against the genome, you can provide it here. But in this tutorial, we don't have these sequences. So we just skip it. Here, you can select protein sequences to align against the genome. In real life, you would use this default option, use uniprot kb-swiss-prott, that is already provided by fan annotate. So it's the full set of swiss-prott proteins to align against the genome. Even if they are proteins from very distant organisms, fan annotate will try to align it. And if it's too distant, in worst case scenarios, this sequence would just not be aligned and it won't pollute the results you will get from fan annotate. In our case, I said at the beginning that we are going to use a subset of swiss-prott to make it faster. So this is this fast affile. We'll leave the defaults here. And then there is these options to tell fan annotate how it should use busco to try to train the models, the ab initial predictors, to recognize gene on this genome. So here, we will select the Daxon, which is closest to our species. And we have some mucorales db, so it's cool, selected. And we will select also an initial Augustus species that is quite close to our organism, which is another fungal species. OK, so then we will not filter the output. We will keep everything, but you can choose to keep on the gene models that have a specific minimum internal length, or maximum internal length, or minimum protein length. That's what you can select here. You can finally tune each sub-command that is run by fan annotate, so Augustus, gene mark, and EVM, for example. We will not do it here. Defaults are fine. And here, we want to keep all the outputs of fan annotate. So we select all like this. OK, so now we are ready to run the tool, and we do it like this. So you get all this data set produced by fan annotate. Let's have a look at each one. The first one is annotation in gene bank format. So it's the list of all the genes that were detected by fan annotate on the genome with their position, and they have an automatic name, which is just fun 1, 2, 3, et cetera, and the protein sequence that was predicted for this gene. And they all have a hypothetical protein name because it's just a structural annotation. There is absolutely no functional annotation at this state. And that's it. Then you have the same information in GFF format here, which is another very standard format like this. So you have the exact same information with the position on scaffold 1 from start to the end. There is a gene with this identifier. And in this gene, there is one mRNA with two exons and two CDS sequences that are part of these exons. And you have this on all the genome. OK, that's cool. You have the NCBI-TBL annotation file, which is quite similar. It's another format describing all the genes that were predicted. And then you have three important files that are the protein sequences. So it's a fast file with a sequence of each gene that was predicted, the protein sequence. The same thing for the mRNA sequence. So the full mRNA, including any UTR, 3 prime or 5 prime. And the CDS sequence here, which is only the translated sequence of each gene that was predicted. So if there are no UTRs, the sequence in CDS is the same as in mRNA here. And then you have a few output files. So the TBL-2SN error summary here will give you some statistics on potential problems that were found by fan annotates in the genes that were predicted. So it tells you that there is one gene that is probably a partial gene, where a part of it is missing. There are 600 genes that have a very short exon that is maybe problematic or maybe not. It's fan annotate is not sure of that. And there is more than 400 genes that have a very, well, a rare splice consensus done off site, which means it's a typical sequence for done off site of an entrant. OK, the same thing, detailed for each of these hundreds of genes. You have the, every time it tells you that at this position on the genome, the gene that was predicted with this name have a short exon. So you can review this list if you're interested and check what to do with it. And finally, there is a stats file, which is maybe the most important one, that will give you a lot of information on the result that was generated. So first, it gives you information about the fan annotate version, which tool exactly was run by a fan annotate with which data, with exact versions. So as you can notice here, in the video, I selected 2023 version of the fan annotate, but here it's 2022. It's just because I made a trick on the data sets here. So in your history, it should be 2023, all the version that you selected in the form. So you have a few statistics on the assembly that you analyzed. So the number of contigs, the lengths, the N50 and the GC content. So this one should be the same as what is produced in the fly assembly tutorial. And finally, what we were looking for at the beginning, it's the number of genes that were predicted by fan annotate. So here, it produced 14,000 genes and a bit less mRNAs and a few tRNAs. So fan annotate is also able to predict tRNAs. So it means this one, mRNA and tRNA, if you add the two numbers, you get the total number of genes. Other interesting statistics are the average gene length. So it tells you that on average, the 14,000 genes have a length of 1.5 kilobases. And then you have information on the structure of each gene. So the number of genes that have UTR or not, in this case, there is no gene with UTR, which is often the case in this configuration of fan annotate. The number of CDS that have no stop codon or no stop codon, which can happen sometimes if the scaffold is not complete, for example. The total number of exons, how many transcripts have multiple exons? How many have only one exon? What is the exon length on average or the protein length? That's it. So if you look at all these numbers, it gives you a rough idea of how your annotation looks like. If you had 10 times more or less genes, you would be quite surprised. But you know that 14,000 gene for this genome is quite reasonable. It's a clue that gives you a rough idea of the quality of the annotation. But it's not enough. We'll see later how to have better information about that. OK. So now we have an annotation, a structural annotation. And we want to add some functional annotation to it to know which gene was predicted and what is their function. And that's it. So let's have a look at the functional annotation tool. So we will use two main tools to perform functional annotation. The first one is EGNOG Mapper. So EGNOG is a big database of orthology data. So the creators of this database made a lot of orthology analysis to cluster many similar genes that are thought to have the same function in many, many species. These genes are available on a specific EGNOG database. So it's available on this website, like this. So you can see that there are 5,000 organisms that were used. And they produce more than 4 million orthology groups. And with this tool in Galaxy, all the protein sequence that were predicted by Funanotate will be compared to all the ortholog sequences from the EGNOG database. And every time there is a match, the EGNOG Mapper tool will get all the functional annotation that is available in the EGNOG database for the corresponding orthology group and get this information and assign it to the gene that was predicted by Funanotate. This is quite useful. So it's very simple to use. You would just need to select the correct version of EGNOG database. If it's not available, you should contact the use Galaxy administrator to install it. Then you have to select which sequences you want to analyze. So it's just the result of Funanotate, the protein sequences level here. And in the output option, all the options can stay with their default values. But in the output options here, you can unselect this one. Because we want to have some documentation, like some comments in the output files to understand what we are generating. So you run this tool, it will generate two data sets. And while it runs, you can run the other one, which is Interpol Scan here. So this one is a huge script. In fact, Interpol is a big database of protein motifs and patterns that were defined by a lot of people in the world of researchers, specialists of each gene family. And when you use this Interpol Scan tool, it will take each protein sequence predicted by Funanotate and look into it to find any motif that is known into the Interpol database. And when there is a match, you will get the functional annotation that is associated to this motif. And assign it to the protein that is matching. That's it. So here, once again, we select the protein sequence that were predicted. We are analyzing protein sequence here. Here, we need to select the latest Interpol Scan database version. So once again, if it's not available, contact the administrator of the useGalaxy server. And then we have to select which applications we want to run. So this one is a bit tricky. I mean, you need to understand what you do. Here, you have a whole list of different sub-bases of Interpol and specific and corresponding tools. So this one are free to use. So most of the time, you will select them all. If you are in a hurry, you can unselect Panther and PFAM somewhere here. But it's not recommended because these two data banks are, well, they take some time to analyze. That's why you might want to unselect them. But on the other hand, they produce a lot of meaningful and useful results. So most of the time, we keep it and wait patiently to have good results. The other option is this one because there are a few sub-applications of Interpol scan that are not free. They require the acceptance of specific restricted license, which means you use it, but you acknowledge that you will not use it for commercial use, which is hard to define. But you have to think about it for yourself. And there you can select them all or only a few ones. It's up to you. You just need to be aware that these applications are not installed on all use Galaxy servers. So if you are on EU, they should work on .fr2. But on other ones, it's probable that it will not work. Just keep it in mind. That's it. And even if you don't run them, all these ones already give some quite useful results. So don't be afraid to unselect it. OK, the only option we might want to change is this one. By default, you only get a tabular file as output. Here, we want also the XML file. And let's run it and wait for the results. It's finished. Now, let's have a look at the results. Let's begin with EggnogMapper. So you have two results. The first one is seed ortholog files. So yeah, it will display here. What it gives you is the similarity between each protein that is predicted by finite state with a corresponding protein in the Eggnog database that was used by EggnogMapper. So you can see each identifier. And if you use it, you can search it in the Eggnog database to get more information. And you can see how it matched with the value and which part of the protein matched, and so on. So this file might be interesting, but the other one is much more interesting. Usually, it's the annotation file. And here, you see the match between the query, so the protein that was predicted by finite state, and the same protein that is in the Eggnog database with the E-value. So matches with two high E-values were filtered out. And these ones were kept. And the most important part are these columns after that. So Eggnog OGs are a list of an autology group in the Eggnog database that were identified for this specific protein. So every time you get an identifier in various formats, so if you look for it in EggnogMapper here, you can have more information here. We won't go into details, but it's just this identifier. And if you go a bit further to the right, you get a lot of different fields. You get Kogge category, description of this category, and a preferred name, which is a name that was assigned to this protein based on its similarity to an autology group that is known to have this function. So for example, this name, that's it. You have a symbol associated to it, so a gene symbol, which is linked to the description here. And here, you have a list of genontology terms. So if you look on Google for this term, this is, oh, wait. OK, no. This is funny. So genontology term is just a number corresponding to a very standard vocabulary describing a function, for example, or here, a cellular component. So by saying that this predicted gene has the genontology term 1, 2, 3, it says that it's part of the histone-acetyl-transferase complex. So each term like this with a specific number has a specific name and description that has been standardized. And you have a full list like this for each gene based on the ignored database. So it's very long like this. After that, you have information that you can get from the Kogge database. It's a bit the same principle of genontology, but this time it's based on the pathways. So if you looked for this term in a Kogge database, you can find more information. And you have the same thing for another database, which is bright, and CASI, big reaction, PFAM, and so on. So all those columns correspond to specific databases that you might be interested in or not, depending on what you're working on. So all the genes don't have information on all columns, which is absolutely normal. So it's nice. Now our genes have names and functions, and that's pretty cool based on ignored database. And now the same kind of result has been generated by Interpluscan using another method, which is very important because at the end you will want to aggregate all the information that you get from Egnog and Interpluscan. We'll see that a bit later. So the output of Interpluscan says that, yeah, let's look at this one, no, this one. For a specific protein predicted, this column is just an identifier, a unique identifier based on the protein sequence. You're not very much interested in it. What you're interested in is this one, pro side profiles. It's, well, for each protein, you will have a line, and each line corresponds, you can have multiple lines, and each line corresponds to a match to a specific motif with an identifier here from a specific sub-database of Interpluscan, of Interpluscan. So it means this protein has a match with the motif PS50110 from the pro side profiles database. So if you look for, ah, sorry. If you look for it in Interpluscan, Interplus, sorry, I always confuse that. If you look for it, you should have a match here, and you can click on it and get a lot of information. So it's a pro side profile with a full description of what this domain that was found in a protein to what it corresponds. So this one means that the protein that contains this domain is probably involved in the response to, it's probably a response regulatory domain. Yeah, sorry. OK, so then you have some score information and some full text description here, here too. And IPR number is another identifier that is available in the Interpluscan. In fact, each individual motif like this from specific database is integrated into Interplus into a general IPR ID here. And this one is a more general description of the motif, and it can contain a different motif that corresponds to the same function. This one. OK, and then you have all the good terms that are assigned by Interpluscan based on the profile matches, and a lot of external identifiers like metacic database identifiers or hecto and so on. OK, that's great. So we have stricter annotation, functional one. And as I said, we'd like to integrate all this data into a single final annotation that could be ready for submission to NCBI. So let's try to do it. So if you want to submit your annotation to NCBI, there are a number of things that you need to consider. So let's have a look at the submission to NCBI section of the tutorial. First, you don't want to perform a real one with the data from this tutorial because you don't want to pollute the NCBI database. So please just generate the files but don't submit them really to NCBI. And then when you need to submit an annotation, you have to prior to that to prepare all your raw data and a few other things. First, you need to create a bio project and a biosample on the NCBI portal. So this will describe which genome you're trying to sequence, assemble, and then annotate, and then submit to NCBI. So which species, which environment it was taken from before sequencing, et cetera. You have the biosample, which describes very precisely which biological sample was used for sequencing and for an assembly of the sequencing data annotation. You should already have submitted your raw reads to, well, DNA-seq and RNA-seq data. This data should be submitted to SRA first and have identifiers associated to them. And they should be linked to the bio project and biosample. And also, you should consider first submitting the assembly to NCBI before submitting the annotation because there is a good chance that NCBI will look into your assembly data, your genome sequence, and ask you to modify it if they found some problems in the sequence. So you should do it first before really trying to annotate the genome. The other thing that you need to have before submitting an annotation is a locustag. So this is the prefix that is used in the predicted gene names identifier, at least. For now, we are using FUN0001. That is the default prefix. But you should have a specific one, which is assigned by NCBI when you want to create your annotation submission to the portal. When you have done all this, you have to prepare a specific template file, which is just a text. We'll see structure a bit later that you can generate by using this form. So let's do it. So let's say we are John Doe. We have a wonderful email address, which is johndoe.exe.org. We work for Foo in the bar department, which is located in Foo Bar Street in Paris in France. The first author is John Doe. So this field has changed since last time. So here you can write the name of a potential paper you would write on this, which is unpublished. And here you would write, if you have it, the bioproject and biosymbol identifier that you can get from the NCBI portal. So once you have filled all these fields, you can create the template file and download it. It looks like this. So it's not very pretty for the human eye. But yes, we will upload it to our Galaxy History now. Drag it from here. Oh, yeah, you haven't seen the structure. I will show you it. So the template SBT file looks like this. It's not very pretty for the human eye, as I said earlier. So now we have this template file. We have the result of Intaproscane and Ecnogmapper and the first predicted annotation by FunAnnotate. So we want to integrate all this into the final annotation ready for submission. So to do this, we have a specific FunAnnotate tool here, which is FunAnnotate Functional Annotation here. And here we have to select first the FunAnnotate predict output. So it's a gene bank format. It's the one that was predicted a bit earlier here. Here, again, we have to select a specific FunAnnotate database. So choose the latest one and the same one that you used earlier in the predicted state. Then you have to select the template from the NCBI submission for Porta here. You have to select the Ecnog output, so the annotations output here. So we don't have anti-smash, which is more of a prokaryotic stuff. So we don't have one for this genome. We have InterProscane output in XML format here. And once again, we have to select which busco model we want to use. So the one closest to our genome is McCorales. That's it. We have to use the same strain name as earlier, Mc1. And here, you have to specify the Locustag that most of the time will be provided by the NCBI. In our case, it's mmusedo underscore. And we select which output we want, so we select all. And that's it. You run the tool. And we wait a little. So let's have a look at the different files. There are many ones. The most important are gene bank outputs here. So it's, once again, all the genes that were predicted by an annotate. But this time, you see that the prefix is different. Here, it's mmusedo, as we asked. You have the protein sequence. And for specific genes, you have all the functional annotation from Ecnog and InterProscane that is associated to the gene in the gene bank format. So you have Ecnog data here with different identifiers and InterProscane data here. And yeah, you might even have some gene names that were changed for specific genes. This one is named Taf5.1, because Ecnog say that it was this name for this gene. That's great. You have this file, which is the same kind of information, but in another format, which is just a tabular format. This one's a split AGP, TBFR, second genomes, and scaffold sequences. All these are files that will be asked when you submit your genome to NCBI. So you have it. They contain the same kind of information in different formats. So it's not very interesting to look at them one after the other. You also get the protein sequences here and the mRNA and CDS sequences, just as the beginning. But just notice that the identifiers have been changed to mmusedo. And the GFF format, which is here once again, the gene names have changed, and you have some functional annotation integrated. You have a lot of reports describing here. You have a few more checks that are done on the annotation with specific warnings telling you to have a look at specific genes that might have a problem, even if it's not sure. So this is the general report, but you have other things like product massfix. So here you don't have. Need correction, no. New name pass. Here you have a few new names that are not done by fanatite yet. It's not very important at this time. But you have here statistics. It's just as the fanatite predicts step, but here you have these lines that are different. And here you see how many Go terms were assigned to genes, how many Interpol scan matches were integrated, how many Hagnog matches to, and how many PFAM links to and the different databases. So that's quite interesting to look at these numbers to guess if your annotation looks good and if it was probably functionally annotated. That's it. So we have now a good annotation ready for submission, but we want to evaluate it a bit better and to visualize it. So there is one cool tool to evaluate an annotation, which is BUSCO, which is also used for assembly evaluation. So you will have more information about it in assembly. But the main principle of BUSCO is to look at your annotation and see if it can find a set of genes that are expected to be present in any species of a specific taxon. So for example, if you take a micro-LS, you know that all the species of this group are expected to have a set of genes in one single copy in the genomes. And it's probable that these genes are essential for the life of these organisms. So we use it, I mean, the BUSCO tool use it to guess if your annotation contains all the genes that it is supposed to be present or not. So it's a good way to have a feeling about your annotation to check if it looks good or not. So how does it work? You select the protein sequences, the last one that were predicted, and then you say that you want to run in protein mode here. And you select a specific line edge, which is micro-LS once again. Here you have, by default, when you run BUSCO, it will download everything it needs to run, and you don't have to worry. But if you use it a lot, you might get errors because it fails to download some data, and you might have problems like this. In this case, you can use cached line edge data and select the latest one here. If the use Galaxy administrator have installed it. In this case, BUSCO will not try to download, and it will run just a little bit faster and avoid potential download problems. Anyway, whether you select this option or this one, you will get the same results. I know that on EU it works with this one, so I select it. And finally, here I want the short summary text and the summary image. And I can run it. When it's finished, you can look at a beautiful image here, which is a bit too big, like this. So this is a summary of the results. So on a total number of 2,281 genes that are specific to Mucorales species, BUSCO tried to find them all on the annotation that we have generated. And it found 2,250 genes in single copy, and they only found three. Oh, wait. No, the total number that was searched is this one. 2,449. And yes, 2,281 were found to be complete in the genome, in the annotation we provided. And this number were in single copy, as expected, and 31 were in duplicated genome, which is not supposed to happen, except if there's really a duplication in the genome in the species, or if maybe there is an assembly problem with the path of a genome that was what is present in two copies in the assembly instead of one. It can happen, too. Then you have the fragmented ones. So there are a few genes that were found but incomplete in the genome, or in multiple chunks in multiple scaffolds. So this, again, can come from an assembly problem. And a few ones, too, 140 were not found in the annotation, which means either they are not present in the assembly, or they were not detected by fun annotates. And probably it's a mix of the two. So you have the same numbers in text format, which is maybe more readable if you want to have a look at it. And that's it. So is it good or not here? You can see that the vast majority was found in the annotation in single copy and complete, which is a good evidence that your annotation is in good shape. So on its own, it's interesting. But if you have multiple annotations, you can always run BUSCO on each one and see which one is the best. And what is important, too, is to compare these numbers to the result of running BUSCO on the row assembly output. So just giving it the sequence of the genome and letting it find the genes in the sequence prior to annotated genome. So it's good to have the numbers before and after annotation and to compare it. So I know it's quite similar, which means the annotation is in good shape in this example. For now, we only have seen a text output of annotation. We might want to visualize it. So we have a cool genome browser available in Galaxy, which is Jbrows. So it's just a way to visualize the annotation and the assembly. So we will give it genome. So the first data set we uploaded, genome masked. And then we have to select a few data sets that we want to display on this genome. So we will display it in two groups. We add two groups. The first one will be annotation. And the other one will be RNA-seq. In annotation, we add one specific track in GFF format, which is the GFF output of the last finalized functional run. And that's it. We leave all the other options, like as is. And in RNA-seq, we select the output of RNA-style. The first two, we run into this tutorial. And no, it's not this data set. This option is BAMP by labs here. And we select this to display a specific visualization. OK, let's run it. So how does it look? So you can open it here. What I prefer to do with J-brows is open it in another tab, like this, to get the full J-brows. Wait, it's not doing what I expected. OK, never mind. It's OK like this. So here, you have a genome brother with a list of scaffolds you can see. So I just realized that it's not displaying the correct one. OK, never mind. I tried to be too smart, so I chose a wrong J-brows history. I have uploaded a new one here. And that should display properly. On your history, you display just fine the first time. It's my fault. OK, so this is a standard J-brows application where you can navigate on the genome. Here, you have a specific scaffold, which is displayed. And here, you can see the whole scaffold coordinates from 0 to 1 million bases. And you have this region in red that is displayed, that is zoomed in in this section. And here, you have a list of tracks that you can display. So we can first have a look at the genes that were predicted by phenolotates. So you can see them like this. They have a hypothetical protein name for the one that don't have functional annotation, or if they can have good names like this. So the big blocks are x-zones. And the little, the thin ones are introns. And if you look at them, here you can get all the functional annotation associated to it with the identifiers that you can use in search on the Interpro or Go or Eggnog databases. Fine. And you can display also RNA-seq data like this. It can take a little while to load. But when it displays, you can see a coverage plot showing you how many reads were aligned to the genomes at specific position. The light gray means it's reads that were aligned at this position. And the dark gray means you have reads that were split between two x-zones. And these regions are the gaps between these x-zones. So you can see the specific reads one by one using this option. So you can see that phenolotate was able to use these evidences, this RNA-seq data aligned to the genomes to predict x-zones of specific genes. And you can go and look at other regions on a region. There is a specific J-Brass tutorial if you want to look more into details how it works. Fine. OK, so now the last things you might want to do with an annotation is compare it to another one. And remember, you have in your history here an alternate annotation that was performed by myself using phenolotate but with different settings and bad settings, in fact. We'll see how it looks. So the first tool you might want to run is HGN Parseval. So here you have to select a reference GFF-free file. It will be the one you have generated. And the prediction GFF file, which is the one that the alternate one that you uploaded at the beginning. And the output type you want is HTML. So you just run this. And the other tool that you want to use is phenolotate compare, where you will select genome annotation in gene bank format. So the one you have generated and the one, the alternate one. And here you select once again the latest phenolotate database like this. And that's it. You run it too. And you wait a little. Phenolotate compare can take quite some time and Parseval too, I guess. So be patient or be patient. Once it's finished, you can have a look at it. So HGN will give you an HTML file showing you various details on the tool annotation. Here you have a list of scaffolds where you can have more details on each gene. And you have some specific numbers after that. So if you go to the scaffold 11, this one, as is explained in the tutorial, you click on the first look here to get more information. And here you will see a specific region of the scaffold 11 and a comparison of the genes that were predicted by phenolotate in the two annotation. So this one is the one that was predicted by yourself in that tutorial. And this one is from the alternates annotation. So you can see that based on this, phenolotate with all the correct settings we used was able to predict four different genes in different orientations. And if you look at the alternates annotation, phenolotate only predicted once big genes with exons in the same directions and with a huge entrance. That's it. What it reflects is that this annotation is probably better. And in fact, it's exactly the real situation because the alternates annotation was done using phenolotate without giving any RNasec data and choosing a wrong line age for Bisco, which was insecta. So it had a lot of trouble predicting genes and predicted like aberrant stuff like this. So yeah, this agent stuff can help you decide which annotation looks better. And the other tool is phenolotate compare. Let's have a look at the reports. So here you have a lot of information. First, the stats. You can see how many genes were predicted in each annotation. The one you have generated as, yeah, this number is wrong here. But it should be 14,000. And the other one as 9,000. It's wrong because once again, I try to be too smart. But in your case, it should be the correct number. If you look at the orthologs, you can see a list of orthologs between genes in your annotation that you predicted and other genes that look similar in the alternates annotation. So you have a full list of orthology groups like this. And you can get more details like this. If you look at Interpro, for example, you can see how many genes from the first or the second annotation have specific Interpro numbers associated to them. So what you can mainly see is that the one, the annotation that you generated, have more genes with the corresponding Interpro terms than the alternate one, which is once again a good clue telling you that your annotation is better. It's the same for PFAM and other stuff like Merobs and so on. If you look at Go here, you can see some specific Go terms that are under or other represented in each annotation. So in this case, it's not very representative. But in other cases, it can be useful. OK, so we now have finished this tutorial. As you can see, it's a complex subject, annotation of genomes. If you want to give us some feedback on this tutorial, you can always, and it's very much appreciated, use this feedback form here or contact us on the Slack channel if you are running it during Smogersport, other training session. And finally, you might be interested in another tutorial concerning annotation, which is the Apollo tutorial here, refining genome annotation. This one will show you how you can use the Apollo application, which looks like this, to manually curate an annotation. So if you have a look at your annotation, and for example in Gebros, and you realize that finality predicted a quite good annotation, but there are some genes that are wrong and you want to correct it manually, this Apollo application allows you to do it easily. So it might be interesting in the following the tutorial. OK, that's it. Thanks for listening.