 Hello, my name is Anthony Brottodo, I'm from Rennes, France, and today we're going to see together how we can use Galaxy to annotate the genome of a bacteria. So if you go to the training.galaxyproject.org, you can go to the genome annotations section where there are several slides and tutorials to annotate different kinds of data. And here we go to the genome annotation with Procaria tutorial, which is specific to Procariotic genomes, so bacteria mostly. So I just click on the end of an icon here, okay. So I will run this tutorial on usegalaxy.eu instance, which is the European instance of Galaxy. So this tutorial should also work on other instances like the Australian one or the main one at usegalaxy.org. So as you can see, the title of the tutorial is genome annotation with Procar. So Procar is the software, the main software we will use that will take the sequence of the genome we want to annotate and look into the sequence to find all the features it can find, which means mostly genes, but also things less common like tRNAs or ribosomal RNAs, if it finds. Okay, so it runs in two steps, Procar. The first step is running another tool, which is Prodigo, which will try to identify in the sequence all the genes that are coding for protein. And then the second step is to use all these predicted genes to add functional annotation to them by comparing them to known sequences from other organisms that are available in international data bases. So the first step for this tutorial is to import the data. So as you can see, we will use this file, which is available on Zenodo. So you just have to copy the URL of this file and then go into Galaxy, create a new story if you haven't done so before, and just upload this new file by clicking here, pasting the URL. You can say that it's a fast day file here and click on start. Okay, so now the file is uploaded to Galaxy, it's green. You can have a look at it by clicking on the eye. As you can see, there are several sequences, which are DNA, of course. And each one is one of the scaffold that was assembled from the short treats, probably. So it's the genome of Staphylococcus aureus bacteria, and that's it. So we'll first rename the data set to make it more obvious what it is, or we can even write genome like this, save. Okay, and if you look at the tutorial now, the next step is to run Procar and just select the file we uploaded before. So we will do it. Procar is here. It's in the annotation section. And we select the genome data set, which is the only one here. And there are a few options after that, we will just leave them as is. And so we just execute like this. Okay, now we just have to wait. Okay, so Procar has finished working. So the first thing we can look is the TXT output here. So as you can see, there are six contigs that were analyzed, that were in the original Fast Day file, which made around 180,000 bases. And Procar was able to find 149 proteins, genes, coding for proteins, and two TRNAs. So that's the general statistics. And if you look at the GFF output here, look at it like this. So GFF is a file representing an annotation on the genome. Each line represents a feature position on the genome. The first column here is the name of the contigue where this feature is. The second column is how it was found. So here it's Prodigal, which was launched by Procar. The third column is CDS, which means Coding Sequence. And the fourth and fifth columns are the start and end position on this contigue. So in this line, you can see that Prodigal found a CDS between position 511 and 750 on this contigue. Some software can write a score corresponding to this feature. Here Prodigal doesn't do it. But here you have the strand column, which means this CDS was found on the forward strand of the genome. And then on the rest of the line, after that, you have a lot of text written. And every time it's a key value pair. So you have an ID, which for this gene is KCBFIMOI00001. So this is a unique idea of the gene on this genome. After this, you have the method how it was predicted. So it's an ab initio prediction. So it's a software working only with the sequence of the genome. That's what means ab initio prediction by Prodigal in this version. And then you have the locustag, which is quite similar to the ID. And product, which means what is the name of the protein, which is translated from this gene. So here we only see that it's an hypothetical proteins. But for other genes, like the number five here, we have more details. So we have a name for the gene, which is BLE. We have other external references. We'll present this a bit later. And we know that this gene is probably a bleomycin resistance protein from comparison with external sequence databases. OK, so we are happy with this GFF. If we look at the other results of Procar, you can look at the genebank file here, GBK. So it's just the same information as GFF written in another format, which is widely used in international databases. So you should have the exact same content as GFF. And if you look at the tutorial, they say here that the FAA file contains the protein sequences of the gene annotated. So if you look at it here, you can see that each gene with the ID coming from the GFF has a corresponding sequence of the protein. So you can find the 149 sequences here. And the last one is the FFN file here, which is, if I remember well, yes, the nucleotide sequence of each gene. So not the protein sequence, but nucleotide. OK, that's great. So these are text files, which are somewhat hard to read. You don't want to print them, but you want to visualize them a little bit to make it a little bit more visual. So in Galaxy, you have a tool which is named J-Browse like this, which is in the graph display data section. And if you click here, you can select the genome of the species we are studying. So it's the first data set we imported at the beginning, the genome here. You can also select, as it's said in the tutorial, the FNA output of FOCA, which is the same, which is here. So if you compare this one with this one, it's the same data. So we come back to J-Browse. We select the genome from history. We can take, as it's said, FNA in the tutorial if you want. Then we know it's the genome from a bacteria. So we tell J-Browse that the genetic code of this species is bacterial here. And then the next step is to add tracks. So we had, first, a track group, which we will name annotation, for example. And then an annotation data track here, which will be the GFF file that was generated by ProCal. And we leave everything, all the option, as in the defaults, and we execute. OK, so J-Browse has finished its screen here. So I can preview it here with the eye. So you see the genome browser displaying inside Galaxy. But for a little bit of comfort, I'm going to open a new tab here. And we can see the full page for J-Browse. OK, so this is a genome browser, which is named J-Browse. Here, on this part of the screen, you have represented the sequence of a genome from position 300 to 455. Here you have the list of all the contigs of the genomes we are viewing. So we have the full length of this contig here. And we have zoomed on this part of the contig. And we are displaying its content here. If we zoom a little more here. Yes, so we have colors as before, but now we have letters on these colors. In the middle here, you have the sequence of the genome. So the forward strand here and the reverse strand just below. So the six frame amino acid translation here. So the first one for the forward strand. So this T80 sequence means tyrosine amino acid. And the three below correspond to the reverse strand. OK, so we can switch like this from this contig to a bigger one, like this one here. And on the left, you have a list of tracks that you can show. So we added one, you have always the reference sequence track that you can always show. But if you click on the Procar here, it corresponds to the GFF that was generated by Procar. So if you look at it, you have several arrows with rectangles. So if you zoom to one of them, so you just say that you want to zoom to this region, for example. And you can see the different genes that were predicted in this region of the genome. So you can click on one gene, for example, this one, UDPN, et cetera. So if you just click on it, you have some details on this gene. So you know the name that was predicted by Procar here. You know that it's a CDS coding sequence. You know it's position, so it's on the positive strand of contig number one here, not one. Between this position and this one. And it's almost 1,000 base pair. Here you have the sequence of the genes. So if you want to blast it, you can do it just by taking it or even downloading it to your hard drive. And here you have a list of attributes that were in the GFFIs that we saw a bit earlier. So if you look at it, you have, for example, EC number, which is the enzyme classification number. So if you, for example, select it and copy it and then search for it, maybe it will find it in Google. OK, so if you click on xplasi.org, you have here. So this number is a specific number to this UDPN, I set it because I mean that's for Epimeras enzyme, which is registered on this database. And so it's a very standard way to represent the activity of the gene. You have the name, the standardized name of the gene, which is WBGU. You have an ID, which was decided by Procar. Also, there is the inference. So how did Procar name this gene like this? Well, it says that it used prodigal in this version to predict this gene on this genome. And it found this predicted gene to be very similar to the amino acid sequence of a well-known protein in the Uniprot, which is K, this one. So if you copy this one and search for Uniprot, this ID here, you will find this page, which describes the function of this enzyme with a lot of details and other similar protein in other species. OK. And then the Locustag is very similar to the ID. The phase is not very relevant here. We know that this protein, this gene will predict this UDPN acetyl, because I mean that's for Epimeras gene. And that's it mostly. So it's a lot of information, which is in the GFF file and represented like this in J-Browse. OK. So congratulations. You've just finished this small tutorial on annotating a prokaryotic genome. So you can do exactly the same for any genome of bacteria, for example, using Procar. Thanks for listening.