 Hello, I'm Anthony from REN in France. Today I'm going to talk to you about one of the first annotation steps that you perform after sequencing a new organism. So it's in this section, genome annotation, and then this tutorial, masking repeats with repeat mask. So this step is repeat masking. So usually when you sequence and assemble a new genome, you get a faster file containing the new sequences corresponding to the genome sequence. And if you are in particular studying eukaryotic genomes, a large portion of this sequence is composed of repeated elements that can be very small sequences, like just a few bases, like 80 repeated a large number of times just one next to each other. That's called tandem repeats. Or you can have a bit longer sequences that are repeated in different positions of the genomes. So it's for the short interspace nuclear elements or the long interspace nuclear elements. And they can come from specific things like transposons or viral sequences. So why do we want to mask them? So first, this repeated sequence can be interesting on their own if they are coming from a specific transposons or viral sequences. But they also can have effects on the expression of genes. So you might be interested in knowing where they are on the genomes. The problem with these repeated elements is that often when you have a new genome, you want to annotate it to find the gene locations because you're often highly interested in genes. But the presence of all these repeated elements can be a problem for the annotation software that will try to detect the gene position. So often the first step when you get a new sequence is to detect all these repeated elements, to mark them. We will see how later. And then the annotation tools will take this information into account to try to predict more accurately the gene position later. So there are two different ways of masking repeats in a genome. The first one is soft masking. In this case, the result of this analysis will be your fast death file with repeated elements printed in return in a lower case instead of uppercase. And the other one is hard masking. And in this case, all the repeated elements found by the software in the genome will be replaced by the letter N, which means you will lose some sequence information from the sequence. So there are multiple software tools to perform this repeat masking step. Today, we are going to test the red and the repeat masker. But there are other ones like Repet. And these tools sometimes use a specific databases to detect repeated elements that have already been detected in other genomes. So the most known repeated elements database are DFAM and RepBase. And RepBase is a non-free database, which means you often have to pay to use it. So today, we're going to use DFAM. And to perform the tutorial, we're going to use a genome that was assembled following another tutorial of the GTN, which is the Fly assembly tutorial that you can click here. So let's start our tutorial. The first thing is to get our data into an history. So just copy this URLs. I've created a new history on usegalaxy.eu. This tutorial should work on all their usegalaxy servers. But we have tested it here. I'm going to rename it k. And then I'm going to upload data here. OK, so now it's uploaded. And you have three files. The first one is the genome sequence, the faster format that was assembled by a fly. So you see all the letters are uppercase. And you have more than 1,000 different contigs. The two of the thighs are repeat libraries. We will use it a bit later in the tutorial. So the first tool we are going to use is called red. So it's just a parade in the tools search box, like this. And you select it. And this tool is a very simple tool. It just takes as input the faster sequence of the genome you want to mask. And you just click on Run tool. So this is a tool which uses machine learning techniques. So it doesn't use any repeat library at all or any databases. But it will just look at the sequence. And based on the training data that was obtained from analyzing a lot of published genomes, the red tool will be able to detect chunks of sequence in your genome that looks like repeated elements. So it will run for a bit now. OK, red is finished. Now you have two output data sets in your history. The first one, if you look at it, contains the genome sequence. But as you can see, parts of it have been written in lower case, which correspond to repeated element as detected by red. So you can see there are a big proportion of the genomes that is repeated. And the other output file is a bed file showing you all the positions of all the repeated elements that were detected by red. So you see on which chromosome, at which position exactly. This is interesting, but you might want to have some statistics on this result. So you can get it from the icon here. In the tool standard output, this is what is written by the tool when it processes your input data. And at the end here, you see the basic statistic, which means, oh, sorry, yes, that it analyzed the genome that is 48 megabases. It found a total of 14 megabases of repeated elements, which represent around 30% of the genome sequence. So this is quite a usual number for this kind of genome, which is a fungal genome in our case. So if you look back at the FASTA file, you see that it's in lower case, which means it's a soft masking operation that was done by red by default. But you can, of course, as explained in a tutorial, convert it to hard masking FASTA file by using the FASTA bed tool from bed tools here. So here you just select the bed output file from red, which is the data set number 10, and the FASTA file from that you used at the beginning, which is genome row. And here, we don't want to soft mask. We want to hard mask, which is the default. And you click on red tools. And then you wait a little. OK, so now while it still is waiting in the red queue, we're going to run another tool for repeat masking, which is quite used in the literature. It's named repeatMasker. So it's installed on usegalaxy.eu here. You just select it. And you select the row genome sequence from the beginning here. And here, we're going to try to perform repeat masking based on a repeat database, which is named DFAM. And we are going to select specific species from DFAM. So all the repeated elements known by DFAM are associated to specific organisms. So when you run repeatMasker, you have to select which organism is the closest to the one you're trying to mask. So in our case, we're going to use Homo sapiens, which is quite far from a fungal species. But this is just to show you what it will give you for reasons. And what we want at the end is a way. We want to select the GFF output. And we want to perform soft masking instead of hard masking. That's what we usually do. Hard masking is really not that helpful because you lose a lot of information from the sequence when you perform hard masking. So that's not really useful usually. So you just run this tool. So at the end of the repeatMasker tool, you get five different output files. The first one is the masked sequence, which looks exactly as the red masking output. You get the output log here, which gives you a tabular list of all the repeats that were found in the genomes. And you get the statistics. That's the most important part for our tutorial. And here, you can see that repeatMasker is able to classify the repeated elements that were found in the different categories here. And for each one, you have the percentage of the sequence. And you can see that in total, this repeatMasker run using the human training data found only 2.41% of the genomes to be repeated. So this is much slower than the 30% of the red tool. But it makes sense because we used human data from DFAM to find repeats in our genome. So it's important to do the right repeat masking library. And as there is not a specific library for the organism that we are studying, we won't be able to use DFAM directly to have good results using repeatMasker, but we can use some pre-generated repeat library designed specifically for this organism. So if you remember, at the beginning, we uploaded two different FASTA files, the micro library RM2 FASTA and EDTA FASTA. These files are just FASTA sequences containing some repeated elements that were found on the genomes by different tools. The first one, RM2 stands for repeatModeler. And EDTA is another software that can be used to produce this kind of libraries. So they work these two repeatModeler and EDTA just by looking at the FASTA file and detecting which sequence are repeated in a genome in various locations. So let's see how it will perform if we use these specific libraries instead of DFAM. So we run a repeatMasker once again using, as usual, the row sequence from the beginning. But here, instead of selecting DFAM, we select a custom library of repeats. And we will use RM2. You can do the same for EDTA later, if you want. And then I click on the GFF output. And that's it. I run it. OK, now when it's finished, you can have a look at the sequence now with Mask using this new library. And you can see that there are much larger parts of the genomes that are in lower case now. And if you look at the statistics here, you can see that repeatMasker was able to find 34% of the genomes to be repeated in various categories that are displayed here. So it's much more similar to what Red gave. And it's supposed to be much more accurate since the library that was used was trained specifically for this genome. You can do the same thing for EDTA. And you will get very similar result, which will give you 32% of the genome. Now what if you want to analyze your own genome and you don't have a repeat library that was generated for you? And you know it's not very well known by the DFAM library. In this case, you can do exactly what was done to generate this RM2 FASTA file here. You can use the repeatModeler tool here. Yes, that's it. Space in the name. So this tool is quite simple. It just takes as input the FASTA sequence that you just assembled. And you run it like this. But the only problem with this tool is that it takes several hours to run. So for now it's running, but look at it. Now it's magical. It's already finished. If you look at this data set, consensus sequences, you'll see exactly all the sequences that were detected by repeatModeler that correspond to repeated sequence from your genome. So this is this file that you can use in repeatMasker like this one here. And this file, you can save it and use it to assemble any new assembly of that same genome. It should give quite similar results. And if you have very close species to the one you're studying, you can also probably use it too. Or you can prefer to generate a very specific library by analyzing the whole genome with repeatModeler. OK, so that's it. We've seen how to perform repeatMasking. The usual way is to use a gear for a non-model organism to just run repeatModeler after assembly, get a new library, and then run repeatMasker to get a cleanly masked genome. Once you have it, it's ready for the next step, which is the genome annotation to find genes on the genomes. And for this, you can use very different tools like Maker, Breaker, Fan Annotate, and so on. That's it. Thanks for listening.