 Hello, my name is Antony Brotodou. I'm working in REN in France and today I'm going to show you how you can perform repeat masking on the genome. So this process of repeat masking occurs after assembling a new genome and before performing a full genome annotation of genes. So when you assemble a new genome, you get a fast-paced sequence of the genome, but often these genomes contain a lot of repeated elements that can be some very small sequences like tandem repeats of very short sequences like 80, 80, 80 that are repeated one after the other. Or you can get some interspaced repeats which are small sequences that are found in multiple places in the sequence. So it can be small ones that are named the signs, short interspaced nuclear elements, or long ones, the lines, which are long interspaced nuclear elements. So these repeats are interesting because they are part of the evolution of the genomes. They can be produced by the presence of transposons in the genome or viral insertions. And they can have direct effect on the expression of genes. The problem of these repeats is that they are troubling the software that are used to annotate the genes on the genome. So you often want to find the exact position of all these repeats on the genomes and to sort of mask them to the gene annotation software that can be used later on. So in this tutorial, we are going to perform the repeat masking on the fungal species, which is mucor musedo. So this genome was assembled following the tutorial named the fly assembly tutorial in the assembly section. I can show you here. If you go there, it's this one. So if you follow this tutorial, you will get an assembly of these species. And now and after that, you can follow the repeat masking tutorial, which is in the genome annotation section and which is here. Masking repeats with repeat mask. So there are multiple programs available to perform repeat masking. Today, we're going to use repeat mask. There are other ones like repeat modeller or repeat. This software are using different strategies. There are some software that will perform denoval annotation of repeated elements by just comparing comparing different sections of the sequence of the genomes to find all the subsequence that are repeated somewhere on the genome. And other software like repeat mask will use databases of known repeated elements that have already been found in other genomes. There are two very used databases. The first one is rep-base, but it's not free. And another one is DFAM, which is free and that we will use today in this tutorial. So that's it. We can start the tutorial just by following this. The first step is to get the assembly, so the fast day file, which is produced by the fly assembly tutorial, or you can get directly from Zenodo. So we are going to get it from Zenodo, copy this URL, and then on use galaxy.eu. So this tutorial should run perfectly on use galaxy.eu and probably other use galaxy server like .org or .org.au. So I'm going to create a new history that I will name repeat masking, and I will upload the test data here. I just paste the URL here and I will name it genome. I start and it gets added to the history. Okay, so now that the data set is green, as you can see the fast day sequence. And if you look at it, you will see the whole sequence, which is in uppercase as you can notice. Okay, so now the next step is will be to run a repeat masker on this genome. You can launch it by clicking here, or you can find it in the annotation section here. Here it is. So we're using the latest version and we will just select the genome sequence. And as I said, repeat masker is using a database. So we are going to use DFAM, the one which is bundled with repeat masker. It would be enough. And then we have to select a species list, a species in the list displayed just below. And as you can see, there's no mucor, musedo species in this list because it's a general database. So today, we are going to use to live the default, which is human. In fact, what we will do is not a perfect annotation of repeated elements. It will be a very light one, which will detect the most common repeats. For in real life, you would want to use a species as close as possible to the species you're trying to mask. But for this tutorial, it would be enough to mask the most commonly found repeated elements. And it will give good results for annotation in the following tutorial with fun annotate. Okay. And there's an important option that we need to specify here is do we want to perform soft masking or unmasking? So here we want to perform a soft masking. I execute. But just for just to show you the difference between soft and hard masking, I will launch the same tool just by clicking on this refresh button, which allows to reexecute the same job. But instead of doing soft masking, I will perform hard masking. So now we will get a result for soft masking and hard masking and we'll compare the two just when it's finished. Okay, now all the data sets are green. It took nearly half an hour to compute this repeat masking on usegalaxy.eu, but it depends on the load of the server as usual. So we are going to take a look at the output of our two repeat masking runs. Each one has four output data sets. The first one is the mask sequence here. So let's have a look. So if you remember well, the first four ones are for the soft masking and the four last ones, they are for the hard masking. So for soft masking, if you look at the sequence that is generated by repeatmasker, at first it looks exactly as in the original fast day file. But if you look closely, there are regions like these that are in a lower case and there are multiple ones. If you go a bit more down in the file, you will find other ones. I can find one here. You can see another one. That's what we call soft masking because the regions that are repeated are just put in lower case, but you still see the sequence at this position in the file. And other annotation tools that we will use in the fun annotate tutorial or maker tutorial will take into account this information that this sequence is repeated in the genome sequence. Okay, so if you look at the same output for hard masking, so data set number six here, you will see that the beginning of the sequence is the same, of course, but for all the repeated sequences, instead of having the sequence in lower case, you notice that the whole sequence is replaced by ends, the exact same number of ends as the length of the repeated sequence. So here you get one and you get, of course, other ones in the sequence everywhere. There is a repeated section. These are exactly the same regions than in the soft masking. It's just that the weight is written in the output fast A file is different. So for hard masking, you just have to remember that you lose information in a genome sequence because if you have an end, you don't know which sequence was there even if it was repeated. Often when you want to annotate the genomes to find all the location of the genes, you don't want to make a hard masking because the software for annotation often consider that a gene can be found outside repeated elements, but can sometimes overlap repeated elements. So for example, you could have a gene that starts here and continue there, and if you have end in the middle of the gene, it's really a problem because you can't guess the real sequence of the gene. If you have a soft mass sequence, if your annotation tool afterwards detects a gene that starts here and finishes here, you will get a proper protein sequence even if there is a little repeated sequence in the middle of the genes. So often you perform a soft masking step. Okay, that's the output, and this output can be used in the following fun annotate tutorial or whatever annotation tool. We can have a look at the three other outputs of repeat massacre. The first one is the repeat catalog. So that's just a list of all the repeated elements that were found in the genome with the position and the scores and the exact difference between each repeated region. You also have the output log here, which is another way of telling you exactly where repeated sequences were found with the exact sequence of the repeat, the kind of repeat, because you have a whole classification of repeats. It can be simple repeat, low complexity regions, but you will see a bit later what it can be. And you have the position, the score, and other things like that. And probably the things that is the most interesting, apart from the sequence is the repeat statistics. And there you will see very useful information. First, the total length of the genomes that you try to mask, the GC level that you can compute with other tools, which is just useful to know this information. And very important, the number of masked bases in the genome, the number of base that were found to be repeated by repeat masquerade that were masked, either soft or hard masked. So in this case, we have 48 million base pair. And in this total, there is only one million, approximately, which were detected as being repeated, which is 2.41 percent of the genome. And if you look at this region in the statistic files, that's where you will find the classification of all the repeats that were found. So you can have some short elements, long elements, LTR elements, and other kind of elements. Of course, if you look at this classification on the internet, on Wikipedia, for example, you will get a lot more information. And at the end, you get a lot of satellites repeats and simple repeats and low complexity. And if you look at the percentage, it's what represent the vast majority of repeated elements that were detected. That's probably because we use the human entry from DFAM to perform repeat masking on the genomes. Probably if we had selected a species closer to our species, we would have more repeated elements in other categories. These ones are very generic and found in any genome of the world. Okay. So 2.41 percent is quite close, which is okay for an addition later on. That's it. So if you go back to the tutorial, we've performed this step of repeat masking. There's a question, and we just insert that. We have 2.41 percent of repeated elements. And now we're ready, if you want to follow on the finite data notation tutorial, which will be in another video. Don't forget, at the end, to look at the feedback form and to give your feeling on this tutorial. It will be very useful for us to improve it in the future. Thanks for watching and listening.