 Hello everybody, I'm Bernice Batu and today I will teach you about the assembly of metadynamic sequencing data using Galaxy. Follow one of the tutorials available in the Galaxy Training Network that you can access by typing training.galaxyproject.org in your web browser. Then you will be redirected to this page where you can find all the tutorials from the Galaxy Training Network and if you scroll down here in the topic sections and you go to metagenomics, then you will see several tutorials for metagenomics for metagenomics and in particularly the assembly of metagenomic sequencing data that you can find here. So this is the tutorials we will follow today and the idea of these tutorials to answer questions why metagenomic data should be assembled, what are the difference between core and individual assembly and some other topics that at the end of these tutorials you are able to describe what an assembly is, what the difference between core and individual assembly as I already mentioned, the difference between reads, contiguous, scaffold to select appropriate tools to assemble your data and etc etc. So metagenomics as you probably know involves the extraction sequencing analysis of genomics data from an entire microbiome sample. So the idea is you take a sample, you extract all the DNA from all the organisms present in this sample and you sequent the DNA without knowing where they come from, which organisms you are sequencing. And for analyzing metagenomics data there are several approaches that you can already analyze directly the raw data, the raw read, the DNA to identify which organisms do you have there, but another approach that you can do is to assemble the reads into longer sequences because metagenomic, so the sequencing of DNA usually cut their, so doesn't extra get all the sequence from no organisms in one row. So you have only short reads or long reads, but there needs to be combined together in longer sequences to may represent the organisms itself. So to do that we need to use the assembly approach and assembler, so it's mean the computational program that can teach together the fragment of DNA. If you think about it, assembly seems to be quite intuitive in like, it's like a jig through puzzles, you try to combine the different pieces together that works together, but this task is really not straightforward, it's more complex than you think we can think because of the complexity of the genomic and of the genomic, especially the repeats. The missing pieces, because when you do DNA extractions and then the sequencing, you don't sequence everything through the whole sequence of all organisms and all the error that can be introduced during sequencing. And when you talk about metagenomic assembly, it's further complicated by the large number volume of data that are produced, the quality of the sequences, the fact that all organisms in a sample are not equally represented in the microbiome community, so that makes the things more complicated even more. There could be also macro organism that are closely related, so they are, for example, in the same species, but they are not the same strains, for example, and they have really similarity norms and because assembly is based mostly on the similarity in overlaps, then we can assemble a sequence that belongs to really closely related organisms, but are formed through strains. So the presence of several strains of the same organisms and also usually the efficient amount of data for organism that are formed in really, really low concentration. So it's really not, so assembly itself is already a complicated task, but with metagenomic assembly, it's even more complex. Here in this box, you can see what are the different strategies for assembly, so that are the main strategy that you use in not just specific for metagenics, but any type of assembly. So you have the greedy extension, the overlap, layout consensus of the print graphs. I will not go in depth in that, you can take the times and especially read the paper that is mentioned here. But for metagenomic assembly, several tools exist, for example, metaspades, mega-hit, and all these assemblers have different computational characteristics and performance that can vary also according to the type of microbiome. For example, if you have a microbiome that have only a few organisms, so low diversity or, for example, if you have a soil where you have expected high diversity of microorganisms, all the different assemblers behave differently. And so there is different benchmarking approach that has been used over the years to try to identify which assemblers to use, in which contexts. And I can recommend having a look at the critical assessment of metagenomics, interpretation initiatives, where they do evaluation of different assemblers. So in these tutorials, we would like to learn how to run a metagenomic assembly tools and evaluate the quality of the generated assemblies. And to do that, we will use data from a study that is called temporal shotgun metagenomics dissection of the coffee fermentation ecosystems. And so what they did is they took the coffee microbiome and indeed the temporal shotgun metagenomic study, so six time points, and they extracted the DNA for these six time points for the microbiome of the coffee. And that sequenced everything with Illumina myseq, the whole genome sequencing. So based on these six original data sets, we generated a mock dataset for these tutorials because the original data were quite big and doesn't fit for a purpose of a training. Already you will see it will already take times to run everything. So please be patient or you can also do something else in parallel of these tutorials or follow another tutorials because the running times are quite long, even if we shotgun, we generated a smaller dataset than the original one. So how to run an assembly? So first we need to get the data into Galaxy. And so any Galaxy analysis start with its own history. So what you do, you need to go is to go to your favorite Galaxy instance. So for me, for example, I use the Galaxy Europe, so use Galaxy.eu here. You can use Galaxy.org and or org.au, so the Australian or the US one. And the first things you need to do when you start an analysis is to create a new history. So the history, as you probably know, is on the on the right side. So the first things to do is to create a new history. So you click on the plus one, create a new history, you give it a name. So you have the name here, you can click here to rename it. So click on the small icon, pencil icon here. And I will rename it assembly of Metagenomics data minus tutorial. So this is my name of my new tutorials. Then I need to get the data. So instead of going back and forth between the two, the different tabular here, I will use this icon here that you can see in the in the in the Galaxy interface. So if you click here on see Galaxy training materials, you will be redirected. It should work. Sorry, it takes a bit of time to load the first time. Oh, yep. Yeah. And you can again go to find all the tutorials that you have in the GTN. So in the Galaxy training network, you can find them here directly embedded in Galaxy. So that they are easier to to see. So assembly, sorry. And yeah. So it's exactly the same. The only thing is you have a bitter integration and you don't have to move between them. So if you go outside here, if you click outside, and you come back here, you will get again the the the tutorials itself. So I created a new history, I renamed the history, and now I need to get my data in my history. So for that, I need to import all the raw data. So the fast queue thinger files from Zenodo or from a data library. So currently the data I will use the Zenodo one. So what you can do is click on copy here. You click on the copy, it will copy everything there. And then you can you need to import them using the importing. So you click on the left side on the top of the of the of the bar on the left side, you click on upload data. And then you click on pass fetch data here on the bottom here. And then here you can pass here your all the things that has been copied. So again, so and then you can click on start, I will do it again to show you. And once it's there, you can close it and you will add all the data in in the galaxy story. So again, how what I did is I went to the to the tutorials directly inside the galaxy. Then I click here on copy it copy everything that is there. Then I went to the upload data here button on the left top left. Then I click on past face fetch data and here in the box here, I pasted all the content that I add in my that I copied. And I click on start. And then I close. So I don't do click on start again, because it's already running, at least uploading the data. And then you can close. So it should upload in your history, you should have 12 data sets that are available, at least downloading using the link that we gave. And as we said, we have six data sets, but we have 12 data sets. So we have six samples, but 12 data here. Because for each data sets, you can see you have, for example, 67 underscore one 67 underscore two. So the underscore one seems so we have parent data that has been sequenced. So for each day samples, we have two data sets because we have my underscore one is the forward and the underscore two are the reverse reads. So we have paired and data here for all the six samples. And to organize them, we can do what is used a cold a paired collections, where we can with the robots. So it's mean organizing the the samples in a collection in collection, you can think a collection like a folder in your in your computer. So we try to put everything in a folder. And a paired collection is a way where the collections, the samples are already the two data sets for each of the samples are already combined with forward and reverse. So we know that these two belongs together, these two belongs together, and these two belongs together. And to do that, to to to organize in the way in that way in Galaxy, what the things you can do is click here on select items. And then click here on the right, select all. So you have here select all. And then you have a button that will appear for all selected. And then you have a drop my menu. And then you say, I want to build a collection, a list of data set pairs. So want what I want to do is pairing the data sets together. And then from one of the pairs, so I will have six pairs. And from the pairs, I want to combine them in a list or a collection. Again, what I did is so I have my things loading here slowly. And then I click here on the on the checkbox icon. And then I click on the select all on the right. Then I click, I have a for all 12 selected, then I can click on build list of data sets pairs. And then I have about a thing that appears in the middle. And what they try to do to create a collection of paired data sets. So by default, it looks if it does underscore one for putting that as a forward and underscore two for the reverse, and it created six pairs. And the names is currently the name of the samples here. Another thing. So here I want to remove this fast Qsanger here. So what I did do is for each of those samples, I click on it and I remove I delete the dot fast Qsanger here. So I do it six times. It's okay. It's not a lot of data sets. So it's all good. And then here and here. So now I should have six pairs here. And to finish to create the collection, I need to give it a name. So I will call it read here, because it's original reads, the reads we will start from. And then I can create the collection. And then if you see, then it's the, the things is, is not really available anymore as before. So it's not like all the 12 data sets are data or files somehow are not visible anymore. They are all in one thing that is called worries. And if you click on that worries, you have the six, you have six element here, and a pair of data sets here. So you can see here that you have no forward and reverse for example, for the sample, the sample 77. And if you want to call back, come back, you can click on raw reads here and you go back to the collection. And here on the top, you go back to the assembly of, yeah, in the original collection. I think it's still downloading, so we need to take a bit of time wait. But in parallel, what I can do is start to explain what we need, we will do. So as I explained before, there is many challenges in metadromic assembly, including the difference in coverage between the samples. So each sample, for example, 77 or 70, we have a different coverage, meanings different. So every, every samples or different organisms in the same, in different samples, the same organism in different samples will be covered differently by the sequencing. So the fact that we thought that different species share conserved regions, and there is multi strains of a single species, and that can make the complexify even more the assembly. And to reduce the difference of coverage between assembly, what an approach we can use is called co-assembly, where all the reads from all the samples are merged together before doing the assembly. So then we reduce a bit the diversity, we reduce a bit the complexity of the, of the assembly and the differences between samples. So the advantage of co-assembly is we have more data to do the assembly. We have, we usually have better and longer assemblies, and we have, we can access slower, abundant organisms because we have more, more data. But the disadvantage is it's higher computational overhead. We have the risk of shattering the assembly, and we have a risk of increased contamination. So it's not always beneficial to do the co-assembly, especially if the change in the strains can cause the assembly to collapse. And when we bin, so that is usually a step that is done after assembly, when we, especially when we want to do, to build metatomic assemble genomes. So build full genomes from the metatomics data. So it's called max. So max is metatomic assemble genome. So max, and usually when you do a building max, you want to build max. So you do the assembly, then you do binding. So you combine the context that has been generated, so the longer sequencing that has been generated by the assembly, and you try to combine, to bind them to, to, to, to say that, yeah, bind them, or bind them, bind them together. So sequences that have a similar similarity, and that could belong to the same, for example, species or the same strain. And when you do co-assembly, the bins, bind the context, so the sequence that has been generated from the output of the assembly, and that has been in, are more likely to be misqualified, misclassified. They can be treated as population genomes. So co-assembly can be reasonable if the data come from the same samples, but just, so the same samples has been just sequenced several times, or we have the same sampling event in the same location, the same site, or the samples are really related. So in our case, when we do, here in the case of the data, we have different samples from the coffee, a microbiome, but at different sampling time of the fermentation. Is it meaningful to sequence, to align, to assemble everything in parallel, or do we want to together, so together, or the six time points together, to really know which microorganism, microbiome we are for coffee fermentation, whatever the time point is, or another approach is we can do is individual assembly. So we, and we assemble each samples one by one. And then we could have a more clear information about the different organisms that we can find at different fermentation point. So, and so when we don't have same sample, or same sampling event, or a related sample, usually we do, as I say, the approach of individual assembly. In this case, we usually have an extra steps of the replication that can be used. So we do an individual assembly, and then we combine, we compare the different strengths to, the different generated assembly to try to de-replicate, to see the similarity, and then identify the gene, the mouth that belongs to the same strength, for example. But co-assembly is more commonly used than individual assembly, and their replication, and then their replication after beginning. But in these tutorials, we will really show how to run the individual assembly. Running the co-assembly is really similar approach, just a few changes that I can highlight. Just a comment, sometimes it can be also useful to run both approach, so to have both individual and co-assembly, and use the outputs to really have an overview of everything. As I mentioned before, several tools can be used for metagenic assemblies, but the two that are the most used, and the one that seems to pop up the most in benchmarking are metaspace, which is a short read assembler designed specifically for large and complex metagenic data sets, and mega-it, which is what is called a single-node assembler for large and complex metagenics reads also. Both of them are using a Durbring graph approach, but they have slightly different preferences, so I really recommend you to look at benchmarking papers to identify the one to use in which context. But both are good in a sense, and both are available in Galaxy. The only thing is currently mega-it is the only one that can be used in individual mode for several samples. Metaspace can be used, for example, for co-assembly, but not really for individual assembly. Today, for these tutorials, we will use the mega-it with the individual assembly approach. Let me go back to check if all the data has been downloaded, so it's still loading. What I can do is I can try to see if it's already able to run mega-it, so even if it's loading, we can probably already launch the next step. One thing you can do is, if you search, you can search for mega-it here, because you can find it here. Another thing you can do, thanks to the integration of the training material directly in Galaxy, if you click on mega-it here, it loads already directly the tools in the middle panel here. Then what we need to do is to select which type of input we have, in this case we have paired collection, so we want to use a paired collection. Just go back to my history. I want to run it individually, and as you see, I need to wait a bit longer, because the collection is not correctly detected. Once all the data set will be downloaded, then the collection will be available here, so it should be okay. So I just do a small, I will ask you to wait a few more minutes until it's done. All the data here I see in forward and reverse. If I click here, I have the FastQ Sanger, so I really have a FastQ file here. If I click on mega-it, now I can select the paired collection, so the raw read. We definitely need to select run individually, so if you click there, then you run individually. So if you want to run it as a co-assembly, you can say click merge all FastQs paired and that will run everything in co-assembly mode. Now what we need to do is we need to select the raw reads, and we want to specify the camera list. So the camera will be the so what mega-it is doing is trying to build a Durbring Graph. So Durbring Graph, do we have an overview of what is a Durbring Graph? I think it's there on the top for the strategy. So we have the Durbring Graph, so the reads are divided in cameras. So camera, it's like a string of length k, and then we try to build a camera graph, so linking the cameras together. So putting all the cameras together from all the reads, trying to build a camera graphs and select a path through these graphs afterwards. So for example here we see this one and this one are connected. This one and this one are connected, because I mean they are similar, so here and then you have the same here, and then once you have that you can also combine them because of overlaps. So you see that here you have 80, see where is it? You can combine them once with that once afterwards, so you can merge that once plus afterwards that once, because there is this graphs here, and then that combines with the one afterwards. So you can really try to build a path to create the longer reads afterwards. But the first part for doing it is to cut your reads in the cameras, so in the strings of length k. So in this case we want to select a specific camera size, so the minimum camera size. So yeah mega heat is using an approach where it tries several camera sizes to build the graphs and try to identify the context based on that. So here we have a length 21 for the minimum camera size and 91 for the things with the iteration of 12, so it's what we want to set up here. So to do that we do a specified min max, so 21, we say 91 and a camera iteration of 12. I think that are the things and then once we have that we can launch mega heat. Again mega heat will take a lot of time to run because first it needs a lot of resources to run, so it's a tool that may require a lot of memories and we don't have a lot of nodes, at least on Galaxy Europe where you can run that, so you need to wait until these nodes for the cluster are available to run the tools. And it's why it's gray and it will stay gray for a certain time point. After what the running itself, because it's short data sets, should be okay. It should mean be not that long. So it will take some time to run. So it will produce a collection of outputs, so if you look already at what is there, we will see that we have one output per sample, so it will be a contic file, so it will be a fast queue file, or faster file, sorry, with the contics that are expected for this sample. And then this data set can be used afterwards, so that are the contics and they can be used afterwards, for example, for beginnings, for the replication for any step that go after the assembly. And so as I said, they contain the contics, so contiguous sequences of a certain length, and that have a certain degree of similarity. Metaspades can also output scaffold, so what does it not mega aid to? So scaffolds are like contics plus, so they are contic data together, that are combined together, but with a gap in between. So contics are a sequence, contiguous sequence, and where we know all the bases, so we know each base pairs in a context, in a scaffold we combine different contics together, but with in the middle a length, a certain part, a gap of lengths, a known length, but we don't know exactly what are the different base pairs that we can find here. It can happen, we can evaluate, we can build scaffold because the gaps, we know that for example when we have parent data, we know that the parent data, the river sensor forward doesn't usually overlap, we know that there is a distance between that, and we can use this information to build this scaffold here together. It's also apparent when, yeah it's mostly that here, so and we can, we know the estimation of the number of bases between the two, the two contexts. So a recommendation, so because the assembly, okay I forgot to, it will take around one hour to run, and things you can do is to import, we already run the assembly for you, so you can run the, you can import the content that has been generated by doing the same approach as before, so you go to upload data, so first again, sorry I was too fast here, close, so again if you go there, you click here, you click on copy here, so you have the link there, you copy, then you go here, oops again, wrong one, data, upload data, you pass fetch data, you can pass through it here, and you can start the download from Zenodo to your history, then you can close, and you can create a collection again here, you can select the one here, because we want them to have them organized as a collection, for all the eight selected, we want to build a dataset list, we have only one data for things, so and then what we want to do is having the name, so we want to keep that one, sorry, so we need to do that, but I just want to rename them, I want to just keep the sample names, so I need to rename them, to remove the context faster here, same here again, and you can run, oh you need to have all selected, and mega it output, and you can create a collection again, oh it will appear again, hopefully, yeah mega it output, and here it should download here, so the output of mega it as I said is a faster file, and from this file we can identify how many context has been found for this sample, or for another sample, so for 68, or for 72, and what is the minimum length of the context, so it's again loading, so I cannot, I will wait a bit until it's done, and I will come back, so either you manage to get the Xenodo data, or you can also wait until the mega it is done, and do something else in parallel, so I recommend you to either take a break, or whenever a break, or to go for coffee, or to look at what is the next tutorial that you want to follow after that, it's now uploaded, so the output from the things we already prepared for you is available here, and if you see the context here, so you have a faster file, and you have here the number of sequences, so for example for these samples we have 122,000 sequences, so it means 122,000 context for this sample, and for example for this one here we have the double of context for this sequence, the context, the next question what what is the minimum length of the context, the minimum length I think will be definitely, oh I don't know the answer to be honest, so the sequence seems to be bigger than 200 base pair, the length that is attributed in the sequence information in the faster file, okay I don't know where is this information, ah it's probably in the right put, we need to wait until the omega it is finished to be able to see this information, sorry for that, oh no sorry wrong wrong wrong wrong here, so we are here, so if you want to do the co-assembly with metaspades we have an explanation here, I would run that, it's quite similar, not that many differences here, so once we do the assembly before going further and do for example binnings or other steps it's good to check the quality of the assembly, and there is tools for that, for example MetaQuast which is a dedicated tool for, for Quast is famous for, it's known for reporting assembly for, for assembly quality, to run quality control for assembly, and it has a metagenomics mode and it's called MetaQuast, and for that so we need to do that, we need to, we really run MetaQuast, so we click on Quast, why it's not loading every time, I don't know why it's doing it, it's doing, it's not loading the towel itself, okay let's find it in another way, so you can search for Quast here, then when you have Quasts you say which type of mode you add, so here we are the nipidule assembly mode, then you can say no I want to use the data sets names, so the names in my, in my collections, and I use, I select the correct collection, so I could already run on the assembly formigate, so directly the one that we, we are currently building or running, it will just wait until this one is done to start the Quast, so I can do that, then the question is do you rave to read options, so I think we say it, yes we want to give, so we have paired and in a collection with a rave, so the idea is when you do give the read options it will map again the assembly, so the original reads to the assembly to try to identify how many, how much of the original reads has been used to build the context, so which type of assembly do we have, so we have metagenomes here, and then it will try to identify and do some early quick taxonomy affiliation of the sequences, to, to, to see if we see also some differences in the quality based on the taxonomy, so if for example for specific bacteria, maybe the quality is worse than for some other things maybe because of low abundancy or other things, so it generates, if you, if you give a reference genome and you select for example the silver database, it will map the context on the silver database and try to do the quality control for only the read that maps to a specific taxonomy, for example, so here we select the silver database and then all the other ones you don't care really and what which outputs we want, so I think we want all the outputs here, HTML report, PDF report, tabular report, log file and everything and then once it's ready you can run on the tools and as I said, so in my case I use this one as an input, so it will wait until this one is done to launch Quast and again Quast also takes an awful long time to run to be honest, so don't be, don't, it's normal if it takes a lot of time, so have a break, take a break there. Okay, it's running, it takes time, the other things you can do is also import the report, the HTML report that has been generated by Quast if you want to go faster and here in my side I move to another history where I already run the Quast, so I can show you the result, so it's really the same thing, so the raw data assembly and then I run the Quast directly. Quast generates several files and the several outputs and the one that is the most interesting in our case is this HTML report and especially, so let's open one, it will create a report where you see, if you open it, you click here, it will open the report in the bottom, you can hear, you see that on the top we'll just, so we'll show you something, so if you want to go to see it better you can put here on the left and click on the really top, bottom left here to hide the toolbar here because we don't really need to hide the bottom here, so what is doing Quast? So Quast is creating a report and the HTML report within the top you have some statistics, then you have some plots and you have some references here and so we can go through the reports and the first thing we see is this genome fraction, so the genome fraction as it's explained here is the number of the percentage of bases of the original read that has been aligned to the context, so here it's 25% or 25% of the read and the basis has been aligned to the context. So what is the question is what is the genome fraction for 68 and 72 and because I will have usually always questions where I want to compare two samples for example, the report of two samples, what I do is I can use the window manager here, so I click on the top here on the window manager to have this bottom, this checkbox here and I click on the report that I want, so it's 68 and 72, here you open also the 72, so and then I can have the report side by side here, if you see and I think I can expand it to the bottom of my ear and ear, so then I can really compare the things. So here in this case in 68, 30% of the original bases are aligned to the reference, to the, ah no, if the number of bases of the context that are aligned to reference, I feel confused now, oh I cannot do that, percentage of aligned bases in the reference, a base in the reference genome is aligned if at least one context, at least one line aligned to the to this basis, so what we did is we didn't provide any reference, but MetaQuest tried to identify by aligning against the seed by database and then it tried to identify, so for each identified genome, the genome fraction is given for by ear, so if you see, so for this, ah yeah true, so if we take these organisms here, 10% of the basis of these organisms, so in the reference database, are found in the context, if we take this leuconostoc, pseudo, blah blah blah, whatever, 84% of the basis of these organisms are found in the context, and if we look at the order, we have one where we have 90, so lactobacillus, so the reference genome of lactilobalasis is found 90% of the basis in the reference genome or lactibus are found in the context, so it's what is mentioned here, so it's lactibalase and leuconostoc as we say it here, oh I think it's, there is a mistake here, then there is another thing which called the duplication ratio is the total of line, so do we have a duplication, do we have some some basis that I found several times, and here the duplication ratio is quite low, so we don't have a really, so if we for example, we have a contic, we have many contic that cover exactly the same region in the reference, we have a ratio that is much higher than one, here we are close to one, so we are covering regularly the things, then we have other information again, like the NGA 50 or other things, but here we will focus on the read mapping, so here we talk about the map, so we took the raw reads, so the one that has been used to generate the assembly, and we map them to the context, so at least Quest did it, and then we compute the percentage of reads that are mapped to the assembly, so in this case we have an almost 80% of the basis in the raw reads that are found in context, so it's mean a lot of the reads are being used for generating the context, here same, so we have a left only a small leftover here, ah I will not manage to do that, and yeah, so it's mean that if we have a percentage, it's mean that 90% of the reads has been used for generating the assembly, okay next point is the misassemblies, so when a misassemblies is when there is errors that has been done during the assembly, so Quest has the possibility to identify misassemblies by mapping the context to the reference genome of the organism that has been identified before, and using that three types of misassemblies can be identified, for example the relocation, so it's mean so for example we have this context in blue and that is made of two parts, the blue part and the green part, and we see that we have the chromosome one and chromosome two, so when we have relocation it's mean that when the mapping of two parts are the same chromosome but they are maybe separated by a map region here, or they map on the same chromosome but with overlaps between the two parts that they mapped, so for example they map ear and ear and there is an overlaps between the two parts that are mapped here, so that is the first part of the first things we can look, how many relocation can we found, and to get that we need to click on misassemblies, oh where is no I think we need to click on the extended and here we see there, so we have if you click here on the bottom again if you click on extended report here and you scroll up again you have the number of misassemblies so I need to, and we have no number of relocation, so we have 78 relocation, we have a lot we have 187 misassemblies, 78 relocation and more than 106 50 for for the order of some part, then there is another thing that could appears is what is called a translocation, so when when a context is mapped to several locations, so when part of the context can be mapped on different chromosomes or different organisms like ear and blue, so you see the blue part is mapped on the chromosome one and the and the green part on the chromosome two, it can be chromosome one or reference one, reference two, and how many relocation has been found for the sample, so we can see we have 80s 25, 65 and for which we are, for which organisms do we have that, I think we need to expand that, so we see a lot for leuconostoc here, and for the translocation sorry here we have a lot, so we have it's a mix between lactilobacillus and this VIX vaccine or blah blah blah whatever, and then we have some calls that are inter species translocation, so the inter inter species is when the translocation appears between organisms, so it's so the translocation can be on different chromosomes for example of an organism so of a plasmid and the chromosome, the inter species translocation is when the context maps on different organisms, so part of the context maps on different organisms, that is a difference, and another thing that can appear is the inversions, so here we see that the context is going in this direction with the error there in the same direction for both parts, but when we map we see that the blue is going in this direction and the green in the other direction, so it's an inversion of the things, and we see that we have a really low numbers of inversion in both cases. Then another thing that is recorded in the report is the mismatches, so mismatches I think that are mismatch between when we do the alignments on the rate of the context and the reference, so when there is a change from A to a G or something like this and we see that the numbers are quite okay here and we can see that there is differences between the different organisms, here it's a bit lower for the other sample than for the first one, and then it generates also, Quests also generate context without references, so if we didn't provide any reference or any reeds to map to, Quests will anyway report that, so the number of context, so here we see that we have a higher number of context, and if you see the numbers are different from the one we saw in the FASTA file, because here the Quests is reporting only the context above a certain threshold, so it reports only the one above 500 base pairs, and if we want to really see the full numbers we need to see, so this one is the one that, so above zero base pairs will give you the same numbers as the number of sequences in the mega-8 outputs, and then we have information about what is the largest context, so here we have context with 63,000 base pairs and in this case the largest context is almost the same size, then we have information about the AN-50, so the lengths for which the collection of all context is of that length or longer cover at least half of the assembly, I really, really struggled with a lot with this AN-50, I know it's a metric that is used a lot in assembly, so it somehow defined the assembly quality in terms of contiguity, so if all the context of an assembly is ordered by length, so if you have the context of the lower, the shortest context on the left side, on the top, and the longest on the bottom, the AN-50 will be the minimum length of the context such that it contains 50% of the assembled basis, so for example you have, it means that if you have an AN-50 of 10,000 base pairs, it means that 50% of the assembled basis are contained in context of at least 10,000 base pairs, so another example here that is described here, let's assume that we have nine contexts of lengths 2, 3, 4, 5, 6, 7, 8, 9, 10, so the sum of all the lengths is 50, if we take the half, it's 27, and if we sum up 8, 9 and 10, it's 27, so AN-50 is 8, so it's the size of the context for which actually along the largest context that contains half of the sequence of a particular genome, so here we have half of the sum is 27, and if we start from the beginning, so from this one to this one, and we go below this 27, then it's this one, so here 10 plus 9, it's 19, 19, and then plus 8, 27, 27, it's exactly the same half of the sum, so it means that AN-50 is 8, so in our case we have AN-50 of almost 1,000, and here more than 1,000 for the second sample, so it's mean that all at least half of the basis of ensemble basis are in context of at least this size, and for the N-90, it's the same ID but not with a threshold of 50 but a threshold of 90%, so it's mean that 90% of the ensemble based are in context of at least 50, 400 basis, it's complicated, so when we compare N-50 values for different assemblies, so we need to have the same assembly size to be sure that we can compare the things, so the 950 alone is not really a useful measure to assess the quality of an assembly, for example if you have different ear, you have different length eyes, but the balls have the same N-50, but one are really more contiguous than the other one, another things you can do is you look at the L-50, so the number of contigs equal or longer to that N-50, so for example is the minimum number of contigs that cover half of the assembly, so if we take these cases L-50 is 3, so L-50 for ear is this one, and really lower for the second sample, another things we have access with this cost is the Icaris Contig Browser, so if you click on the really top of the cost result and you click on view on Icaris Contig, so you have first a report and then you can click on the Contig size viewer, the top ear, Contig size viewer again, and then you will be able to look at the context ear and we can look at, oh I think I need to make it bigger ear and if I want to until ear, so not the full length, but only ear, oh sorry I copied pass, I didn't copy pass, so I want to look at ear, sorry ear, so ear between zero and so the contigs are again ordered by size, from the biggest to the lowest, to the to from the longest to the shortest, and then if we look and then you can get at them, and here you see that you can inspect the contigs one by one if you want, you can also look at if they are correct contigs, it means that they are aligned on the reference genome, you can see on which reference genome they are aligned here if you click on it, you can see the gray, it means that they are not aligned, here in red you have the misassembly and you know you can identify which type of misassembly here, we have a translocation here, and the first one, why is this one, so this one is as some part that are aligned and some that are not aligned, so it's why it's in white, and then the first page if you want to go back to the main menu here, you have on the first page you can even look at the context for each reference genome here, so here if you look at the context aligned to this organism, specific organisms here, you can open the viewer and look at the contigs and the most covered contigs, so the organism, the contigs on the top are organized by how they map to the reference, and the colors again explain here on the right, so at least here, so you can zoom again on certain sections here, I think I took the wrong one, I wanted to take the look on August, because it's the one that have the most contigs, so here you can see, you can zoom, you can look at the contigs themselves, you can see where they are in the reference genome, so you can move around the reference genome, and the different colors means the type of context we have, the red blocks mean that we have a misassembly, and the graph on the bottom here you can see it's the GC percent on the right and the coverage on the left side, so in blue. Once you did that, another things you can do is looking at the visualization of the denouvel assembly graphs and using bondage for that, so we have an explanation here how to do that, if you want to look how the graphs look like, so we did it, and you see all the contigs, all the contigs are combined, it's quite hard to see these pictures, but you see here you have a long contigue for the longest contigue, and then you have the different contigs and the different branch that appears there, so you can do that and you have some questions here. So I think we are at the end of this tutorial, so I hope you learn a bit about the metatomic assembly and what can be done, especially to obtain the genome of species, for example, organisms that are found in your samples, and we know that metatomic assembly is complex, but there are different approaches like the to print graph approach, there is different strategy for doing metatomic assembly, like co-assembly or individual assembly, and different tools like metastate or megate, and once your choice is made, you can start by assemble the contigs together, set the quality and visualize the graphs afterwards, and then once you are sure about the quality of your data of the contigs, then you can move forward for the next step, for example, doing the binding, and there will be a tutorial for that available. Then I think I'm done with that, and I hope you learned something and hope to see you around. Thank you!