 Welcome everyone to this tutorial on large genome assembly and polishing. Here is the tutorial material and while I'm showing this tutorial I'll have this open in a different window and I'll show you the steps that I'm following in the Galaxy window. So in this tutorial we're looking at what genome assembly is so I recommend reading the introductory material here if you're not familiar with some of these concepts. We're looking at some of the challenges in particularly assembling these large genomes back into what we are hypothesizing as their original chromosomes. So there are a lot of challenges here and we'll definitely see that with this test data set and you'll no doubt see that with any real data sets that you try to assemble as well and that's just part of the current landscape of genome assembly where we're still in the process of developing all of the tools and the backend computation to try and work with these huge volumes of data. You should be able to use your own real research data with the workflows described here. So we do always recommend though checking each tool separately on any data that you want to use and also perhaps sub-sampling your real data sets so that any problems you can find out more quickly rather than setting a big job running and having to wait a long time. And if there are any issues with the tools or the data running on those tools then contact the people who are administering your particular Galaxy server. Today I'll be showing these workflows running in the Galaxy Australia server and again if you are using your real data in with these workflows keep in mind that so many genomes have never been assembled before so it's likely there are going to be particular settings in the tools that you'll need to change or customise so that they're relevant for your input data as well as your research questions. So in an ideal world we would be able to make perfect assembly workflows that you could just put your data in and run but I think if you want to get the most out of your data and your research you really want to be thinking about what those tools doing and whether that's working well with your particular data type. So in this tutorial we're covering various steps. This is a picture of the analysis workflow that we'll be following so this involves quality control, counting the camers in the reads to get an idea about the genome characteristics, trimming and filtering reads, assembling and polishing the genome and then trying to assess the quality of our genome assembly. In this tutorial we'll actually run these as separate workflows and we may change some of the parameters in each of the workflows as we run them so rather than running all of the separate tools we'll actually be running a set of workflows. In this tutorial we talk about how you can run workflows in Galaxy and where you can import these workflows from. Okay so now we will start with importing the data so let's give our Galaxy history a name and then in the tutorial document there is a list of the files that we need to import so I'll copy that list then I'll go here into upload data, paste fetch data, paste those file names in there, start and close and these files will now start to load into the current history so now our files have imported and we just need to check that they're in the correct format so you'll notice you've got three eucalyptus files, they've got the sequencing reads in them and we need to check they're in the format fastq.gz if they're not in that format make sure you click on the pencil icon for each of these three eucalyptus files and assign the correct data type which is fastq.gz sometimes Galaxy will upload them as fastqsangr.gz but they need to be fastq for this tutorial. The next step is to check the read quality so for this step we're going to run our first workflow which is the data qc workflow and this is going to report some statistics from our sequencing reads so in the tutorial document it explains how we can import this workflow and I'll just show you that here so we can copy the link from the tutorial and then we want to import it into our own Galaxy set of workflows and then that should now appear in our list of workflows and your list of workflows will look different to mine but it should load as the top workflow and once that workflow has loaded to your list we can run this workflow with the run button and in this case I usually expand to the full workflow form just so I can see exactly what's happening here we won't send our results to a new history we're just going to do all of our workloads in the current history and we need to check our input files are correct so Galaxy will guess but we need to check so for long reads we want to make sure it's the nano file and nanopore reads so that's correct for our elimina reads we want to check that it's the r1 fastq reads and then for the elimina reads r2 so that's our paired elimina reads and in this particular workflow all of the parameters have already been set to particular defaults and you can always look at those by looking at the workflow either here or in the workflow canvas and you can change anything that you want to change so we won't look at all these settings currently but we will now run our workflow the workflow is now running we can see the workflow invitation and this describes the inputs and the outputs and the steps that are being run so as the workflow runs this will fill out so it shows us what inputs we used in this particular run for workflow what outputs we're hoping to get and then more detail about all of the steps that are being done in this workflow so to get a full idea of what's in here have a look in the tutorial document and also at the workflow itself and then you can see the tools that have been used in this workflow which in this case are the nanoplot tool and the fastqc tool and then combining those results with the multiqc tool and we can see our jobs are starting to run here in the history and when they they've finished running they'll turn green let's have a look at some of our results uh first we'll have a look at these multiqc results and we'll just collect some of our side panels so we can see a bit more clearly here this is going to summarize our laminar reads results from fastqc and for example one of the things we can look at is the sequence quality histograms and we can see that it reads quite good quality there's some drop-off in quality at the end but not very much at all and we could also look at our nanoplot results let's have a look at that html report and in particular let's have a look at read lengths versus average read quality and we can get lots of information there and so from these qc results you might then make further decisions about firstly whether your reads are of high enough quality for your particular analysis that you want to do and also if you need particular settings on any filtering or trimming of reads according to their quality now we will look at some of the genome characteristics and we'll do this by counting the camers in the sequencing reads and there's more about this explained in the tutorial document so what i've done is i've imported the workflow for this one already this is the camer counting workflow and we can look at that in our workflow list so i imported it the same way that we did in the previous step for the data qc workflow so here's our camer counting workflow and with all these workflows one way of looking at them in more detail is clicking the edit button and then you can view them in what you call the workflow canvas and this is just a nice way of seeing how all the tools and the inputs and outputs are connected together this one's not a particularly complicated one we'll collect our side panels here so this is a very simple one but it gives you the idea of how things are connected there's the input going into this particular tool and that tool's output is going into this tool and then these are all about outputs here from the actual workflow and this is another place where you can see all of the settings that have been put into the workflow and you can decide whether they're right for your data or if you want to change them so for example let's have a look at the marrow settings by clicking on that tool and it brings up the tool over here at the right and all of the settings that have been put in so these are some of the things that you may want to change for various reasons so exit back out that workflow canvas and again we've got our workflow here in the list and we are going to run that again just with the results going into our current history and all we need to do is check that we've got our correct input file here just the rrun file and click run we have our results now so one of the things we'll look at is the genome scope transform the linear plot with an eye icon just make a little bit smaller lock so this is a graph of the results of our came accounting and it will tell us some of the estimates it's made about the genome size and the ploy so this is discussed more in the tutorial but this is quite a useful graph as a result from our came accounting now we will run the step for trimming and filtering the reads so I've already imported this workflow and we can just run it here directly and we need to give it the correct inputs so check it's got r1 here r2 and the nanopore reads here and we can run that in the tutorial I discussed more a lot about what settings have been put in there for the defaults but note that you will most likely want to change some or all of those um existing settings if you're working on this with real data let's have a look at some of our results we'll have a look at the fast p results for the Illumina reads click here with the eye icon and we can just get a lot of information there about the results of fast p from both the trimming and the filtering that we did in this workflow and now we're up to the genome assembly stage so we're going to do that with a assembly tool called fly so again I've already imported this workflow you can see that here at the top of the workflow list so import that into your own uh list of workflows and then let's run that and we'll expand to the full workflow form so for this one we need to give it the input long reads and we'll give it the reads that went through the trimming and filtering step so these are the fast p filtered long reads and then we can see what tools this workflow is using the main tool is the fly assembly tool and all of the settings in there and then we're going to run some tools to look at the results of that assembly so faster statistics vantage image to get a um visualization of the actual assembly graph and the cost um genome report so we can just run this directly from here and that may take a little bit of time to run the workflow has now finished and all of our output files are at the top of our current history we'll have a look at the cost results so click on the cost tabular report and this gives us some really basic statistics about our assembly you can see the total length here is almost a million base pairs our largest contig is 235,000 base pairs and we have 152 contigs your results probably won't be exactly the same but they'll likely be similar we can also have a look at the cost html report and look at the actual contigs here just to get a visualization of the sizes and the distribution of sizes of contigs and we can have a look at the vantage image of the assembly graph and we can see that lots of our contigs have been joined here in this top left picture and we also have a lot of unjoined contigs some of them appear to be circular now up to the assembly polishing stage so I have imported the workflow and we can have a look at it in the workflow canvas this is actually quite a complicated workflow because it involves sub workflows so in here we have three main steps which are to polish the assembly with raycon then with madakar and then with raycon again but each of those raycon steps are their own workflow with their own set of steps including mapping with minimap and iterating that polishing steps several times so it's much simpler to view this when we paste in the sub workflow just as a sort of a single box here we can run that now and what we need to give it as input is the assembly that we want it to polish so this will be our fly assembly the consensus just double check that's the correct one this should be a faster file and that's correct so yes that's the assembly that we want to polish then we need to give it the long reads we want to give it the same set that we actually used in making this assembly which in this case was the output from fast p so we want the fast p filtered long reads we need to check this is the correct setting here we have oxford nanopore data so that's correct here and we'll also give it the on lumina reads and we can run that so now we can look at our polishing results we can look at first our original genome assembly and the length of that by looking at the faster statistics so the original length was nine million about eight hundred thousand base pairs and then after all these polishing steps we can see our final assembly is here it's the raycon polished assembly faster file and the stats on that are that the the assembly is slightly smaller which we would expect that often happens with polishing some of the mistakes get removed so now the assembly is about nine million six hundred thousand now we'll assess the quality of the genome that we assembled so we have a workflow for this that i've already imported let's run that we need to give it the polished assembly which is file 61 in this case your yours might have a different number but it's the raycon short reads polished assembly and then we're also going to give it a reference genome originally when we imported our files we imported a reference genome a rabidopsis in a faster file in this case that is not closely related to our data that we're using which is from a eucalyptus but we're just going to use that here as an example to see what the results might look like and we'll run that workflow and now we're looking at the results from assessing the quality of our assemblage genome our busco jobs are still running but we'll have a look at the class results so looking at the html report and we'll look at the content browser here and this will give you an example of how well your assembled contents have matched to your reference genome and ideally here you would put in a closely related reference genome we haven't in this case we've just used a rabidopsis but in a real analysis you could put in a closely related genome and then following that you'll also get results from busco which will show you whether the expected genes were found in those assembled contents so let's have a look at the summary of all things that we've done in this tutorial this is the flowchart again showing all the steps that we did with equality control using fastqc and nanoplot came accounting which used the tools merrill and genomescope then we trimmed and filtered the reads using fastp from there we assembled those trimmed and filtered reads using the fly assembly tool and we looked at some of the results in particular we used the bandage tool to look at the genome assembly graph then we polished the assembly using rakeon and modarka and finally we assessed the quality of our assembled genome using the tools busco and fastqc thanks for following today's tutorial on large genome assembly and polishing and we hope it's been useful to you whether using this test data or your own research data and if possible we would love any feedback that you can leave by clicking on this link at the bottom of the tutorial thank you