 Hi everyone, welcome to today's tutorial on chloroplast genome assembly. This is a tutorial in the Galaxy training network and we'll be following it today. So what we're going to look at is how we can assemble a chloroplast genome, and to do that we'll assemble a genome from long reads. Then we'll polish that assembly with short reads. Then we'll annotate the assembly and look at the annotation. And then finally we're going to take a subset of the reads and map them back to the assembly and see how they look. So let's have a look at the chloroplast itself. So the chloroplast is a photosynthetic organelle found in plants and algae. And here are some, they're the little green circles within those larger cells. And you can see this under a microscope. And it's thought that these chloroplasts were originally a type of photosynthetic bacteria that was ingested by another cell. So, and now it has become the chloroplasts within the plant cell. And a similar thing is thought to have happened with the mitochondria. So because these chloroplasts were originally bacteria as thought, the genome of the chloroplasts now is similar somewhat to bacterial genomes. It's usually circular, but it's much more than typical bacterial genomes. So in plants a chloroplast genome might be about 160,000 base pairs. And they're usually in these four parts. There's a large single copy region, a small single copy region. And then these two areas which are the same or very similar and in different orientations. And they're called the inverted repeats. So there might be lots of chloroplasts within a plant cell. So there might be lots of these chloroplasts genomes. Then the plant will have typically a very large nuclear genome. And there might also be a lot of mitochondrial genomes in the mitochondria. So just to get this idea of these relative proportions of those three genomes within the plant cell, the large nuclear genome, the several medium sized mitochondrial genomes, and then many quite small chloroplasts genomes. That's an oversimplification, but I just wanted to get this idea of there being these multiple genomes. And today we're going to look purely at the chloroplast genome. So today we'll do genome assembly. So this is the process where we piece together lots of smaller DNA fragments and join them into longer pieces, hopefully representing the original chromosome or chromosomes from which those reads came. So in this tutorial today, we're using long reads from Oxford Nanopore sequencing and short reads from Illumina sequencing. And the dataset we're using, it's a real dataset from a sweet potato sequencing project, but I've cut the data down into a much smaller subset so that it's easier to use in this tutorial setting. So the chromosome that we're assembling today will be about 160,000 base pairs long. And that's obviously going to be an easier thing for us to do and to try and assemble a nuclear genome, which are very much larger. Let's start with looking at uploading the data. So what I'm going to do is keep the tutorial information open in one tab and then a Galaxy window opening another tab. So here is my Galaxy window and here's the tutorial information. So this tutorial should work on all the major galaxies. It should work on Galaxy main, Galaxy Europe and Galaxy Australia. Okay, so to start with, let's open up Galaxy login or register first if you haven't used Galaxy before. And then we will name this history and now we'll get our data files in. So we're going to import those from a repository on Zenodo. So here in this tutorial lists where they are and we can actually just copy those links with the copy button. Then back in Galaxy we'll go to get data, paste fetch data. Paste in those links, then start and then close. And we'll see those files start loading here. They're in progress now all of its column and when they've loaded they'll turn green. So the data we're loading here is one set of Nanopore reads in fast Q format and one set of Illumina reads also in fast Q format. They may take a few minutes to load depending on how busy Galaxy is. Let's have a look at our files now that they're in a history. We can do this with the icon next to the file. Just have a look at what's in that file just to check it's how we expect it to be. So this is our Illumina fast Q reads. We can see that this looks as we would expect. We have the read header. Then we have the read itself. So the DNA sequence. Then we have just a blank line and then we have this line of quality scores, which is encoded by symbols and letters. So that looks correct. We've got lots of reads there. Let's have a look at our Nanopore reads. And again, that looks correct. We've got read headers. We've got the read sequences. Then a space and then the read quality scores to reach of those positions. And then the next read starts and so on. So that looks correct. One thing that we'll do here is have a quick look at the Nanopore read quality. So that's the second section here in the tutorial. And we're going to do that with a tool called Nanoplosh. So I'm going to type that tool in into the toolbar. And we found that here. And all we really need to do here is to give it the correct file name. So make sure we're giving it the Nanopore file because that's the one we want to look at here. And click, click execute. And that job has now loaded in Galaxy. So there are five output files that are going to be produced from that Nanoplot tool. And the tool is now running. So there are lots of things you can do with this step where we're looking at read quality. There are various other tools you can use. And then you can look at your Nanopore reads, your Lumina reads or whatever data you have. So that tool is finished. So let's look at just one of the outputs today. We'll look at the HTML report with the eye icon. So we won't go through this in a lot of detail because there are other tutorials available in the Galaxy train network that go into a lot more detail about quality control. So I wanted to give you a brief overview today of how we might use one tool at this step. So Nanoplot has looked at those Nanopore reads and it's calculated a lot of statistics and made some plots. We can get some general information here. We can see the mean read length is almost 7000 base pairs. That's quite good. We can see the histogram of read lengths and so on. So for us, that's a fairly good read length that we are using in this assembly process today. But if you were doing a particular type of assembly or had some other requirement, you may need different characteristics of your read set. And so this quality control step is where you can look at read length or read quality or whatever it is you need to know to make sure it's adequate for your analysis. So we've just had a brief look there, mainly at the read lengths in this case. But again, there are these other tools that we can use. For example, we can use FastQC for the Illumina reads. And we can also use a nice tool called FastQE, which will give you your quality scores represented as emoji. So that's a fun way to have a quick look at the quality of your reads. Now let's assemble our reads. We're going to assemble the long reads, so the Nanopore reads. And to do that, we'll use a tool called Fly. So I'll type that again into the tool panel. Click on Fly. And just collapse that side panel there so I can see a bit better. We want to give it these Nanopore reads as input. We'll give it an estimated genome size, 160,000 base pairs. And we will quick execute. And now Fly is running. You can see it's making five output files. So Fly may take some time to run. It may take five minutes or 15 minutes, perhaps, if there's a very big Q in Galaxy. So I'm going to pause this recording now while I wait for Fly to finish. So Fly has now finished assembling those reads. We've got five output files. I've clicked here on the icon for the log file. And we'll zoom to the end here where just a little bit. We can see that the total length of the assembly is almost 160,000 base pairs, which is what we would expect. And there's two fragments. So let's have a look at the assembly graph with the bandage tool. So in the tool panel, we're going to search for bandage. We will skip the bandage info part today. We'll just go to bandage image. We want to give it the graphical fragment assembly file. We won't look at the node length labels today. And we'll click execute. And I'm just going to collapse this side panel again so we have a bit more room. And then that's finished. So let's click on the eye icon. Let's zoom out a little bit so we can see that assembly. So this looks like we'd expect. We have a large part and then sort of a collapsed part here and then a smaller part here. So this seems to reflect what we would expect. This is probably the large single copy region of the chloroplastinone. This is probably the small single copy region. And then this is probably the collapsed inverted repeat region. And there's a bit more discussion about that in this question and solution section in the tutorial, which you can look at by clicking on the plus next to the solution. So this was just one tool we've run here today, run fly. But you can repeat this with different versions of fly. Sometimes the newer versions will assemble these reads just into a single circular chromosome. So some of the changes in that new version can affect the assembly. But you can also try different assembly tools such as canoe or unicycle. Now let's polish that assembly. So because we have a set of short reads, we can use those to correct some of the errors that might be now long read assembly. Because the long reads typically have a higher error rate than the shorter aluminum reads. So let's do that now. Let's zoom back in a bit to make our galaxy bigger. So first of all, we need to take our set of aluminum reads and we need to map them to this nanopore assembly. So let's look for the tool BWAM. And will you select a reference genome from your history or use a built-in index? We're using a genome in our history. So it's one we've already made. And which shall we use? We'll use the fly assembly. So that's the consensus sequence here. Leave that as auto. These are single-end reads. And then we need to give it the correct set of reads. So this is the alumina fast queue reads. Those other settings are correct. And then we will execute that style. Okay, so that's given us an output which is a BAM file. I'm going to rename this file just so it's easier to see what we've done. So let's call this alumina BAM. So I did that with the pencil icon here in the history panel. So we're going to use that BAM file now in the tool pile on. So let's search for the tool pile on. So again, we're using a genome from history. We're using our fly assembly here. We need to give it the alumina BAM, which it's already found. We will create a changes file. Yes, there. And then we can click execute. So we've got two output files here. We've got the actual corrected assembly. This is the faster file. And we can look at it with the icon. But we've also got this file, a changes file that we wanted to have a look at. This actually shows us all the changes or the polishes that pile on has made to the assembly using information from the short reads. So that can be quite a useful thing to see. So we've polished our assembly with the short reads. Let's get some statistics on that polished assembly and compare it to the unpolished assembly. So we'll do that with a tool called faster statistics. So I'll search for that again here. So first let's run it on our original fly assembly. Execute that. And let's run it as well on this pile on polished assembly. Okay, so for our original assembly, we can see the length of our assembly was 158,857 base pairs, which is about what we thought. But with the changes made by polishing, we now have a much bigger assembly. It's moved up to more than 172,000. So really roughly four or five thousand changes have been made in that polishing step. So with pile on, you might want to do several rounds of pile on polishing. So each time you'll polish the new or the latest assembly. And there are other polishing tools as well. So you can explore all of that in Galaxy and some of them have a different effect on the polishing step. Now we will annotate the assembly. So that's putting information onto that polished assembly about the genomic features that are on there. So there isn't yet a tool available in Galaxy specifically for chloroplast genome annotation. Instead today we'll use an approximation. So we'll use a tool called Prokka, which is designed for bacterial genomes. So it's not going to be perfect for the chloroplast genome, but just as an example to show you of how we would do the annotation step, we're going to use Prokka today. But I have put some information here in the tutorial about how you might want to use a more specific tool for chloroplast genomes. There's a web based tool called GE Seek, which is quite good. That's what you might want to do if you were doing this in a real life situation with the chloroplast genome. But today we'll just use Prokka. So let's type that in here. And we want to give it the polished assembly. So that was this faster from the pile on step. And we'll leave all the settings as they are and click execute. So Prokka makes a lot of output files and you can explore those by clicking on the icon once that tool has run. Today we'll probably just look at the text file output just for a very quick and brief overview of the annotations. And after that we'll view those annotations in a genome browser called Jbrows. Okay, so Prokka has finished. Let's click on the text file. And we can see it's found lots of coding sequences, some ribosomal RNAs and some transfer RNAs. So that's just a very brief example of how we might do an annotation. And let's view it now using Jbrows. So type in Jbrows. So we're using a genome from our history. We're going to use our polished assembly, which is from the pile on step. We will use the bacterial genetic code. So now we're setting up a track group. Click insert track group. Insert annotation track. And then it asks what type of track do we want? And here we want a GFS track because that's our annotation file that we're going to look at. And it's found the GFS file here, which is from Prokka. And then we will set that to run. So Jbrows has finished. Let's have a look at the output. So I'm going to click on the I icon here. I will collapse both the side panels this time. So we have a little bit more room. Make sure our tracks are ticked. So we're looking at a track, which is the reference sequence. And then we're also looking at this Prokka annotations track. So let's zoom in a bit here to get an idea. So if we zoomed in all the way, we would get right to the nucleotide level. In fact, we'll make that a little bit easier to see. Let's only show the forward strand and let's not show the translation this time. So this is our reference sequence. And then zooming out again. Various parts of that have been annotated with features. So some of them will be hypothetical proteins. Some of them will have names. And you can click on those names to learn more about those annotations. So we did this today with Prokka. As I said, you can also do this with other tools, such as GSEC. So now we're going to view our sequencing reads mapped against our assembly just so we can get a bit of an idea about how those reads look. So we could use the original set of sequencing reads, but instead I've cut those data sets down even further just so it's a bit easier to visualize here. So again, we need another set of finals from Zanodo. We can copy the link from the tutorial. We can go to the get data box, paste fetch data, paste in those links, click start and close. And those files are now loading here at the top of our history. So what we need to do is to make two BAM files. We need to align the alumina reads to the assembly, and then we need to align the nanopore reads to the assembly. So we can start setting up that tool to run, even though these files are still loading. So we'll do it the same way we did when we were looking at our polishing step when we match the reads to the assembly. So we want to look for BWAMM. And first of all, we're using a genome from History. We will use our pylon polished assembly. We'll use single reads, alumina reads, and this is the alumina tiny fast cue set. So select that there and execute. And we can repeat that job with the nanopore reads. And one way you can repeat jobs in Galaxy is to click on the rerun button underneath an output file. And so we can repeat this job for the nanopore tiny reset. Make sure you tell it to use nanopore mode, and now that job's running. So we'll rename these output files, so it's a bit easier to see what we've done. This was the alumina mapping. And then this is the nanopore mapping. So nanopore tiny. Okay, so we have these two new BAM files, and we want to look at them in a genome browser. So again, we're going to use Jbrows. We're using a genome from our History. It's the pylon polished genome. In search track group, annotation track. So last time when we did this with the annotations, we used a GFS file here. You can see in track time. But this time we're using the BAM file track type. So click here for BAM. Let's look at our nanopore BAM file here. And let's insert another track. Again, a BAM track. Let's make this the alumina match rates here. And we'll set this to run. And we'll pause here while that runs. Okay, so that's finished running now, and we can view that output. Again, I'm going to collapse these side panels here. So we have a little bit more room. Make sure everything's ticked here. And let's zoom all the way in again. We'll zoom all the way into the reference sequence level. Again, make this a little bit easier to see. Just show the forward strand, and don't show us the translations. So this is our genome along the top. Zoom out again. And we can see the sets of match reads. So you can probably tell by looking, but the set at the top, these are our shorter alumina reads, and the set at the bottom are our longer nanopore reads. So I just think this is a nice way to get an overview of how the reads are different. We can see that there are lots more potential errors here in these longer nanopore reads with all these little colors representing differences from that reference sequence. And there are far fewer differences here in the alumina reads. So we can zoom out and just get an idea of how those reads look against the genome. And this is one way you can check your finished assemblies because if there are large areas where no reads are mapping, or there are twice as many reads mapping, or other variations like that, you can start to see whether there may be a mis-assembly part in your assembly, so something that is perhaps incorrect, or where there is perhaps a repeat region. So if there is twice the read depth in some areas, perhaps it's a repeat region in the genome. So in the real genome, it might be two areas, but your reads have all been matched in one single area. So it's a way you can start to explore the correctness and some of the features of your sample genome. So there's a nice tutorial in the Galaxy Training Network where you can learn more about J-Browns, and that's linked here in the tutorial. So one of the things that you can do next is to repeat the steps we did in this tutorial on a different dataset. So I found a dataset from a Eucalyptus sequencing project and I've cut that down in size a bit, so it's more easily used in the tutorial. There's a link to that here in the tutorial and you can upload those files to Galaxy. And then you can run either the same tools that we ran here with the sweet potato data, or you can use some different tools as well and learn more about what the tools do and what the results would be. So in conclusion, what we did today, we assembled the chloroplast genome with the long reads. We polished it with the short reads. We looked at the assembly graph and had a bit of a think about the structure of the genome. We annotated the genome and had a look at that in a genome browser. And finally, we took these wall reads and we matched them back to the assembly and started to get an idea about how those reads differ and how we could use those to investigate the correctness of the assembly or some of those other features like repeat regions. So that's the end of the tutorial today. Thank you very much for coming into this tutorial. We hope you found it useful. We'd encourage you to fill in this feedback form at the end of this tutorial if you can. That's really useful information for us. And don't forget, there are lots of other really useful tutorials available in the Galaxy training network. Thanks very much.