 Hi, welcome to the practical session for quality control in the Galaxy Training Network. My name is Andrew Losdell, biopetition based in Melbourne, Australia. We're going to get started straight away, so I'll just start by sharing my screen. And here we have my screen. I'm on the Galaxy Australia server, and your server may look a little bit different, but use whichever Galaxy you're comfortable with or you have access to. Now the prerequisites for this course are a basic knowledge of Galaxy. That's not much more than knowing how to log in, run some tools. And what we're going to cover today is a fairly standard and straightforward part of bioinformatics, which is looking at the quality of next generation sequencing reads. I'll just quickly switch to the notes, which you may have in front of you, and we're referring to this throughout. There are challenges in the tutorial that you can try. I'll answer a few of them, but I'll leave quite a few for you to do in your own time. This lesson has expanded quite a bit from previous years. And so we're going to be covering four different kinds of sequencing technology and how to assess the quality of those. Whereas in previous years, only about two of the technologies were covered. So there are some parts of the tutorial that in previous years, the tutorial video goes into quite a lot of depth, in particular the fast QC output. And so if any of those bits you want more information and more description that the company's the notes, I encourage you to look at the 2021 version of this tutorial video. For this one, we've got four kinds of technologies. As I said, there's single-read reads, paired-end reads, long reads from Pac-Bio and long reads from Nanopore. And so we're going to go through all four of those. We'll break it into about two sections. So we'll do the single and the paired-end. And there'll be time for a break, which will also be convenient as the Pac-Bio data takes a little bit of time to download. Okay, so this tutorial is all about sequencing reads and understanding how the reads that come off the sequencer have some quality associated with them. No sequencing technology is perfect. And so there are different kinds of errors and amounts depending on the kind of technology. And it's important to understand what these errors are and to see what you need to do depending on your analysis, whether you need to filter or exclude or just identify errors that may affect your downstream analysis. So the best way to get started is to look at some sequencing reads and see what they actually look like. So if you haven't already, let's create a new history in Galaxy. I'm going to call my history quality control one. And I'm going to download a file. So from the note, you'll have a link to a fast queue file. And we'll inspect what a fast queue file is in a second. What I want to do is import this. It's hosted on Zenodo. So I'm going to go and upload some data. I'm going to use paste fetch data because we've got the URL. And we can just paste it in here and then click start. And there's quite a small sequencing file. Shouldn't take too long. See it's finished. And if we click close, we should see that our history should update quite soon. And it will be downloading there. And there it is. So this file name is specific to the study that it came from. So we're just going to rename this and call it reads. And we'll try and type it currently. So now you can see it's been changed to reads. So we just imported a fast queue file into Galaxy. This is very similar to what you'd get from the sequencing facility. So let's click on the view icon and have a look at what is actually in this file. So it can look complicated. But the fast queue format is almost 20, 30 years old. It's relatively easy to understand once you have it explained to you. So each read that comes off the machine is encoded by four different lines. So the first line always begins with an at symbol in the fast queue format. And then followed by some information. So that's the first line there. The second line is the actual sequence has come has come off the machine. ACs, T's and G's typically third line is a spacer line. It always starts with a plus sometimes it repeats the information from this line. But sometimes it's just the plus like this. And the fourth line is probably the most complicated line is the quality information stream. So this is for each corresponding base here. In the same position is a score to let you know how confident is the machine. This is correct. The machines that came from the companies changed how they encoded. So sometimes letters meant one different quality and then later on they meant and another. Fortunately, things have standardized somewhat nowadays. Unless you're looking at historic data often think will be an aluminum 1.8 bus format. But if you're ever dealing with a mixture of data always should check what the origin of it is. So those notes are there. We go back and keep looking. There's a few questions in the tutorial. So a quick second to have a look and answer some questions. So have a look and see which character corresponds to the worst score. And what is the quality score and accuracy of the third nuclear time? Okay, so if I look at those questions I was quickly going through the answers. So in terms of the worst score zero is the worst score. And that up here corresponds to the exclamation mark ASCII code. In terms of the other questions when we look at the first sequence, the third nuclear height is a G. And so if you look at the third position in the quality, it's also a G, which admittedly can be a bit confusing times. But in terms of quality score, a G, if we go back to our notes, a G is there. And you can see that actually corresponds to 38 Q. Okay, and a score of 38. So the final question is about what is actually the probability? And what is the accuracy of this? And so you can look at a third quality score Wikipedia page for more information, but essentially each time you go by about 10, you increase almost an order of magnitude in terms of accuracy. So 38 is somewhere between 99.99, 99.99% accurate. And as you get more into bioinformatics, you'll see that often, pointy is usually kind of a rule of thumb about where you want things to be in the short read era. But as we'll see later on in long reads, currently the quality is generally a bit lower, which is a bit of a trade off for the longer reads if you get. But there you go. So that's looking at just a basic file and seeing how it works. But admittedly those ASCII characters can be a little bit confusing. So we're looking at a file, if you're looking into the command line or inspecting like we do in Galaxy here, it's not really that intuitive. So we're going to use a Galaxy tool now. The Excel is one of the authors on called fast QE, which is a bit of a fun tool to try and translate those weird characters into something a bit more visually appealing and instantly recognizable. So this is the first tool we're going to run on our data. It's called fast QE. And so we're going to go here and search for fast QE and comes down here. It's the only one with emojis in it. So it's easy to find. And so as input, it takes a fast Q file, in this case our own called read. So let's click on it here. We're looking at the scores to show. It can tell us the quality, either the minimum or the maximum at any position in the read for all the reads in a file. So we're just going to choose the mean this type. So let's get rid of minimum maximum and click execute. That shouldn't take too long. And when it's finished, there'll be a HTML file that we can look at. Okay. And after a while, you can see that the results of fast QE will appear here. So we can click on the icon to see what the report says. And hopefully you agree. It's a bit more visually appealing. And what this does is it summarized across the first position, the second position, et cetera, across all the reads. So rather than looking at one read at a time, looks at all of them. And the emoji essentially is meant to convey how high the quality is. So we can see at the start of the read, the quality is generally quite high with the smiling faces. And towards the end of the reads on average, the quality starts to diminish. So you can see in the notes, we've got how the Fred score, the ASCII code and the emoji correspond. As an aside fast QE, you can use it on the command line as well. And you can customize these emoji for whatever purpose you want to convey. So a quick question in there is, what's the lowest mean score in this data set? And so using the notes, so if you go down to the table here, you can do seconds to quickly work out what is the lowest quality. Okay. So have a quick look and have a check. Pause now if you need some more time to answer this. But if you have a look at the guide and even just sort of using the emoji, we can see that actually the sad cat is the lowest score in here. And that corresponds to a Fred score of 13. And as you can see, it stores the end of the sequence. And so as the machine goes on, that's relatively typical. But again, you can quite see. And so we'll see later on what we would do with this diminishing quality and whether there's anything you need to do to modify the data. Now, fast QE is a fun little tool, but it's typically not used in production. And one tool that precedes it and actually has a long history in bioinformatics is fast QC. And so this is actually what you probably would use if you're doing data lab bioinformatics and it's a really standard part of many pipelines. So just like fast QE, we can just search for it. But fast QC instead of E, and fast QE is named to kind of run with fast QC. And it will do the same thing, but I said it much more comprehensively. So I'm going to choose some reads from our history. Reads is the only thing there. We can leave the defaults. We'll explain later on as getting the notes are quite comprehensive. I wanted some of these things do, but we'll leave the defaults and then just click execute. And hopefully it won't take too long to run on your machine and you'll get some results. And we'll look at the HTML in a second. Okay. So my fast QC run has finished. So there's two parts. There's raw data and the summary web page. Let's click on the summary web page to see where it is. And this is a typical thing you would do in bioinformatics. And we might just see, make my screen a little bit bigger so we can see the full plot. Or I might take advantage of the Galaxy overlay. So if I click on that, we can look at this in its own window. And that's a typical thing to do. So one of the questions is what kind of encoding and so fast QC will serve basically with some heuristics. I think about what it is. And so I can tell that it's sangria-alerminal 1.9, which again is going to be the most common addict. Some basic statistics. 112 sequences in there. The length is 296 base pairs. And so this is kind of the main fast QC plot that you'll see in many bioinformatics talks. And if you're sort of comparing to fast QE, fast QE shows the mean in each position, which is essentially this blue line here. So you can see that fast QC is more comprehensive and gives you the whole range of values at each position. So you can sort of see, although fast QE summarizes it at the emoji level, what we really want to see here is that the quality is very strong and high towards the start of the sequences. So it starts off sequences one to 10 and then starts spinning them. But then towards the end of the reads, you can see that quality is quite variable. There are still some very high quality reads, but also some very low quality reads and it diminishes over time. As I mentioned at the start, the 2021 version of this tutorial goes into a low depth about the fast QC outputs, and both that and the nodes are quite comprehensive. So if you're not familiar with the tool, I would really recommend going through those materials as I won't cover them in as much depth during this recording of the video. One thing I will look at, so you've got basically these are the summary reports here. Another thing that's interesting is the adapter content. Okay, and so I said we look in the adapters, we do sort of see some kind of adapter signal from this next era transfer as a sequence and the rest of the fast QC, we're going to skip for this recording, but again, look in the video, but we're going to skip straight to trimming and filtering, which is about how do we deal with the consequences of the quality scores or these other QC such as looking for or adapters. A lot of the other things are often just warnings and most of the time, you can see on the left-hand side here, there are some warnings, some errors. They're not necessarily always critical, but they can just represent a clue that there might be some technical issue with your sequence. So as you get more experienced, you'll often look at these, but again, it's always a, you won't always throughout the whole experiment, but it's important to look at these and then understand what kind of quality landscape you're dealing with, but I said we're going to go straight down to trimming and filtering of short reads in the materials and this is about dealing with the results of this QC check that we've done. Okay, so throughout filtering, what I'm going to do is I'm just going to minimize this report so we can compare so we're going to do some quality trimming and filtering, I'll explain when that is in a second, and we'll compare the reports before and after we do our next step. So let's just minimize that. I'll refer this back later on. So the two kinds of kind of quality assurance that you can do at this stage is trimming and filtering. So trimming is taking the actual sequences for low quality ones or adapters and actually removing them. So keeping some of the sequence intact, but modifying it to remove serious things. The second way is to filter. So not even include certain reads. They could be ones with a low mean quality score, but I can be too short or have too many ambiguous bases, which is usually represented by an N instead of an A, C, T, or G. And so the tool we're going to use to do that is called cut adapt. So it's like everything else. We'll look in our tools, start searching for cut adapt. That should come up shortly. So cut adapt, remove, adapt the sequences, fast queue, fast A files. So the first question of this tool is a single end of pattern. Single end files for the moment. We'll look at paired end reads later. And the reads again, the probably the only reads in our history. So they come through there. There's a couple of things. If you look at the notes, there's a certain adapter sequence that beforehand for the purposes of the tutorial we know is in there. And so you need to copy and paste the adapter sequence. And when it says insert three prime end adapters. So click here, enter a custom sequence. I said there. We don't need to give it a name. We only given a single one. So we're going to look for these kinds of adapters to clean up our reads. We're also going to ensure that reads are at least 20 base pairs long. So if we still scroll down to filter options, you can say the minimum length is going to be 20 base pairs. And look at the read modification. We have a quality cutoff. We're going to say we don't want reads to have a queue score lower than 20 on average. So again, it's going to get rid of some of those really low quality reads. And for our output, we're going to ask for a report from conduct. So once those three things are selected, execute and we'll revisit this when this run finishes. Okay. I hope mine's finished just before I get going and turn off the window manager for a second. So we can see this in the classical view. And so cut adapters given there's two outputs. We've got the read output and the report. Look at the read output. It's quite text based and it's telling you what was happening to the files. Whereas the report is going to be more comprehensive. So there's a few questions in there about this energy range that contain adapters. How many removed or is it the quality? So take a few minutes to look through those. Okay. So pause and finish those questions. You haven't already, but let's have a look at the report. So click on report and look at the questions. You can see that they're essentially in here in the top of the, in the summary. So there's any of the reads with adapters is there. In terms of quality trimming, you can see that this percentage and in fact, no reads were too short. So again, this is the entry information. You can see what's gone in and what's gone out. Next thing you can do is we'll use both of our previous tools for checking the quality of fast QE and fast use and look at the modified data rather than our original data. So let's go back fast QC type of emoji because it's the only tool with emoji in the word in the name. And this time we'll use fast Q. We use the code adapter rather than the original. We'll get, we'll just do the mean again. So it's the same and we'll execute. And so that should finish pretty soon. And we can do the same for fast QC while we're waiting. So just fast QC, do the cut adapt and run it. And so wait for both those two jobs to finish. And then we'll have a look. And so I'll pause until that's done. Okay. So both of those are finished. So we're going to use the scrapbook feature again, which is this little window manager icon here. And we had our original fast QC data. So let's look at the fast QC on data six, which is here. And that's the web page. HTML report. We'll go across here. Sorry about that. Let's just minimize these. Let's just adjust these windows. So you can see them side by side. So look at the basic statistics, both of them. So fast QC, let's put the original data on the left. And you can adjust the data on the right. You can see the file name is different. But it's same number of sequences. Remember they weren't, none were filtered out. But the sequence length varies. So and the GC content, because we've trimmed these. So some of the sequences have been cut by the quality of all adapters. But there's two main things we should look at. So let's look at our per base quality plot. So there we have the new one. Let's go to the original. So here's the original. And we've sort of the quality diminished towards the end. And when we look at the new one, we can see that it a lot higher. I think it's still diminished towards the end, but we've clearly gotten rid of some of the quality reads. So we can also look at the adapter content. So this is the new file and the original file. So remember we had adapter content here in the original, but after using cut adapt. Oops, I've using cut adapt. We can see that. Otherwise that kind of doubt. And go down to the adapters. You can see it's no longer there. So it really has been an effect. And just for completeness, let's look at the fast QE. So we can see that in the original, again, the poor quality, and then it's been reduced on average. And so we can see that our trimmed and filtered out data is a higher quality than our original. And so hopefully any inferences we make on those later on will be more robust. Now that was just my short reads. And the predominantly, probably it's going to be paired in, but you'll see in the wild. So now we're going to look at processes in some multiple datasets to having two fast QE reads. So paired ends, one from one end of the fragment, one from the other end. And we'll see how Galaxy handles those from a quality point of view. So if we refer to the notes, we'll see it when out to the processing multiple datasets out of the tutorial. And we're going to get some two new datasets. So there's a pair of ends of the same set. How about we'll hear from this, what's the story of the note? I think he's copy. And I'm going to go back into Galaxy and download these in much the same way we did the original single end data. So let's switch back. I'm going to upload data. Again, we're going to paste in fetch data. And we're going to paste in two lines here and start and that should finish relatively quickly. Let's wait for that. Come back. Okay. Now those are downloaded. You can see they're just fast Q files again. Still got the overlay turned on. That's a large file. So it's only showing a few, but again, we can see these are just short reads different to our previous example. This is the plus line does have the same information which I talked about. It wasn't in our first example. And you can see in the order that the ID is the same. Reach one. So those are our files. But come on with the quality. We fast QC can read them both together. So we'll click on fast QC, click on multiple data sets. And then control or command. But if you're from Windows or Mac, you can choose both again. We'll leave the rest. And then we'll execute that. And now we're actually going to look at another QC program which is called multi QC, which is really good at collating QC results from different kinds of programs. And so it's, you don't understand sort of the raw output of fast QC and can use its information as plots to sort of collect things together. Which is really nice when you've got paid in reads, but you want to do that. What happens to both of them at the same time. So we'll let this finish. And then we will load up multi QC and combine the outputs of the paid entries. Okay. So fast QC is finished on data 11 and 12. When you use multi QC, which again is available in the tools. When it loads. And then multi QC is good at aggregating results from different kinds of programs. So we're going to do that. And then multi QC is good at aggregating results. By loading the tool, what we're going to do is we're going to choose both of those fast QC. Now we don't necessarily need to get the HTML reports. So the same multi QC understands the raw data from fast QC and it will pull in the data, pull in the plots and sort of create its own plots. And so what we're going to do is run that in a second and look at the output of multi QC. Okay. So this is the interface. And so we're looking at fast QC. And I'm going to say fast QC output. That's raw data. So the raw data files. And then we're going to choose the two raw data fast QC results on our GSM sample. That's two there. And then we're going to execute. And we'll see what the web page looks like on multi QC and see it's running now. Now I'll just pause and come back when it finishes. Okay. And now that's finished. Let's click on that web page view and have a look at what it looks like. And so this is a multi QC report. And so you can see it starts with general statistics and it's pulled from each of the fast Q files, send it to because it's GC content. A number of sequences. So again, it's just pulling information from the fast QC to show it anyway. But it's a nice little way of analyzing it. And here it's loading and it's going to combine both those samples together. So we just wait for it to finish. Okay. And with the report load, we can now start to scroll through as you can see. Our reads here separated. So you can sort of see how many unique versus duplicate reads. And then we get down to the plot server. It's similar to what comes out of fast QC by default, because they've been overlaid and took each other. So the first thing we can notice is that forward read the underscore one is in general a high average quality than the two. And this is kind of typical for parent reads just because the way the chemistry works. So you can see that the second reads are often of a lower quality and nice thing can zoom in, which is a great feature of multi QC and multi QC combined with Galaxy is a great, great combination. Looking at other plots again, each of the samples overlaid, actually we have lots of samples in this scenario. You can see them all together. And the per base sequence content, the orange here is a warning and readers have failed. So before we had a warning here, the quality being a little bit lower, but it's per base sequence content in the two is very, very, very high. And so there are some quality issues with here. So just like with the single end reads, we can use cut adapt to modify the paired end reads. And so next what we're going to do is do the truing again. So again, let's bring up cut adapt. And because they're paired in reads from almost all purposes, you want to keep the pairs together. So part of one of the pairs is filtered out. You'll typically filter out the other one as well. And so it's why it's important to do it on my program. It's a way forward reading of one, which is the reverse reading that we're doing. So it's quite important to get these right here. And store one for the forward, under store two for the reverse reading. The read one options. If you do have adapters, it's found in those plots, then you may want to scale them up. If you look at the adapter content, see both past there was no adapters really found. So we don't have to do it for this. Read two options. Again, we're going to do minimum length of 20 or cut off 20 red score again. And the output we're going to have a report just as before. So let's execute. And again, it says cut adapt knowing about the paired in reads and treating them accordingly. Just wait for that to finish and come back shortly. Okay. That's finished. Let's look at the report. For the readout with you, which is the result of the training, but the report tells us what happens. You can see 1.4% were too short and 98.6% were written out. Yeah. The quality of trimming to get 2.5%. And so any and we get 96.7% of the ratio. So we've done a bit of tidying up. Have that will be more robust for further down the line. Other kinds of assessment. So again, there's more questions in the tutorial. I leave them to answer in your own time. Have a look. And again, one of the things that cut it out is the documentation is quite thorough. So it's worth reading that to understand how you do it. But to summarize the audience, these first two parts is that we could really see the difference in the quality when we do some quality assurance checks. So we've got tools to help us visualize it and then tools to help us modify our data to have a higher quality subset to use in further analysis. Like with all reproducible research, you don't want to modify our original data. So it's good to have the original data and then the process data so you can compare and go back if you need to revisit or change the settings. And so that's the end of the short resection. And now we're going to shortly move to the long section. Okay. So we've finished the short resection. We're going to go through a little bit of the idea of how the quality scores work and a general approach to cleaning them up. Now long reads are a bit different. It's the name implies they're quite longer, but also the quality was different. So how we deal with them can be different. So let's start a new history. Just so we've got a bit of space in our history. We don't get confused with our short reads. So I'm going to create a new history. Let's call it quality control. So long reads. I'll have a nice blank history there. And also if we go back to the notes, we're going to be at this section here. So long reads. And we're going to use this data file here. And this download can be quite long. So what we're going to do is go back to our history. Going to go upload data. And let's paste in fetch data. And we're going to paste in on there and start to download. And this will appear in our history here. And new part two long reads. So I said this could take a while. So great times go get a drink, stretch your legs, and we'll come back for more of the second half into long reads looking at quality and next generation sequencing. See you soon. Okay. Now that's downloaded. We click on it just like in our short reads. It's a fast queue file is compressed, but we'll see if this, it looks a bit different. And that's just because of the length. So you can see that we can scroll across. These are very, very long reads. And again, the quality information is here. So nano plot is a program. A little bit similar to part. You see is a sequence quality checker, but is more towards long rates. And particularly nanoporbid books with pack bio as well. So we've got nano plot, the file we'll choose will be this sub sample from this pack bio read that we've got. And the only options we're going to do is we're going to customize the plots. We want to get a bivariate plot in the KDE format and also get the dots and do an Instagram of n 50 marks. We're going to say yes. So let's start that. And when it's finished, we're going to look at the HTML output. So pause for a second. And we'll come back when that finishes. Okay. Now that's finished. We can look at the outputs. So now what's got individual stats and filtering, but the HTML report should collapse at all together. And this is a bit more visually appealing. So let's have a look at that. So some summary statistics, the number of reads in 50, the mean quality, all in text here, but then also there's some nice plots as well. We've got Instagram of the read links. So they can say about 18,000 here is the mean. And after a lot of transformation, we can see where it is. Now this isn't necessarily necessarily that surprising because Hi-Fi has a bit of a size selection step. So it generally restricts things to the thing that you're interested in. But again, quite a nice little program interactive using plots. And this is read links versus average quality. There's no real kind of correlation there. So the notes are quite comprehensive on these. You can have a look through and see what each of these is used for. And there's a couple of questions there about the dot plots. And there's something unusual with the Q scores. I should have a look and take a few minutes and then try out the question there and see if you can get the answer. And another optional thing that we might do now, see how long it takes is we just might run fast QC on this data and see what happens. I'll just load up fast QC. And we're picking this and let's get it. So that will take a little while. We'll come back when that's finished, have a quick look, and then we'll look at some other long reads from Nanopore this time. And maybe we can have a comparison and see how the quality changes between the two. Okay, and fast QC is finished on that data. So we can compare and see what does fast QC tell us. So fast QC still works. You can see the binning now is getting quite large. And overall, the read quality is quite high. There's again some variability. Now we have to adapt a bit again. The fast QC was built a long time ago. Well, for long reads, but it's good to see it's sort of still hands up. Although you would, as I said, typically use more modern tools that are designed for long read, long read data. But leaving your own time to compare some of these results with the Nanoplot results as well and see, which do you prefer? Why? Why not? And so the good thing about pipelines is often you can include both. If there's one you prefer, just a small part of and include it. Or like I said in the previous part of the lesson, using multi QC, you could combine things. Okay. So that's the end of the long read from bio. Now it'll go long ways from Nanoplot, which is a slightly different technology. And then the high five is, you'll see in the notes, it uses a circular process to go over read over and over again. And so that way, that's how to get such high quality. We'll now look at Nanoplot, which doesn't have that. And so we can see that the Nanoplot, depending on the chemistry, is a much lower level to high five or to short reads. So what we're going to do is we can create a new history for this part as well. So let's do this. New history. Now we'll call it quality control three again. I hope you're a better type than me. And let's call it now. And like before, we're going to get some sample data. So in this case, we're getting two things from Nanoplot. So it's, there's some links in the materials to Nanoplot. And hopefully you're familiar with the way that it uses electrical charges, things across across as DNA or RNA passes through a pour and that difference electrical charges, what's called base called, and it's converted that electrical signal into ACs, Ts, and Gs. So it's a totally different way to the aluminum sequencing. So this data is already been processed into fast QC or sorry into fast Q files. And also we're going to have some summary data. So here in the notes, we're down to Nanoplot. And so these are the files that I'm referring to. So one is the base called fast QGZ. So it's compressed and in the sequencing summary text as well. And so our first step is to upload it. So just those a few times, you know, a few sections. So it's quite on the bottom there are files. And then let's start downloading those. Those that'll take a little while, not as long as the other high-five data. And so we'll pause and come back when that's finished for you. And we're going to run high Q, high QC or Pico QC, I suppose on that. So I'll just pause for a second and come back when your data is downloaded. Okay. Now that's downloaded. We'll use the Pico QC. Now Pico QC is a little bit different. So rather than using the raw data that comes off the machine, Pico QC is designed to really work with previously base called things off Nanoplot. It's using sort of sequencing summary files that come from the base calling program. So it is a little bit different, a little bit different to the way other programs work. That means it's pretty easy to run. We just keep pointing to the sequencing summary text and we'll leave all out the defaults. Again, with all these tools, have a look at the manual if you're interested in what the other options do. Let's start running. And once that finishes, we'll inspect it and have a look at what the results are. Now that's finished. Let's have a look at the Pico QC HTML report. It's being a little bit slow this afternoon. I'm recording this on a Friday afternoon. So even if you just get tired from the week, I just pause the recording and come back. So you don't have to listen to any more of my awkward space filling banter. Okay. Now that's finished. We can look at the Pico QC report. Now you can see how many reads. You can get any information straight from the sequence or so, how long it took. So it's not just the actual reads, but it's things about the actual experimental process. Right now these, there's no barcoding around me, but if you had multiple barcodes that would be in there. And we have a look at the status of the reads. So how many pass, median read length to thread score. See some questions in the material about how many reads do you have in total? What's the mean in the median? And so we can come down here and have a look. Again, my text is a little bit squashed. If you are, you can use the scrapbook feature if you like. I don't know if we'll be this slow this time. I think it a bit bigger, looking a little wild to load. Just leave that there. It might be just a bit slow. So I'm going to look at the basic called the thread quality. So again, one thing you may notice comparing to short reads is the quality is generally a lot lower. So the maximum here is about 12. And you saw before we filtered anything lower than 20. And that's really part of this trade off at this current point in time between the length and the quality. So you get much longer reads and we can do amazing things with them and putting, you know, resolve unique ice reforms if you're doing RNA seek or put together structural rearrangements and the cancer point of view that you might not be able to see with short reads. And so everything's got a trade off in life. And this is this is one of them so that the quality is a bit lower and we don't have that circular process that we did with the Pacific bio high fly reads. Can you come get a pot here of the length and it's an interactive plot you can zoom in. And if it all reads and also just look at the reads that pass QCC so you can see actually all the reads are down here. Look at the Fred score and so in a lot of Fred scores here they've been cut off. So nothing lower than seven seems to have been included and you look at the output over time. So again, it's got experimental information in here and you can even get down to the read quality over time. So it tends to on average stayed pretty, pretty high around 10 but the variability the whole time through and you can see the channel activity. So again, the actual pause on the on the nanopause is the channel ID and you're looking at the experimental time. So you can see how long they were applied for. So again, really quite interesting program. We'll see where else that the nanopause data sits. And again, the materials for this lesson have a bit more information about how to interpret some of these nanopause plots. As a few questions I'll leave you to do in your own time. And where this video has probably gone on quite a long time. So we're almost there again. We could do out of interest sake is we could look at fast QC and run it. So comparing these specialized long read quality assessment programs with the conventional or the old school, if you will program fast QC. So I'll run that and come back when has look and then we'll just get to some conclusions and then we'll be finished. Okay, so fast QC is finished. Let's have a look. I still got scrapbook turned on so we can view that and that looks quite different to our short read. So you can see from a short read sort of viewpoint the reads are all in the danger zone because the Q score is quite low. And no adapter content because we didn't necessarily have adapters in this section. And again some of the other box give warnings but we know it's a totally different kind of technology and even the sequence links you can see here up to 100,000 base pairs. So massively different kinds of ways of thinking about your data. And increasingly we're seeing a mix of the two approaches. We can still do some things with short reads quite well that we can't do with long reads. That may not always necessarily be the case but for now a combination of technology and analysis and economy means that short reads and long reads both sort of have their place I think. So if you've been following my materials you can see that we're almost at the end and I suppose to really sort of conclude or up to we really use this tutorial to look at the quality and fast queue files. It's the first step in lots of different kinds of biophanics analysis RNA-seq, chip-seq lots of other things following NGS. And we looked at a few different quality assessment tools so short reads we used fast QE. For short and long we used fast QC which is sort of the standard for a long time and then in long reads we looked at nano plot and Pico QC which are more specialized to the long read technologies. We also looked at the effects of trimming and filtering on short reads using tools like cut adapt and I suppose some key points that I would like you to take away from this lesson is that you want to perform quality control on every dataset before doing other analysis. So it's often your first step you'll just do it by almost by default. It's always important to assess the quality metrics and if you've improved the quality in a sound way to do that. Good to use these kinds of tools to check the impact of quality control so looking at the before and the after and there are also different tools out there. As with always with these materials if you have any feedback and there's FAQs here there's some of the papers that we looked at there's feedback please send it here. So this lesson has expanded quite a lot in recent years with the advent of long read technology. So whereas before we ended the short reads we're now doing a lot in this lesson and so any feedback on whether you think it works having them both together or we should separate them out would be welcome. You can start this tutorial and again in the sequence analysis package this is a good first step and you can now move on to other ones and into more advanced more advanced lessons in the Galaxy Training Network. So I guess I'll stop sharing now and finish by saying thank you very much for listening to this video. If you have any questions please reach out through the normal Galaxy channels and have a good week. Thank you.