 So, hello everybody. Welcome to the practical session for the quality control. So it's the hands-on material, you also find this on the Galaxy, or in the Galaxy training materials, on the Galaxy training materials side. And as a requirement, I already mentioned this already also in the theoretical session that you just need to know the basics about Galaxy. And it will be straightforward. So we will just apply a few simple tools to some example data to see how to perform quality control. And then I also explain a bit more about what kind of quality parameters you have to look for, which you need to check. And I will sometimes do a break, because sometimes it takes some time until we import the data or the data has to be analyzed. And then I do always say then, okay, let's take a break here and make a short stop and you can either pause the video or you continue watching the video depends on you, but you can pause at that point. And then do your own or follow me in this hands-on so you can do the same steps, and then pause the video and until when it's finished, the import of data analysis, you can continue the video. So I will always mention something like I will know there's a break. Okay, let's come. You can read come to the hands-on so you can read the introduction a bit, but in the meantime, let's go to Galaxy. So, log in. And then if you log in, you can create already a new history. On the right side, you have this plus button up here create new history to click on it, then you get a new history named unnamed. So, and if you click here, you can change the name. And let's name it like my first quality control. And then we switch back to the hands-on material. So we did this create a new history we renamed it. And here's a link for our example data. So we just click here on copy. And then we go here to Galaxy on the left side, you will find this download from URL or upload files button up here. So you click on it, you then click here on this button paste and fetch data. And then you just paste in in this box here, the link, and then you click simply on start. We close this year now let's see maybe it runs very, very quickly. And then I don't have to do a break. So let's go back so we report now some data. And then we also rename it. Okay, let's do a quick break here. And then we come I come back if the data is imported. So, so welcome back. If you downloaded the data, you will see it here on the right side. In green and we can now rename it just just simply click here on this it is attributes button. And then you go to this section here, where you have the name of the data set and it's a bit long so let's just rename it. And I mean, name it like reads underscore one. And then we click here and save. And this renames now our data. And then we can already apply. Oh, no, let's first inspect this data. We talked about I talked in the theoretical part that we always deal when we have raw data with two formats either faster file or fast queue file. And here we have fast queue file. If you click here on the view data button, and you wait a moment, then you see, there are lots and lots of sequences or reads in my data. And the fast queue file always has this format that each read has four lines. The first line is my read ID. And sometimes with some comments behind it here we have the read length actually, but doesn't have to be can also be the organism, or some other information you, which was commented. And then you have the second line which is the sequence of the read. The third line is an additional buffer line always indicated with a plus sign here. It's empty. Here we have again kind of the same information we had here in the first line. And the last line, the fourth line is the quality string. And as I mentioned, the quality string each character in here stands for the accuracy of the base call so it gives you the probability that this base call is correct or not. And this is encoded as, as ASCII and the ASCII sign so let's go to the hands on material to get a better idea. So the ASCII is our character, which is then, of course, no computer just a number. This ASCII code gives us a number and the number is basically the threat score. So based on the sequencing device so this is very, very important. So based on the sequencing device and also the version. So like Illumina 1.3 or Illumina 1.8, we have a different default value where the threat score was the lowest threat score. The threat score of zero starts with sign which yeah so here it starts with Illumina 1.8 with the exclamation mark and the Illumina 1.3 with the net sign. And the threat score also explained this in the theoretical section is basically a number and this number can be then also translated into the base call accuracy. And so we have three questions here already. And the first question is which ASCII two character corresponds to the first fat score for Illumina 1.8 plus I kind of already answered this. The second question is what is the threat quality score of the third nucleotide of the first sequence. And the third and last question is what is the accuracy of the third nucleotide. So if we answer this, the first question, let's show you maybe stop the video. I already showed you this solution, but maybe take a moment for yourself to answer it and stop the video maybe. And yeah, so I continue. So the first question. Yeah, so basically an Illumina 1.8, the worst or worse, reverse score is zero here the exclamation mark. So now the second question was, what is the threat score quality of the third nucleotide of the first sequence. So first sequence, third nucleotide is an I. And we know it's Illumina 1.8 and I so we, if you look here in the pictures maybe hard to see, but the quality or the threat score for J is 41 and for I is actually 40 so we know now we have threat quality score 40. So third nucleotide G has therefore so what is the accuracy of this third nucleotide had therefore if you look here in the table, a base called accuracy of 99.99%. How would I know this if I would calculate it myself. I mentioned this in the theoretical part there is a conversion. So you just saw the probability of the base call. So take the logarithm and then multiply it by minus 10. So if I know now I have a threat score 40. I just divided by minus 10, which is minus four. And then take 10 to the power of minus four so which gives me the probability of 0.001 and therefore gives me if I take the difference to one I get the base core accuracy of 99.98%. Okay. Good. If you know we now go to the actually assignments or to the next step. And now let's type here in this field search for tools fast QC. Then you pick this to your fast QC read quality reports. And here you have you can already applied to your data. So let's just click on execute. And I will explain now just briefly what these other other options are so you have actually fast QC to see the tool itself has not many parameters which you have to optimize. You can give it a contamination list or a data list if you want to search for contaminants or adapters which are not standard ones in your data so you can provide this as a tabular file. You can also provide also in galaxy list, which specifies which kind of sub modules so which kind of plots fast QC should run and which with which kind of quality thresholds, you can provide this here. Another option called disabled grouping of basis for reads above 40 base pairs. What this means I will explain in a moment. And then you can set here to lower the limit on the length of the sequence to be shown on the report I will also mention what this means, and then you can also hear say the length of the camera to look for. And then from the practical part from the theoretical part that you have this came a content plots fast QC shows you the overrepresented cameras in your data until you can set the length. So if you anticipate some longer or shorter gamers can change this here. And for the moment. So for me it's still is running. So let's take a break here and then we come back when it is finished. Okay, so let's come back. So if we have run now fast QC, then you will see that you have now two files in your history. You have one file which is the web page file, and one file which is the raw data file. This is the raw data for the moment. And you will see that the raw data is basically the raw data which fast QC uses to generate the plots. There's nothing spectacular to it. But technically, if you would want to use the data or if you want to see the data behind the plots you can look into the raw data file. That's not important. We are now coming to, yeah, we are now go step by step through all of the plots and the results which you get from fast QC. The first one is the basic statistics. So this is this table you see first the filing. So this is what we have the file which we have analyzed the file type doesn't matter so much here and coding so this encoding. So we can check again what kind of sequencing we have done which which kind of sequencing device was used. So here again we see it is Illumina 1.9. And therefore we know also if you ever want to look into the Fred score quality into this quality string we know which kind of encoding was used. Then we see, okay, we have up to 100,000 sequences in our data. Zero we actually flagged as poor quality and the sequence length around about, but it's actually, it's a constant number so every sequence and this length of 37. And we have a GC content of 53. So this is the basic information. And the first thing I would always check is how many reads were analyzed to check really if your coverage was the coverage is here the same as you would anticipate. And also the sequence length, but this we have also an extra plot. And also immediately here on the left side you see every possible plot, which you get from fast to see with the red. Yeah, sign or symbol. Sorry, red, I said red, but it's actually green so green check mark means it's totally fine. So there's no quality. And the orange exclamation mark here means there's a warning there is something a bit odd but it's not a heavy quality breach. And if you would see a quality breach, then there is a red symbol was it was a cross sign, or like a, yeah, like a stop sign, so to say. So let's go to the first plot the bear per base sequence quality. We already covered this in the theoretical part here on the x axis you see the position in the read. Here on the y axis is the fret so the fret score and each position has now a box plot so the distribution of the fret score for for your read library at that position. So that is the mean fret score or the mean quality so to say. And you see, that's why there's no big quality breach here that the majority is in this green area so more than have a fret score more than 28. Fine, you only the end bases only a few ones have a quality between 20 and 28 and only very very few so the whiskers here go a bit into the red zone but these are some outlets are very few weeks actually of very poor quality. Now I want to just mentioned something which which I skipped. You see the read length. We said is 37 and you have for every read position year one box plot. So, for every position we have to read distribution of Fred score quality distributions. Let's go into this hands on material. If you click on non uniform x axis what happens if your read lengths get gets a bit longer. So let's say 100, then fast you see actually starts to bin some of these positions into one box plot. And then you will see only the first 10 positions as one box plot and the other ones are actually been and and you have this kind of scale down below fast you see does it because the size of these plots are uniform. So I have a uniform size and therefore the read length gets longer and longer the box plus at least distribution would be more would be more squished so at some point fast you see decides to bin some positions. And therefore, you have, if you go back again on to fast you see. So there's two manual, you have this disabled grouping of basis for read smaller than 50 base as you catch can actually disable this. However, what can happen is that either fast to see throws a warning or the plots looks then very ugly. Also, if you have really, really long reads, and you want to actually see certain windows of these reads or specific lengths to compare it make it also maybe comparable to another experiment. Let's say you have an experiment was only 5050 basis and you have another experiment was 200 basis, and you want to compare it to the 50 basis experiment, and you can set here limit for the length for this 200 base pair experiment. Only at 50 basis. So this is this option here. Okay, let's go back. And let's go to the hands on material. Just so I mentioned also the theoretical part that the quality at the end might drop so you can see actually poor quality experiment, but might drop generally at the end because of certain certain things which can happen. Which which happens naturally for sequencing, which is called signaling decay phasing, which can happen, and this has something to do with sequencing. So the fluorophore force, for example, degrade over time and phasing is something to do, which is an enumerate and Illumina sequencing called bridge amplification. There's a spot of a lot of molecules which you sequence at the same time, and they need to be stay in sync to get a constant signal but at some point over a longer period of sequencing time, they lose this synchronicity. And the signal comes a bit becomes a bit more blurred, and this, therefore, the quality generally reduces at the end for base calling and this is called phasing there are other things which can happen you can read it up. And here also some other sequencing errors mentioned if you want to read up and if you want to check what kind of you see something in your data and you're unsure what now the issue is. We mentioned here some potential quality breaches that might have happened and quite often it comes to maybe some sequencing of every sequencing errors. So now now to questions how is the mean score changing along the sequence this this tendency seen in awesome. Take short moment for you to answer maybe you can stop the video again. And then the solution is more or less what I actually already mentioned the mean sequencing quality here stays above 28 so it's actually good. So this is really good quality and is this tendency seen in all sequences. More or less. You, you see, again, because of the whisker and here because of the second quartile and second quarter second and first quarter that some reads are a bit lower than 28. Again, this is just a few weeks and it's not that big of an issue. Okay, I make a short break here. And let's come back. So we have this plot here which I mentioned is specifically for Illumina sequencing. We have this tire sequencing quality where you can see the if there's something heavily problematic with this or if there was something very problematic with the sequencing device. So there you would see if you would have private air bubble in the flow cell or something. So here it's very good. So blue means good quality, red, if you would have very big red stretches would mean poor quality. You have then the per sequencing quality scores. I'm actually showing you the, the key on the x axis the mean sequencing qualities of the frets course the distribution of it. So on the y axis you see the number of reads with this mean sequence quality. You see a lot of sequences which you already saw from the base sequence quality plot that a lot of reads actually have a quality of higher than 28. So this is actually quite a good plot that you would expect. Okay, let me check. So the next one is the per base sequence content. You see here now there is a warning sign. And that's why so here you see on the x axis the position in the read on the y axis the fraction of the reads with this, this nucleotide and that position. So ideally you would expect distribution something like this. So this would not generate warning, but here we have one and because of the beginning. So we see we have a lot of, you know, some, a lot of bumps in there and this is perhaps due to a bias, which happened. So that's mean this is a poor quality just mean something either we have anticipated or not anticipated so it's probably a bias from the protocol maybe for chip sake I would assume something like this, because of a clip sake, because of the protein binding to certain regions or the extraction enzymes I have used or the fragmentation also is a bit biased. So we're expecting something like this, then again so this then from mutations may be biased or I have also sometimes adapters in there, or some contaminations could lead to a different nucleotide composition. So the sequence GC content here also shows a warning sign so the blue line is here the theoretical distribution the red line is your very good distribution. And on the x axis we see the GC content in percent. And here on the y axis, the number of reasons this GC content. So it's actually the distribution of the GC. And it's a bit it shows you a warning, because the distribution is a bit shifted and a bit spiky. That's again maybe an indication that I feel some sequences in there, or bias in there in my read library. And then I have the base and content. It's perfect. We have no answer in there, maybe you remember from the theoretical part that an end is basically a base call which which we're very I don't have a confidence anymore about what kind of base it is. So if you're sequencing device or the algorithm cannot decide anymore based on the signal it gets that this is an ATG or C, then instead of writing an ATG or C, it will write an end in your file so you will you will see then also in your fast you for some ends in there. So this is on the x axis of the position again in the read and this on the y axis diffraction of the reads with this and content, and here zero so this is perfect. And then you have the sequence length distribution I mentioned all of the reads had a sequence length of 37. So, if you expect something like this and perfect. There could be of course sequence of different lengths depends on your protocol. It doesn't have to be always be constant. And then you have the sequence duplication level. Here, you see how many sequences are duplicated how fast you see decides about this is just looks at the sequence itself in your read library and looks if it can find the sequence. There are a lot of times so more than once. And on the x axis you see the double location level so how often the sequence is in your read or in your read file and on the y axis, the fraction of the reads with this application level. Blue line is your empirical distribution the red line is the distribution you would have if you would do the application. And here on top also says again in the title, the fraction use or the number of reads or the fraction of reads you would still keep if you would do the application. So it's here 91%. So pretty good. So a lot of reads are actually unique molecules, only a few ones are duplicated or tries in there in the data may four times. And very it's barely seeable but you have a little bump with some reads are made 10 times in there. So this is a pretty good experiment. And I would say, this is typically you can expect from something you do DNA sequencing RNA sequencing you quite often actually see a lot of applications in there, for example, gypsum or clip sake. It's really important that you discuss if you ever see this, then you discuss what to do with your bioinformatician or with your PI, because it's not that you then always apply a data application. And I come to this in a moment, I will just go over this now to show you the rest of the plots and then we talk about this again. So here you have overrepresented sequences. There are none in there. So, and because you have none in there, normally would have after the adapter content plot another came your content plot showing you came as a home a fraction of came as in your data. So here, if there are none. So we have then also no overrepresented sequences. So the came a content plot is, is that missing. And here it says that no overrepresented sequences, else you would have then your table. And then the last one is the adapter content just shows you. So if you have not provided plus can see with an adapter contamination list or something, then it just shows you the standard sequencing adapters, potentially in your data. And here we can see on the x axis, the position of the read on the y axis, the fraction of the reads with this adapter. And that position. And you can see we have no adapt that's basically perfect. And this I would assume that somebody already removed that quite, quite often. This is either done by the sequencing facility already, or you already have given it to him, your bio format biometrician and you already moved the adapters, if not, if you really have raw data that I would see some depth. So let's go back to the hands on material. So, again about. So we recovered the per base sequence nucleotide composition plot. So this is all this was our plot here and this is a plot, which you would see, perhaps for by so far sequencing data. So, or, for example, I see data also showing you that a very heavy nucleotide composition doesn't and this will actually throw you doesn't mean that this is a big quality breach. Probably, this is then something actually intended. So this I would also expect for clips of data sometimes that I have a nucleotide composition bias. And so, even though it shows your warning it's actually intended and everything is fine. So he was the question actually why is there a warning for the per base sequence contents graph. And then I thought I explained this so in the beginning, there's kind of a bias biases and that's why there's a warning for the per sequence GC contents we also so there's also a question why is there a warning. And I also talked about this so it was shifted to the left and a bit more spiky PCI duplicates. So this is so this is was our plot here. And this is actually a plot I would actually would anticipate to see for for example for clip sake. And for clip sake, I would do a d application because for clip sake, and it is very important to know where my binding proteins bind. So clip sake is a protocol where you analyze the, where we are protein bind to your transcriptome to on this. It stands for cross linking immuno precipitation. And it's very important to know where this, where this position is, and every duplicated sequences generates a very heavy bias and therefore for for your counts, and therefore this stops your later analysis for chip sake or on this. And this is actually also mentioned, not mentioned here. Sorry. I think it's mentioned. It's mentioned in this text somewhere that far on a stick or for chip sake. I know it's mentioned here. This can actually be truly represented sequences, because you have if you analyze you, you can rotate data. Then you have simply very abundance. So you have different isoforms of your transcripts right. And so this could be just means that you have a very abundant transcript. You have a lot of sequence coming from the same location. And therefore you have a lot of sequences, which are the same. And therefore you have a lot of duplicates in there. So if you ever see this year for on a stick or chip sake, then really talk about this was your biophonetician or your P I'm and ask if you should do the application or not. Because if you do a de application, I would only do a de application that if I have enough material and it depends again on your analysis you want to do, but let's say for differential expression analysis, you want to know which kind of transcripts differentially expressed. Or if you do an on a stick experiment, which kind of transcripts are differentially expressed between two conditions, then you need up to at least 10 minute million reads, and if you have not enough and you do a de application, then you will actually influence your results quite heavily. So I would only do a de application, if I have enough reads in my read lab. So the partial sequence is also mentioned here again so this is what what I so this is what what we kind of had. And this is what you would expect to see if they're maybe when some Rob something wrong with the sequencing device so they're maybe because it's also very systematic here. It's not as there are no red lines so far, but I would still ask why I would see such a systematic bias here and probably then it has something to do with your sequencing device. We per base and content this we had, and this is something where I would also again see this is not very big issue here, but many remember from the theoretical part, we had a plot which was really we had a lot of positions with a lot of ends in the diffractions. And then you would really maybe just discard the tone they said because there was something really going wrong, and you're in your sequencing or your experiment sequence length distribution we talked adapter content. Just mentioned this was our plot and this is a plot I would see if I still have adapters in there so you see okay here, I have universal, you mean adapters still in my data. So, if you are would see something like this, and then you have to do an adapter training, and we come to this point later on came a content. Again, I mentioned to be we had me don't have this in our reports so far because there are no overrepresented sequences. But then if you have this plot you would and there is a breach kind of you would see something like this year. And this could be that this is linked to my adapters. So because if I have adapters in there then I also have overrepresented sequences. And this is also shown my camera content or this is actually if I don't have adapters in there and still see see this and I don't anticipate this. Then I have some contamination or some systematic bias to to my fragmentation, so I would need to check it. And with that we come to an end to the fast to see part. I just want to mention there are many, many more scenarios, many more things which can happen, of course, to the quality of your data. If you see this, then, and you don't know what this might be, then you can follow this link here or click on here. And we, there you will find more descriptions about what kind of quality reaches might happen, which will look like this and that. So you get some more information about what might have happened to your data. Okay, we do a small break here. So let's come back we were left off by discussing the results of fast to see. And we saw that the only slightly issue with our data was that the end of the basis were a bit low quality. What we are going to do now is we do something called an end trimming. And so, let's come back we were left off with the discussion of the quality control, and we saw so that they were one to warnings but they were not that big of a quality breach. So the only issue maybe there was was the end of basis or the end of the reads the end basis were with less quality than the rest of the read or the reads. And we are now going to do something about this by doing an and bears and base or a trimming and this is called a trimming the quality trimming. And then we will use now a tool called cut adapt. So let's go to galaxy and search here for the tool cut adapt. And if you click on it, then you have the first option here saying is your data single and or paired and data, and we have only one file so it's definitely single and. You can select or file or fast Q file if you don't see or if you cannot select anything here, then probably your fast Q file or your faster file is not correctly formatted in galaxy, meaning that your format states here, either a question mark or something totally different to fast Q sign is fast QC somewhere faster. So you have to change this. Then the next options. You have you have here, you can provide cut adapt with adapter sequences. So on the thrive three prime five prime or both ends. If you would click on it, you could provide either a custom, you can provide a custom sequence from yourself or from the from the history. But we saw already in the fast QC plots that we had no adapters in our data. So let's remove this again. And you can also use a cut basis for me it's before adapter trimming so this you can also do. If you know that there are maybe some molecular antivirus in your data then you cut cut this off before you do the scanning. There are further options here, but I will not explain them into much detail if you want to know about this, then we take take your time and read through it. What we want to do is now go to filter options and set here as a minimum length 20. So we say now to cut adapt, please fill out any read which is smaller than 20 basis. This you quite often do because if you do a mapping, then small reads map very easily to your genome and transcriptome but quite often they are not uniquely metal. Because of course, the shot sequence, the higher the chance twice or more than once in your to your genome or transcriptome and therefore quite often you discard very short reads. This cut off is a bit there exists some studies but is a bit empirical. You see a lot of papers different values for that. Here in this example we do cut off with 20. Then go to read modification options where you have here this option saying quality cut off. And here we also say 20. So we cut off every base, which is lower than 20, or has a lower threat score than 20. And last but not last option. In output options we want to report. So say yes. And then we execute. Okay. Then we wait and we come back when cut adapt is finished. Okay, if kind of cut adapt is now finished, then we have now two files here. The first one, if you click on it, then we simply get another fast queue file. So cut adapt simply has done something. And if you would know, check the size of both files. So our initial file and the cut adapt file you see, there's definitely something happened because it was reduced in size. So if we now click here on this report, then we get in some information about what happened. And for that, we have now two questions, how many reads have been found with adapters, how many base pairs have been removed from the reads because of that quality. And how many sequences or sequence pairs have been removed because they were too short. And if you were short time to answer this, maybe you can also stop the video. If you want to think about it yourself, not then here on the solutions. So how many reads have been found with adapters. So they were actually zero none as you can see here. We have already confirmed this right this fast to see there were no adapters in my need and we are actually also no provided any depth of sequences. How many base pairs have been removed from the reads because of bad quality. So here you can see the solution but in the file itself we see here so 1.2% so around about 44,000 basis where we have the minority, the majority 98.7% we retained. So it's still an hour date, which is pretty good. And last question how many sequence pairs have been removed because they were short to short meaning they are smaller than 20 basis. You can see here so reads that were too short. We had 322 so 0.3% and rest actually long enough to be able to pass the filter, which is pretty good. And now what we can do is we can rerun our fast to see on our. Basically on all new data. So let's go back to fast to see. And let's run this on this, not on the report but on the fast so on this read one output of Canada just click simply execute. And to that, we let's let's wait until it's finished. And I just want to mention so bad qualities sequences so if you see this it's always good to apply cut adapt however don't simply do it immediately also maybe ask yourself why is there a bad quality and check again. If maybe something might have happened with your experimental setup with your library preparation sequencing before you do a quality training. And just want to check first what might have happened. Another mention here is that if you want to read about so here is an explanation how cut adapt is doing the quality trimming so it's just, it doesn't simply move every base which is lower than our quality threshold. No there's actually a more sophisticated method behind it to take into account something like a false discovery rate. And this is maybe worse for you to read up to to see that cut adapt has a bit more sophisticated method that just simply get rid of this business. If they're below this threshold. Let's check if fast to see this already finished. No, it's still running. Every time if this stone is running. We will actually need a second data set, because I, I said basically why I lied when I said that we are dealing with single and the data area actually we are dealing with parent data, and we just analyzed the first place of the format format read. We are using the reverse one. For that, we want to do now again a quality control. And therefore we can already copy the data into our or import the data in our history. So go here again on download from your files. Click here on paste fetch data paste the link, and then simply say start. And in the meantime, so I simply do a break here and we come back if that is finished. So now should be finished. And we can check. Now let's have to think because we have to run fast to see. Yeah, let's, let's do this first. So let's rename our new file the subset to and click here on this pencil icon and rename it like you want. I name it now as reads to so this is my reverse. So my second mate, and we save that. And now we want also check the quality for this fast queue or this read library. And therefore we select again search for fast to see fast to see and apply it to rates underscore to execute. Now let's look at the quality of our data which we have modified with cut depth, a process with cut depth. And as you see, just as a comparison so this is now a new quote, just in comparison, let's go back to the raw data. So this you can see the basis here at the end had a bit of lower quality. And now if you go to the money for the processed file you see it has way better quality than before. And you probably would also see this here in this average quality per read so the quality distribution mean quality distribution. So this is the new one. And the word distribution looked a bit more at its barely seeable, but I can tell you, you have no more reads in above 28. And this is the main, or the main things that happened, of course, now you have your kind of warning. And this is what I also mentioned that if you apply cut depth, then of course you get read weeds with different lengths. And so the length is here between you see it also and already in the basic statistic between 20 and 37. And if you go to this sequence length distribution is he now most of the reads are have a length of 37, and a few ones are a bit shorter. So these are the rates where we quality trip. And this is also what I would expect if I would have raw data actually that has different read length, then I would also expect a distribution like this. So most of the reads should have a certain length, but then only a few ones vary from this value. Okay. You can pause the video now. But I see already that my fast QC for the second read file went through. So first, what questions you hear how many sequences have as well this is the with the fast QC how many sequences have been removed has sequences quality been improved. And actually we saw this already from in the cut it up file. 100,000 sequences and we were left this 99,678 now, and has the quality improved improved so I mentioned this we saw quite well that some of the bases at the end have now better quality than before. Now, let's go here and look at the report of the second read file. So first we have pattern sequencing. Second read file should also have the same number of reads as my first read files. So here we see 100,000 sequences of this should not be the case and something definitely went wrong with the sequencing. The quality of zero sequence link is also 37 and the GC content is also for 53. The first thing we observe here is we have a warning per base sequence quality that this time the quality is actually way worse than in our second read file. So a lot of so base is really you're down here to fret quality score below 20. The sequence quality is good. They are maybe here a few bars but it's not big of an issue. This is also not bigger than issue most of the reads actually a bit above 20 in quality of 20. We are now see here. This is the sign in fast QC for quality breach. So this is kind of showing you are well he is something definitely off. So probably because here the bias in the beginning is way higher, and he also in the middle we are some of the biases. This we kind of saw already in this first sequence file per base and content is good, no quality breach sequence length constant also very good duplication of also good know about represented sequences and no adapters also my second. We go immediately before I come back to this step and apply cut it out. So, let's search for cut it out. Because we saw there is a quality breach was with my second read file. And now it is very important I told we have actually paired and sequencing data. So we have to select here and set off single and paired and. And now we have to select for first file or first read. Also first made file and for a second fast Q file is second fast latest one. And then we set the same options as before so this we can minimize here so filter options minimum length 20 read modification quality cut off 20 and output options yes we want to report and then we execute. The question is now, why do I have to I mean, let's ask a very naive question why do I have to choose pattern sequencing, why not just simply apply kind of depth only to my second. And the simple answer for this is, if some of her reads are filtered out in your second read file so let's say you apply it only to the second read file. And you later on do a mapping step or some subsequent analysis with your files. Then you will get an arrow or complain from the two words saying something like your read files have not the correct format or your read ideas are incorrect. Your course, of course, you have paired and sequencing data, the first and second made read files should be in sync so they should have the same number of reads. And if you just apply it to one of the mates, then you filter out some of the reads which are still in your, in your other file. And therefore you have to say to cut it out okay we have better sequencing data and if you filter out something from your from one of, if you filter out a mate and also filter out the other mate and this therefore you have to apply it as. So this is this run through. Now, before we look at the results of cut it out. Now, let's apply something called multi QC multi QC is a very handy tool, which can combine one or two of your fast Q reports into one big report so it makes a summary of it and this is very handy. If you have a lot of files, a lot of quality reports, and you then want to see all of them in one big plot and not have these individual plots, then multi QC is a tool you have to use. So search for multi QC click on it. And then here which tool was used to generate blocks. So we select here, or we search for it basically fast to see. Yes, raw data. And then we apply it to the first of the fast Q reports where we have done the raw data quality control check. And for that we take so this file so 10 and nine so for me it's so we have to select now the raw data not the reports itself we need the raw data for me it's here the number 10. And then I press control or hold control. And then for me it's the second file is here file number three, so I click on it so that both files are now selected. And I simply say, execute. Okay. And the same thing we are now doing also, or do we have to do it. Let me think about it. We would have need to apply fast QC but let's okay come we will come to this bit later. Let's check out the results of cut it out. And so here already the questions how many bases have been removed from the reads because of that quality how many sequence pairs have been removed because they were too short. And I leave this up to you to figure out on yourself so you can stop the video. And if not, so let's go and see where we can find this information. So we got from cut it up now of course three files. So one file for my processed first made one file for my process second made and then the report file. The first question was, how many bases have been removed from the reads because of bad quality. And total so we see a total process basis and quality trends we see okay, my first read for for my first made we have 44,164. And the other kind of depth report that the second made we have a bit more basis so one of its round about 140,000. And the second question how many sequences have been removed because they were too short. And this we see here pairs that were too short and removed so one about 1.4%. So the minority the majority 98.6% stated read library which is pretty good. So there's the answer to these two questions here. And so you could see is still running. So let's apply fast and the meantime fast QC to the processed data of my second made. So for that, we, yeah, search for, yeah, we search for fast QC. And that applies. So here it's for my file 12. So cut it up on data eight read to output and simply click on execute. Okay. And seems like it takes a bit of time now here. So I do the last break and then I come back. Okay, so let's come back. So I hope this is the last break. So what I want to do is, you know, so for the last step, I want to do blind multi QC now, also for the processed quality controls for the process. So again, search for multi QC type in your fast QC. So the processed files of the quality controls on maybe already worth to actually rename now your data. And for, for me, it's here data number 17, or the fast or the raw data of 17 here, and this one fast QC. No, wait, sorry. See, I already made a mistake. So it would be better at that point to rename your files, of course. And so we run, run here. So you can see we have here the raw data, we have done the quality control with the data, we run cut a depth and then down the quality control for the process. So I want to be here so it's all again control and click here on seven or if you ever named it already, then so if you're better than me, then click on this file for and then simply execute. Now let's take a look at the multi QC report. And you simply click here on the HTML report and the multi QC you will see, or you get the same information basically as in fast QC, but visually a bit different. So here you have the basic statistics and you can see already here's also listed how many percent of reads are obligated with the GC content how many reads are there. And seems like we have to wait until it's loading. I hope it's doing it quickly. Please hold the line. Let's see new data. I just had to click on it again. And here you see the, again, a plot about unique reads and duplicate of reads here you have now in here the per base quality plot. And this time for both files, as you can see. And then you have the per sequence qualities so the per sequence quality distribution per base sequence content so the nuclear type distribution is shown like this here but you can also click on it to get the plot which you know which we also have in fast QC report. Here you have the GC content per base and content sequence length distribution application level over represented sequences at up to see content. And then last but not least is status check so this is a plot. Remember in fast QC you saw here on the left side, the names of the individual plots, and then a sign if there's a, if it's a good, if it's checked so it's a satisfying plot. If there is a warning or it's a quality breach and this is kind of the summary. So each of the individual files you have to provide and you can see as we know from my second file that was a quality breach for the per base sequence content. Okay, let's look into the fast QC report of the processed second mate to click here on that page for the for the first latest fast QC report. Yes, okay, great. And we can see here. The quality improved so there's no quality breach anymore so the end basis here, even though there's one position which goes a bit still into the fred score quality below 20. The rest is, I have now improved in their quality. So as a comparison. Let's go to the unprocessed plot. So this was the previous plot you've seen the base and quality definitely improved improved. And now what you can see plot is also also finished for the processed files. And yeah, so you also see then this improvement in here. So let's go to the what you can see report of the raw data. Yeah, and you see for the second week or made it was way lower than now in the in the processed file. Okay. Yes, we come more or less to an end we reached the end of this hands on of the session about quality control there are two questions left and I leave this up to you to answer. So you know how to operate or do a quality control is fast to see how to do a trimming with cut it out. And also how to combine several fast to see reports into one with multi QC. So we thought that this material needed a bit more improvement than let us know and also rate this tutorial if you found it very helpful, then leave a comment and with that I hope you will enjoy the rest of the sessions and I say to you have a nice day and a good week. Bye bye.