 Hello, everybody, I welcome you to the quality control session. And this is split into the theoretical part and the practical part. Here I will cover briefly the intention behind quality control for next generation sequencing experiments and then go deeper into how to do quality control. There we use mainly a tool called FastQC. And in order to, for you to understand this session, you need to have listened to the introductory courses for Galaxy, so you need to know how to operate in Galaxy. And I will answer these three main questions. So how to control quality of next generation sequencing data, what are the main quality parameters to check. And I also sometimes briefly explain, if you see there's a quality breach, then how can I improve my nature? And the main objectives you get from the session is that you can manipulate FastQ files, that you can control. So you can do a quality control with FastQ files and that you can use it to FastQC, understand what the output is of FastQC, and then act accordingly if you see there are some quality breaches and maybe do some quality corrections. So why you have to do quality control? I mean, it's obvious why you would do quality control because you want to check if your experiment worked on right. So you want to check. You have a lot of steps for next generation sequencing data. So the experiment, the library preparations, the sequencing, and even some additional steps in between. And you want to check, for example, did my knockout work correctly? Did the sequencing actually? I mean, if you send it to a sequencing facility, it's always worth to check if the sequencing facility did something incorrect, something might happen there, or even for your library preparation, right? So there could be some contamination happened. And you want to check if your library is really, if there was a quality breach. And there are many, many next generation sequencing experiments. Chipsake, Clipsake, Arnaisake, Hi-C, and so on. They all generate their own. So they all have their own protocols, their own biases, which you have to look after. And even, for example, for Clipsake, so in the investigation also of the proteome and the transcriptome where you link the protein to Arnais with cross-linking, even there you have different protocols like iClip, eClip, Parklip, Hits, and so on. And all of these protocols, I can tell you, they have even different biases, which you have to check. However, every next generation sequencing data follows kind of these last steps. They are some sequence, of course, with Illumina, Ananturant, or Nanoporms. And what you get out of it is either a raw pasta file, but more importantly, you need a fast queue file. For Nanopor, for example, you get a faster, also sometimes, we just call it a faster five file, but even the faster five files, or you try to convert it into a fast queue file. And this fast queue file, you then use for quality control. And after you've done the quality control, after you checked, if my raw data has some quality reaches, you go in other steps, like mapping and your general data analysis. So now let's talk about the main format. So I mentioned you have faster format and you have a faster queue file. So what's faster, if you never heard about it? Faster file lists mainly all of your reads. Which you sequenced. And it always follows this format that you have one line for your read, which is your read identifier. Sometimes after the read identifier, you have some comments like the organism which was sequenced. And then the second line is the actual sequence. So ATC or G, we symbolize it here with access. You sometimes have ends in there. What an end stands for, I will explain later on. And then after the second line follows the second read. So a second read identifier, a second saw the sequence for this read. And this goes on and on. So then follows the third, fourth, and so on. And so this is the basic faster file. And then you have the fast queue format, which so the queue stands for quality. And again, you have two lines. You read one line for the read identifier with some comments and then the second line for the sequence. Then you have an additional third line. You very, very often see a plus sign here. So this is kind of a filler line if you want to include again some additional information. And then you have a fourth line where you have the quality string. We symbolize it here with queues. But of course there are different characters. And each queue here, so to say, so each character stands for the quality that this base or this base call was wrong or not. So the accuracy of this base call from the sequencing device. After the fourth line follows again this second read. So a second read identifier, the sequence, the additional filler line, and then the quality string. And then again, the third read, fourth read, and so on. So what is this quality string or what are these quality scores? They are called FRED quality scores or FRED scores. And they again, give you an idea about the accuracy of the base call. Let's say you have a FRED quality score of 10. This means that your base call was 90% accurate. Meaning to a chance or probability of 10%, this base call was wrong or not. So let's say if you have 10 reads, so 10 times the same read, then one of these reads, this base is wrong. Obviously if the FRED score increases, so let's say you have a FRED score of 20, then you have a base call accuracy of 99%. Meaning there's only a chance of 1% that the base call was wrong, so then this particular base, so this ATG, C is wrong or not. How is this FRED score actually calculated? So there's a conversion, so this probability of your base call being wrong. So let's say 1%, 0.1, you calculate or you convert it into a FRED score by taking the logarithm and then multiply it by minus 10. So let's say 1% 0.1, the logarithm is then minus 1 times minus 10 is 10, so you have a FRED score of 10. And here also in this picture, it's very important. So you have to make sure which kind of sequencing device you use. So this calculation is different for the technologies. So let's say for Solexar, it's kind of a bit different conversion. And so you always have to check which kind of sequencing device you have used. And I always also mentioned that at the end, what you have in the FASTQ file is a quality string, so not numbers because obviously if you have numbers then it's either a very long line or it's very hard to know them because you have a very, very big number to know what kind of quality scores you have or not. So at some point it was decided that the FRED score is converted to a character by using the ASCII code. So what you have in your computer, all of these characters, so what you have on your keyboard, if you type it in, actually this is normally represented in the machine so in the computer as a number. You can see so an ASCII code, the add sign has the number 64. And now we can use this to convert the number of the FRED score into a character, so to say. And again, it's very important here that you know what kind of sequencing machine, so which kind of sequencing machine was used because if you ever want to investigate these base chords yourself or these quality strings, then based on the sequencing device, there's a different default where these FRED scores starts. So for example here, already also with different versions of the sequencing devices can change. So for example here in Illumina 1.3, they started with a FRED score of zero and this add sign. And then if you have a FRED score of 40, you would end up with an H whereas in Illumina 1.8 and higher, they change that the default is here, the FRED score of zero is at the exclamation mark. And if you have a FRED score of 40 or 41, then you're here at JOK. So you always have to check for your sequencing device. Okay, so now we go a bit deeper into identifying potential quality issues. And for that, we concentrate on the very important tool called FastQC. So if you try to search for it in Galaxy, you'll find it very quickly. And you apply it to a FastQ or a set of FastQ files, you can also apply it to some BAM files. And then you get as a result an HTML report. So if you open it, you read it with the first page, so to say, where you see the basic statistics. It lists you some general information about your read library like the file name, the file type, the encoding. So there you can also check again which kind of sequencing device and version you have used or was used. Then the total number of sequences in your read library. So you have here a thousand reads. These are the sequences which are flagged by FastQC as pure quality and so you're zero. Then you see the general sequencing length. So all of the reads in this read library have a length of 100 nucleotides. And you see the average or the general GC content or the fraction of GC content. And it's here, so almost all reads have like 50% or GC content of 50%. And if you then go to the first plot, this is one of the most important ones where you see the base quality score. So the x-axis, you see the position in the reads. So the base pair or the base. And for each position, you get as the distribution for the quality, so the fret quality score which is plotted as a box plot, as you can see here. And the blue line is the mean quality score for all of the positions. And the red zone is basically very poor quality. The green zone which begins at around 20 to 28 is okayish quality and 28 to 40 is a very good quality. So the green zone. And so a fret score of 20, as you may remember is around about like accuracy of 99%. So 1% of the base is wrong or not, a probability of 1%. Here you can see, this is an experiment where you see already good quality. And in this example you see an experiment with a very bad quality. And you see the end of the reads in the read library. You have very poor quality. So it's in this red area. You will also get an information, immediately an information on the report about this. And now you have two decision or you can make several decisions now. So you either can get rid of the data set at all. If you have another lot of other quality breaches as well. But you can also decide to actually trim off the ends. This is actually quite common. So you do this quite often because with every sequencing technology, even with this nano-poor, the base quality at the end of the reads is always a bit worse than in the beginning. And the longer the read, the worse it gets. And quite often you then do something called an end trimming. So you cut out the ends if there is some quality issues. And this you can do with traumatic or cut adapt. For example, here in this intermediate quality example, where you can see that you have only here three positions at the end, which are poor quality. You will probably do an end trimming and leave, of course leave the whole data set. After this per base quality, you have this per sequence quality where you see on the X axis the fret score or the distribution of the fret scores and on the Y axis, the number of reads with this average fret score. And so it's basically a distribution of the average fret score in the read library. And you hope to see a very high peak and then at least with a fret score of 20 or 20 plus. If you see a little bump here with some reads with lower average quality, then you can either again decide to get rid of them. So this you can do either again with cut adapt has I think some options to get rid of low quality reads in general and also mappers. Some mappers already do this internally like O-type. Of course, if you do an end trimming, this can also change. So if you do with cut adapt and end trimming, then the average quality for this read increases and this can already improve your overall quality for your read library. Then you have this plot here, which is specific for Illumina sequencing. If you're familiar with Illumina sequencing, you have a flow cell with different lanes. And you have a flow cell with two lanes. And here on the Y axis you see this and on the X axis is the position in the reach because it's a sequence by synthesis. It's kind of plot showing you how the sequence performed over time. So the longer, so the higher the load is, the position or the later the position, so to say the later the time. And it's a cold by what scaling here meaning that cold, the cold colors or blue color stands for good quality. Hot colors or red stands for poor quality. Ideally this picture should be completely blue, but you can see there are some bars, some positions here in both lanes, which are red doesn't have to mean now that this was a very poor experiment or sequencing. Maybe these two positions have been or these positions have been a poor quality. I would actually keep this experiment. However, if you see a very big stretch or a lot of positions was read, then this is an indication that the sequencing went wrong. So what might have happened is then quite often there was an air bubble in the flow style and then the sequencing didn't work out well enough and you would probably then get rid of this dataset. And we come back to this plot, which is the per base sequence content showing you on the x-axis the position in the read and on the y-axis the fraction of the reads which have this nucleotide. So the fraction for T, C, A and G. Ideally, of course there shouldn't be any bias. So the fraction for all of the nucleotides should be similar through the whole read library. This is of course not the case. You see it actually quite often in the beginning or at the end, at the end because the quality is sometimes a bit worse in the beginning. Could also be because of adapters and also at the end because of adapters. However, there are also protocols like Clipseq or Chipsseq even like Cut and Run. They have internal biases because of the restriction enzymes they use or sometimes even if you, yeah, because of the restriction enzymes you have used, then you see generally some biases in there and this doesn't mean this is quality, it's just a bias of the protocol. So you have to kind of, so this is, we have to say, you really have to again know what your protocol is doing and what kind of biases your protocol is inducing to your data. So it doesn't, if you do for example, a Clipseq experiment and I see something like this doesn't mean immediately this is a bad quality, it just confirms the bias, the protocol induces. However, if you know and you anticipate no bias or no different nucleotide composition in your data, then this is either a quality breach because of poor fragmentation or your fragmentation didn't work so well because you have done a very aggressive adapter trimming. I come to that in a moment, what this means, meaning that you removed the adapters quite heavily and ending up with a bias because of this step or because of all represented sequences. So there's maybe contamination or some repetitive RNA in there or some RNAs from repetitive regions, then this can show you, or this could be the case for this nucleotide bias. The other plot here, the sequence GC content shows you what you also get from the initial statistic shows you the mean GC content, so the fraction for it on the percent and on the y-axis, the number of reads with this GC content. In blue is always the theoretical distribution which should be normal distributed and in red you see the actual empirical distribution of your data. So there can be, it's again a plot where you kind of have to know a bit more about your experiment, your organism and the protocol you have used because you can end up here also with a bimodal distribution so meaning you have a second bump or more than one bump in here. If you do a metagenome analysis, of course if you analyze a sample with different organisms they have different GC contents and therefore end up with a distribution showing more than one peak. However, if you don't anticipate something like this should be uni-modal, so only one normal distribution, one peak and this maybe means that you see a bimodal distribution that you have a contamination in the data. The contamination in the data probably is also shown if the distribution is very broad and another thing if the distribution is very spiky so there's a very spiky peak then this could imply that you have a very heavy adapter contamination in there which has a specific GC content of course and then you have this big spike in there. So look out for this GC content so this indicates a bit more about contaminations about library contaminations. Then you have this per base end content maybe remember I mentioned when I talked about the fast and fast Q file format that you have your sequence string so where you have ATG and C however there might be another character and N in there and this is due to the case that the sequencing machine of course does your base calling so it decides at which position, which base you have and gives you then also this quality scores and the spread score for it. However, if the algorithm because the signal was not strong enough or it couldn't really detect what this base is and it gets really, really low then it cannot decide with some confidence anymore that this is an ATG or C and therefore it sets a placeholder and this placeholder character is an N so showing you the sequencing machine couldn't any more detect what kind of base at this position is correct or not or if this is a ATG or C. Of course here on the X axis you see the position in the read on the Y axis how many or the fraction of the reads in your read library having an N at that position and of course ideally you hope to see nothing so everything should be at zero. If you see something like this here so this is very poor quality then maybe you can get rid of it by again N trimming so getting rid of these bases at the end or at the beginning however if you see a lot of them here so like this there's also in the middle some a lot of reads which has Ns here so there's really then this you would then combine all of the plots which we already discussed and see if there maybe happened something with the sequencing device or if there was a problem with contamination then you would probably maybe get rid of the dataset entirely. Then you have the sequence length distribution just showing you a length of your sequences in your read library here on the X axis on the Y axis the number of reads. So this here means that all of the reads have a length of 75 bases. This doesn't have to be so if you already anticipate that there is a variation as the sequences because of fragmentation or because of your experimental design then at least you hope to see a very high or big spike at a specific length you how you designed at which you know how you designed your read library and then it drops immediately. Also when you do and the end trimming or adapter trimming and you investigate again the quality of the reads to check these steps and of course also these reads have then different length and you would also see a non-constant distribution. Yeah, so this is a general check up if you can check if your design of your read library was correct. And then we come back to this plot and duplicated the duplication level plot showing you on the X axis the sequences sequence duplication level as you know or maybe know that because of PCR you will end up with sequence duplicates. So reads or sequences which are simply duplicated because of the PCR and the sequencing technology. This happens and I hope ideally would see actually no duplicates so duplication level of one but quite often you see at least two a bit of a spike here in two which you can see here. So the blue line is your actual distribution and the red line is the distribution if you would do the duplication here on the Y axis so this is on the Y axis is the fractional reads with this duplication level and how fast QC is doing this is it's just checks the sequence so in the read library that can find the same sequence once, twice, three, twice and so on. Again, doesn't mean that this experiment was bad or not if you see spikes in higher duplication levels like maybe nine, 10 or even higher. It just so for example in Chipsack or Clipsack I anticipate to see something like this and on ASAC also probably however if you see it really a very big spike in the higher levels and this is an indication that you have done too many PCR cycles and yeah probably you have to then redo the experiment if not, so small bumps in here or maybe a bit more bumps in higher levels it's not that big of a deal you only have to then consider if you do a deduplication or not a deduplication so remove some of these non-unique molecules. This is where I would say you have to decide based on the experiment again and if you have enough material for example in Chipsack I see that again often that there is a bit of some duplicated treats in there and I see that I only have two million reeds depending on the world let's say on human two million reeds then probably I wouldn't do a deduplication if I have to do the differential expression analysis because I have reduced the data too much or I removed too much of important information because also in Chipsack maybe these reeds come from different transcribed isoforms so I actually remove some important information and I have not enough data so for Chipsack if you don't want to do differential expression analysis at least you need to have around about 10 million reeds which is which is a good number to do a differential expression analysis in Chipsack however they are quite often do a deduplication because it's very important to know your protein is binding to your transcriptome and the reeds are generally quite short and therefore it's you have a high duplication level and you need to do a deduplication this one here this plot shows adapter contamination so far as QC also checks for standard adapters for well-known devices Lumina and XTERRA and so on so these kind of adapters it checks if it can find these sequences in your data hopefully so if you already removed your adapters it will not find anything of course if you really have the raw data then you will see that there is something in your data and so if you have a quality breach there you do an adapter trimming with cut of that or trim galore there are a lot of tools which will help you with that and I have to point out if you do an adapter trimming really check this again don't assume that I applied now my tool to my data then the adapter is gone now please check this it's also sometimes really important to check this because they are sometimes read like we which is a bit more complex you have maybe an adapter at the beginning and at the end and sometimes a double ligation of an adapter and so you have to do a double adapter trimming and so really check this if this is gone so on the X axis is the position of the read on the Y axis is the fraction of the reads having this adapter sequence and again if you see a quality breach in here so as soon as there is a fraction or some fraction higher than two or three percent it will show you a warning that there are probably some adapter sequences in your data and then you have to do an adapter trimming this goes more or less hand in hand with this plot this is the K-mer content plot showing you the position here on the X axis in the read on the Y axis the fraction of your reads with this K-mer and this position I set hand in hand because if you have adapters in there then you quite often see also something in the K-mer content of course you have a molecule or a sequence which queues more than once in your read line and therefore you have an increase in some K-mer also then shown here if you removed some adapters and you still see some K-mer content doesn't have to be that again that this is a poor quality maybe it just means it happens quite often for an ASIC experiment so if you have repetitive or satellite sequences in there then of course you see something in the K-mer content okay with that we are more or less at the end so I already talked at every plot a bit what you can do what you can to improve the quality I told you that so filtering of sequences if you see that you have this average or mean quality score distribution of this if you have there a quality breach you can filter for low quality sequences some mappers include this already internally but you can also do it with cut-adapt or some other tools if the reads are too short so you see that your read lengths are too short so lower than 30 maybe also worth to filter out depending on your experiment with a tool also you can use cut-adapt or traumatic with too many end bases you can do an end trimming or again filter out these sequences GC content wise if you see GC content bias in there which you don't assume or which you have not anticipated there is a tool also to correct for this it's called if you search for deep tool GC content correction in Galaxy there's a tool which you can use to correct for GC content bias but only use it if you are sure the GC content or there's a bias should not be in your data there are other quality measure tools which you can use to remove some quality bridges I talked a bit already about at every step now again cutting and trimming sequences is possible for this end trimming adapted trimming you probably have to do okay so I hope you now know how to run the quality control in the practical presentation I will do the hands on so we go a bit deeper and doing it a bit hands on in Galaxy how to run this I hope you know which kind of quality parameters are important for application sequencing data so you know how to run maybe already fast to see and what kind of impact it has for your quality control and with that I thank you very much and yeah I hope you enjoyed this presentation which is part of the Galaxy training network so you find these slides of course there and you can also join on the practical part that we see in each other by the way