 So quality control, so this presentation will be about mainly quality control of raw sequencing reads and after that because you will use it during the exercises we will also shortly talk about databases where you can store read data. So why would you actually perform quality control? One major reason is that if you generate sequence read is that you want to know the base quality. The base quality is a proxy for accuracy. So how well the base color actually knew that the base it called, so the A to B, the C or G was actually that base. So it tells you something about the error rate, obviously. Second one would be what is the read length for eliminate sequencing. Usually you should know that beforehand because that's the setting of the machine. But for example, for long reticulating, you do not really know that beforehand depends on the library prep and how well the sequencing run went. So then that's a very important quality measure, especially if you're interested in long reach, you hope to get long reach, obviously. Another one could be are there adapters in my sequences? Barcos do not occur very frequently, but adapters they do. Sometimes it happens and you want to figure out whether that's the case and remove them if you can. Are there overrepresented sequences? So sequences that occur more frequently than you would expect. If you do whole genome sequences, sequence thing, you usually do not expect a lot of overrepresented sequences. But if you do, for example, RNAC, because you have a difference in gene expression, you expect that some sequences occur more frequently than others just because you have a difference in gene expression. So those are four relatively important things. You usually check when you look at Illumina data, but also on long read data. And how would you get that information? Well, luckily, there's a lot of software written that can help you do that. One would be the manufacturer software. So all of the sequencing technologies, they provide some level of quality control with their machine. That's the case for Illumina, PacBio, and for Oxford Nanopore technology, especially for the long read sequencing methods. It is recommended to have a look at those because their quality measures can be quite specific. For Illumina, by far the most frequently used quality control software would be FastQC. That's also the software we will be playing around with during the exercises. In addition to the manufacturer's software for Oxford Nanopore technology, PacQC might be interesting. There's really a lot of nice visualization with PacQC, for example, about the number of active pores and the activity. Nanoplot is a nice general quality control software for both Oxford Nanopore technology and PacBio is relatively lightweight and easy to use. For first glance at your data, that's very nice software. And while methods evolve also software evolve, so there are quite a lot of different tools on the market. But just to mention those, I think, most frequently used ones. So usually what you get when, for example, when you have done sequencing, often that is done by a sequencing facility at your university or your department, for example, what you usually get is a FODQ file. For PacBio it can also be a BAM file, but the same information, probably the same information is stored in a BAM file as in a FastQ file. So what is a FastQ file? It is nothing more than a FastA file. The FastA file contains sequences, so the actual code bases you could store in a FastA file, but you're also together with those sequences, you want to store the base quality, so the accuracy of the base calling, so therefore it is called FastQ. And this base quality is a minus 10 times the log 10 of the probability that a base is wrong. So if you have a low probability that the base is wrong, you have high accuracy, and therefore you have a high base quality. So if you have a base quality, oops, sorry, if you have a base quality of 20, you have a probability that the base is wrong of 0.01, so an accuracy of 99%. If you have a base quality of three, then your probability that the base is wrong is very high, and therefore you have a high error rate, and therefore a very low accuracy. So in the graph below, you see how error and the accuracy are related to the fret core, so that this base quality score is called the fret score. So if the fret score increases, the accuracy goes up and the error goes down, obviously. So what you're always looking for is a high base quality. This base quality is always represented in integers. In principle, you can also calculate it as a numeric, so with digits behind the comma, but usually they are represented as integers. This is a very typical plot that you first see when you look at a quality report in FASTQC, and what this image depicts is the average base quality or actually the distribution of base quality along the read. So you see over here, and that's for all your reads together. So for all your reads together, at the first base, there is a distribution of base quality between 30 and 34. So we have seen that the base quality of 20 has 99% accuracy, base quality of 30 is even a higher accuracy. So we are really looking at very high accuracy over here at the beginning of the read on average. If we go further along the read, if we go up to even all the way, these reads are probably 250 base pairs. So if we're over here, let's say at 160 base pairs, there's a very wide distribution. There are some reads with high base qualities, but there are also many reads with very low base quality that are in this red zone. And that means that base quality or accuracy is below 99%. And therefore, well, very high error rates to be expected. So this is not a great quality of your reads. Banshee has a question. Yes, sorry. So this is a plot from Illumina, right? Yes, it would be Illumina reads. And it's from one particular sequencing run, or it's from multiple stuff, like it's an average of many runs, because technically we should have something the range around 150 to 300, which should be more better. But here around something like 150, the quality is getting low. Yeah. So this is an example of a single lane, a single sequencing lane. And over here, the sequencing apparently didn't went very well, because we get very low base qualities towards it. And also it depends upon the cluster density as well. Yes, depends on cluster density. Definitely, because if you have a lot of overlapping clusters and the base calling becomes challenging, can also, I guess, can also relate a bit to the quality of the DNA, for example, there can be many, many things that can cause this. Usually what you can expect, if the sequencing provided sequencing platform you're working together with, they try to make sure that you get high base qualities in the end. So they do all the quality measures and they have the experience to make sure that the sequencing goes well so that you usually have high base qualities. But sometimes it just can happen that the sequencing didn't went well or there were some other problems and you end up with plots like this. And that's of course very important to, at least to try to correct for this, but the data you still have, or maybe redoing the sequencing. Got it. Thank you. Thank you. So if you would just align those reads, so this is a visualization of an alignment, we'll go into that a little bit deeper later on tomorrow. But if you would align those reads, this is paired end. So we align from the five prime to three prime. The little line over here means that part of the fragment that is not a sequence. And then we have read number two over here. And we see a lot of colors over here. And the color means that there's a mismatch with the reference. So these reads over here, we saw here are aligned over here without any trimming. And you see really a lot of mismatches with the reference towards the three prime end of the genome. Many errors towards the three prime end of the read. And that's just because Godbite is very low accuracy. And usually that if you want to do, for example, variant analysis, that of course will interfere with the variant analysis. So usually, if it's that bad, if it's as bad as we see over here, we definitely want to do some quality trimming. Another thing that can occur that is irrespective of the base quality is that you're sequencing adapters. So these adapters are added to the fragment in order to, for example, to start the sequencing. And what happens that if you, if the fragment is shorter than the read length, as would happen over here, do you see my mouse or not? Do you see my mouse? Okay, good. So let's say this would be the ideal case where your fragment is longer than your reads. Then you have this part that you do not sequence. And these parts that are actually reads, and we have the adapters over here, so no problems there. If the fragment is shorter than your read, then the read actually sequences a part of the fragment and then read into the adapter on the other side. And then you're sequencing adapter. In principle, it's not a huge issue. It's, of course, not very nice because you're, for example, sequencing the same fragment twice, both forward and reverse is a bit redundant data, but in principle not a huge issue. However, you still, you do have adapter sequence in your read. And your adapter sequence probably does not occur in your genome. It's an artificial sequence, right? So it will result in issues with alignment. So therefore, what we usually do if we see adapter sequences, and usually you see the percentage of adapters increasing towards the three prime end of a read, which makes sense, right? Because the further it is towards the three prime end of the read, the more likely it is that you're actually sequencing adapter. We are trying to get rid of that because it will result in alignment issues. And that's relatively straightforward to do. So the solution for both of these issues would be trimming. So both this low base quality at the five prime end and adapters at the five prime end. So what you do with trimming is you try to find and remove regions or reach with the low base quality, as we saw in the first example, adapter sequences, as we saw in the second examples. And nowadays, what you also try to remove are apology sequences. So with more modern Illumina machines, they use a two channel system, a bit technical, I'm not going to explain the entire thing right now. But systems that use two channels, they often have apology sequences if there have been issues with the base calling. So basically, they're the same as an end, but they are represented as a G. And they are wrong. So you should ignore them. And therefore, you should also remove apology sequences. For example, if you have been sequencing with a NovaSec 6000, and that's the most high throughput Illumina machines. Software to do that, there is a lot of different software. Most frequently used, I think, are cut, adapt and thermometric. In the exercises, we will be using cut adapt mainly because the syntax, I think, is a little bit easier to understand than thermometric. And what we will do is specify the adapter sequences and the minimum base quality we want to keep in order to improve the quality of our FASQ files. There are a lot of things that can go wrong during sequences and that you might find in quality control reports. We cannot cover all of them during reports because it's very diverse. But there are some very nice articles on sequencing problems, quality control problems, that can occur on sequencing.qcfill.com. So if you are running into an issue and you're not really sure why it's there, it's quite likely that there's an article on this website about it and it's very nicely explained with examples. That about quality control for now at least. We will go deeper into it during the exercises. Are there any questions so far regarding quality control? If not, then continue to the databases. So with databases we're talking about over here are databases to store raw sequencing data. And there are three big databases where you can store raw sequencing data. One of them is the DNA Databank of Japan, DDPJ, MTBI, which is the American one or from the USA, and the ENA which is the European database. Let's see that there are some questions in the chat. Okay, it sounds like that's good. And the nice thing about it is that there are huge databases to really store insane amounts of raw sequencing data and they all have a very similar format. And that format is kind of decided by the International Nuclear Type Sequence Database Collaboration. So that's the collaboration between these three big databases and through all three of those databases connections you can get all the data that is in either of those databases. So it's a very powerful system. Gabriela, you had a question? Yes, I have a question. I'm sorry I posted the question in Slack but I guess it's difficult for you to catch up. Now regarding quality control and the programs to remove adapters, I know that some companies can deliver the FASQ files without the adapters. And in that case, would you still recommend, could adapt to do the quality control to trim the sequence with low base quality? Not necessarily. So in general, if you do not see any adapters in your quality control report, it doesn't really make sense to do adaptive trimming too much. So you can just do it again just to be sure that all adapters that you would be able to find are gone. If we're talking about in the term for base quality, a lot of the software you will be using downstream, for example in your alignment, will take base quality into account. So for example, if you do variant analysis, then if you're calling a certain SNP, for example, with EATK or Freebase, whatever, they take this base quality into account. So if you get a low base quality, so let's say if there's a read aligning and they see a base quality of five, they take it into account in a model. So they take that uncertainty into account in a model. However, if you have very low base qualities as this problematic example, as I showed in the presentation, then it's always a good idea to have at least some basic trimming there because base quality below 10 might even mess up the alignment for a bit. So then really trimming for base quality makes sense. If you have a nice report, so if you have nice base quality distribution over your entire read, then it might not even make sense to do trimming for base quality at all. Yeah, and for me it's also, I'm not sure when I should do it or not because in the report of FASTQC, we have like all these tags which are in green check mark or in cross or in yellow. So for example, if we get all in green check mark, so it means that we don't need to do any trimming before. Yeah, so what FASTQC does, it just uses a range of a quality measure and then says, okay this is green, this is orange, this is red. But it depends a little bit on your application whether that also is true for your application. For example, if you do RNA sequencing, then almost always in the overrepresented sequences, you see a red cross, but that's actually something you would expect from RNA sequencing. But for example, for adapter sequences, if you see a green thing, probably you do not have to trim for adapter sequences. Also for base quality, if you see a green thing over there, they probably also do not have to trim for base quality. And what about the reads, the size of the reads because I know that sometimes you can get like shorter reads than expected. So it isn't good to remove those reads as well or you will wait until the variant calling analysis that they just don't take into account for the variant calling. Yeah, so if you have, so in principle by default, the Illumina sequencer will always, as far as I know, will always provide you with all the reads the same length as you have specified in the settings of the machine. So meaning if you specified in the settings of the machine, okay, I'm going to sequence 150 base pairs, all of your reads are expected to be 150 base pairs along. They might have very low base quality, they might have adapters in there, but most likely they're all 150 base pairs long, unless you really, if your adapters were, so if your fragment was that short, as you're also starting sequence outside your adapters, then you might expect shorter sequences, but that is quite unlikely if you, if there was some QC during the library preparation. So in your case, that might have been caused by the trimming, by adapter streaming. So what you tend to do is usually set a threshold on minimum read length, because for example, very short reads, for example, 10 base pairs or even shorter, they are very difficult to align to map. However, usually also again, the mapping software takes it into account. So by default, they already get a lower mapping quality and a mapping quality will also be taken into account by downstream analysis software. And one more question. What does PCR free library does exactly? Because many companies offer to do that, like, and they, it's like also like, I think that they always sell it that your sequence we have will be more, we have more quality or more, more pure reads, but I'm not sure about that. What exactly does it mean? Yeah, so PCR does basically two things. You can get a fragment bias. So some fragments get amplified more efficiently than others. So that can cause a non equal distribution of your genome, for example, depending on GC content. And secondly, what you can have is reads coming from the same optical or the same, let's say, initial fragment coin, we call them PCR duplicates in your sample. And for depending on your application that can violate the assumptions you have when you do, for example, variant calling. So if you have PCR free, you usually need quite high input, because you need to start with a certain amount of library. But then you, you should not have these negative, these disciplines of PCR. Okay, great. Thank you so much. Right. Welcome. Sheena. Yes, sorry, a question related to the framing of the adapter, for example, do you need to know the sequence of it? Yes. So is that something that usually like company that sequence provides you or is something that you look at it because there is a high abundance of it and you realize it's not part of your. You can always ask the your sequencer platform about sequencing adapters are very universal. So I mean, for Illumina, they are almost always the same. Okay. So and the nice thing, by the way, I just showed that about what you see is that it actually shows what kind of adapters it finds. The number of adapters are limited. So what you see over here is the color depicts the type of adapter it found and the Illumina universal adapter. That's kind of the standard adapter, let's say, and we found that one quite frequently in this dataset. And then, you know, okay, I need the Illumina universal adapter. The only thing you still need is the adapter sequence. I can either Google that in the exercises, you will actually use the sequence. You can see that what the sequence is. You could also use that while in your own work. Great. Thanks. All right. Good. Any other questions regarding quality control? No, then I go on to the databases. So databases very big. Lots of data, lots of raw sequencing data is generated. It grows exponentially. And that data can be stored in either of those three databases. Their structure is the same for all three and they can be a bit challenging to understand at first. But it makes sense, I would say. So it all starts with a project and samples. So a project would be, well, the project you're working with, for example, related to the ground. And the sample would be a biological sample that you are taking. For example, you're taking a long sample of a mouse, and that would be a sample. And you can register that in all of those three databases. Usually it was, of course, only with one, but they can be registered in the same way in all three databases. If you combine a sample with a project, that would be an experiment where you're actually going to do stuff with that sample. And if that is sequencing, then you get a run ID. And a run ID is typically the output of a single sequencing lane. And those run IDs, they contain the usually fast Q files that you can directly download from, for example, a sequence read archive, but they can also be stored in a bound file. But typically that's the fast Q file. There are command line tools that you can use to actually retrieve that data. They are quite efficient and quite nice, I would say. Because we're talking about very big data files, that is not a trivial thing, not always a trivial thing to do, because you want to be sure you have downloaded the entire data files. So what you usually do is first pre-fetch your data with pre-fetch, which downloads your data in an SRA-specific format. So you cannot really use it yet. And then you convert that to, for example, a fast Q file. In order to retrieve sequences other than raw sequencing reads, for example, assemblies or genes, you can use a very nice command line tool called Entrez Direct. You can search and retrieve sequence data with it using e-search and efetch. And we will actually use Entrez Direct in the exercise. So you will not learn exactly how to use it, but I will show you an example. And if you're interested in using either SRA or Entrez Direct, there's a lot of information online.