 an introduction to a negative direction sequencing. And I will show you this first step of what we do for a benefit pipeline during the practical. So perhaps I will go a little bit fast during the lecture. But don't worry, except for the introduction. Almost everything that we will see during the lecture for the second part, we will see it in detail during the practical. So today, I say we'll introduce NGS, try to understand what kind of data we are getting when we do sequencing, to understand for this module specifically what are the problem with alignment and how to avoid some error. And we'll see what the file format we use when we do alignment. And as I say, we'll run and do the practical. So in next generation sequencing, it's a major revolution in the field of biology. Because 10 to 20 years ago, when we started to do sequencing, it take people and consortium. No one was able to do it on its own. It was big consortium that do sequencing project. It takes years and billions of dollars. Now, if you go with low coverage with this kind of small sequencer, you can have your low coverage genome sequence in a few hours for thousands of dollars. So it's a major revolution because now everybody is doing sequencing. So what is the definition of sequencing? So it's really an improvement from the first generation of sequencing based on the clone and clone approach where people were taking the fragment, putting in the clone, and then doing the sequencing in this kind of machine. This kind of sequencer where it was a capillary sequencer. So the main issue with this type of sequence was, especially, the number of sequence you can generate by run. So you can only generate an 100 of seconds by run. And so when you try to, in each one second there are tubes. And it's got a lot. So now how magnetic sequencing works. Instead of using capillary sequence, we use cluster of sequence and we use imaging. So it's just a representation of a part of a flow cell. So when you do your sequencing, you take your DNA, you put it on the flow cell. You put your flow cell in the sequencer. And the sequencer just take picture of your flow cell at each cycle, at its base incorporation. And give you a point like this, which is a cluster of innocence, and is able to determine at this position, I had this basis. And then for each cycle, you know which basis I've been incorporated. So it's really taking, at the same time, millions. It's 100 of millions of clusters at the same time for one lane. So it's 100 of million times 8 per run. So it's really another level of data throughput. So how it works in details. Here I will only present Illumina shame. But after the practical, if you want to discuss about the other type of sequencing, like bio, like other things, come to see me. So for Illumina, which is the technology that most use actually, you take your DNA, double strength DNA, you shear it, not normally. So you shear your DNA with the genomes. Usually, you select the fragment of a given site. So for Illumina, usually your fragment size is around 300 to 600 base pair. You add two different adapters at each end of your DNA fragment. Then you denaturate your double strength DNA to signal strength DNA. And you load your flow cells with signal strength DNA. So when you load your flow cells, one of the adapters will be attached to the flow cell. So what's happened after that? It's what we called, I think, I lose one slide somewhere. So that's a problem. So usually what we do, oh, yeah, like that. So you take your fragment and there's a load of three complementary adapters in your flow cells. And what you do, your fragment with the two adapters will stick to three adapters and create a bridge like that. And then you will have a notification of the bridge to create a number of strength molecules. And it's what we call the bridge amplification. It's really the basis of the Illumina technology. So when you have this bridge amplification, you have your two strength. And then you have time to denaturate your sequence. And you have two copies of the sequence in one strand. And you repeat this process several times. So it's go. You have to copy. So you increase the number of copies in order to obtain this kind of clusters of the same sequence in the two direction. So you've got this set of clusters which are just kind of PCR and the replication of the same molecule in the two direction. Then you start your sequencing. So how it works, you provide some enzyme and some level of DNA which will stick and start to complement your molecules. So all the molecules will be complemented at the same time. And for each cycle, you will have a laser that will flash all the labels. And you will have a laser excitation of the labels. And the sequencer will take a picture at each cycle. And it will be able, so this is kind of picture, to look all the molecules in the same cluster should flash with the same labels because it's the same sequence. And as you start with a specific adapter at the beginning, you only sequence in one direction. So you sequence in one direction. So you take your picture and you're able to say, OK, I've got this cluster of yellow and yellow for me, it's a G base that I incorporate. So I know that at this position on my full cell, my first base will be a G. Then I take another picture at the second cycle and I say, OK, I got a G now. I got a blue slash for this cluster. So it's a C. Next cycle, a T and G and A. So you're able to generate for this cluster your sequences. And you're able to generate sequences for the other cluster at the same time. So it's really simple. And it's allowed us to generate millions and millions of reads at the same time. So we talk about Illumina. That's one of the other players of the field. So we have five major players. We have a lot of other, but the five major. So the first player, it was Life Technology, which was a really huge player at the beginning of the NGS like 10 years ago, five, 10 years ago. We developed a solid, which was a short read, similar to what Illumina did, and which is now working on the IOM torrent. But Life Technology is now really decreasing. The amount of sequencing is provided to the research community. The main player is actually Illumina, which is the one that provides the most sequencing to the research. And both of them provide what is called small sequencing. So it's around its side of reads around 100 to 300 of base pairs. We've got medium sequencing, which is mainly done by Roche 454 and by MySick with Illumina, which is what we call medium sequencing, which is the size of the read is around 300 to 600 base pairs. And we've got the other company, which are more targeting large sequencing, where we have passive bioscience, which provides applied bio-instruments, which is long reads. And it's for Nanoport, which provides Minion and Rediode. So actually, only Minion I really use. And it's still in the testing development for the Minite. So just to give you an overview of all these sequencing technologies, just to tell you what's important is the read length. So for example, if you take Illumina, it's around 50 to 600 base pairs. 454 is around 700 base pairs. And when you go with long reads, it's up to 50 kb for a pack bio and up to 100 kb for Oxford Nanoport. What's also interesting to see is that for small technology, the advantage of small reads is we have really a low level of error, so less than 0.1%. Where when you go with larger reads due to their technology to generate long reads, they are really error-prone. So they are between 13 to 15 error. The price also is something that the lower reads usually provide much more reads than the long reads technologies. So the cost pair basis is really lower for a small read technology. So what can you do from this type of machine? So we have three categories of sequencer. So the small reads here, solid Illumina. So when you use these technologies, usually when you do world something, like World Genome, World Exome, World Short Suitome, World Genome by Shield Fight Genomics, so when you want to target the world genomes. When you use the medium technology, now people are using it mostly at the beginning. It was really used to do small, low genomic sequencing before these two companies arrived on the market. But no, people are using them to do optical sequencing, metagenomics, and validation mostly. So because it's really where you want to have, you do a targeted sequencing with specific primer. You know that your primer is around 500, 600 base pairs, so you can directly have the full sequence between your primer. So the last two technologies with the long read, they are used to do small and medium genomes assemblies to try, they really push up for the genome assembly and they really improve the quality of genome we're about to generate since they're on the market. We also use them for doing long appetite sequencing to do some epigenomics and also to violate some structural variant when you discover some specific, for example, to translocation. You want to stick on this region to be sure that you see your two regions of chromosomes that stick together, so you can use this long read technology. So when you are thinking about doing next generation sequencing, there are some kind of parameters you need to think about before choosing which technologies, which machine, which company, which center, whatever. So this is the main parameter. So for the analysis, the four main parameters are the side of the read, as I show you, the side of the read is really changing depending on what's the aim of your study. If you want to do assembly, you go with long read. If you want to do what you know, you go with short reads. And so on the side of the read, the type of library, so do you want one read? Do you want parent reads? Do you want arena reads? So there's a lot of, you need to think about which type of library, the error profile of the machines. So if you are interested, for example, in doing, looking for in-depths, you won't go with technologies that generate this kind of error. You will just go with technology which generate substitution error. So the type of error of each machine has its own type of error. So you need to think about that before choosing one technology. And also the potential of doing barcodes because you usually don't need to have a full lane for one sample. So how can I barcode divide my overall set of reads into smaller set of reads to be efficient in terms of cost? So you go with the other parameter which is not usually my problem when I meet people to discuss about. The only problem is the cost of the sequencing and the turnaround time. So usually that's an issue for people but most of people make that these two parameters are the first parameter to take into account. But for me as an analyst and a brain information this should be the two least parameter because it's better to do less sequencing on a small set of samples and have a better answer on this small set of samples trying to do everything with a smaller budget. So a lot of people say, oh, I need to get this cost but for me it's not the most important. It's important, it's what's your question and how you can answer your question. So what we are doing with the data? So in this practical, I will mostly focus on so DNA-seq analysis and I will show you what are the main steps to do that. So when we talk about DNA-seq analysis we can summarize the analysis like that. We start from FASTQ, I will explain you what is FASTQ but is the output we got from the sequencer and we want to go to the variant, the file that we receive for the variant which is called the BCF. So this is all the stuff we want to start from the raw data from the sequencer and go to the file which gives you the variants of you have in your DNA. So the input data is the FASTQ. So usually now a lot of people and most people will do paradigm sequencing so they will take the fragment of DNA and they will sequence, for example, 100 base pairs at each end of each fragment. So for each fragment you have two sequences. So you end up with, for example, two files, two FASTQ files, one for the one end, all the read from one end and the other for all the read of the other. And usually your FASTQ file is this kind of format. So it's a four line format. So the first line is an header which always starts with an add sign, followed by the name, the nice name of the machine. And then you have some kind of positional information of the flow cell. So nobody reads it, but it's just a way to have a unique identifier for each print. Usually you finish with either one or two, which indicates if you are on the read one or on the read two. Then you've got your sequence, like in a FASTQ format. Then you've got another header, which starts with a plus sign. So in most technologies, like Illumina, the second header is empty, but you can put it whatever. In some technologies, it's just a replication of the first header, changing the add to the two of the sign. And then you've got another sequence, which represents the quality of each basis in this sequence. So it's what we call the best quality. So the best quality, so it's a score, so it's a letter. It's why it's a letter and not a score because we need to have one character only to describe the quality. Because we don't know if I have something like that, if I start to put some scores, like 33 and 12. I don't know if my first base is 33 and my second is 31. So one letter, one character should represent one basis. So what we use, we use a conversion of this score in an ASCII format. So each value give an ASCII character. So, and this is a representation we have there. So each ASCII character, we transform them, we are able to transform them as a numeric values, which represent our best quality. So the best quality here is minus 10 log of probability. And the probability we are talking about here is the probability that the base, the sequencer say is, like A or C is wrong. So the probability that my base is on error. So as it is a minus 10 log, the higher the number is, the lower the probability of error will be. So when we talk about best quality of 30, we know that we have 0.1% chance that as an error that is given by the sequencer is wrong. So usually when we have our fast queue, we generate some QC of the data. The one of the first QC is to look at this best quality. So this kind of graph we'll see later in the practical. It's just a representation of all your reads and for each bar, it's the representation time of all your read for the first cycle. So we see here that in our first cycle, almost everybody are qualities around 30 to 34. So, and we saw that at the end of the read, the quality is decreasing. So it's a normal shame of quality distribution you see in your Lumina. It's just because this is all you, when I show you the sequencing, it's cluster of reads and not all the bases are incorporated at the same time. So the longer you go on your sequencing, the more small reads that are in advance or late from the global speed of everybody occurs. So the more the more noise you've got on the signals. So it's not about the difference which buys if it's a clear dot or if it's a mix of two dots. So it's why at the end, for Lumina, at the end of the sequence, you will see the quality which drops. So it's important to take into account this lower quality and to, if possible, remove it because if you increase bases at the end of your reads, you increase as a false snippet so the false variants will discover later. Yes? The quality of the bases, is that assigned by the instrument itself? Yeah. And we'll see later that usually we re-evaluate the quality, taking into account some genetic information because there are not totally true. So would the current instruments have different formulas to assign the quality? Yeah, but it's always what we call a thread score. So the score of the probability of an error, but it's how the error is estimated that different between different machines. Once this score is given, it's... Yeah. It's changeable. Yeah, yeah. It's the same unit thread. Yeah. Another QC you can do on your data is to look at the positional base content. So it's the same representation as previously. So each position here represents a basis of a read, and it's a global distribution of the A, T, and C in your reads. So you see, you expect, for example, for human, here you expect to have your reads which are around 50% of each basis. Here we have a specific pattern. It's just because, as an example I take here, if you come from RNA data and RNA data for Illumina, the way the data is shared is not fully random. So we have some kind of patterns of non-random sharing. But it's normal. Always the two-fold first base, when you do Illumina RNA sec will be like that. And it's why we don't have... For DNA, we would have expect to have GC around 30 and the other 20, because for more genomes, the GC value is not equal to the 80. Another type of QC we've already done is we're looking in your data if we see some adapters. So if your fragments are shorter, you expect to go over the full size of your fragment when you sequence, and to see the adapters arriving at the end of your fragment. So that's a way to check the quality of your data. And we also look at the duplicates. So how many times we saw the same reads in the data? What we do also, as you see, usually we take 1,000 reads. We blast it, yeah. A question for the duplicated data. I understand that depending on the sequence thing, you won't get a percentage of expected duplication is not the same. But once you have your analysis... Depending on what you are doing. For RNA, no. For DNA, yes. Remove or just mark the reads and then don't count them when you try to look at variants or something or whatever. So here what we do, we take the read, take a set of 10,000 reads, we blast against NT databases, and we just look the result and count the read. It's 100,000. And we just display the result just to be sure that if you have sequence, for example, here, we have second mouse, we expect to see read blasting on mouse. If we start to see read blasting on human and we have second mouse, probably the guys that do the libraries, I make a mistake and we are able to track this kind of error. Yeah. Yeah, it's not all the same. It's like we take randomly 100,000 because that's in everything we have so long. So we just take it just to... Just a small subset just to be sure that the quality and we are really sequencing the good thing. So we don't look where it blasts, whatever, if there's multiple it, we just say, okay, the best it is on this spacey, just to be sure that we are sequencing the good spacey. So when we have checked all this quality, usually we see that usually at the end of the read we have decrease in the quality of the reads. Sometimes we have some adapter in the read. So the next step we need to do is trimming. So remove all these bad bases or all these non-genomic bases. So for trimming, for trimming, so you got your two reads and you could have lower quality, but as I said, if you sequence a short fragment to accept to see adapter at the end of your reads. So what we will do usually we will trim. First, we will remove the adapters of the read. If we see adapters in the sequence, we will remove the sequence, then take the reads. We will check the basis, the quality of its basis and if the basis is lower than the given results, usually it's 30 or 20, depending on what kind of accuracy you want in your data, depending on what your project, but usually it's 30. We remove all the bases starting from the end that are below 30. And if at the end, if you remove too much bases and your read lengths start to be short, so less than 32 base pair, we say, okay, we can't use this read, it's too short, we remove from the analysis. So to do that, we will use TrimMatic, but you have to know that there's a lot of other tools and for many of the steps we'll see, there's always many tools that can do the job. So when you have Trim your data, you now have some data with good quality in terms of bases. What you will do is you want to align your data or to position your data on your genomes. So when should I use alignment or mapping versus when should I assemble my data? It's basically based on the reference. If you have a reference which is a good quality, go with alignment and mapping. So mapping is you take your read and you try to find the best location of your read in your sequence and it's the best because sometimes the best is not the true location because the assembly could be missing some part of your read, so you try to find the best location. If you think that your assembly is not good enough, there's a lot of gaps in your assembly, a lot of pieces that are missing, you can choose to do assembly. So in these cases, you will first try to regenerate the reference by building, by passing each read one behind the other to build your set of contigs which will recreate the reference sequence of your spacing. In your cases, we will only use a reference approach but I think we have a session later where you will do assembly. So when we talk about read mapping, what is the main issue and what is the challenge when you do read mapping, we need to find the position of medium read on the genomes. The genomes is large for human, it's three gigabytes, GigaBespa, sorry. And in many cases, you will have some reads that have many location and you cannot try to find the exact position only because otherwise you will remove every variation you have in your genomes. So in most of the cases, when you are doing DNA sequencing, you try to characterize this kind of variation. So you need to be to tolerate some variation between the read and the reference sequence which give you even more possible position in your reference sequence. So we cannot use black because it's really too slow. So there's many type of algorithm. The one that is currently the most used is a bureau wheeler transformer which is the one that is the more efficient in terms of the balance between speed and accuracy. And the most known, one of the most known aligners to that is BWA. And this is the type of the sort of what is today. There's other type of assembler and so this one is known as the best assembler at all but is not free for academic. So usually we don't use it. And this one, BOTAI, for example, is really dedicated for RNA-seq data. So when you deal with RNA, you have other challenges you need to take into account because you're not like DNA where your seconds need to be. You really need to be mapped in one block. And when you're mapping with RNA, you need to think that you are dealing with exomes that are passed altogether. So you need to be able to split your reads between different pieces and position these pieces in several blocks. So what is important when you do your alignment is usually for DNA-seq, you will want to do world genome sequencing. It's not totally true now with the new ICICX machines where you can get a bunch of data for one lane but if you're still working with ICIC 2005, to have a decent coverage to do analysis like 30X, 46X, you will need to sequence multiple lanes. So it's important when you have multiple lanes to align each lane separately. First, because it will be faster for you because you will use clusters that align them in parallel. And then to be able to track where your reads come from. So to do that, we need to set for each alignment we set an identity, which is called a read group, which will be flagged to each read, which means that this read come from this lane of sequencing. So at the end, when you merge all your lane of sequencing together, you are able to track back where your read come from. If you've got an issue, if it's not a good data, you will be able to track that by adding your RG tags or your read group tags. Another thing is that many tools that you use will require you to have RG tags in your data. So when you align your data, usually you would generate another type of file which is called SAM or BAM. So SAM is the uncompressed format, BAM is the binary compressed format of this file. So there are these to store the same information that you've got from your fast view and the information of the position of your reads. So you see, we'll have the same information in your fast view, so the sequence of your reads and the quality. We still keep the read name, so we're about to get the identity of the reads. We've got a flag number which we can use to tell information on the mapping. So is it mapped? Is it made? Is it correctly mapped? Is it unmapped? So we've got a flag, it's numerically done, but we'll see during the practical how we can use it. We've got two value to indicate the position, so the chromosome where the read is mapped, the chromosome and the position on the chromosome, the quality of the mapping. So it's a first code, it's the same as for the best quality, but it's for all the reads. A string which gives you information about how the read is mapped. So it's a 60 base per read and told me I've got 76 matches. But it's not the same matches than in BLAST, so it don't mean that it could have a variation. Matches, it means that the read is not there. It will just track matches or it could be mixed match. It will track insertion, deletion, or base that are not from the genomes. So 76M will mean match or mismatch, but he has position, the treat, so he has given from the reference, he says this is the base that corresponds to the reference for all the reads. So when we have done alignment, unfortunately the alignment is not perfect. So we need to clean our alignment. So one of the first cleaning we are doing, the first is to do in-depth alignment. So why we do that? Because when the liner works in a specific way where for them, they tend to favor creating mismatch, then in-depth, then gaps. So the idea is to look on some specific region. And especially when you have a lot of mismatch, or where you have known in-depth from the population, to look at the specific location, and to look, oh, if I include some in-depth position, will the neighbor base will be matched the same way or will be matched in a better way or in the worst way? And try to say to really evaluate if the liner didn't have to remove a real in-depth. So really the thing is to try to set a location of an in-depth and see if it's increased the quality of the mapping or not at this specific location. So usually it works. You can see in this example, we have the same set of reads before doing the real alignment. And we see that we have a lot of SNPs, especially these two ones, a lot of SNPs there that appears in a read. Not, it's not called a SNP because we're not at the SNP calling step, but there are some bases that some mismatches in my alignment. And just by adding a render to all my read there, most of my mismatch disappear. And my read, the quality of my alignment will be really, really higher. So it's really important to go over this step. Another type of refinement of your data you can do is to look at duplicates. So either you can mark duplicates or you can remove duplicates. At the end, it will change anything to do one or the other option when you, for your analysis. Us, we prefer to do marking duplicates just because we don't want to remove anything from the data. So we're able to come back from the initial data whenever we want. So we prefer to mark duplicates. So you have different type of duplicates. So you can have optical duplicates, but it's really rare. So it's when you have clusters that are bigger, that are really big, when the big amplification was, give you a huge cluster of data. The machine tends to estimate this big cluster as two different clusters. So it will give you the same sequence for the two reads. But in fact, at the beginning, it was only one DNA fragment that you amplify. So this is true for the old ICData, like 2005, for the new ICData, like ICX. The flow cell is different now. The flow cell, each cluster has a specific position that is predefined. So you won't occur in the view type of machine. Another type of duplicate you could have is the PCR duplicates. So one was you will see the most often. So when you do PCR, some reads could be over-amplified and take another location because they are over-amplified. And you will have the two reads which come from the same fragment initial. So really, it's the way to see what are my real initial DNA fragments. The other type of duplicates come from the new machine, so the cluster duplicates. It's when you, with the new flow cell, which all the clusters are predefined, if you don't load correctly your data, if your flow cell, you don't put enough materials in your flow cells, you will have some empty holes. And the DNA fragment from the neighborhood holes will turn to go and fill this fragment when they do a re-temptification. So it's what we call clustering. And sister is what you have, kind of homologous sequence and the sequence will merge together to give a kind of mixed sequence. Another type of refinement you need to do on your data is what we call base localization. What you discussed previously is usually it's a concert ton to increase the quality of their bases. And so the idea is we try to take the values that the vendor did and try to estimate the real value of the base at each basis of each tree. So the idea is to take into account some specific bias we are in the data. Specifically, there are some cycle bias and some genomic bias, like the composition of the region and the bases. So if you have DC reach region, your quality will be different if you are in 80 reach region. So the second side will not work the same depending on your genomic context. So the idea is to look at all these parameters and to re-evaluate the real or we expect more the most real value of the base quality. So when you have done that, usually it's always good for each step of your pipeline to take metrics of what you have done to be able to check if everything have worked correctly. And when we look at metrics, so as I said should be take at each step. So you will find many tools that do their own metrics. Some tools specific to RNA, some tools specific for DNA. So depending on what you are doing there's many tools you can use. I will show you just a set of tools. So the main, the formatual tool we use to generate metrics and other analysis is some tools, VTools, VKR and GTK. And so the most important is to look what's your trimming step build, what are your alignment metrics, your coverage along your genomes and what is your exercise. Is it correspond to what you have done in your labs? There's a lot of them but we will mainly focus on that in the practical. So when you have done that, what you need to do is to do your variant. So I will not present that because it will be the topics of module five for the single nucleotide variation and module six for the structural variant. So just to finish, as kind of conclusion, you will see there and if you look in all Biostar, all these kind of forums at NGS have a lot of technology and methods. You need to think about it when you design your experiment because most of the time, us, we are a center that offer bioinformatics services and people usually arrive and come to see us telling you this is what I've done. Could you do the analysis for me and answer this question? Quite often we arrive and say no because you don't have done the good things at the beginning and say come to see us before designing your experiment. So there's a lot of methods, a lot of techniques to think about before designing. If you want to do bioinformatics, if you want to do your analysis, you will have to have a good understanding of what are the errors and what are the techniques you have been used to take to incorporate that in your analysis. You will also need to have both strong mathematical and informatic skills because it's not always easy to do. It's easy to do one sample, but when we do hundreds of samples, we won't do everything by hand so you need to create your own script too. But we provide also tools for people that are not low level, high level of informatic skills. So there are some resources but to understand what is behind the scene, it's always good to understand how things work to make sense of your data so you will need to have a kind of informatic and mathematical background. And actually doing the analysis is not so complicated. Our third group works on that. Us and others will work to develop methods to try to evaluate either to develop a method or to evaluate the other method to know which method is better to do this task. But the main limitation we have is the compute and the storage because BAM files, for example, you have tumor, you have paired and normal. Each BAM file could be 300 gigabytes of data for normal, 300 gigabytes of data for tumor, so it's 600 gigabytes per one sample. So if you have a number of samples, you are around a few terabytes. So you need to think about these resources in your budget. For computing, in Canada we are really lucky because we have a consortium which is called Compute Canada that allows you to have high performance computing resources available freely for every academic people. So you can go register on their site and they will allow you to use their cluster field. So it's really a chance we have in Canada. So that's most of it. Now we can have fun and do the real job.