 Okay, so module two is about the data. So you have seen with Mike what you will do at the end of the day when you receive the data, when you have generated your volume call and you want to find the clinical aspect of the work. So in order to really understand what you will have at the end of the day, it's important to understand how the data is generated and processed. So it's what we will see in that module and in the lab afterwards. So what are the objectives? I will give you a short introduction to what type of resource we use to do computing of this genomics data. To understand the kind of data we will use in genomics and to understand how we analyze them to how we map this read on this data on the genomes and what are the type of error you could face when you do this type of analysis and you will also learn about terminology and format. And in the lab we'll do the first step of data processing. So first, a small introduction on high-performance computing. So when we work in genomics, we cannot use traditional computer like your own laptop or your own personal computer. So what we need to have high-performance computer, which are called clusters, and it's a really a kind of pool of 100 or 1000 of different computers together you could use to do your analysis. Why we need that is because the size of the data is really big and the throughput of the data that is generated all over the world and especially here in Canada is really increasing every year. So building this high-performance computer is really costly. So you don't need to do that because there is an initiative in Canada which is for Compute Canada which built all this high-performance computer server in all over Canada. So in every province you can find at least one or several server and you can use the one of your province or the one of other provinces. So the idea of Compute Canada and of this server is to ask for an account and if you are a Canadian academy or a Canadian researcher, you will have access freely to this resources. So you ask for an account, they give you an access, then you will have an allocation every year which means allocation, you will have a complete time. So a number of cores you can use for a given of time and you have a storage space. So each time you log and you start to do some jobs, you will use your complete time allocation and each time you generate a file and you put a file on this you will use your storage space. So it's really a simple way. How you can get an account, we'll not give you detail but you need to go and you have the style, you need to go to Compute Canada to apply and then you go to the contraption, so to the province contraption and you apply for a specific account in a specific cluster. Once you have your account, you can now log and do your analysis. What is important and when you do your analysis because usually you don't want to have your result as fast as possible is the important thing is the queuing time. Being on queue because a lot of people are using this resource, it's a shared resource, so being on queue could take, could use most of your time than the real analysis. So you have some parameters you need to play when you do, when you set up your jobs is to, what are the lengths of the job, what are the number of CPU you need and how you have already used your allocation. So this is a parameter you can use to try to avoid spending so much time on queue. So how busy is the server? You cannot control, but if the server almost full, you will have to wait for time. So we have this resources available through Compute Canada and we will use this resource during the practical after my talk. But what is interesting is that when you do genomic analysis, you don't want, you want to use this resource, but you want to have access to all the software that people are using and all the other resources. So in partnership with Compute Canada, the C2G have built a system based on the CMFS, so it's a CERN virtual machine file system, which is a way to provide to every HPC site a set of resources. So the idea is we take care of installing and maintaining the resources in one location and then every site will have the same resources available. So you could use a cluster in Ontario, in British Columbia, or in Quebec, you will have access to the same software, to the same data set in terms of genome references and so on. So we do that through an initiative, which we have called Genpipes, where we provide so the bioinformatics tool. So we provide more than 90 different tools for genomics. We provide genomic resources, so 20 builds for 14 spaces. And we also provide a Catern analysis pipeline, where you can just use the pipeline and you will see what we'll do today is part of the beginning of one of the pipeline, which is the NSIC pipeline, but we have a pipeline for cancer, for RNA-seq, for chip-seq, for methyl-seq, metagenomics, so all these kind of things are already set up for you if you want to use it and analyze your data. So it's a really, really big effort that we have done in partnership with Complete Canada, so any trees, so feel free to use it. Many of you, if you are not Canadian, you may not be able to use it. So if you're not Canadian, you could consider it to become Canadian, but it's not so easy. But other thing is that all this aspect of HPerformance computing also could be applied to other type of resources, like cloud computing. So if you're not Canadian and you want to use Amazon, things could be, analysis could be done the same way. And we are now working on a version of GenPype in a container that you can directly use in this type of resources, like cloud computing, where you will have access to the same type of resource in terms of software and genomic resources. OK, so now we have a better idea of where we should do this analysis. It's what the data and what the analysis we will do for genomics and is specific when you want to generate the variant. So what's the data? So most of you should know about Sengar sequencing as a usual traditional sequencing method. So here we talk about high-super data, so next generation sequencing data, and it's not a clone-based data as we did previously, but it's a cluster data. So what the main difference is that instead of generating hundreds of sequence at a time, we generate now hundreds of millions of seconds at a time. So it's another scale of analysis. So all of this is based on what we call for small re-jumpification. So when you do your library prep, you take your data fragment, you share it in small pieces of a given land that you have chosen, and all these pieces you put adapter at the two ends of your fragment, and you put that on the flow cell of the sequencer, and then you will do amplification. So the probes at the end of the fragment will fix probes on the flow cell, and the second probe will fix another probes, and you will create a bridge, and then you will have an amplification of your sequence, and then you will have two strand structures after the first amplification. Then you will relapse into single strand structures, and you will do the same again and again to generate a white round copy of your initial fragment. So at the end of the flow cell, you will end up with that type of profile where you will have a lot of clusters that represent the same initial fragment you have put in the flow cell. The sequencing will start from the top of your sequence, and at each cycle, each fragment will have a base which is labeled with a fluorescence that will be had to do the sequencing. So at each cycle, the machine will take a picture like this one. So it's a small part of the picture, and each of the cluster will generate a dot on the picture. And it will correspond to one base. So for the first cycle, here you will have a picture, another for the second, three, four, five, and you will be able to generate the corresponding sequence for your fragment. So if we take that cluster, we'll have yellow, blue, green, yellow, red, and it will be GCTGA. OK, everybody understand that? So how? We do that for every cluster at the time, for a million, hundreds of million clusters at the time, and we're able to generate each basis of each fragment for all the DNA fragments that have been put on the flow cell. So that's what the sequencer will output. So based on that, what we'll do, we'll take this and we'll process in order to generate a variant called, so to generate what are the specificity of each of your samples. OK, so in that module, we'll only see how we can start from the data we've got on the sequencer to what we call a BAM file. So it's an alignment file, so a high quality file that you can use to do variant coding. So this is a full pipeline. So in this module, we'll just focus on that part, which could be represent like that. So we'll have the FASQ file from the sequencer and we want to generate alignment-ready file to do variant calling. So variant call and what we do from the variant will be presented this afternoon in the module tree. So when I see a FASQ file, when I talk about FASQ file, it's what you receive usually from your center. So you will either receive one or two files per sample, depending on the design of your sequencing. But whatever the number of files you will receive, the format will be the same. So for each sequence, you will have four lines. The first one is the header that represents the name of the sequence and some positional information. Followed by the line where you have the real sequence itself here. Then you have a second header. So depending on which technology you are using, but you have either in the second header or just a plus sign. In the numerator plus sign. And then you will have a second set of sequence, which is the base quality sequence. So it's a score that is encoded as an ASCII characters that tell what the quality of each score that have been made for this sequence. So how it works, you take this character, you translate it into numbers with the ASCII code. And then this number gives you a probability of error. So it's what we call the base quality. So what that means, so the base quality, the number you have there, is a thread score. A thread score means minus 10 log base 10 of probability. And the probability in the case of base quality is a probability that you wrongly call the base. So for example, if we have a base quality of 20, we have 1% chance that the base we called for this sequence is wrong. So the higher the score is, the better you can trust. The better is your base called and you can trust it. So usually what we receive is data. We do a lot of QC. One of the main important QC is to look at your base quality all around your sequence. This is a type of graph we generate where each box is represent one cycle. So all the first base of all read for that run, and you have the distribution of base quality for that cycle. So it's cycle, first cycle to cycle, third cycle, and so on until the end of your reads. So it's really important to look at that because having high quality is really important. If you have low quality, it will bias the mapping. It will bias the snip calling and all the other subsequent analysis you will do. So when we have this data, usually what we are doing, we have a look at the FASTQ. Usually we saw this pattern with decrease of quality at the end of the read. It's a normal pattern. So what we do, we usually trim the data. So what we mean by trimming, we filter read or part of the read that we don't want to keep analysis. So part of the read could be either we filter. So if we have read that are fragments that are short, we will have the sequence, your fragment. But at the end, we will start to sequence the adapter. So it is the sequence we add for the sequencing. So it's this technical sequence. So there's no genomic sequence. You don't want to have that in your analysis. So if we find them, we will remove from the sequence. Also, as we say, if the quality is not good, we'll start from the end of the read and check for quality. So the user will define which given threshold. And if we saw base under this threshold in terms of quality, we remove the basis. So at the end, you will end up with reads with different sizes. And what we do, we only keep a read that are longer than a given length. Because otherwise, the short read will not be really informative and may create some bias in your analysis. So it's the step of trimming. So after the trimming, what we can consider that what we have generated is the most high quality sequence. No, we want to do the alignment. So what the alignment means is to find the best location of your sequence on your reference genomes. I say best because as the reference is not complete, it may not be the real sequence. So you really want to find the best location. It seems easier to do that, but it's not so easy because, first, you need to do that for a million of sequence. And you need to find a location in a space that are a billion of letters, so it's not as easy. You cannot use a traditional algorithm like Blast. Otherwise, you will launch your analysis and you will come three months later. It will maybe finish or not. So we want something fast and accurate. And the issue is that some sequence will have many locations, so you will need to find which one is the real one or the best one. And you cannot only say I want perfect matches because the interest of the analysis is to find what is specific to your samples. So you want to tolerate not exact match to be able to catch the biological variation in your specific sample. But when you do that, you also let errors and technicality become part of the play. So you need to do not exact matching, but you will need, after that, to correct for possible error. So to do that, we use an algorithm which is called Brutal Widow Transformer. I will not explain it today, but it's the one that is not used by most of the mapper and will use the BWA mapper, which is one of the best mapper that are available for DNA-seq. If you use other type of data, of data, RNA, or GIP, you should use probably another mapper. When we do the mapping, it's really important to do. So when we do DNA, usually, it's happened quite often that we do several run of sequencing for one sample. So in that case, each run of sequencing should be aligned individually and then merged all together. So when you do that, you need to put a tag on each run of sequencing to be able to identify your read at the end. Because if you see something particular in your sample at the end of your analysis, you will be interested to go back and to see if this pattern is shared between all your experiments or if it's something that happened in only one experiment. In that case, it will be a batch effect or something like that. So it's really important to do that in order to decipher your signal. But also because a lot of tools that we use need to have these tags on the data. So when we do the alignment, we generate a file called another file, either a SAM or BAM. So the SAM is a text file. The BAM is a binary complex file. Now most of the tools use only BAMs because it's a way smaller in terms of space on your disk. And what's the format of this file? So for each sequence, you will have one line this time. The line will be like that. You will have the first column, which will be the name of your read, the same as in the FASTQ. You will have a score here, a flag that tells information about how the fragments have been mapped. So compared to what is made, compared to the strength, compared to all of this kind of information, we will see it more in detail during the practical. Then you have the position here, where as the sequence has been located. So the chromosome, the position on the chromosome, you have the score of your alignment. It's a thread score that tells you the quality of your alignment. You have a segar string, which tells you information about how your fragment have been matched to the reference. So it's really a kind of physical matching description. And then if you have a pardenary, so if you have two end of your fragments that have been sequenced for the same fragment, so that you have the mate information in terms of position and the inside side. Then you have the sequence in terms of bases, the sequence in terms of base quality. And usually, you've got extra flags that I will not describe here, because these flags are really aligner-dependent. So each aligner will generate its own flag. So when we have done that, we have generated the alignment. But alignment is a complex process. And when you do that, the aligner needs to choose some parameters like penalties and so on. So it's not a perfect job that has been done. There's no aligners that will do a perfect job. So what we need to do is to take this alignment and to refine it to have the high quality alignment file before doing the variant calling. So how we will refine the alignment, the first thing we will do is to do indelible alignment. So why we need to do that is because most of the aligner tend to favor to create mismatch between the sequence, the reference sequence, and the read sequence, instead of creating gaps. So if you are in a region where you have indels, you will quite often see this kind of patterns where you will accumulate mismatches on your reads in a really close proximity. So we know that we don't expect to see variant quite often in the genome around one variant every kilobases. Depending on the region, it could be less or more, but you don't expect to see like five or six variants in 100 bases. So what indelible alignment will do, it will look at your fragment and look at your reads on the genomes. They identify in the first place region where you think there may be an indel that have not been called. And then it will go back to this region, retake all the reads that are mapped on that region, and try to insert indel to see if that can improve the mapping. If you see that it don't have improved the mapping, it will let the read the same way. If you think that the indel could improve, it will realign the read. Now, in that case, in that region, you see that adding the indel make almost all the variants disappear. So probably all these variants were technical artifacts. So we're false positive if we don't have the indel realignment. Another type of refinement we want to do on the data is to mark or remove duplicates. So what are duplicates? Is duplicates is when you do your sequencing, you have your DNA fragment at the beginning, and what you want, your goal is to have one sequence per initial DNA fragment that you have generated at the library prep. But due to the way the library are generating, you could have several sequences that represent the same initial fragment. So you don't want to count that fragment several times. Because if there's an error at the beginning, then you will see this error that pop up in several fragments and you will see, oh, probably it's a variant. No, just an error that you see several times. So you really want to see only one copy. So where does this duplicate come from? It could come from in all Illumina flow cell. It could come from if you have a really large cluster, it could take this large cluster and consider it as one or several clusters at several clusters, two, three, four, five clusters. So it will generate several copies of the same data. In the new Illumina flow cell, where all the clusters are already defined, if you have empty cluster sequence can jump from one to the other cluster. It could come from PCR. So if you do a lot of PCR cycle, it could generate many copies of your data. So it's why now, many of the library kits that we use are called PCR-free to reduce, to remove all the PCR steps in order to remove the duplicates. Or it could be sister, so Camaric sequence that generate because of two sequence in the flow cell that merge together and that will propagate around a free cluster. So there are different ways to detect them. And here we'll look at the two different approaches during the practical. So it's really important to remove that because otherwise I say it could create false positive in your data. As I say previously, the base quality is really important when we do variant coding. It's usually based on a Bayesian approach which take different type of parameters in the model and one of the parameter is the base quality of the sequence at this location. So what we want, you want to have the real and more accurate base quality. And it had been shown that the vendor, the sequencer try to inflate the base that are generated by the machine to say, oh, we are good. And especially that this machine suffer from specific bias related to the genomic context. For example, which basis are you sequencing one after the other or by the position of your read by the position of the basis in your read. So the idea of the base for calibration is to take all your alignment to look at this known pattern of error to model it and then to correct for it to have a better representation of your quality. So when we have done that, we have variant coding ready alignment. So we are ready to do variant coding. But what is important to do also all along the process is to generate metrics. Really, metrics is really, really important. At each type of your analysis, you need to generate metrics. Why? Because if you have any issue at the end of your analysis, by looking at the metrics you have generated, you will be able to understand where the error come from and what's happening in your analysis. Otherwise you will have to restart one by one and see and check what I've been done. So metrics is really important. So there's plenty of tools that provide metrics and we'll see we'll generate it during the practical. If I have to pick up some of the metrics, the four important will be looking at trimming, looking at alignment. So trimming, as I say, the graph that we show alignment, alignment rate to understand if we are correctly aligning your read, if we are using the good references. Also, depth of coverage because usually you target a given depth of coverage with your dynamic center. You want, for example, 30x or you want 100 if you do exams and you want to be sure that you receive what you have asked for. And also, exercise, so when you do med pair, it's important to have the control for the size of your fragment because for, if you want to do structural value and coding as an exercise, it is really important in the detection of structural value. Okay, so... Is that your question of the trimming? Yes. Which is sometimes... Yes, I can comment on that. So now that we saw that many of you had any sequencer, so first the question was, is the trimming optional? No, I'm not. So if you have a really high quality data with really base quality, you could not do the trimming part. If you use mapper that supports what we call the soft clip. So the soft clip is if you find a sequence that is not in the genomes, it will mark this part of the sequence of your reads and take it, this is not part of the genomes, like adapters, it will mark them directly on the reads and so you won't need to remove it at the beginning of the trimming. So it's true that now if you have really good quality, you can skip this part, but not all the mapper supports that, so that's why I will show it, but if you have BWA support it, so if you have really good quality, all the base quality is on the roof, on top of the roof, and you have really small level of adapter, you could go without trimming. Okay, so in conclusion, if you want to do that, so if you want to do data processing, just need to one is that you will need a lot of mathematical and informatic skills to do that. If you don't want to do it, you will probably find bioinformatician to do it for you, but it's important that you understand what they are doing on your data, because each time we do analysis, we make choice, and if we make a choice, there is a cost to pay, I mean a cost, I mean you have a counterpart of your choice, and you need to understand the choice that have been made and what's that mean at the end on your samples, so because some variant could be missing because of this choice, some pattern should be probably due to this choice, so you need to understand exactly how your data have been generated. My main message is also matrix, matrix, matrix, that's really, really important. If I want to summarize also, try to always generate the best quality data at each step before going to the next steps, and this is a general comment, the major challenge now for doing this data processing is not the methodology, because now a lot of methodology are really there, except for really new type of analysis, but for general processing, all the methodology is there, all the tools are there, which is more challenging now is the compute resources and the storage capacity, so it's why I use the introduction on computer data because this really, you will have, like you will need like a lot of CPU to do the analysis, and if you do world genome sequencing, one sample could take more than a terabyte during the processing per sample, so if you have a hundred of samples, you need to have a hundred terabyte of space, so it's really challenging. Okay, so that's it, do you have any questions? Yeah. In the metrics, is there one problem that is more common than the other? There's so many problems, you can face it. One problem that you can see, for example, which is really easy and is, for example, you don't choose the right reference genomes, so here it's common because most of us will work on humans, so we have the right one, but if you then want to work on mouse, or if you think you will work with human, it's sampled, and then you align your genomes, and you see that your alignment rate is low, it should be okay, probably I don't use, or probably there's a mix up on the data. But even in human, a lot of times, as we've discussed before, the actual assemblies, so if you map one data set on one assembly, and then you use a different data set, there was map using a different human assembly, and that's a very common mistake. Yeah, for long-term projects, where we have some projects that start years ago where the last build was not the same, so it's a good issue, so I would not say that there's one matrix you can use, but what the main important thing you need to do, and it's related to what Guillaume is saying, is to harmonize your analysis all along your project. So, because if you change, if you run, for example, 10 samples on one pipeline, and then you say, oh, this new pipeline arrives, this new assembler, this new violent color, and then you use the other with a new pipeline, you will create technical artifact, and you will create this kind of batch effect that then you can perhaps take as biological signals. So really, when you do a large project with several samples, try to use the same kits, try to use the same sequencing technologies, the same pipeline, everything should be harmonized. Otherwise, you will have, we know that each pipeline, its kit, generate its own false positives that you could then interpret as a bogey called violent. Which is, I would add, so I think that the sequencers have been getting better, but there's some metrics that really have to be a sample itself, or the quality of the sample, like duplicate rate, so watching the duplicate rate is important, and marking duplicate says you're going to be doing, but there's some, I mean, sometimes there's problems with the sequencing, but often there's problems with the sample itself, and some of these metrics really have to come with the quality of the beginning that we need. I would say one of really important steps is also, yeah, the base of a calibration, because it's one that removes a lot of false positives, and it's, again, it's a pity because the tool that we used to do that, which is GTK, which is a broad tool, has no decrypted these tools, because they have no generate variant colors that do it directly at the variant calling. But we need to do it because we don't, we use this tool for variant calling for single nucleated variant calling, but if we want to do structural variant, all the type of analysis, you don't have this, you need to have this down file with this indel, really realign, already done. So for me, it's a really important part, and when I talk to people with GTK, I always say, why do you remove them? We're not, there's not only your tools that can be used, so I say there's no one particular metric, it's really a set of metrics.