 So, good morning everyone, I'm Mathieu Bourgett, today I will present you an introduction to what we used to do data analysis of NGS data, so what tool we use, what type of computing structure we use and the data. So, I will do the presentation for this module and the lab for this module and the lab for the next module, the module tree. So, you will see that the two labs are really connected. So, today, the learning objective of this module is to introduce you computing resources that you can use to do NGS analysis to understand the type of data and what we do with the data to know, to learn about the terminology and the format of the file we will meet and to do that while looking at how we analyze the NSAC data. So, first I will present to you how we use HPC servers. So, HPC stands for high performance computing server. So, we are using HPC server because when we analyze NGS data due to the size of the data and to the volume of the data we generate for the kind of project we are working on, the traditional computer you can use is not enough power to do your analysis. So, if you run it on your own computer it will take months or years just to run your analysis. So, what we need is to have a really high level of computing resources to do your analysis. So, it is why in Canada we have the chance to have Compute Canada that provides this HPC center. So, what are HPC centers? It is a set of clusters. What are the clusters? It is like a grouping of hundreds or thousands of individual computers in the same place. So, not really an individual computer but just the computing resources like an individual computer. So, each computer is called a node and in each node we have a processor which is called CPU. So, in Canada we are lucky we have Compute Canada. So, it is a national platform which tries to merge all the consortia for the different province which provides the HPC resources. So, depending on which province you are coming from you will go to usually one of the other consortia. So, if you want to use Compute Canada to do your analysis and you need to have a Compute Canada account and why you need to go to have this account is because the idea behind Compute Canada is instead of the government, instead of financing directly researchers to build their own small cluster, they prefer to give the money to one group, so Compute Canada and to reduce all the management of this cluster. So, if you want to use the resources you will need to go and be part of the consortium to have your account and when you have your account it will, when you have your account you will have free resources. So, free resources need you will have a storage space so a place where you can put your data. So, it's a limited space but you will you could ask for extension every year if you are the site of the data, as a quantity of data you have growth and you will have a Compute Time allocation. So, what you mean by Compute Time allocation? It means that you will have a number of curly year use time. So, a bit complicated to understand but the idea behind you is it gives you a set of time where you can use the computer with high priority and once you have used all your allocation you can still use their resources but you will have lower priority compared to people that didn't have use all their allocation. So, just to have a fair use of everybody that is part of the consortium. So, when you are on Compute Canada servers and what you do you just launch your, what you want to do is to launch your job and when this job are execute then it will use your allocation. So, how you get your account? You just need to go to Compute Canada then what you need is when you have your account in Compute Canada you need to ask an account on the local provincial consortium you want to to be part of and in this consortium you will choose which HPC server you need to be running your analysis. When you have your set of everything and you could access to your server you will go to the server and you will log and you will arrive on the logging node. So, there's two types of nodes there's logging node working node. So, logging node is when you arrive on the server and it's just an entry point for everybody and the role of this, when you have this node is just to allow you to submit your job to a scheduler. So, a scheduler is a software that takes your jobs, finds the available resources on the graph of the cluster and match what you ask to what is available to make your job going through the good computing resources that match what is required for your job. So, when you go there be careful, don't, some people, some new user try to launch their job directly on logging nodes which is really bad because it's a really limited resources that is for everybody. So, if you run your job there then your sysadmin will try to be on your back and say stop doing that you block everybody so it's a kind of a general rule. In the case of the lab practical today, so we will work on one HP server in Quebec which is called Mammut but we won't be on the logging node it's directly set up that you will arrive and everybody will be on one big working node so we will directly launch our job which is so not exactly the way you will do it if you will be on the on a real HPC server. So, as we will don't see that during the practical, when you launch your jobs as I say you will specify the task you want to do to the scheduler and the scheduler will match your task to the computing resources so you will have a time where you will wait that the scheduler is able to find the good resources for your jobs which is the queuing time and this time will depend really on how what you set as is needed for your job so it depends on the number of CPU that's the length of your jobs and how loaded is the cluster. So, you can try to play with this parameter to reduce the job length the queue length sorry because sometimes if you are really short jobs you can spend more time in queue than really running your job so it's kind of a game to find the best the most optimal way to submit your job. So, that's a good thing you have this resources where you can launch your job but what you can do with this resources if you don't have any tools or resources you can use to do your analysis. So, C2G in partnership with Compute Canada we are working on a system which is called CBNFS which stands for CERN virtual machine file system which is the idea is to have one location where we maintain all the tools and resources at one location and it's sprayed between the different not all but a lot of cluster in Compute Canada where you will have whatever the cluster you use you will have access to the same tools that have been installed the same way the same type of resources I mean the same type of genome reference sequence and everything. So, the idea is to reduce the management for users so you don't need to install the genome you don't need to install the tools you just ask people and we will ask for you. Yes? Good question, is Compute Canada only available for people who are in Canada? Yes, I will talk in the in the in the two slides about that. Okay. But yes, so the question was is a Compute Canada is available for for Canadian only? Okay. Yeah. So, as I said in the CMFS we have a set of bioinformatics tools so most of tools I think now it's more than 90 but every time somebody asks us to install a new tool we will install the first test the tools see if the tool is interesting for the community and if yes we will install the tools so all the community will have access to the tools we're also providing resources in many so many built and for many spaces and we also provide a standardize part time to do a classical type of NGS analysis. So, all this resources are used through a set of module system so you mean when you arrive to your to the to the HPC server you have nothing in your environment but you have access through a module available you can have the list of all the resources that are available and you can load the tool you need to your environment and this will only set up your environment to use this tool so you can treat for five tools at the same time what how many you want and everything will be set up and you will you will be able to to use the tool so the question what to do if you are not Canadian so the first first show is to can become Canadian but no more than what is it yeah so you have to that most of what I present here is a really common concepts that are shared by most of the HPC server and most of the other computing server like cloud and other type of instance the only differences if you use this other type of instance or this other type of servers is that you won't have all the tool and pipeline that have been pre-installed for for you with a partnership with computer so you will probably have to to to install the tools on your own but if you want to do that you can go on your on your website where we have a bunch of script for every tools we maintain we we provide an install script so and everyone wants to reinstall this tool somewhere someone else can just take our script and run it and it will go fetch a source on the internet and and install the script correctly okay so now we have seen where we should do analysis now what we do with the data and what type of data we are what we are talking about so we are talking about next generation sequencing data so this data are really evolution of an old technology which was the cold clone-based sequencing so in the older technology if you have a clone you put your your dna fragment and your sequence and you sequence around like 196 sequence at a time here are the ideas to extend this techniques to be able to sequence millions of fragments at the same time so how it works so you have this sequencer and the sequencer just instead of doing like a standard sequencing where is is measuring that how base is integrated is just a picture of each cycle and the picture generate a cluster of your results which represent every base incorporation in each fragment so how it works so there's many there's many technology to do the sequencing I decided to focus only on the Illumina technology because it's more used actually and probably will stay the more used for the for the for the longer time so the idea of the Illumina technology is to do bridge amplification so what you do you have your dna you have your dna you share your dna to a given fragment size usually around 4 to 500 base pair you add your adapters and so you have this set of thick of a short fragment and you load your so you have this specific adapter here at each end of the of the fragment and then you load a big set of of a fragment on your flow cell so the flow cell is what you will put in your your sequencer and in the flow cell you will have amplification of your of your sequence so the adapter you have you will put at the at each end of the fragment will match some free probes that are put on the flow cells so the molecule will bind the probes and they will be amplified to generate a two strand molecule so you will generate a complementary molecules and then the two two strand structure will be break and you will have a copy of your molecule the reverse copy of your molecule that will be attached to the flow cells so you will have other you will have then other probes that are free so your molecule with the the probe at the other end will match three probes in the close proximity of the other one you you you have generated and you will have the bridge amplification so the amplification of the of the reverse of the first copy so the copy of your of your initial fragment you will do amplification and you will break these two strand structure and you will have two copies in two directions which are linked to your to your flow cell and what you will do you will repeat this this process multiple time and you will end up in your flow cell with a cluster of molecules which contain two set of two set of molecules so the same molecules but in the two direction then you will start to do your amplification so you will start with one adapter so one of the two molecules direction you will you will match your adapter and then you will incorporate your the amplification the the the corresponding bases to your to your to your molecules while the bases you will incorporate contain a few recent few recent and as it's at each base incorporation you will take a picture of the flow cell and for each cluster you will see the same fluorescent that will that will flash and you will be able to see the the cluster as one dot on your flow cells like this one so if you go to the for for example at the first one you will if i if we take this cluster we will see a yellow yellow dots and the blue dots and the green dots yellow dots red dots and and it's how we're able for this molecule to understand what are the sequence because each color is linked with one specific basis so the the trick is what doing the sequencer is takes is takes this picture and then is able to for each cluster based on the position on the on the flow cell to regenerate the sequence of your of your fragment yes each cluster it's bi-directional right yeah so so although if we see it here so we'll start with one so when we do so usually we do Persian reads so we have one read that starts with adapters so the sequencing will start only with the molecule in one direction at the beginning because the there will be adapters that is complement the adapter the two and are different so the molecule that are in one direction you add first specific adapter so you will only go with in one direction and then you will clean the model of the molecule of your cluster and then you read the other type of adapters that will read the second attack in the other direction and then you will be able to not do not mix your signal with the two set of molecules sorry no it's just so the question is is the bidirectional sequencing is an error reduction technique no it's just because the quality of the sequencing decreased with time so you can go from one extremity to the other to the from one end to the other end in one shot otherwise the quality of the sequencing at the at the end of the molecule will be will be bad and also the molecule will be too long so so the sequencer won't be able to sequence this size of reads so the idea is really to to to sequence one extremity and the other extremity to have like the two copies of the of the like the two extremity of the of the of the dnaf original dnaf fragment to be able to locate the two fragments and to try to find some specific variation due to this fragment so it's it's more because we cannot go one in one shot over all the all the all the fragment first because the size of the fragment is too long and second because the quality will will drop down so another technique if you if we think about like bag bio or nanopore all these kind of new techniques which is longer reads are more doing one molecules but with longer reads and go straight but it's a totally different way of doing sequencing so if you're interested we can discuss about that at the end of the next break so now when we when you design your ngs experiment what you what type of parameter you need to take into account mainly the read length so the length of your reads so it depends on your size of your size of the molecule of size molecule the type of library so will you do only one direction two direction depending on on your on your type library for example if you use if you're interested with uh smaller any analysis you don't need to do two direction because your molecule is only like uh 30 best spell long so you just need to go straight to one molecule what type of error you will face depending on the sequencing technologies bag bio is more uh sorry lina is more like substitution bag bio for example is more like indels so depending on what you're interested in it's it's interesting to look at different technology uh also how you can how many reads you can generate and can you put several samples in the same lane because usually one length couldn't be too much for your experiment so you you want to to reduce the cost by putting several um uh sample in the same lane so the cost and the return turn around time is also some parameters but it should not be the parameters that uh lead your choice about how you design your experiment most of the time when people come to see us and they have designed their their analysis based on cost and turn around time they don't have made the good choice of how to answer their biological question so we have this ngs and so we have this sequence and what we want to do in the case as presented by uh mike in the previous lab we want to uh find the variants that are related to some specific diseases uh or some specific phenotypes so the what we want is to start from the sequence we have generated by the machine to uh so to the to the list of variants specific to the into uh or uh sample so this is the pipeline we use to do that and in in this lab we will focus on so this is a full full pipeline to do that so in this lab we will focus on the first part of the pipeline in lab three we'll focus on the second part uh uh but the idea of this really uh lab two is to provide you um the step to arrive to a file that is ready to do a variant calling so as we say here we start from the fastq file which i will explain which is a the format of value you will have from your sequencer and you want to have to uh to a file so data that are ready to do variant calling so what are the fastq file so this is the file you will receive from your uh center it's a set of sequence that represents if you are uh single one uh one uh read of each uh DNA molecule or two read of each DNA molecule if you are paired uh so these two sequence represent the two one or two and of your original DNA molecule for each molecules so if you up if you open these kind of uh molecules uh the format will be like that will have full line for each uh for each sequence the first line will be as the either of your of your uh of your sequence of your sequence which contains the name of the sequence and in the name you have the location on your flow cell so the name is really based on the experiment and your machines and everything and the location of your of your sequence and at the end here you will have an indication if you are on the read one on or on read two so all the read one will be together all the read two will be together the second line is really the sequence itself the third line is a second possibility for an either so most of Illumina uh format won't we won't use we want this second either but some other format could have the like kind of um duplication of the first either here and then you have a quality so the quality is okay I need to go fast so the quality of your of the of the sequence uh is um um ASCII character so ASCII can be translated to a numerical character so ASCII character that uh tells the quality of your of your data so what is the quality the quality is a threat score so when you convert your ASCII to a numerical value you obtain this what you call the best quality and this best best quality is turned for minus 10 log by 10 of the probability and the probability is the probability that the base you have sequence is not the good one so if we talk about the best quality of 20 we have one percent chance that the base that we are looking at is wrong so what we do with the data usually we have this this fast queue what we do usually we QC this fast queue so what we do we look at the base quality for each cycle for all the sequences and we're not sure that the quality is uh is enough so more than usually 20 or 30 uh in in general if you have low quality uh usually it could affect your analysis afterwards what we do also we usually look at the sequence the base content sorry I think you have the mean and the medium yeah yeah it's because your your cluster when you incorporate the bases all the sequence of your cluster do not have the same rate of incorporation so you will have some some molecule that will be in advance some molecule that will be delayed that will uh play on the quality of your of your cluster so the type of QC we do we look the base content we'll look also if we find the adapter so if your molecule are short you will find the sequence a non-genomic sequence that you have and find at the end of your molecules we also look the duplication rate so I may need time a molecule if we have some molecule that has been artificially um duplicated we also do kind of this kind of QC where we take random reads and just at last a set of random room for your fast queue to the to the end of databases to just ensure that what you have sequence correspond to what you expect in terms of spacey of uh so if you sequence human and you have this result of mouse you say oh I got a problem with my sequence so when we have done generate this matrix and we are okay with the quality uh what we do we usually do trimming if the quality is not good it's not good enough so uh what's the what the the goal of doing trimming is to remove the possible adapter that could be at your end of your uh it's extremely so when you sequence a read if the read is too short you will have the adapter at the end so you want to remove this adapter because it's not part of your dynamic data you want also to remove uh the base that are under uh uh level of base quality because you don't want to uh to to introduce error in your data and so when we do that after after also cleaning the sequence so removing some bases if the read are too short we discard the reads because if you read that too short the mapping will be uh will be a crap so we don't want really short reads so we keep only a larger read so to do that we do we use traumatic but there's many of tools you can use so after doing uh trimming we do alignment so alignment is what you do is you read in when they are they are clean so you you have two choices either you have a reference sequence usually you do alignment so the the idea of the alignment is to find the best location of your reads onto your reference sequence if you don't have if you don't have any reference sequence you do you generally do assembly where you build this contentious reference sequence and then you do your alignment of on your contentious sequence so when we do read mapping it's a bit challenging because you need to locate a million of short fragments in a really large space of possibilities because the size of your genomes and it's challenging first because uh you will have many reads that have possible uh location and also you don't want to have a perfect alignment you want to tolerate mismatch or gaps in your in your in your look in your mapping because you want to identify variants if you only uh tolerate only perfect match you won't just will keep the what is what is exactly the same as the reference and that you don't need it in fact so there's many uh many um algorithm and methods to do that we use uh the bureau wheeler transform um uh approach which is implemented in the bwa uh aligner but there's also many other aligner you can use so bwa is a really good aligner it's one of the best aligners dedicated to dna but if you know like botai is one for that is really used for RNA so depending on what type of data you will probably have to change the liner when you have when you align your aligner it's really really important to include a kind of tags for each library you have generated that and each lane of sequencing uh you have generated the idea of that is to hide identifier to your uh to your experiment because sometimes you will merge several lanes of sequencing together to generate a final uh data set and you want to be able to try back where you read come from for the different uh experiment if you have any issue you can able to then to look if the issue is only um specific to one experiment or to the world biological sample and also because you will see that the users many tools that will use that need this information when you do your alignment you generate a set of file which is called uh SAM or BAM so it's the data as is the the data type we used to store alignment so SAM is a text format and BAM is a binary format so almost nobody uses SAM because it's too big the size of the BAM file could be between 10 and 500 gigabytes so you can imagine if we uncompress that it's really much larger so it's why nobody use uh SAM file almost so for each alignment in your in your uh in your BAM file you will have uh this one liner one line big line which will give this information so we have the same read name and in the FASTQ a flag so it's a bitwise flag so it's kind of way to encode how your mapping have gone so we'll see that in the in the in the practical you've got the two positions so chromosome and position on the chromosome of your reads of the starting of your reads the quality of your alignment the quite is the same as the base quite is a freight score but it instead of going on the error possibility of error of your basis possibility of error on your mapping so the same approach a sigastring will tell you how your read is is not the position of the paired and reads if you have paired and reads and then you will have the exercise between the two read and then you have your sequence in nucleotide and your sequence in base quality and then you will have extra flag which which are uh specific to each aligner yes is this the ungapped it is it only for ungapped what do you mean by ungapped it means that the read map to the present genome could second it's impossible to one half maybe one position yeah yeah the read could be split into uh i'm not sure so the read could be so the question is is the alignment reporting only ungapped uh mapping no the the read could be split in case the read is split one part of the read will be uh tagged as primary alignment because for stat you don't want to count the two a part of your read as a same read so one will be marked as a primary alignment but the other will be uh linked to the other so you could have really like gap and distance especially when you do RNA you expect to have read that map from one example to the other example so you need your read to be split into the two locations so when you generate your uh reads uh the next thing to do before doing variant calling is to refine your alignment because uh alignment is really something tricky uh aligner any none of the aligner is perfect so when you choose an aligner you you choose some constraint of what how you will do the alignment so you need to correct for this constraint to have the best the best alignment you can so the first thing you do is to do indel rearrangement so why we do that is because most of the aligner tend to prefer having mismatch instead of having gap so when you have indels they tend to do not uh insert the indel in the alignment and to install instead several snip fake snip around the indel so what you want to do is to go over this possible indel and to realign uh things that there's possibly there's a possibility of indel and to uh realign the reads to be sure that you don't create this fake uh snip so usually you have this kind of pattern before alignment after alignment you create the indel and also fake snip disappear also as i say uh you have a lot of uh you in your data you could face having some duplicates so the same several copies of the same original dinar fragment and you want to count only one copy of your of each original molecules because if you have for example a sequencing error at the beginning of your of your first step of your library and you have several several copies of these uh molecules with an error you will it will appear for you as a snip if you don't count that is just the same copy of the same fragment what you want is to see biological fragment biological variation in different original copy dinar copy so to do that we uh we remove the so the duplicates and we keep only one read per uh fragment also uh the sequencer uh the sequencer usually tend to inflate the data the quality of the data and especially some of the quality is biased by a specific genomic context or position in the read so what we need we need to correct to have a more homogeneous distribution of the base quality so we do a best recalibration so when you do that you have set of really like high quality alignment file and you are ready to do your variant calling but the one thing I want to add that which is really really important to do is that each step of your analysis you need to generate matrix it's really important because it's where you will be able to see when you have something that goes wrong and to understand what goes wrong so matrix at each step is really important so as I say we should correct the matrix at each step and there's a lot of tools you can use to do that so matrix at trimming alignment the ups of coverage inside a lot of matrix so the next the next the the next step of the pattern will be present in module three so in conclusion if you want to do ngs analysis it will require you to do a lot of entomatic skills to just learn the pipeline and render analysis and also a lot of mathematical skills to understand what you are doing because it's usually good to understand what you're doing just don't click on the button and trust what the software do really important as I say previously matrix matrix matrix the more matrix you will have the more the more confident you will have in your data important when you generate alignment element is not perfect refine the quality of your alignment and as I say which is go with the mathematics you need to have a good knowledge of what you are doing how the data generate to understand what the artifact you can face at the end of your analysis and the last thing is the more challenging actually for the doing ngs analysis at a large scale is not to do the analysis but is more to have the sufficient amount of storage and the sufficient amount of resources to do the analysis thank you any question yeah is this a process that you intervene with in each step along the way or is this press button you know you upload your fast file and you press button and you wait some variable like the time that pops your up so no so the question is is it a process that you do it manually or you do it like automatically in this like a pipeline way so no for like this kind of standard analysis at the center we have developed our own pipeline so we press a button and we hope everything will be fine until the end which is not which not often the case so there's always like a like a cluster the cluster can like pull it off or or a file could be too large for what the setting we use in the pipeline so sometimes we have to redo and go it manually but we in many cases it goes all along the process and we just need to like monitor how the process is going monitor when the stats is coming out of the of the of the pipeline to see okay we are at this step we saw we saw the matrix the matrix looks good let's continue to run the pipeline if we saw that the matrix is not good we stop and try to understand what's happened to the to the data but it's at the beginning we do it we were doing it manually but by the time we're now doing it since i like five between five and ten years of experiment to do that kind of analysis we have automated all the process so if you go in the pipeline in the pipeline I present on the beginning on the on the software all this pipeline is automated you can just like not one click because you have to set up the pipeline set up your file for you but you just will need to set up and then launch the pipeline and the job will be displayed will all dependency or job will be taken into account and you just need to attend them to receive the email that your job is is finished yeah how do you determine what are good quality metrics is that something you develop in house standards or so so there's no gold standard that have been published as okay this is the standard matrix but they're kind of standard but it's really upon each time you have a new technology so each time Illumina release a new version of their chemistry we need to reevaluate that and to see is or standard is still okay and if there's no set of metrics we need to add or no the result of quality we need to adjust so it's why there's no gold standard because the technologies change so so often that the standard will have to adjust to the to each new change but there's some kind of rule of thumb that we know is an experiment for example when we sequence human we expect that due to the quality of the reference genome we expect that the the percentage of freeds that are aligned to the to the genomes is around 95 percent of the read if you start to see lower level of read alignment like 1980 you start to see those probably there's all other type of DNA in my experiment you know this is this kind of of matrix but there's no like really gold standard that have been set in stone saying this is the matrix you need to to out it's more like kind of experiment but for example for the quality for for when we look at base quality we use 30 as a standard but if we want to do more less stringent because we want to generate more we can we can adjust the results so for example when you do dna you go around 30 but sometimes when you want to do assembly you don't want to have so much variation so much error so you can pop up to 35 or more so you need to adjust depending on what you are doing