 Good afternoon. I will talk about genome alignment. And what I want to give you to teach you today is a bit of NGS technology, but you already have a raw introduction this morning, so perhaps I will just go fast on this part. But what is important is to understand what type of data when you do NGS sequencing, what type of data you will have in Ns, what are the problems you have to face with this data, how you resolve it, so what type of error, what type of the data, how we call things and to become familiar with the format of the file you will have and to explain how we process the data. So first, so the short introduction, so as Jarod presented this morning, NGS is a revolution that started 10 to 20 years ago when we do the first genome project. And what is important to understand why it's a revolution is because at this time it takes billions of dollars and years to sequence and to generate a sequence of one genome, where now it takes hours, days and just a couple of thousands of dollars to generate the sequence of one genome. So here you already have seen that, so it's just an improvement in the kind of technologies where before we do it, we do like 100 of sequences at the same time, now we do it a million of seconds at the same time. So how it works, so the idea is to do the same as what you do when you do the clone-based approach is to take, to see, to look at the fluorescence of your bases and to look at each incorporate bases, which fluorescence it corresponds to know which base that has been added to the sequence. So the idea is to take, instead of looking at a specific vector, is to take pictures of millions of sequence at the same time and to be informatically able to resolve that and to know, okay this dot corresponds to each specific sequence and then so on for each basis. So how it works, there's many techniques and Gerald presents you most of these techniques this morning. One of the most used is the one from Illumina and it's called Sequencing by Synthesis and how it works. The idea is you take your DNA, you shear it at a given size, so you know approximately the size you expect to have in terms of DNA molecules, you add specific adapters, at the end these adapters will allow you first to link your molecules onto the flow cell and to start the sequencing later on. So what you do first, you attach your molecules like that and you have like, so when you attach molecules then you denaturate the molecules and you have single strain molecules and then what you do after that, you do what we call the bridge amplification. So you have a molecule with one M, an adapter that is linked to the flow cell and the other adapter at the other M will match a free spot on the flow cell, so it will create a bridge and you could use this structure to do amplification. Then you relapse the two two strain structures and you will have two molecules in the two direction and then you repeat this process a number given of time to generate the cluster of molecules like that. So in your cluster you will have molecules in the two direction. So you have this molecule and then you start from one M, so specific to only one copy of the two copies you have in your cluster and you start to incorporate your basis and to take pictures of your sequencing integration. So has Gerard present this morning? If you have more questions about that feel free to ask at the end or during the practical, I can give you more detail but I think it was really well covered this morning. So in terms of technology, we have five major players for sequencing. We have live technology, Illumina, so Illumina is really the main the major player, Rush, but Rush the technology the 454 is actually dying and we have Pac-Bio and Oxford Nanopods. So this different type of nectology could be a group into a different type of molecules. So we have the small read technology, so usually we have molecules around 50 to 200 base pair. We have medium molecules, it shows around 6 to 7 100 base pair and we have long read molecules. We could go up to 50 or 100 kb of long molecules. So what is important to see when you have this technology is to understand what is the strengths and the disadvantage of each technologies. So here we have a summary, I won't explain everything but you have it with you so you can look. So it's really important to understand what choose when you choose a technology what this technology will bring you. I mean what type of molecule but what type of error, what type of disadvantages because when you decide to design your project it's important depending on what you are interested in it's important to choose the right technology. So one thing which is really important it's the number of errors. So you see that the short read technology are less error prone than the long read technologies and it's also the size of the molecules, the number of read parents. So all these parameters you need to take into account because if I need to sequence a bacteria or a human I don't need the same amount of reads. So all these kind of parameters you need to think about before designing your experiment. So all these technologies are more prone for specific type of application. So here at the Centre in Montreal we have almost all the technology except the solid one and we hear how we use the technology. So the short the medium the medium technology so the medium read launch technology we use it usually to do some small to novel genome sequencing but not so much now we have long read technology we use it mostly to do metagenomics and anti-consequencing. So anti-consequencing it's the same as validation so it's what we do really we amplify by PCR specific fragment and we want to have the sequence of fragment of five to six hundred base pair. The short read technologies so the short reads go with really high throughput it's used mainly to do all the wall omics. So wall genome, wall exon, wall transcriptome, chip seek all the experiment where you want to integrate the wall the entire set of your genome whatever it's RNA DNA and so on. And then you have the long read technologies so here that these two technologies like bio and oxford nanopore. Here we use it mostly to do so small and medium genome analysis and particularly it's really efficient to do a genome assembly. So if you don't have any reference genomes for your spacey to create the reference genome it's really good. We also do to do some target sequencing if you want like if you want to characterize a full transcript of one or two kb it's cool to have the full transcript in one molecule so you can see one read that contains all your transcript instead of multiple read. And we also have recently acquired in the center this new technology which is called 10x which is really cool. It's not a sequencing technology it's a library prep technology. The idea is to do your library when you do your library your library will be done for each molecules or set of molecules on a space on a bead which contain everything so and you will add a specific barcode to each of your molecules. So that is to able to be able to link your molecule so you will do short short short sequencing so like Illumina sequencing but you will have a specific barcode that will be able to link all your molecule together and you will be able to either reconstruct longer type of 30 to 50 kb long so you will know that all this molecule with high with low low level of error will come from the same apotype so you will be able to rematch things and do apotyping but it's also what is really where is really cool is also you can do instead of linking all molecules you can do single cell library prep so you put your cell and each specific cell will be tagged with a specific barcode and then you do a sequence you have all your all your cells and you can differentiate where the DNA come from from the different cells so it's it's a really cool technology. Yeah Mochik is Mochik so Mochik is a center my guild university yeah so when it's time to design your your experiment this is my recommendation in term of what you need to think what are the what are the what read lengths I need what library type I need to do so RNA DNA or pair then single and which technology which profile of error I want because if I want to if I'm interested in in depth long read could be problematic if so depending on what's your interesting and depending on your your budgets and depending on how many how many depths of current I need per sample do I need to barcode or not my sample because we know that barcoding is cool but it also implies some could imply some some bias in your data we know the recent days I've talked about the index switching and everything there's other parameters you need to take into account it's cost and turn around time I put it in gray there because for me it's really not the major parameters and really often people and PI come to me and to design a project don't come to see me and they arrived they say I want to do that that that but I only have this this amount of money how I should adapt to do everything with that amount of money and this is not a good strategy you need to first thing design and then you you have your design then you have your budget and you say in with this budget I can do either the whole set of experiment or only a part of experiment not try to do to fit everything in a small budget otherwise you will you will just generate data that are unusable at the end okay so this is a introduction about the technology now how we can analyze the data so here in this workshop we are working about dnsx analysis so we will only focus on dnsx so here is a I show you the the workflow of the dnsx pipeline we have developed in your center and in this module we will only talk about the first step so the the processing data to generate high quality alignment file ready to do variant coding we will show the we will show the rest of the pipeline tomorrow so just to give you a more detail a more summary of this of this part so the idea of this of this of this module is to how can I do when I start from the fast you file and generate a second ready for variant coding so this meant this could be divided in three steps so what I talk about fast you file what is the fast you file so the fast you file is in most of the case what is the form the file format that you use the format of the file you will receive from your center so usually if you have uh if you when you sequence your molecules if your molecules are long enough you will do a parent sequencing so you will take your molecules and you will sequence one end in in a of the molecule and the other end of the molecules so you won't have the the middle of your molecules so you will have the two reads so it's what we call a paired read so you will you will have for each sample two file the file with the end one and the file with the end two in each file you will have for each sequence four lines the first lines which correspond to the either of the sequence so the first part is the name so it's always starting with the at sign and you have the name of the of the of the sequence which correspond usually to the name of the sample to the name of the machines and some some physical position of the of the sequence on the flow cell and at the end you've got uh you've got um and then this will tell you if it's your read one or read two then you will have your sequence in in a base then you will have the place for another either so if for Illumina it you will it will just be a plus sign but for other type of uh technology it could be uh other could you could have a repetition of the of the either with the plus sign instead of the at sign and you will have a first the first line which will correspond to the uh value of the quality of your uh of each basis so you will see that you will have for each basis you will have a value quality which is not corresponding to four because you can imagine if i have a value quality of 100 i don't know if i have 100 to which uh character to which basis correspond so the value that you have here it's uh represents the um so its quality is one is a one letter and it's uh uh ascii value of the of the letter so form don't mean it's for for you need to go go to the ascii table and know what the value what the ascii value of the four and you will have the quality of your data plus or minus a basic um amount of um of the value but we will see that later so what this base quality mean so when you are about to extract the value of your base quality so you will have the best quality uh like that and the best quality uh is a fret score quality so fret score quality it means that it's minus 10 log base 10 of the probability and for the best quality the probability is a probability that this base have been wrongly called so it's a probability that you have an error so what you want is to have the higher the highest base quality to have the lower uh probability to have an error on your call so when we receive you or the fast you file we have this base quality what we really do usually do we generate this kind of plot uh which uh represent the distribution of the best quality along your sequence for all your sequence so it's uh yellow bar it's yellow box represents the distribution uh the 95 percent distribution of the base quality for all the sequence at this position so at the first base of the read then you go second read third read and so on until the end of the of the of the of the read and it's important to look at that because if you have a low base quality it can bias your analysis at the end because the lower quality the more chance you have to have uh error in your read and to uh control error with uh variant another type of um uh matrix and you see we do on the on the fast you as uh previously we take the we take all the read and we'll look what are um the base content so each line its color represents a different basis and how many so the the percentage of this basis at each position so you will see here at the beginning it's a bit uh non-random and then it's become random up to uh 50 percent of each basis so this is a plot that would you expect to see uh in RNA analysis uh because in RNA the way the RNA is capped is catch the cdna uh the RNA is catch cut to generate the cdna is a normal in non-random it's why we observe this uh non-random pattern and has the cdna works on gene we expect to have uh uh 55 no 25 percent of each basis on the gene if we look at some whole genome we will expect to have a line here of gc higher because we have a higher gc content in the genome we have around 60 percent of gc with uh 40 percent of 80 so depending on what you are doing if you are doing your world genome and you see this pattern you say oh they don't do a world genome they do already so that kind of qc you can you can do to look at your data another type of qc we currently we really often do is to look what are the known seconds that we found in our data in known seconds i mean when you prepare your library you have specific adapters you have all this all these sequence that you have added and used to generate to library we know this sequence we know the sequence of these molecules and so we just look do we detect this sequence in your data so the you always expect to have a bit of this sequence because sometimes your molecules are too short so you go over the world molecule so if you see some molecule it's not a problem if you see lot of them that become a problem what we also do we also we also directly from the fast q are able to estimate a row to do a row estimate of the percentage of duplication in your data so duplication it's uh reads that represent the same initial molecules and you don't want them you want to have the maximum of reads that represent a different molecules because if you have 2100 reads that represent the same initial molecules at the beginning you don't you don't need them you don't need to have 100 times the same molecules because it will if you have an error in this molecule at the beginning when you do first this year you will have 100 times this error you want to have the most as possible the most of them you expect to have a large diversity of molecules what we do also it's something that we had only like I say like five years ago it's just what we do when we have fast q we take we randomly take 100 of 1000 reads and we blast against the nt uh database just to check what we have sequence because sometimes you could have some some surprise because there could be some contamination or uh mixing so if you are sequencing human and you see that result with mouse as first you say oh there's a problem so when we have done all this qc uh usually especially based on this one when you see oh i got some um some bad quality or if i've got some adapters or if anything what we'll do next is to do trimming so what what trimming mean it will we will trim the read to remove the bad quality basis yeah on that previous one the qc of raw sequences what number of hits for the non target species would cause you concern so for example like there's a new 65 human hits in that table yeah that's probably because of sequence similarity between the human and the most yeah yeah targeting what level would make you worried that you had some human contamination in your mouse sample or vice versa um i would say that uh usually i i cannot give you a rule of thumb because there's no clear but if i see less than uh one percent and or zero one percent i would i would be fine no because i know that you know i will i know that just by the reference assembly i would lose around depending on the on the spacey around one to ten percent of my read just because the assembly is not is not good enough to to align all my reads so if i have like zero one or one percent of something that is that i've been contaminated if i if i'm if it's too close to my spacey i would be probably worried you know if i've got a mean that if it's like mouse and if it's bacteria and i'm human if i only like one percent of bacteria i will say okay i will lose this one percent and not worry but the lower is the better so trimming so uh as i said trimming is done to um remove low quality basis so you have your your reads you have your adapter at its uh end of the of the read and you have start after the adapter at its sequence and you have to do the forward um sequencing and you have to do the reverse sequencing so you can imagine if the read are too short i will go and sequence over the adapter so what do what we do when we do uh trimming we remove we look for adapter and we find adapter at the end of this of the read one or the read two we remove uh this sequence after that we take all the molecules starting by the end of the of the read because we know that due to phasing uh issue of the cluster we know that at the beginning of the molecule the quality will be really good and then it will decrease because with the time there are some molecules that will in advance or late of the of the world groups and we'll start to see some discrepancy in the in the in the sequence in the sequence we are reading so we'll start with the end of the molecules we'll look at the quality if the quality of the molecule is lower than the given we we assume is good we will cut the basis and we go to the next basis and if the quality is still lower we'll cut the base the base until we got a sequence uh base as with the quality uh over the uh given result most of the time we use 20 or 30 uh as a result then when we when we have when we have done that we'll look at the length of each molecule it's remaining molecules because some molecule will be 100 base pairs some other will be 50 some others depending on how much we have trim and some could be really short so if the molecule is too short we drop the molecule because if you have a molecule of 30 uh two bases the alignment will be will be of bad quality so 32 is all number but you could do depending on what you want depending on how all this number i'm not like i'm giving you what i'm using but there's no like rule golden standard for that so it's it's my my experiment and uh i have to say that when you do this going to use this result and for everything you will do in your in your experiment every software you will use every every result you will use you need to think that if i made a choice i have a consequence of what i will see if i decide to cut too much i will lose some information if i decide to be uh to be more like more less strict i will i will have more more error so every choice will have an impact so you you just need to know what you are doing to to be aware of what choice you have done and if you see something well at the end to know that this it could come from the choice you have made so to do the trimming us today we'll use pneumatic but there's many other tools uh cut adapt is another really famous tool and most of them do the job good so if you have one trim one trimmer that you prefer just choose the one you prefer so when you have trimming trim your data usually you have your fast you in a good quality the next thing you will do is to do the alignment so to do the alignment you need to have a reference so it's really when you have your fast you you have two choices either you do the alignment but to do that you need to have a good reference sequence if you don't have you will need to do assembly so you will see that tomorrow afternoon with jar jar that will come and present you so i will not give you some detail on that but if you don't have a good reference because you're working on a specific specie you generate two data you do your own assembly you generate your assembly you have your friend sequence and then you do your assembly or your mapping so the idea of the mapping is really to find the best location of your read on your reference sequence be careful i don't say the true the best location because your reference is not perfect so for most of your read it will be the true location but for some of them it will be the best location so the the problem of the mapping is so you need you have this million set of reads that are usually one between one or 200 base per long and you want to it's a kind of puzzle you want to place them on the reference genome which is a billion few billion long so it's not so complicated for one read but when you do that for million read it could be really complicated why it's complicated because based on the nature of the reference genome you could have multiple locations of your of your reads because there are some regions that are that are that are the same at different locations in the genomes and you don't want to look only at perfect matches because in most of the analysis with the sorry with DNS with DNSSEC what you want what you want to do is to find variants so you want to be able to see these variants if you if i'm only looking at perfect matches i will only find the same thing that the reference so you need to let some freedom to have some differences to be to be catch so there's many algorithms to do that us we'll use the bureau will or transfer transformer algorithm which is one of the most used and efficient algorithm and we'll use the bwa tool which is one of the top three mappers that you can use the top one mapper is novaline i won't use it because we won't use it because i think until recently it was a commercial software so i don't push for commercial software i'm really like a open science and open open source software and the difference is really really small so you have to know also that if you want so here bwa is really a liner that is dedicated to do a dnsc analysis if you want to do another type of experiment if you want to do RNA if you want to do chip you will need to do to use specific mapper because all these molecules have different characteristics that the mapper could take into account for example for RNA you need to break your molecules into exon so you need to tolerate big gaps in your alignment where bwa will try to favorize region where you read align all along the in one pieces what is really really important when you do your alignment is to use the lg tag so it's an option when you do alignment you could so the lg tag is usually you generate your sequence except if you and also if you do some with icx usually when you do world genome you will end up with several experiment of sequencing several um um read set for the same sample so it's really important that this read set for each read set that generate that come out the the machine you do a specific alignment separately and you give a specific read tag for this sample why just because aligning each plane separately will allow you to to have more to to gain time because you will paralyze your work and two because you will track where you read come from because at the end when you measure all you learn together you do your analysis and you see oh i saw a specific patterns and which is not expected that could be cool but that could be sometimes and not cool that could be a pattern you see way too much variation way too much variants that you expect then you can go and speak to that in igb by read group and see if you see this variant in all your read or all your different library in that case you will say okay that's something biological in my sample if i see only one library probably something more technical that come that come from in this library probably it's a library from another sample for me so it's really important to to be able to track back what you will see at the end to use the read group it's also really important because no many tools no important of that read group and will require you have set up the the read group so when you do your alignment what you will end up is a file which is called uh sam or bam so uh sam is a uncompressed in the uncompressed version bam is a binary version of the of the file and uh sam is really like it's uh sorry i got a i forgot what what it's done for it's right yeah sequence alignment mapping format thank you so you will have one bam pair reset or pair sample depending on how many so and in this sample you will have a big header that explains you a lot of things we'll see more in detail uh during the practical but what's important to know is that you will have one line for each alignment that have been formed so the line will be like that you will have the read names the same as you have in the in the in the fastq file you will have a flag so flag is a way to numbers that allow you to that allow you to uh describe what's happened to your read if you read this map and map so it's a bit score you added and for each event you have a specific bit score and you add everything and you're able to retrace uh what have been done on your on your on this read then you have the reference position the the two next uh field then you will have the quality of your alignment so it's a thread score the same as for base quality but for the world read then you will have a cigar uh value so cigar value this describe you how the read is mapped i will see that more in detail during the practical but here for example uh 76 m means 76 match so that means that you have positions so it's a 76 read long and all of the read have been matched to a position on the genome we be careful it don't mean that it's a perfect match match means that it correspond the same position but it could be a variant then you have the information about the mate if you have a mate so if you have a equal sign that means that the mate is uh mapped on the same chromosome and you have the position where the mate starts the position of the mate and the inside so the distance between the two mate so it's not between the two it's between the two and of the read then you have the sequence quality the base quality and then you have other set of uh of um field that is um aligner dependent usually you will have the read group you will have other uh other matrix which is as i say aligner dependent so you have done your alignment and that's good but uh there's no perfect aligner the the best one are good but there are no perfect so what we need to do when we have done as when we have the fast queue we need to refine or file to take this alignment and try to have a better alignment the first thing we need to do is to do indel re-alignment uh so why we need to do that so here is an example of indel re-alignment so you have the reads you have what you what you see before after alignment and you see that in this region you accumulate many possible uh mismatch with your reference so all the color red uh all the color red bases as we saw this morning uh with florans so when you see that and one the you probably say oh i got some issue in my read because i don't expect to see to have so much variant in my read i expect to have a variant around every every kb we i don't expect to have like five or five or six uh variant in the in the in like a 10 or 15 base pair so why we saw this pattern is because the way um world genome uh and most of the aligner works it's done to favor mismatch again indel so just question of algorithm but all the most of the learner will tend to uh to make to put more mismatch instead of in depth because it's term of penalties and everything on the algorithm so what what we do what we do in these cases when we have too much uh variants or when some of the reads show an indel we'll take the the region and we'll realign all the reads to say if i plus an indel at a given position will it increase my alignment so it's all what we do it's slow it's complicated to do but it's really uh it will really save your life and clean up your data uh no i just i say that but some uh sneak color if you want to do uh just variant and sneak calling some sneak color and the one we'll use reduce the reduce the local alignment around the snip when it call when it calls so it's not mandatory when you know that you have this kind of aligner you will use at the end and and when you will use your data only for variants but to my to my point of view whatever the aligner i will use i will do uh i will do the indel environment because i could use my data for other uh for other purpose and just doing uh variant calling so i don't care if it's an overkill to to do it the first time and to and have the aligners that redo it but i will do it but if you really know that you will use a specific aligners that taking to account and you will do nothing else with your data you can skip this step but i will not recommend that it's not that it creates too many indel it creates too less indel yeah but usually you will have just probably the or the opposite problems you will have too many there's actually there's no weight we don't have to find the perfect spot of perfect algorithm that is able to generate the perfect number so it's either one or the other the thing is that it's more commercially it's more easier to favor mismatch in indel the next improvement you want to do is to mark duplicates so as i said duplicates uh at the faster level we're able to evaluate the other the level of duplicates we have in your data but what is important to do is know is to really go over go over with data and see what are really the duplicates and either remove it or mark or mark them so uh what are my duplicates come from so there are different possibilities depending on the different technologies uh so a classical one which is most uh linked with uh old uh Illumina flow cell is the optical duplicates which means when you overload your data you have three spots on your flow cell and when you do beach amplification you've got really large cluster and when the picture when the sequencer takes a picture if the cluster is too large it will count your cluster as two uh separate cluster in that case it will give you two uh two different read for the same cluster so this is the one that are really what we call optical because it's really the optic of the sequencer that do that and this one are really easy to target as we have the physical information of the cluster on the flow cell if we saw exactly the same molecule in the two physical two neighborhood position one of that is cluster we have in the new flow cell so this is old flow cell in the new flow cell we have the same type of uh issue in the new flow cell the pattern Illumina flow cell all the well and all the cluster are predefined physically by uh old in the in the in the cell but if you don't load enough your data you will have three holes so you will have if your molecule is too long molecule will jump from one hole to the other so it's a clustering you have the PCR so PCR when you do the PCR at the beginning you will have your molecule that will start to amplify uh and you will have also sisters that create um but this one we won't uh look at it so the PCR is you just do too much PCR of the sequence so you will have you increase the number of uh your sequence in your data and sister you just to create a kind of uh artifact in your in your data so how we look at the duplicates uh so we can do before mapping as we do with uh with a fast cue so the idea is the camera approach it's what we what we do we take uh not the first we take the base 10 to 20 in the read one and the last 10 the last uh the base uh 10 to 20 at the end of the of the read we match them as a camera and we'll look at the camera favorite and if the same sequence it probably means that is the same read so it's not perfect but it's giving us a good estimate most of time what we do it we will and how we'll look at it it will look at uh using a positional approach what we do we just look where the read match at the other five prime of the first read and where it's matched at the five five prime of the second read and if it's the same position we expect to see the same read we take the five the five prime because we know that at the other committee we could have trimmed the read so two reads could have been trimmed differently so when we do the read when we do that when we find duplicates what we do we look at the quality of each of the duplicates of the same duplicate groups and we only keep the one that has the best quality and the other as I say either we remove from the bomb either we mark them us usually we mark them because we want to be able to come to keep everything in the bomb file to be able to come back to the fast you file if we need for any any other reason but doing one of the other is the give the same result at the end of the analysis another type of another type of sorry the type of improvement on your alignment you can you need to do is the base recalibration so why we do that because when you do sequencing the the vendor try to inflate the value of the of the base quality and also because the value of base quality is bias first by the position of the read you see that you have a decrease at the end and by the genomic context because when you have sequence different different base it's the polymerized do not work at the same at the same speed so you will more increase some error depending on the on the context genomic context of your of your basis so the idea is to do a modeling of that system in your in your in your in your in your data when you know where the read is you can know the context you can know the position to model and to correct to have a more a more flat distribution of your base quality all over you read and all over your your genomic context so you just correct for that then when you have done that you have alignment files that are ready that are clean and ready to do your variant calling so before you do your variant calling when you do to that what is really important to do is to take matrix uh you should collect matrix at each time it's really really important because it's really bi matrix you can understand what's happened to your data and if there are any issue with your data because sometimes we don't play as a matrix we do the analysis we saw some word pattern at the end and then we go back as a matrix and we saw oh there was an issue at this step okay so it's really really important so many tools provide their matrix uh so we'll use some tool be a tool because all have their matrix function so you have plenty of occasion to generate matrix and it's really important the most important matrix from my point of view is the one so it's all the qc at the beginning but to know the trimming rate and how the trimming has gone the alignment rate and how how it's gone the depth of coverage or insert size and so on but this is the if I need to select four I will only select this one so when you have done that then you can you will be able to go and do what we'll do at module four and five which would be do the variant calling for snv or for a structural variant oh I don't have my conclusion so I will do my conclusion without the slide so my conclusion is ngs and working in ngs is really something that is uh interesting but that will require you to have a good understanding of the biology of the mathematics and of the informatics so biology to understand what you are doing and your experiment uh mathematics to understand what are the algorithms that are used by tools and informatics how to use and run these tools what is really important matrix matrix matrix matrix the more is the better and also one of the major limitation that we have with working with ngs is informatics so it's really uh kind of uh size of the data for one sample is for world genome could be one sample one BAM file could be 500 gigabyte of data when we do the uh processing it could uh it could be uh three or four times three or four times this amount of data of space you need so when you can imagine when you have uh when you process for kind of uh random project you really if you process 500 samples you can imagine you can have like tera tera tera of data so informatics is really the limiting factor of the ngs actually okay that's it for for this lecture you have any question yeah I have a question about your marketing yeah so it's my understanding of the reason that we use spread scores because they can be summed in the competences so for example if you have paired end reads where there's an overlap in the overlap region uh when you have agreement you can sum what for uh the confidence score because you have twice the evidence for that point and if there's a contradiction you take the difference because you know your confidence is uh your more confident call is reduced by your less confident call and so what I'm wondering is if you have duplicate reads the same way that you would uh with uh an overlap where you would increase your confidence rather than taking your most confident scores can you not sum the uh the agreement and then subtract the uh disagreements to give you a little confident answer for that particular uh fragment we could but here uh the the idea behind the marketing duplicates is uh when we we we could perhaps correct to have probably the better uh copy but it's quite complicated to do because it's me it means that you need to to change for example the the sequence and the other thing is when you have like discrepancy how do you know which one is true so how do you correct that that's uh because you can see okay it's overlap if it's a perfect match you can you can see okay that's perfect I'm keeping that but if I if I start to have like two or three copies that don't perfectly overlap and that's the discrepancy on what don't overlap which one do you choose there's this kind of question which is and you don't want to introduce error so the idea is not really easy to understand to decide what is uh the the truth when you merge if everything merged together yeah you can just like uh use that one and say okay I I could increase my quality my my my mapping quality uh but your mapping quality is if you have exactly the same thing should be the should be the same for for for everything so I don't see how you can really improve your mapping uh your your mapping quality is that way I guess the the piece I'm probably missing is that uh you said that you just keep the one best quality sport as they want the best yeah the best so I so it takes the best uh quality the best uh best quality all over the sequence so the and the best mapping quality so it takes the one that seems to be the better sequence and the and the best uh the best map I should be the same but the better sequence quality so because if you have one read which has a drop of quality at the end you say okay I'm trusting more the ones that are and not this drop of quality right so it's why we we keep only that one so the the thing I was wondering meant is that if you have a contradiction shouldn't you be taking your best quality one and subtracting the uncertainty that's introduced by the one that's similar but has a disagreement because each of those calls has a certain probability associated with it just because one has the the highest probability that doesn't necessarily mean it's correct so it doesn't be uncertainty introduced by the other diminish the quality of your best call because if the other you have discrepancy but you know the quality is is lower so you probably you are more likely that this other uh read uh bring you an error so it's why you remove them you don't want to take them because you don't want to to introduce the error they can contain uh is that make sense for you because the idea is you don't want usually you know it's a random one because most of the time all these uh sequence will have the same sequence because they come from there are duplicates so there's not really so much difference in between what is different is also the size of the of the read because some of them are between two certain levels certain levels the other will won't be trim so that could make the difference but you try to have the the best general the best general question if you take the information of the other that are lowest quality the only thing you can bring is to bring some uh some error so it's why we remove them the other won't give you a real answer I think because they are of of low quality yeah usually it's better to do after mapping but I'm not aware of a specific program that handle parallel sequence do you mean by by that haplotypes or this kind of so usually it's better no it depends on the on the reference so it's so at this time it's more a question of assembly you will try to reassemble in that case the parallel sequence is a is a really is a the problem because you will break your single you will break so if you are able to have long read enough to go over your parallel sequence and you're able to generate correct reference you will map them and then you will probably you try usually in most of the reference you try to minimize the parallel sequence to the lower number of copies and then you will have this amount of sequence all the read from the different parallel that will map in the same copy and you will see that you have a another rate of read at this location and you will be able to tell okay this region is a parallel sequence if you don't have the reference sequence it's more it's more problematic because you will need to generate this reference second before doing them and you will probably not be able to generate the sequence at this location now if we talk about human genomes the new version of the human genomes have been made at age 38 and they add a lot of haplotypes especially for the hld region so we know we have this i think they add like 100 or few hundred of apotype for hld sequence that's a that's a really big issue for the aligner because if align my read on that if the read is on this region it will see oh i can map on this on the basic apotype but i'm putting the map on many of other apotypes so i will so we will mark it as mapping quality of zero because it will have equal chance to map at different location so now with a new reference like bwa and other type of of aligner they have developed this strategy where you inform the aligner of where you expect to see this apotype patterns so the read will be mapped normally but if it falls in the region that are marked it will add a mark to the read and then you pre-process your bound file and it will specifically goes to the apotype and redistribute the read between the apotype and do not mark the read as unmapped so the new version are apotype aware but you need to know the apotype in your reference so it's really as i say for the as you first as my first answer is really a question of the quality of your reference and you can ask a pretty basic question so yeah yeah go what do you mean by one direction it's like so we talk there's a forward and reverse read but but so if you've got this cluster are you just are you just getting a unidirectional read at this stage do you get the right direction so you've got one read in one thread so if you take this cluster you can imagine this cluster will be a mix of uh so so peak down and uh propo up yeah and there will be a mix of other molecules which will be uh propo down and peak up okay so when the when the when the probe is on the down on the social you cannot use it for sequencing the sequencing will always start from the top so you can imagine that so do we have in the cluster a group of forward and group of reverse and you will start so first read will be with a purple one so complementary uh so the first read we bring before the first read the complementary adapter of the of the purple one and then all the purple all the molecule with the purple on top will start to sequence yeah because the adapter is uh is available and the the complementary arrive and then we will measure then at the end of the read one after a given number of cycles 100 well everything is wiped and then we bring the complementary of the pink one and all the molecules that have the pink one on top will start to sequence and you will have the second read okay that's very nice and it's why your two reads will be on different strut