 Hi everyone, so I'm Mathieu Bourget. I will be teaching the module three, which is based on general alignment in this workshop. So as a reminder, all the content of the workshop is under the Creative Commons license, which means that you can share, you can reuse, you can modify at the moment that the content stay under the same license, and at the moment that you site people that have been involved in creating the content. So today we'll talk about general alignment. So the main idea of this topic is to make you learn about how we do alignment. So what are the terminology we use when we do alignment to become more familiar with the file that we generate? So the idea is to go from a fast queue file, so what we get from a RoCi concert to a high quality band file, which will be ready for variant analysis. So it's important in concert analysis that you understand that there's concert related challenges that will impact what we are doing, especially how we align the data. And so everything, every of this objective is linked to how we can, you can run the first steps of a tumor pair pipeline. So how you can run the first step of DNA-seq analysis of concert samples. So my talk will be divided in four main parts. First, I will give you a small introduction about why we're looking at a variant in concert and what are the main challenges. Then I will present the pipeline we have developed in a summarized way, and then we'll go into a step-by-step approach of this pipeline. And then the practical, after the conclusion, when we do the practical, we will do the step-by-step, but this time you will do it really on data. So let's talk. So we are interested when we work in cancer, mainly to discover what are the abnormality that we observe in a concert sample compared to normal and compared to healthy cells. What are the abnormalities? So this mostly are genetic abnormality. There's also epigenetics, but we will not cover that in that talk. And what people also want to see when they are studying concert is to see how this abnormality evolves with the history of the tumor. And so in these two main points, it's really important to have a really accurate way to detect genetic abnormality, so genetic variants. So it's why when we use data stack, it's really important to have good variant calling, but to do good variant calling, you need to have good mapping and alignment of your data, otherwise you will bias and create technicality in what will you observe after your variant calling. And also you want to have the best set of calls in order to be able to see what are the pathway, what are the biological functions that are affected by the variant you will be calling. So what are the source of problem specific to concert sample when we want to do variant calling? One of the first source of problem is the clonality, which means that when you take a sample to analyze, you take a slice of the tumor at a given time point of history. So if I'm taking here, you can see that mostly I will see variants that come from the light blue clones and almost I will not be able to see variants that come from the dark blue clones. If I take a slice at the later time in the tumor evolution, I will say that the clone that's where light blue will be almost no more present and I will only see the dark blue one and not the purple, not the pink one. So the fact that the tumor is not one set of homogeneous set of cells, it's a mix of clones give troubles because we'll not be in the traditional way of calling variant because the amount of evidence to call variants will depend on how the different clones are present in your sample. So it's what we call the clonality. So it's a factor we need to take into account when we try to do a variant analysis in concert. Another set of challenge we need to take into account is the purity of the data because when you do something of the tumor, there's always a mix of tumor and normal cells because it's never 100% tumor cells in your sample. So you need to be able to go into your data and have an estimation of what the proportion of tumor cells because it will also reduce the amount of evidence you will see for doing a variant calling. Another source of problem which is more a general source of problem that is not specific to tumor is the mapping sequencing error. And this is also increasing tumor because usually the quality of the sample is a bit under what you have generally when you sample blood or other type of tissue. And it will reinforce the amount of sequencing and mapping error you will face. So usually we see this error in this region, we see this error in repeats, in apotypes and so on. So as you can see it's an example when you have this kind of highly repetitive region, you can see that as a mapping that we can observe in IGV will be drastically reduced which can make it harder to do a variant calling. So it's important to understand that when we talk about abnormality in cancer, usually we are trying to look for heterozygote because usually you expect to see that the tumor related variants that will be interesting will be some new variants that affect regions that are not already heterozygote. So it's usually a homozygote reference region that will become heterozygote. So it's why it's complicated and why it's important that the number of evidence you need to see to detect a variant as a real impact because when you look for homozygote, the number of evidence is always large but when you look for heterozygote, maximum you will have around half of your reads as evidence. So if you now include clonality, purity and issue with mapping, this will drastically reduce the amount of evidence. So it's really the main point and the main challenge when we do cancer analysis. So to do that, we have developed a pipeline since now, many years because we have worked on several large scale project, cancer project and the last version of four pipeline was based on the project which is called profile where we try to analyze children, adolescents or young adults that are hard to treat. So the main idea is we have this, we are in connection with oncologists and the sample where they are resistant to standard treatment. So the main idea of this pipeline is to collect DNA from the tumor, from the blood and to collect RNA. So I will not cover this part today. So in that pipeline, we have two approaches. The first one is a fast, fast approaches where we focus only on region of the genome where we know there's potential genes that are involved in cancer development. And that is to do really, to provide when we receive the sample out of the sequencer in a 24 hour to be able to send a report to MD experts that will try to find actionable targets to try to adopt the therapy of the patient. On the other hand, we do a full analysis of the sample in order to have a really larger view of all the variation of the abnormality in this tumor and we do it more for research purpose. So today I will more focus on this section on the full genome analysis. So here it's a kind of simplified work flow of the pipeline. So the idea is to go from fast profile so to generate high quality fast profile to do the alignment, then to get a high quality alignment and then to do the variant calling and then to filter and then to annotate. So this is the full pipeline that we are using. Today we'll focus only on the first part here, this square where we go from fast profile to high quality alignment file. So just to give you a more idea of the pipeline. So this is a real work flow of the pipeline. So it's a bit more context from what I showed you in the previous slide and we'll do this first eight step in the pipeline. So now we will go and look at that more in details. So just to let you know, when you do this kind of analysis, you need to have a real computing infrastructure to do that because we're talking about large file with hundreds of millions of reads. So you cannot do that on your computer. So either you need to have access to an HPC server or a cloud services, but don't try to make it work on your computer. You need also to have a large storage system because the size of the file is usually really large. As you can see, the final BAM you will have could run from 250 to 500 gigabytes per sample and you will usually work in a paired mode where we have for each sample, we have a normal and tumor sample. So it's could take really a large space. If you, for example, couple of years ago when we did the kids kid project, we analyzed 100 tumor or normal pair and our final storage was around 80 terabytes. And during the processing because each step you generate files, temporary file, during the processing we need 200 and 400 terabytes. So this is really mainly to show you that really don't try to do that on your own computer. You really need access to high performance computing versus and if you don't have this access, the first thing before starting to think about analysis is to get this kind of access. In term of time, our pipeline is take approximately 36 hours to process every samples. So as I say previously, the main point of this module is to be able to show how we can start from FASQ files, the input file you will get from your sequencing center to high quality alignment ready to do variant calling. So let's talk about FASQ files. So FASQ files are, as I said, just before are the files you usually receive from your sequencer or your sequencing center. If in recent way and usually when you do analysis, you will do parent analysis. So which means that you have your DNA fragment or the DNA fragment and you have cut it in pieces and you will sequence the two extremities of the fragment. So you will have two reads per biological fragment. So for this type of analysis, you will receive two set of file, one set of file which will contain all the reads that come from one end and all the reads that come from the other end of the DNA fragment. So these two files will be synchronized so the order in which the read are placed in the file will be the same in the two file. So you can easily find that the first read in file one will correspond to the read from the same molecule in the read two file. And in all these files, the structure of the file will be the same. So these files will be four lines per read that have been generated. So the first line correspond to the either which gives you the name of the reads. So this is this part here. Usually the name of the reads come from the name of the sequence surplus of other kind of technical worth naming. So this part here would give you the position of this read on the flow cell. And then it's not mandatory, but some sequencer or some pipeline will give you at the end. It will tell you if the read is read one or read two. The second line will give you the real sequence that the sequencer have generated for this read. The third line will be a plus or a plus following by another either which would be similar to this one except that the outside will be changed in a plus. Why we have this third line is because originally the sequence for each sequence it was given in two file one for the sequence one for quality but now a long time ago we decided to put everything together in one file. So we have merged the two formats into one. So it's why there's a remaining second either here. And if you look at all data you will see that there's a second either but all new recent will just have a plus sign. And the last line which is a kind of weird line where you have a mix of symbols. You can see here it could be numbers, it could be symbols, it could be later give you the quality of each basis. So you will tell me about how can we get quality from that. It's just that quality is measured as a number but we cannot put number here because as we have one base if I put for example 40 then 30 so I will have 4030 I cannot say if the first base is related to 40 to 4030 so there is to have one character to encode the quality. So we use the ASCII characters which could be translated into numbers. So if you take each character you can translate into a numbers and then it gives you the value of the quality that have been measured. So when we talk about base quality as I said it's a number it's a FRED score. So you will see a FRED score is a type of measurement that you will see frequently in the genomics analysis. So the FRED score is minus log 10 a probability and the probability will depend on which which value you are looking at. In case of base quality it's minus log 10 of the probabilities at the base that have been called have been wrongly called. So it's the probability that you have an error in your base column. So to give you an example if I take base quality of 20 it means that I have 1% chance of error. So 1% chance that the base I called is wrongly called. If I take a base quality of 30 it's 0.1% and so on. So when you will see all the I would say quality value like base quality mapping quality all this kind of quality value it will be given at a FRED score using the same approach of minus log 10 of the probability. So when we get this base quality so the first line of each read in the FASQ file what we can do is to look at this quality in your data set because it's really important to understand that when you get the data from the sequencer the data is not perfect there are some errors there are some issues in the data and you need to check your data first and to only keep the high quality data for your analysis. So what we usually do we are generating this kind of plots where you see here each box represents the distribution of the base quality at each cycle. So you probably have seen how the data is generated with the Illumina we do cycle to create each basis so each cycle for each cycle all reads have been add one base so we take all the reads at the first cycle and we'll look at the overall quality of all the reads in your sample. So as you can see usually the quality starts quite high then the machine starts to calibrate so the quality increases and then the quality slowly decreases all along the sequencing. It's due to the fact that when we look at Illumina data there's what we call unfazed data so the more cycles you do the more reads that are incorporate base in advance or late generate lower quality in the cluster and lower quality in the base that is called. So it's a traditional pattern that we saw in a short read is that you have a quality that starts really good that increases a little bit and then it slowly decreases towards the end of the read and at the end of the read sometimes you still have good quality but sometimes you have a really lower level of quality. So it's important to look at that and it's important to take that into account that if you let low quality bases you will create false positive variants. Another QC that we are doing what we do we look at the composition for each cycle the composition in terms of bases so you expect to have for example so you have the composition so each line represent a given specific of base so ACG and you look over all your reads what is the composition so this example is not the best one for this workshop because it's based on the ARNS data it's why we have a weird pattern at the beginning and then we have everything at 25% which means that if you take all every and we look at all the base we find 25% of C, 25% of G so you should not expect that in World Genome data because you know that the genome is not 25% of GC so it should match what your organism have in terms of GC content another QC that we are doing is to look for known sequence so when we generate a library for sequencing we add a non-genomic sequence at the extremity of the reads so we don't want that in our data because we don't want to include the sequence that could create artificial variation in our data so we know all the sequence that are used by Illumina and other type of sequencer so what we can do we can look for this sequence and to estimate the amount of data we have in our reads and if we detect some then we will have to remove them we can also look for duplication so it's important to look for duplication because you don't want to have several times the same original molecule that is represented in your data but I will come back later why it's important to look at that so here we generate an estimation to be able to say okay is my data good or not in that case if 14% will be not good so I will probably call the genome center to say what have you done it's a really high number of duplicates another QC that we are doing usually we take a small number of reads like 100 1000 of reads and we blast them again as an error database and then we look what is the organism that is it so if you are looking in human and you've got these results like that and you say most of my reads match with a mouse then it's problematic and you don't have what you expect to receive in your sample so you can then either check what you have done as a library yourself of what the genome center have done and if there is any mix up in your data so when we have looked at the fast Q and evaluated quality most of the time as I say the sequencer is not perfect and you will have to do some action on your fast Q to remove the lower quality in the region of your reads so it's what we call trimming so the first step that we are doing after checking quality is the trimming so how we do that so there are many tools to do that we use trimmomatic and today world is trimmomatic so what we do we first look as I say for adapters so the known sequence that are at the extremity of the reads because when you sequence your read is the size of the read the size of the fragment is shorter than the size of the read you are sequencing you will sequence your fragment and then you will start to sequence non-genomic sequence that you have add to your sequence so as I say we know the sequence so the first thing that we do when we do the trimming is to look for this sequence and if we find this sequence then we remove them from the read the second thing that we are doing is to only keep bases with high quality so for all the read after adapters have been removed we start from all the last base of the read and we look at the quality if in a base the quality is under a given result for example in our case it will be 30 so 0.1% of error then we remove the base and we jump to the next base if the base is still under 30 we remove the base and jump next to the next base and so on until we arrive on the base which has a quality higher than the result we have set so you can see sorry in my slide we don't see the last point but I will explain it so you can see that this first part will remove part of your read the second part will also remove some bases on your read so what we have done is to look at the remaining lines of the reads and if the lines of the read is under a given result we use like 30 or 50 depending on what was the original size of the reads if the read after the two quality check and cleaning is under this result then we discard the read saying the read is too short to be interesting for us excuse me I didn't understand the necessity of removing those adapters why should we remove them because at the end of the reads here for example you could have a base here which is part of the adapter and which when the read will be aligned it will be aligned until the dynamic part and maybe the first couple of base could be aligned to the genome saying okay this base represents a variant because it's not what we observe in the reference sequence but it's normal it's not what we expect in the reference sequence because the real genomic sequence you have stopped here and after that it's a synthetic sequence that we have had so we don't want to include in the read this because it will appear as difference from the reference and what you want to identify in your analysis is what are the region where we see difference between the reference and your samples so you will create fake and fake variant with the presence of this sequence thank you and I have another question if you remove those bases with lower quality can it cause a deletion something like that for example no because we always remove data starting from the end of the read so we will not remove some base in the middle of the read where in that case if you do it just removing without marking that the base is a known base or unknown base it could create deletion but it's not the case as we only work starting from the end of the read until we stop to do the trimming same thing for the adapter you start from the end of the read so the read starts here the end of the read is there you start from the end but still in a in a biological integrity here and if I still go there because I remove a set of base due to quality like this I will still have this sequence in one set which have a there are genomic integrity and it will not cause the insertion or deletion in the case I say for the trimming we will use trimomatic it's a choice because we used to use that but as for every step many other tools exist and are really good to do that so if in your future analysis you prefer to use another tool than trimatic don't worry that's solidifying so when we have done the cleaning the next step is to do the alignment so when we do alignment what we want to do is to find the best location of your reads why is it best and not the true because in the genome the genome is still not perfect I don't know there's quite something they say that it really is a full version but I didn't have a look to the sequence yet but until you use that the genome is not perfect there's missing parts there's repeated regions that are usually shortened or put only in one location which makes that the alignment will not be perfect so you want to find the best location of your read so the challenge is that your genome is a really long straight of letter so it's 3 billion letters and your reads are 1 million of string of 100 or 150 so you need to replace all this piece together so it's a kind of real big puzzle challenge and why is it complicated it's because you don't want to do only exact matching so you don't want to say okay I will only keep the one that really matched my reference because the difference between your reference and your samples and what are these different that are related to cancer so you really need to tolerate not exact matching so when you do that you will include biological variants but you will also include sequencing errors and other type of error in your data and the main challenge is that you need to be permissive to that error not too much so you need to let real biological variant to be in and then to be able to decipher between error and variant so to do that there was many algorithms that have been developed the one that is mostly used is bureau transformer approach which is the best balance in terms of speed memory and accuracy for sure so the idea of blast would be better in terms of accuracy but try to blast hundreds of millions of sequence and you will see that you will launch your job and then you can go in vacation for a couple of weeks or months and then when you come back if you are lucky your job will be complete so the idea is to try to find an equilibrium between in hours with a good accuracy so this approach is the one that have been shown as the most performant and the mapper we will use is called BWA but many as for the other for the trimmer many other tool exist and many other tool provide good aligner so when you do your alignment you will generate file which is called BAM or SAM so this file is used to store the alignment so the BAM file is a binary version it's what we usually use and the SAM file is a text file so this is really large file as I said BAM file could be several hundred of gigabytes so SAM file will be even more bigger so it's one most people use BAM but SAM and BAM are really the same so compared to the FASTQ file BAM file will store an alignment so one end of the read in one line in the FASTQ file we are using four line here we use one line so here is the format that we use in the BAM file so how this line is created so the first column of the file the first part give you the read name so the same as you've seen then you've got a flag which describe how the alignment went in terms of specific physical components like for example is my read one and read two paired sorry matched and in the same chromosome for example so is my read is really paired is the two read are in the same strand or in different strand so all this kind of really physical description of your alignment then you've got two columns the first tell you which chromosome the read is aligned and the second tell you the first base of the read what is the position the chromosome of the first base of the read then you have a quality value so it's a freight score as previously so minus of a probability in that case it's a probability that the read is not mapped correctly so you can see here 60 is really a high quality of the mapping then you've got another column which describe you how your read is mapped in terms of physically mapped so it's what we call the cigar stream in that case it tell me there are 76 match so which means that my read was 76 base long and all the base have been located to the genomes so a match means that the base have been position on the genome it don't mean that the match it don't mean that the base is the same as in the reference it means that in this read the read at this position correspond to should be located in front of this read in the reference it do not check at all if the base is the same or not the only information it will give you if is there any insertion or deletion it will tell you ok I have match for example 30 base then I have created deletion and I have match the other set of bases and you have got two other columns that tell you the mate so there is a second read so read 2 if you are working with read 1 or read 1 if you are working with read 2 and it tell you what are the position of the mapping for this other for the mate so if you have the equal side it mean that the read have been aligned on the same chromosome if there is another number it mean that the read have been aligned on the same chromosome which correspond to the number given and then it give you the first base of the alignment of the other of your mate then what it give you is the inter side so the distance between the two extremity of your reads then you have the sequence in term of bases and then you have the base quality after that you have got other field that are aligned on one it is why I will not go in detail in that field because depending on which aligner you will use and depending what you have done you will see different field but all the field will be described in the either of the BAM file question so there is no mapping on the second read you don't have any cigar string information no cigar string correspond to the mapping of this read but you also have the coordinates for the second read but the cigar string do not have any look at the second read when you will find in the file the second read so you will find another read with the same read name probably the dash 1 or dash 2 at the end if you have it then you will have the information of the second read when you will find the second read so you will see the same read name you will find another flag which correspond to how this one have been mapped and then you will have quality and the cigar string of the mate am I clear yes thank you what was the 119 again was that the difference between those two which one 119 at the end this one it's the inter size it's the distance between a new read between mate 1 and mate 2 so that should be what you would get by subtracting the reference position from the second one like 8, 8, 2 that one and that one we should get 119 by subtracting those two no, not exactly because what you look is the first base of your read 1 and the first base of your read 2 okay and what the inter size is is between the first base of your read 1 and the last base of your read 2 so it's why it's not directly the difference between both numbers but it could be pretty close I was going to add the sort of what they should be should they be mapped against a normal genome and in a cancer genome of course all bets are off yeah and we'll look at those later so a small note on the alignment so it's not always a case now that we have really sequences that have really high throughput but in many cases and especially in cancer we want to have really high coverage for the tumor it's happened that you will generate a library from your tumor sample and you will sequence it multiple times to increase the coverage of your data and it's really important when you do that that you align each lane separately so each output of your sequencer for this library separately and you add a read tag what we call a read group so why we should do that so first it's to be faster because if I receive three different files so if I sequence three times my library if I align them in parallel I will gain in terms of time to complete the data but what is really more important is to track where the read come from because when you will merge everything together to get your full final BAM it's important that if you see something in your reads, in your BAM that is unusual that you are able to track the unusual reads where they come from are they coming from one specific set of reads that you have received from the sequencer or they are coming from all the different set of reads that you are received from your sequencer in case when it's only one set of reads it's probably an artificial pattern. When you see that this weird pattern is displaying every every read set every set of file you get from the sequencer then you will say oh maybe it's more biological so it's really important when you will to track back the evidence of what you've discovered is found in every of your samples otherwise it may be a false positive observations you are doing. Another important reason of doing that is that now most of the tool that we will use and you will use to do your analysis will ask you to have read group on your file so after this step we have generated alignment but as for the sequencer the aligner as I say is not perfect as we say we try to combine speed and accuracy so the alignment will not be perfect so the main thing we need to do then is to refine the alignment to make it better to again try to reduce all the technicality all the errors all the possible possible things that could create a false variant in your data so the first thing that we that we are doing is to do in-depth realignment so why we are doing that is because when we do alignments it's a kind of penalty game so every so having a mismatch having a difference between reference and you read a number of points having a gap cost another set of points and usually having a variant or having a mismatch between the reference cost way less than having a gap in most of the alignment process so the aligner will turn to not create gaps when we talk about DNA aligner and sometimes in some region you will observe these kind of patterns where you see a difference between the read and the in many read and the reference and you see another reference just on a few basis after and in some region other read show a gaps so in that case it's a really suspicious because you don't expect such a high amount of variation in a closed space you expect to have variation in general in the non-conservable sample every KB in conserved sample it's a bit less but you don't expect to have three variations in ten basis in that case it's usually a sign that something went wrong in this region but that's happened due to the fact of the penalty game that the aligner is playing to find your best location so the in-depth realignment will go over your genomes and your alignment and will look try to identify all the region like that where the region looks suspicious and then what it will do it will realign the region but using a different set of penalties for variant versus gaps try to reinforce and be more prone to alignment and there is to try to change the alignment in some reads when you have created false variant because you have not included a deletion so that's the main purpose of this step so it's a really important step no some people would tell me yes maybe it's not useful to do it anymore because there are variant colors that do it themselves that's true but not all the color will do that so it's why I think it's still good to do it because if you want to use colors that do not do local realignment while it's calling you will need to do it before so it's why we're still doing it what is really important is when we work in concert so we have usually normal and tumor is to do it in synchronization so both sample together because in your in your data the normal and tumor come from the same sample so they share I would say probably 99.9% of their variation so they will share 99.9% of their indels and sometimes when you have a region with indels it's more harder to map so you have less coverage so by putting the two band files together not create a mix of the band file but by giving the two band files together to the indel realignment tool it will get even more information about the region and it will create even more true realignment so it's always good to do that because indel realignment have do when you have indel in your band file the weight is right and you could put the indel here you could put there so there's a different way to mark deletion or insertion in the band file so the fact that you give the two band together will ensure that the deletion at this position or insertion is marked the same way between the two band files the last not the last the second refinement that we do on the band file is to do marking duplicates or removing duplicates so I already talked about that a bit before so what are duplicates duplicates it's when you have in your data several representation of the same initial DNA fragment so what you want to do when you do world genome sequencing you generate a lot of DNA fragment thousands of millions of DNA fragments and you sample them and you want to have as possible only one representation of each DNA fragment so it is if you see several in time the same variation from the reference and your sample if all the read come from different DNA fragment it means that all the difference you see are real logical difference that you have been observed in your data so you don't want to have one fragment that is represented 20 times otherwise if there's an error at the beginning when you create this fragment you will see 20 times the same error and it will appear as one fragment where it's just another representation of an error so you want that one fragment one vertical fragment give one read so the idea of duplicates is to look for this duplication of data and try to remove it so mostly it will come from PCR sometime for optical but the ratio is really like I would say 99% of duplicates so there's many ways to detect them the most efficient one is the positional approach and it's why we will use mostly in the workshop so why it is important to look at duplicates as I said here is an example of what you can observe in your data so if I'm looking at this position I will see oh I've got 6 evidence out of 8 that there's difference between my reference and my variant my sample so I will probably call that a variant now I do I remove duplicates then I notice that all these threads are the same version of an initial fragment so I'll know if I'm removing all of them except one if I'm only keeping one version of each fragment I'll say oh I've got a difference to my reference only one out of three so now it's really less likely that I will call a variant at this position so it's really important to reduce the number of false positives in your data to do this step the last refinement we do on the BAM is what is called the base for calibration so why we do that because the base quality is used for mapping but is also used for variant calling and the different sequencer try to inflate the value of the base quality in their reads and if you look at the if you look at the data and if you do some kind of matrix looking at where is a real variant or not we can see that there are some common patterns that influence the variability of base quality there's a position there's a genomic context of of technicality that will influence the base quality so the idea of the base of calibration is to look at each base and to look at how this base fits into the model of variation of base quality and to re-give to this base what we expect is a more realistic base quality in order to just improve the base quality for variant calling so once you have done that then you have high quality bound file and then you have a bound file that will be ready to do variant calling so I will not go further in the how we touch the bound file because then it will be the next module that will focus on the variant calling so I will stop there in the pipeline but before stopping I would like to note that all along the process what is really important is to take matrix it's really really important because each time you will see something unusual in your data you will have to go back to your matrix file and to see okay is all my matrix file seems okay and in that case what I'm seeing as unusual maybe a real biological unusual variant or phenomena or is it something which is due to all the all the processing I made on my read because each time you do a step you make a choice when you do the alignment you make a choice about your penalty when you trim you make a choice about your base quality results when you do your alignment refinement you make some choice you usually have a cost to pay in terms of some read will be removed some read will be biased and that will be noticed in the matrix by looking at the matrix you will be oh I see that in my matrix my internal size is really really too small so it's why I don't see for example insertion all my matrices have really bad alignment rates so it's why I don't see a lot of falsities in some specific region so you always have to go back before judging an unusual phenomena to matrix to confirm that your matrix are okay and you can interpret this unusual observation as a variant in terms of matrix we'll see during the practical the different matrix but there's one tool we really like to use which is called military qc and it's why I want to show it to show it to you we will not use it today in the practical but it's a really efficient tool you just run military qc on the folder what you have run your analysis and it's a kind of it's a matrix aggregator that will be able to provide you all your matrix into one file so it's really easy after after that when you are running military qc to explore your matrix during the practical we'll do it by looking directly in the file but in general when you have a real set of data with a lot of samples using military qc is really an advantage so to conclude so today I've tried to introduce you what are the main challenges when we do counter-analysis and to really try to understand what are the source of artefacts what could be linked to the sample for example if they are specifying your sample and it's specifying in contrast sample what are the sources of errors that come from the technology the mapping and everything the sequencer and then what are the I try to touch a bit what are the source of artefacts you can see due to the analysis it's correspond to what I tell you when you do something you make a choice and you have a cost to pay it's really important to understand that you will have a lot of artefact and a lot of error in your data and you need to be aware of that and you need to understand really how all the process to generate your data is made in order to understand and to make the difference between biology and artefact what is important to also keep in mind is concept are complex we talk most of the time about variant people think about single nuclear variant but there is also copy number alteration there is also mutational burden so it's really complex it's challenging in terms of analysis but it's really fun because you will have a direct view of what the variant impact are when you will be able to all this complexity you get direct impact which is usually sometimes a bit more difficult to see when you need to have a larger set of data for example in rare disease here I focus more on the on the schema on the design of using a tumor normal pair for analysis so this is done when you want to work on somatic variant or when you want to look at LOH so lots of atherozygosity but in cancer usually you could be doing different way or different design like comparing DNA to RNA in that case we do that when we want to focus and validate gene fusion and we do that when we are editing or a little specific expression sometimes you do pre-prost treatment when you want to measure how the impact of a specific drugs and you see all this challenge make things complex and it's why now when we do the analysis first we need to use specific tools and one of the main take-off message is don't use standard tools because otherwise it could give you really useless results and the second thing is that now and you will see it during the workshop we tend to go more and more with a single cell approach of cancer because of purity, clonality, heterogeneity going at the cell level at the cell genomic levels to get out of this cancer-specific phenomena so it's why you will see during the workshop and why if you work in cancer genomics you will probably be exposed to single cell analysis so that's all for my talk I would like to thank people from C3G that work with me on cancer and other collaborators that helped me to develop the pipeline