 So today I will talk about genome alignment, so just to give you one phrase summary, it's how you place your reads over the genomes, but we'll see more in detail what this means. So what I will try to bring you as knowledge during this lecture, I will give you another small introduction to the idea of technology, so just a reminder of what we already see this morning about doing a jar of the talk. Two, as we try to make you understand what the main classic of the data, what is the problem that we have to face when we do genome alignment, what are the possible sources of error, and also give you some more general knowledge about terminology and file format. And during the practical, we'll see how to run the first step of the DNS pipeline, and tomorrow we'll see how to run the second and third step of the pipeline. So as I say, I will start with NGS technology introduction, so I will try to be fast on that because Gerard explained it really well this morning. So the revolution of the sequencing was made because 10 to 20 years ago there was a batch of major projects for sequencing genomes, and it takes months and years depending on the size of the genome, so years for the human genome, and it costs a billion and a half dollars. We are now with the new technologies for small genomes, we can have the genome with sufficient amount of collagen hours, but in hours, and it only costs thousands of dollars. So what is the next generation sequencing? So as Gerard explained, it starts from the clone-based approach that is presented this morning, so it was you got your fragment of DNA, you put it in a clone, and then you put it in the old sequencer, and it will give you your sequence as a program here. So what the problem is, the number of run you can do at a time, so it's 96 sequence at a time, it takes many hours for each run, so it's really not suitable for large genome. So the next generation sequencing, and it works, it works in the same ID to generate code to use fluorescent code for each basis to be able to retrieve the sequence, but the idea is to do medium of sequence at a time, so it's just a part of a small image of a flow cell, so you see that in the flow cell, each cycle it will take a picture of this gigantic image of dots and try to refine which cluster corresponds, in each cluster which color is each cluster for each cycle. So how it works, so I will present here only the building approach, which is called sequencing by sentences. The main idea is to take your DNA, you share your DNA to a fragment size where you know approximately the size of your fragment, you add some specific adapters, then when you have these adapters you load this fragment on your flow cell, so on your flow cell you've got some complementary adapters that will catch the one you load on the flow cell, so you will end up with some fragment of DNA which is linked to your flow cell, and you will do what we call the bridge modification, so you have the fragment that will catch one side of the DNA fragment, and it will create a bridge between one adapter to the other adapter, and then you will do the amplification directly on the other flow cell. So when you do the amplification then you denature the double strand molecules, and you will have two copies, and so on at the end to arrive to cluster of sequences that represent exactly the same sequence, so for one group of molecules you have the same initial DNA fragment that you have put in the flow cell. Then you do the sequencing, so you start from the beginning, you add fluorescence, for example, be nucleotide, and for each and then you plant it, the fluorescence is catch the same, for all the clusters at the same times, and you will end up with this kind of images, where you are able for each cycle to know for a given cluster, I got a yellow dot, I got a blue dot, green dot, yellow, red, and yellow to extract the sequence. So it works really well, it's really efficient, there are some sources of error as we discussed this morning, phasing, prefacing of the data, but we'll come later on that. So there's actually five marginal players to do sequencing, life technology, which have developed a lot of solid and iron torments, which do solid exides, and my sick, my sick, so except for my sick, it's kind of, my sick is more in between, but this technology do small fragments of sequencing, so from 50, from 30 to 100, 200 base pair alone. Here with my sick and 454, we go in the medium size of, medium size of size of 3, so it's around 300 to 600, so 300 more for my sick and 6700 for 454. What you have to know is 454 is now closed, so we're still at the center of some 454 run of sequencing, but it will be, it will be, the technology will die during the next year. And we have the large REIT technology with Pacific Biocents, we developed the PacBio machine and Oxford Nanopower, we developed the Minion and the ReDI. Just a quick comparison of the technologies, so we talk about this morning, but what is interesting to look when you try to decide, when you have your own project and try to decide what technology should I use? First, the size of the REIT, depending on what you are doing, if you want to do structural variants, you need to have a long REIT, if you want to do assembly, you need to have longer REIT, if you want to do arena sick, you can go with shorter REIT or more general, just classical more general, might be shorter REIT, so depending on your experiments, you need to know the size of the REIT you want to do, the type of error that each technology has. So long REIT and medium REIT really generate more indels, and especially long REIT generate indels around, when you have stretch of the same type of nuclear field, so when you have like 3, 4, 5, 8, for example, sometimes you will only have 4 out of 5 or 6 out of 5. Where for the short REITs, in Illumina, you will have substitution, so mismatch. And what is the type of error? So you have really low error for the short REIT, larger error for the longer REITs, but this is a single pass error, so this is what the row seconds will give you. So we have methods like direct talk this morning, if we do two-day REITs or two-pass REITs for a nanopart, it will decrease to 3%, if you correct CSS approach by value, it will also be able to decrease the level of error, but we will do a bit on this maximum size of REIT you can have. So all this kind of technical configuration, and as Gerard said this morning, all this numbers are probably still out of date because the technology evolves really fast. What is important to think is about advantages and disadvantages. So mainly nanopart and packed values advantages is the size of the REITs and the time to sequence. Nanopart is also advantageous because it's really to use, you just need to open your device and fill the device, but mainly advantages is the low yield, the cost and the type of error and the stability, but not the stability is more and more in progress with nanopart. For Illumina, the main advantages is to the high sequential, the lower cost and accuracy of the data. The main disadvantage is not for use for certain like us is the cost of the equipment. So and the fact that you cannot read some certain type of variant due to the lower size of the data. So just give you an example of what we can do with this type of machine. So this is what we've got, the number of machine we've got at our center. So we've got three, four, three, four, three, all this type of machine. So for the medium, for the machine that give you medium range read lines, usually what we do is that, smaller than what you're sequencing. So it was what we do at the beginning. No, we don't do it anymore, but you can do it. That it won't be a not the best choice, but we do mainly for this type of machine on the consequence in metagenomics and valuation. For the short read and high throughput, what we do, we do everything, which is wall, wall exon, wall genome, wall transcriptome, cheap sex, wall genome by seed fat, and anything of that you need to have a wall coverage of your genome. So you need to have the power of the high yield of our sequence. With the long read, what we do is small and medium to know the new assembly. So long auto type sequencing, targeted sequencing, epigenomics, and validation. And here, sorry, there was a problem with my slide. With the Tendix, so Tendix is not a really sequencing technologies, it's more of library prep technologies. And what we do with that, we do a lot of apotyping, especially for cancer. We do wall genome assembly, but what we really do a lot with Tendix Dynamics, actually the center is a single cell RNA-seq analysis. Because directly with the machine you can put yourself and the cell will be will be catching individually and you could also back up to catch each molecule individually. So it's really efficient for the single cell analysis. So now that we know a bit more about the technology, what are the main parameters you need to sync when you design your experiment? This is a different type of approach. Redance, library type, error profile, barracoding, if you want to put Cheval in the same main or sequencing. So this is a main characteristic that needs to drive your choice. When I meet people and discuss with them, they usually come with these two parameters as a major factor. If you do it, to my point of view, it will not be the best way to design your experiment. But for sure we have not an limited form, so we need still to take it into account. So no, what should we do from data? So when we do DNA-seq analysis, the idea is to start from the fast queue what we will have from the machine to VCF. So VCF is a variant called format file. So it's the file where you will have your variants. So what is the what we have at the beginning where the fast queue? So for each sample you will have usually two files because you have your fragment and you will take on one end of the fragment and the other end of the fragments and each end will be in a file one and each second end will be on this file. So for each sample, depending on the coverage you will have, you will end up with a file that could go from 5GHz to 300GHz of data, so it's a massive amount of data. And in your fast queue, your data will be presented approximately like that because it's sequencer, it's technology has its own set of fast queue formats, but there are some kind of rules you need to respect. So the rule is you will have four row pair sequence. The first one that starts with an add sign that describe that gives you the name of the identity of your sequence. Usually it's linked to the name of the starter, the name of the machines, and you have some personal information regarding where the cluster was taken on the flow cell and usually at the end, we tell you slash one, we tell you which read one, or if you are on read two, you will be slash two, the same ID will be slash two at the end. The second line is really the sequence that you have called from your sequencer, then you have a third line which is used for a second header, so usually it's starting with a plus sign. Usually it's either empty or the same ID, and then you have the quality of each basis. So the quality, as you can see, the quality is a numeric value, but it's a numeric value that tells you what's the quality of my data. The problem to use a numeric value is that if your numeric value goes more than nine, zero to nine, you cannot make the difference between, if I say here my numeric value for my first basis is 32, it's three for the first one, it's two for the second, or 32 for the first one. So what we use for the quality, we encode this value, so each numeric value corresponds to a specific ASCII, let ASCII permit ASCII, sorry, ASCII sign our parameters, ASCII letters, or characters, yes, and so each characters are able to revert to the given numeric value correspond. So you won't have to do it, every software that takes this file type of file into into account will do it for you. So what this quality means, so it's called the base quality. So it's a numeric value here, Q, which is what we call a French score. A French score is minus 10 log base 10 of a given probability, it is a definition of a French score. And so in the case of the base quality, the probability that we measure here is the probability that the base that I've been called is incorrect. So the probability that you have made an error. So as it is a minus 10 log 10, the higher the numeric value is, the lower your probability of error will be. So if we take a 30 base quality, we'll have 0.1 percent chance of error. So this is how we usually represent base quality, and how when we do base quality, how we see our sequence. What we do usually, so it's a representation of for each cycle of a sequencing experiment of the best quality, a distribution among all the bases, so among all the million reads that we generate. So it's the representation of the distribution. So as Gerard mentioned, the distribution usually starts high, reads the top quantities they can do, stay stable, and when we progress along the read when we accumulate cycles, we start to have, as I said this morning, we start to have some molecules that are not in phase. So we start to accumulate molecules which are not in phase with a group. And we start to see a cluster which is less well defined, and we start to have a decrease in the quality. That's why at the end of the read, we saw a real drop down of the general distribution of the quality. Another type of, so yeah, before going to the next step, it's important to look at your quality of your data, because if you have low quality, you are more likely to see error in your read, and if you are more likely to have error in your analysis, like SNP or as we call it. Another type of analysis or QC we do on the FASTU file is to look at the positional base content. So along your reads, what are the contents of my base? So if I'm totally random, each color represents one basis, and if I'm totally random, all my reads, I should have 25%. Do you think that's fine? Yes or no? Here that's fine, because it's a graph for RNA data. So in gene, the base content is around 25 of each basis. But for world genome, I tell you if you see, for world genome, if you see 25, 25, 25, there's probably an issue, because the GC content is not the distribution of all the base in the genome. The GC content is around, I think it's 60% or the opposite. So you should expect to have one around 30 and the other around 20. So 2 bar. If you do world genome. Another type of QC you can see is do you find specific adapters? Do you find specific known sequence in your data? And what's the amount of duplicate reads you have in your data? Yes? So here, it's just because it's a common pattern you see when you do RNA. So because the way the RNA library preparation is done with Illumina, the way the fragment is shared is not fully random. And that's why we observe this pattern, this kind of pattern, because there's specific region where the DNA is shared by the entire. The DNA will be flat. The DNA will, we don't have, this type of pattern will be flat, but with two lines around 30 and two lines around 20. I don't get your question. Yeah, but you, you, I think you share for the universal structure. It's why I have to say that it's what Illumina explained on their forums about this pattern. So some people keep these bases because if you want to do, for example, a SMV calling on RNA, here you are really more error prone because you have a specific pattern and you have problems with Illumina and a lot of people when you do SMV calling will just remove the 12 first base on everything. Another type of QC we do usually on the data is to take extract randomly a set of a given number of reads here. It was 1000 read and we just blast to the NR databases and we just look what is the more important it we've got to be sure that it corresponds to what we were supposed to sequence. And it's also a good way to flag for contamination from other space. So we have all fast Q. We see we can access the quality. So when we have this kind of pattern, we don't have usually a perfect fast profile. So what we need to do is to remove data that is with lower quality or data that is not coming from originals. So what to do? We do trimming. So when we do trimming, we apply usually three steps, three actions. So the first thing to do is to remove adaptors. So adaptors you have it at each end of your DNA fragment. So if your fragment because when you do your size animation, you cannot take exactly the size. So it's kind of an interval of size. So you will have sort of fragment for sure. And so this sort of fragment when you will sequence, you can go over the end of your DNA fragment and start to sequence the adaptor. So if you catch them, you will need to remove it. The second type of cleaning we do, we start from the trip to the end of the read. We just change the quality of each basis. And if the quality is below a given threshold, so it's your choice to apply each threshold. Depending on your experiment, usually we use either 30 or 20 on what we want to keep before to be more stringent or less major. And we remove all the bases that are under this threshold. And when in a read the threshold goes up to this threshold, we start to trim. And when we have applied these two conditions, these two filters, is the size of the remaining reads is lower than the given size. We say, okay, it's not a read that I can use efficiently for my list. I will drop it. Okay. So just a quick comment. It's important when you do this type of thing to start trimming with adaptor. Because trimming the adaptor is a sequential recognition. So you have the sequence of your adaptor. So if you start by trimming, then you will start by removing bases in your adaptor. And it will be harder to find the real adaptor. So it's important to start to do adaptor, quality and background. Yes. So when we observe this kind of pattern, it's really rare when it's mature. But usually it's sometimes it occurs. For example, we have that like one and one and a half years ago. We have a problem of temperature in the room of the sequencer. And the temperature goes up. And then the emergency, we stop everything, we put the temperature on the grid temperature and we restart. And we have one cycle that the quality goes down. In these cases, what we do with specific tools, we say, don't take interact on these bases. And we turn every call as an end. So when you say that you trim, for example, from here, let's say, this number 93%. You do that for all the reads that the machine generated or only those reads where the quality was bad. On this read, one of the quality is bad. So also one thing, you have an option. Here, when I do it, usually I do it base by base. But you can do like a sliding window. Say, okay, I take the, for example, the five first last reads, the quality of the mean quality over these five reads is lower than the threshold. Then I cut these bases. Because that's the reason for the standard deviation, right? Some of your reads will have good sequences until they're very ended. Yeah, but usually you cannot be done. Because in most of the case, if you are going down, it means that you read are unfazed. A lot of sequence on your cluster are unfazed. And you cannot refazed them at the next cycle. So usually, when your quality starts to decrease, it will not become better after a few cycles. It will just become worse and worse. But it's different for every read. Yeah, it's different for every read. I think there was a question there. Since your adapter is at the end, what's the name of the adapter that you program to? Localege. So we put the, when you, the recondition is not the perfect matches. So you tolerate some error. So when you have low quality, you don't expect to have each base as an error. But it's still, you know, it's still around like 20, 10, you know, like percent to percent. So you don't expect to have those people error at the end. I think you have another question. No? Okay. And now, just to tell you that in the way we work, now to tell you the truth, no, we didn't, when we have good, when we have good quality all of our sequence, which is the norm for that it really was, now that we have this pattern, we usually have something like that, which at the end, we are around this level of 30. So in that case, we don't do anything. Because we know of an aligner that is able to catch the adapter and softly, they have also marked the part of the read as non-genomic. Which is this alignment? BWMM, the one we would use during the practical. Yeah. Does the read of quality is lower toward the end as a scenario? No, it's just because in the first cycle, the machine is not, is needed to calibrate. And especially with the older 4-cell, we need to define the cluster. Now with the new 4-cell, the new patterned 4-cell, each cluster is already prepositioned on the 4-cell. Each cluster is, the machine knows where each cluster is at the beginning. The quality starts really higher and it's almost flat. Yeah. Soft tip? So soft tip is when you align your reads, you can do many things. I know you can do many things, but it can break your read into different pieces. But it can break your reads and say, okay, these pieces, I don't find any place for these pieces anywhere in the genome. So probably it's not a genomic region. So instead of let it there, it just marks the sequence as, this sequence last, for example, 10 days, is non-genomic. So don't use it for other subsequent analyses to look for variants for everything. So it's just a kind of additional information you add on your alignment to tell you this 10 or whatever number of reads, don't use it because it's not part of the genome. So soft tipping, you keep the sequence in your reads. So you keep the sequence in your file, but you say, don't use it. Hard tipping, you cut the sequence. So many tools to do it. As we do, we will use Trimomatic, but many other tools are available. So many other tools do the same kind of job with the same efficiency. So it's really a user preference. As it's just because we use it for long and because it was really easy to handle. Because at the beginning, when we use Illumina, the Fred's Court was not encoded the same way, and it was able to really use from one to the other. But it really has no rule of thumb about what is the best one. So when we have done the cleaning of the fast queue, so we have a good set of data, the next thing is to do alignment. Before doing alignment, when you have your reads and you have cleaned your reads, you have two possible strategies. Either you do alignment mapping. So the idea is to take your read and find the best location on your reference sequence. It's important to see it's the best location. Most of the time, best would mean the true location, but not always. It's the best location based on the reference sequence. Whereas you can do assembly, where the ID and which will be developed by tarot tomorrow, and details. It's to take all the reads to try to reconstruct contentious sequence as contigs and to regenerate your own reference sequence. And after that, when you have regenerated your own reference sequence, you can either use it as if or remap relocate your read over your new reference sequence. But me, I will not go further with assembly. So as we'll see, what is the read mapping? So the read mapping is a bit challenging because you need to map million of short reads to the genome. So really a lot of tasks to do. And so you need to map. So it's kind of, you have a puzzle of three billion place and you have millions of pieces you need to place there. And so it's a it's a logic in term of computation. And also because many mapping locations could be possible for each pieces. And you don't want to only keep exact matching. Otherwise, you won't be able to extract the variant. So you need to be able to tolerate some error in your matching. So there's many algorithms but the most used now and the one that seems to perform the world is the Bureau Wheeler-Tontron algorithm. It's the one which is implemented in the BWA mapper, the one we'll use. But there's also many other tools that exist. Why we use BWMM? Because it's been shown to be one of the best mapper in terms of accuracy. The one that would perform BWA is novel line and we don't use it usually because novel line, you have some licensing, it's not totally free. So we don't want to I don't want to push you to purchase a license because BWA is good enough to do to do the work. There's other good. So for example, Bota is really good as an aligner, especially when we do RNA and to do when you want to extract exact matches. So there's other and there's a lot of review about each aligners that pop up every one or two years. When we do aligner, when we do alignment, it's really important to align each line separately. For many reasons, the first one is for speed and because sometimes when you do large world genome, you have like two or three line of alignments. It's less the case now that we have the ICX because you can have really a good amount of coverage with one line. But if you still use the ICX 2005, we'll still have for like a concert, you will need to have several lanes for each sample. So you need to align your lanes separately and when you do your alignments, what you each round of alignments, you add the RG tag. So the RG tag is a way to track where you read come from because you will align to different lanes. Each lane will have a specific each read. We have a specific RG tag that will be associated with the lane where it's come from then you merge everything. And if you see a specific pattern in your final set of align reads, you are able to split your read by a lane using the RG tag and to see if the pattern is spread in every lane or if it's lane specific. If it's spread to every lane, you can say okay, that's something that seems to be biologic. If it's in one lane, probably you have an issue in that lane. It's occurred to us, we saw a set of, one sample with an increased set of mutation in genes and then we split by lane, we say okay, we have one lane which is exome or the other where a whole genome. So that could make a difference. So it's important to set your read group. So to track error, to differentiate the origin of reads in the final band, but also because many tools will not work if you don't set the read group. So when you align, you end up with a file in a format SAM or BAM as we saw this morning. So SAM, sequence alignment, sequence alignment format. So the SAM file is a text file, the BAM file is a binary version of this file. So almost nobody use it, work with the SAM file because it's so huge, so nobody work with the BAM file. And for each read, that's the type of data you have in the SAM file. So there's many fields, not all of them are monetary. So I cut from the monetary field, but you could have extra slide field at the end of the line for each sample. So the monetary field are the read name, a flag. The flag is a numerical value that you could use to describe the mapping. So if your mapping quality is bad, if you are not primary alignment, all these kind of features that you can use to describe your alignment, to give some kind of parameter for alignment, then you have your reference position. So the chromosome and the position, you have the quality score of your alignment. So it's a thread score as we see for the base quantity. So it's a probability that your read is not mapped correctly. Then you've got the cigar string. The cigar string is a representation of your alignment. So here it tells you 16M, which is a 76M, which is a 76 basis, are matched to the reference. So map could be mismatched. You could be mapped and have a mismatch. So you could have M, you could have I for insertion, D deletion, you could have S for soft clip, H for up clip. So you have all these kind of features. Then you've got the position of the mate if you are in a pair of reads. So the equal sign means that the read is mapped on the same chromosome. If it's another chromosome, you will have the number of the chromosome and the position of that chromosome and you end up with the inter-site of your data. Then you have your sequence and then you have your base quality. So now we have done the alignment. What is important to do is to refine the alignment because every alignment is not, every aligner has his trans and weakness and there's no aligner which is perfect. So we need to take this alignment and to apply some set of filter and rearrangements to make the alignment better and more suitable for the variant calling. So the first thing we do, you will already talk about that this morning, is to embed the element. Why we do that is because aligner tend to favor of creating mismatch instead of entertaining the app. So the idea is, aligning a read is just a question of penalties and score to position your reads. And for almost all aligner, creating a gap takes a lot of penalty while creating a mismatch takes not so much penalty. So when you arrive, when you have a straight of letter like that and you have an indel, sometimes quite easy when you are at the end of your reads to tolerate one, two, three, a set of number of mismatch instead of creating a gap. So usually when you see these patterns where you accumulate snippet in a really several steps in really short position, it could be a sign of an indel, of a missing indel. So there's a tool in the GTK that is basically done that and we'll do it later on. And when you realign, when you create the indel, you see that most of your variants read disappear. Another type of refinement we do is to mark duplicates. So what is the duplicates? It's when you have different reads that represent the same initial DNA fragments in your library. So you want to count only one read or one pair of reads per real DNA fragment. So where does this duplicates come from? So it could come from PCR. When you do your PCR, you can have crazy amplification of your fragment and then you, because you will end up with several copies of your fragment. So you could also have what we call optical duplicates. So it's only valid for all type of source cells. So it's when your source cell is not well loaded. And once there are some clusters that become giants because you don't put enough DNA. So there's massive amplification. So the cluster becomes really large. And when the machine is looking at it, takes the picture, you consider it as two different clusters. So it will call a two different cluster like that. And it will call optical because it's just the optics that fail. In the new set of this type of duplicates is not anymore the new type of source cell, but we have the other one. When you don't load enough your new type of flow cell, you will have molecules. If the molecule is enough long, the molecule will try to jump into an empty cluster position. So because you are adaptable here, also if there are no molecules that go there during amplification, if the molecule is long enough, it will jump and try to amplify in the other well. And you've got the sister one is when you have to close up to close and it will create like 10 of hybrid of hybrid sequence. And then so I agree usually it will be called could be called one is one of the neighbors. So how we detect them? We can detect them before mapping using camera approach. It's what the one I told you we estimate the top duplicates at the fast Q level. But most of the people use approach where we detect duplicates and we do it after mapping. So by looking where the two reads map. So you can imagine that if you have the same fragment, the data will map exactly at the same position. Another type of quality assessment refinement we do on the bar is the base work calibration. Why we do that? Because when the sequence called the base, the base quality, it's time to inflate the value of the base. And so we need to re estimate the real value. So the calculation try to know score based on specific patterns. So based on the cycle of the machine. So if we are in a good work, we should expect to see something flat for every cycle and for every type of read. But we know that there are some bias in everything. So what we do, we look at the specific dynamic context and technical context and we re establish the true value of the best quality. So when you have done that, you have so refine the quality of alignment and then you will be ready to do variant code. So what is important to at each step of this process is to look at your matrix because it's where you will find if you have a specific issue with your data. So really I encourage you to catch matrix at each step of your process. So the matrix, so to be collected at each time, usually each tool provides their own set of matrix. So it's not super complicated to get it. But you have other specific tools you can use to ask for specific additional matrix. So there are a ton and ton of matrix. The one I really look when I do a project and analysis, I look either what is my trimming value, what are my alignment rates, what are the coverage I obtain, is that fit what we are supposed to have, what are the insights, is it correspond to what the lab did. So all these kind of matrix are really important to be sure that your experiment is correct before doing the next step of the variant calling. So the variant calling tomorrow, I will give you more detail about that in module four and five. Today just to finish, just to give you a general conclusion about that, not specifically on alignment but more specifically on working with this type of data. So if you want to do, to work on NGS, you need to be really aware of the technology and the method because every technology is different, implies different assumption, different type of sequence, different so you really need to have a good knowledge of technology and methods that could be applied to these technologies. You need to know what are the error and the technical artefact you will face based on the technology you have, you have choose. Otherwise you won't be able to understand your result or to make some between something that make biology sense or not. You need to have both mathematical and informatic skills because when we do analysis, we will use HP, HPC cluster. So we need its unique environment. If you're working on Windows, on me, it's problematic. So you need to have good informatic skills and good mathematical to understand what you're doing because many of the tools do a lot of mathematics or now do more and more machine learning of this kind of statistical approach. And now the major challenge for us as BFM musicians that work on NGS data is not so much the methodology but there's still some methodology challenge which is the capacity of the compute and the storage because when you do, for example, when we do large projects of hundreds of people in cancer, it could take 3, 2, 500 terabytes of data to store and to process. So the major cost. So that's it.