 Hello everyone, my name is Francesca Boulat. I'm the host of today's Bite Size Talk. And with me are Sarah Monson and Sarahi Varuna from the Institute of Health, Carlos III. And they are talking today about the NFCO pipeline viral recon updates and use cases. And up to you. Thank you very much. Hello everyone. We are really glad to be here today to talk about NFCO viral recon. This is the second talk regarding this pipeline, the first one as a bite site, I think. And so we want to talk about some updates and some new functionality. We've added to the pipeline in the last year and a half, I think. And also we want to speak about some use cases we've been using in our lab using viral recon as the main character. We're going to start with a brief roadmap, development map, followed by the major functionalities in the last releases. And the use cases, we're going to talk about the Relic of Network, which is the genomics or billions network for SARS-CoV-2 in Spain where we use viral recon for data analysis. We're going to talk about the paper we participated regarding an study of a long-term COVID patient. And also we're going to speak a little bit about the work we are currently doing, studying the multi-country multi-box outbreak in Isti where we are also using viral recon. As a roadmap, this pipeline was first started, the first release was in June 2020, but we started the development in March, more or less. Second major release was in a year later in May, 2021, where all the pipeline was rewritten using the L2 implementation. And also a new whole branch of the pipeline was added for nano-4 data analysis. Also, Pangolin and Spade was included for lineage assignment for SARS-CoV-2. Just a few months ago in February, we've released a new release, 2.3, including some fix of regarding problems or decision-taking from a viral consensus was added. And we're going to talk about this functionality deeper in the NSS slide. Currently, we are in the version 2.4.1. And this is the major functionality we've added. As I just told, the nano-4 branch of the pipeline allows us to handle both Illumina Ritz and nano-4 Ritz using viral recon. For nano-4 data, Arctic Network pipeline is used. A variant calling and consensus gene on output is generated and also NXPata and Pangolin for lineage assignment is computed over this consensus genome. One of the main functionality in the version 2.3 is that now the user can determine which variant color, which combination of variant color or consensus-generated software wants to use. Until this version by default, Ivar variants always use Ivar consensus as the software for generating the consensus, but now you can combine them. You can use Ivar variants and for variant calling and VCF tool consensus or the other way around, providing more flexibility for the user and also a more capacity of decision of how the consensus will be generated. This is one of the main functionality and this is important because changes the output or the way the consensus is generated from previous versions. Now the default is to use Ivar variants as the variant color and VCF tool consensus as the consensus generator. We've taken this decision due to some behavior of Ivar that may not be the desired one for this case and some known issues of Ivar consensus that are not yet addressed by the software. For example, here we see that Ivar includes low frequency deletions. When we use viral recon, we select a threshold for including variants in the consensus. For example, the default parameter is that we include variants in consensus that meet the criteria of more than 0.75 a little frequency. In this case, we see that even if we use this criteria in viral recon, we see that this deletion, which is 0.43 of a little frequency, the deletion is included in the consensus where the reference should be included. Here we can see the reference, the consensus generated by VCF tools and the consensus generated by Ivar. Ivar is included a low frequency deletion that shouldn't be there if you don't want to. Also, a known issue about Ivar is that it has some issues regarding the calculation of default coverage of insertions and deletions. Here we can see that this is a low frequency deletion as the previous example. Again, the reference, the VCF tools consensus and the Ivar consensus. And here we see the deletion, the low frequency deletion, but an N and a mask position is added even if we have enough coverage in this area. So this is an issue about the default coverage calculation that is fixed using VCF tools consensus instead of Ivar. Another issue why we selected VCF tools consensus is that Ivar, this is not an error of Ivar consensus, that is just the criteria or the behavior that Ivar has to create the consensus and that may be the behavior the user wants. That's the main difference between VCF tools consensus or Ivar. If you want to include variants that are, if you want to include variants regarding intrahost by availability in your consensus, for example, in this case, we have two positions here that Ivar includes ambiguous nucleotides. This is because in this position, in order to be the criteria of 0.75, you have Ivar needs to add two nucleotides in this position. That's because we add the ambiguous nucleotide. In this case, if we only want to add the majority or the more representative nucleotide, in this case is A or G, this is the only two nucleotides that meet the criteria of more than 0.75. So it depends if you want to add all the information of intrahost by availability in your consensus or you don't want to include this noise in your consensus. Ivar includes the ambiguous nucleotides because it includes majority the behavior is to include majority alleles, anti-yommeted is a certain criteria and this is two consensus only includes variants that are more than a low frequency. And another issue is just another example of the previous one. This is also an election in low frequency variants. And we see that Ivar is including N's masking sequence that could make problems when you upload to USAID, for example, instead of including the reference or the election, but this is an area which will cover it with the coverage, but Ivar is only including N's instead of the nucleotides or the election. Another functionality we've added in this case, we are going to talk about two new functionalities regarding the script that converts the Ivar output to VCF format. And then we added two new functionalities regarding code on merging and strand bias. In the case of code on merging, we mean that when the variant of concerns B117 appeared a new complex variants appeared as variants for SASCOP2. And we realized that for this complex variants that changes the three nucleotides in a code, the variant colors, Ivar and all the variant colors reported the variant as three lines, there's three different changes. This is a problem because you don't have the correct annotation. This is three changes that change the code entirely. So the amino acid is changed completely. If you have it in three lines, the annotation couldn't be correct, not for Ivar, not for SNPF. So we've created a function that goes position by position reading the TCF file from Ivar. And we check if they are consecutive and if we found two or three positions that are consecutive, we check if they belong to the same code. If they belong to the same code, like this case, we see that the red code is exactly the same for the three positions. We collapse these three lines in just one. So the red has the three alleles and the odd has the three alleles. This makes that SNPF creates, annotates the amino acid change correctly fixing this issue. And this is included in the Ivar variants script. Another one is as we all know, and yes, that are prone for certain bias as a strand bias is one of them. Here we have found a strand bias, for example, when we have a variant that is only supported for forward or reverse strand reads. And this is normally making that is more probable that that variant is a false positive. A strand bias is usually corrected to annotate it, but most of the variant colors that this is nowadays, but Ivar still lacks this functionality. So we've added this annotation in the Ivar output conversion to BCF. What we do is to create a contingency table regarding the forward and reverse strand reads for the reference and the red alleles. We calculate a PSA test and we mark as a strand bias position when the p-value is less than 0.05. This formula is obtained from the tutorial from JTK. And finally, a new output for reporting variants is included also in the version 2.3 where it's really useful because we combine the data from the variant calling, the notation and also the lineage assignment. And this provides a good way to study, for example, metagenomics data from C-Witch, SARS-CoV-2 data. It is really useful for variant inspection for studying co-infections, et cetera. And now, Sarai, it's going to talk to you about the use cases. Yes, now I'm going to explain you three use cases of viral recon here in the Institute of Health, Carlos III. The first one is the Relic of Network, which is founded by the era incubator program and is a Spanish network that aims to create the SARS-CoV-2 surveillance at national level based on genomic sequencing. In this network, the microbiology labs from hospitals are going to select the samples to be sequenced based on criteria established by public health authorities and they are going to sequence those samples. Then they will send the FASTU files to the Relic of Platform here in the Eastie and we are going to analyze those samples with viral recon. So we are going to be able to see the national evolution of the viral variants and viral lineage. And also, we are going to share genomic data with databases such as G-side or ENA. Also, the idea is that we will give support and information to the different labs that are inside the Relic of Network. As you can see in this schema, there is at least one group in each of the autonomous communities in Spain that included in the network. So altogether, we are going to create a national surveillance of SARS-CoV-2 and probably learn from this approach to extending to other pathogens. This would be a general schema on how the samples are sequenced and analyzed here in the Institute of Health Carlos D'Cerre. After two days of sequencing samples, they are going to be stored in a hard disk cabin and processed in a high-processing computing server here in the Institute of Health using viral recon. And then the results are going to be retrieved to the microbiology labs. The second example is about an immunodepress woman that had prolonged viral replication. So she was receiving immunochemotherapy and after the last cycle of immunochemotherapy, six months after she was admitted in the hospital of being positive for a SARS-CoV-2 RT-PCR. After nine months of being discharged and readmitted in the hospital, being RT-PCR positive for SARS-CoV-2 and receiving antiviral drugs and convalescent plasma, the woman died. What we saw after 237 days of collecting 12 samples for sequencing is that the last sample obtained had accumulated 29 nucleotide mutations and 22 amino acid mutations using viral recon in the mapping approach with the Wuhan reference genome. For this, we use viral recon version 1.2 in development version. Something interesting is that we use the long variant table that Sarah explained you to create these plots where we selected the low-frequency variants to see how they were changing over time in this patient. We have in the X axis the date of sample collection and in the Y axis the alert frequency and each line and dot represents one variant in the sample. When no dot is shown is that that position didn't have enough coverage in the sample. So in this example, we can see the ORF1AB mutations that we can see that most of them are present in the non-structural protein 3. Something similar happens with the S gene where most of the variants are accumulated in the spike protein S1. Also, we found one of the variants that was afterwards considered as a mutation of concern of the delta variant inside this woman when the delta variant wasn't circulating in Spain. Something interesting we found collecting these low-frequency variants is that we saw patterns of different viral subpopulations competing inside the intrahost. So we think that there was an intrahost mutation and competition between the virus subpopulation and also that those antiviral drops were selecting a resistant viruses. Last example is the most recent one and is how we in the Institute of Health, Carlos III, treated the multi-country monkey post outbreak in non-endemic countries. So here we sequenced 28 samples and we used viral recon latest version to obtain different FASTAG genomes for both the NOVO assembly and mapping approach against three different monkey box genomes. We obtained using an Illumina NOVA sec of 2 for 150 grids, 33 samples that had the 100% of the reference genome covered at least at 10-fold of depth. We used the mapping consensus FASTAG files and the NOVO assembly FASTAG files to create multiple sequence alignments and see the performance of both approaches. We saw that the ends of the reference genome couldn't be assembled with the NOVO assembly approach but with the mapping approach, we could see that there was enough coverage to obtain those sequences. Also in the PlasmidID plots obtained with viral recon, we can see how in the NOVO assembly the right and left ends of the reference genome are missing. Also, monkey box genome has shortened repeats that we were trying to discover if this approach was able to obtain the exact number of shortened end repeats in our samples. We found that in the NOVO assembly approach, when the shortened end repeats were inside different contexts, the abacus introduces ends in between the context so we couldn't reconstruct the real shortened end repeat scaffold. In the mapping approach, we saw that we had enough coverage to cover the reference shortened end repeats but that we are limited to the number of STRs present in the reference. So in order to discover the real number of repeats present in our monkey box samples, we are trying to sequence the best cover sample with my SIG-2 for 300, also that is going to be analysed with the latest version of viral recon and with Oxford Nanopore Technologies. And we keep working on that so we can tell you anything yet. Well, this is everything. Thank you very much for your attention. Thank you to all the people that developed viral recon with us and to the reference laboratories in the institute and the genomic unit for all this work. Thank you. Thank you very much. So now we have time for some questions for anyone. No, so if there are no questions, I also want to mention that you can always ask questions later on on the Slack Bitesize channel. And this video will be uploaded to YouTube later. Thank you very much again. And I also would like to thank the Jan Zuckerberg initiative for funding these talks.