 Hello, everyone, and welcome to this week's Bitesize Talk. The speaker today is Solyne Corion from the University of British Columbia in Canada, and she is going to talk about variant catalogue. This is a next-for-pipeline, but it's not part of NFCORE yet. Variant catalogue is used for population analysis from whole genome sequencing and specifically to identify variants and their frequencies. Since Solyne is living in Canada and due to the big time difference, we decided that it's best to record this talk. Therefore, if you have any questions, please ask them in the Slack Bitesize Channel. As usual, I would like to thank Solyne for her time and the Jan Zuckerberg Initiative for funding the Bitesize Talk series. But now, without further ado, I hand over to Solyne. Hi, everyone. Welcome to this week's NFCORE Bitesize Talk. I am Solyne Coréar. I'm a research associate at BC Genome Sciences Centre in Vancouver, Canada. And today, I'm going to talk to you about the variant catalogue pipeline. First, I would like to acknowledge the lens on which I work, live, and play. Those are the traditional ancestral and unsettled territories of the Musqueam, Squamish, and Selweduth nations. First, what is a variant catalogue or a variant library? When we talk about genomics and DNA, a variant catalogue is the frequency of the variants within a population. For example, in this population of five individuals, they all get their whole genome sequenced and at a certain position in their DNA, some individuals carry an A and some individuals carry a C. From that individual information, we can deduce the frequency of each allele in the population. In this example, the A allele has a frequency of 006 and the C allele has a frequency of 004. This is the main information that is within a variant catalogue pipeline, the frequency of the variants within the population. When do we use a variant catalogue? There are several ways to use it, but a very good example is through the NF core rare disease pipeline. During the variant annotation and prioritization step of that pipeline, they use Nomad. Nomad is the biggest variant catalogue to date. The reason they use it is because a variant that is frequent in the population is unlikely to be responsible for a rare disease. When we are looking for the variant responsible for the rare disease in a kid, we can already filter out all the variants that are frequent in the population. As I mentioned, Nomad is the biggest variant catalogue to date. It helped tons of families to get a diagnosis in rare diseases, but when we look at the ancestries of the individuals within Nomad, we can see that most of the individuals are from European ancestry. Some populations are not even represented. They are not represented or underrepresented. This lack of representation from some population is leading to an inequity in genomic care because if the kids affected with a rare disease is from an ancestry that is not represented in the variant database, then it's harder to remove the variants that are frequent from this population and so harder to give a diagnosis to this kid. This is a known issue, so several variant catalogues were generated around the word. For example, Iran with the Iranian population or Kova with the Korean population. The project I was working on is the silent genomes project in Canada. It's a partnership with the indigenous populations of Canada to build the indigenous background variant library. A very similar project is taking place in New Zealand with genomics athera where they are working with the Maori population. When we were working on the indigenous background variant library, we needed a pipeline to process the data to get the variant frequencies. Some pipelines were existing but none of them were filling the three constraints that we had. The first one is that we wanted the pipeline to rely on open access tools that were previously benchmarked because we didn't want to develop anything new and use software or a new tool. We wanted it to be comprehensive and by that I mean it had to include single nucleotide variants but also mitochondrial variants, structural variants, short-term repeats and mobile element insertions. All those classes of variants are known to be potentially implicated in real disease so it's very important that all of them are present in a variant catalog. And finally, we wanted it to be able to work on local servers or on the cloud because different projects may have different constraints. We developed a variant catalog pipeline that you can see on the left here. This is just an overview and I'm going to describe each part in more details but the edges that it takes as input FASQ files from participants and it output VCF files, so variant calling files with information about the variants, their position, the allele, the frequency of this variant within the population, which is the main information we want, the frequency by sex as well as some annotation. So the pipeline is divided in four sub-workflow that can work independently or all of them can be run in parallel or at least in the same pipeline. So the first sub-workflow is the mapping sub-workflow. It takes as input short-trip and sequences for individuals as well as a reference genome. It has been developed so far for GRCH-47 and GRCH-48. The mapping tool is BWAM and it output one BWAM file per individual. The second sub-workflow is the mitochondrial variant sub-workflow. It is mostly based, it's very much based on the work from Larisha and all that was published in 2022 and it's therefore very similar to the pipeline that is used by Nomad for their mitochondrial variants. So it takes as input the BWAM files previously generated. The variant color for the mitochondrial variants is GATK-Mutec2 and the reason why there is sort of a parallel section here is because the mitochondrial DNA is circular. So to be able to map the mitochondrial reads against this reference genome, it is linearized with a fake breakpoint around zero here. So the maps that are supposed to map, the reads that are supposed to map over the fake breakpoint do not map correctly. So variants located around these regions are not called correctly. So I read that issue. They developed a shifted reference genome where the fake breakpoint is located on the other side which allowed to call the variants correctly in this region. Then the variants are lifted over. The information is merged into several VCS. I will detail the steps at the bottom later. The third sub-workflow is the single nucleotide variant sub-workflow. It is the most straightforward one. For a variant calling we decided to use deep variant and we are using GL nexus for the joint calling. For the fourth sub-workflow, which is the structural variant sub-workflow, it was mostly deployed by Mohammed Abdullah, a postdoc within the Wasserman lab. It was decided to use smooth and menta for structural variant colors. Jasmine is used to merge the variants and paragraph is used to genotype the structural variants within the individual data. Then the information is merged with VCF tools. For the short tenem repeats, we are using expansion hunter and for the mobile element insertions we are using melt. And so all the variant calling part is very similar to other pipelines such as NF Core Red Disease or NF Core Sarek. What is really specific about this data, this pipeline is the steps at the bottom here. It's the sample quality control, variant quality control, all of frequency calculation and also sex imputation. The reason for that is the quality control is performed differently if you have just one individual or trio versus if you have a population. So all of this is performed within AL, which is a Python-based analysis tool that is also used by Nomad and some other variant catalog pipelines. So as I said, it performed some quality control as well as the variant frequency calculation. And then the information, the variants are annotated using VEP. So that was just an overview of the pipeline. This is the actual complete pipeline. It's available on the Wasserman Lab GitHub and it's described in more detail in this preprint. It was tested on a hundred samples and it works. The details of the number of CPU hours as well as the number of samples as the variants that are filtered out by the quality control steps is available within the preprint. However, these versions still rely on locally installed software. And that is an issue for two reasons. First, it's really hard for all the projects to use. And second, it's impossible to test very easily like we are used to test other NFCore and extra pipeline with just one common line. Therefore, the future for the pipeline is to move it to an NFCore level pipeline. My goal is to move the mapping as well as a single nucleotide variant sub-workflow during next month's hackathon. So if anyone wants to team up with me for the code or for coding review, please reach out. After that, we will have to move the mitochondrial and the structural variant sub-workflow also to NFCore. This will allow first other people to try it more easily, but it will also force us to do better documentation. And that is very important to make sure that other groups can use the pipeline. If the documentation is good, then it's easier for other people to try and use this pipeline. To test the pipeline, I actually needed to create a new data set because the one that were available within the NFCore did not fit my needs. So I needed Peridential ReadFascue files that included part of an autism as well as parts of chromosome X and Y to impute sex for the individuals as well as read mapping to the mitochondrial chromosomes to test sub-workflow 2. I also needed read supporting the presence of structural variants to be able to test the sub-workflow 4 and several samples, including X and XY individuals to be able to test the variant frequency calculation part. So this will hopefully be available to others soon, in case you need them to test your tools or your pipeline. I will also include the reference genome for the same region and additional files, such as the short 10M repeat catalog, the mitochondrial reference file and the shifted one I mentioned before. In other future developments, I would like to include more to test at least more reference genomes, including the T2T for humans, but also non-humans reference genomes. I would like to include more software, for example, to give the opportunity for the user to decide which mapper they want to use, which variant colors they want to use. We also want to make sure that it fits with the NF-core rare disease pipeline. I know that they use slightly different colors for structural variants, and it would be interesting to make sure that there is a good fit. And it's also possible to include additional metrics, such as ancestry inference, mitochondrial lab pro-group assignment, or relatedness calculation. Those are metrics that are often associated to Brian catalog pipeline. It was out of the scope for the silent genomes project, but we understand the relevance for all the projects, and it would be perfectly, it would be great to also include them and have them as an option. So, I would like to acknowledge everyone within the Wasserman Lab, especially Wyatt Wasserman, the team leader, Mohammed Abdullah, who worked a lot on the structural variant pipeline, sub-workflow, and the rest of the pipeline, as well as Brittany Hewitson, the silent genomes team, and also all the NF-core community. It's been a very welcoming community, and I've learned a lot. Obviously, this is not live. If you have any questions, please reach out on the NF-core variant catalog channel to spark a discussion and start threads on different things. If you prefer to reach out directly to me, you can do it through Twitter or GitHub. Thank you for your attention, and have a great rest of your day.