 So, hello everybody and welcome to the last session of this hackathon, the last presentation at least. Today we will present some of the featured NFCore pipelines, the pipelines that have been mostly used or that have been recently incorporated into NFCore. First Maxime Garcia will introduce us to Sarek, a workflow for detecting general and thematic mutations in whole genome sequencing and whole exome sequencing. Then Alex Peltzer will follow with an introduction to the RNA-seq pipeline and Harshil Patel will introduce us to the viral recon pipeline, the newly added pipeline that has been a contribution of Heman NFCore to sequencing the SARS-CoV-2 new virus. So we are excited about this and also about Björn Gröning's talk afterwards about biocontainers and bioconda. So, welcome Maxime and we look forward to Sarek's introduction. Thank you. So, yes, I'm going to talk about Sarek, I would like to thank Gisela and everyone over there for giving me the opportunity to present Sarek here. So I work at the Swedish childhood tumor biobank in Sweden in Stockholm. I do also sit all the time within SILIFE lab which is a national center for molecular bioscience in Sweden and I'm working closely with national genomics infrastructure. And we are also working, collaborating quite closely with NBIS which is another infrastructure but very specific for bioinformatics. And I think for me as a bioinformatician and I think for this whole community reproducibility is very central and that was very what was in our mind when we decided to start Sarek. So what is Sarek? It's a park, a national park in northern Sweden which is a very complicated landscape. And we figured that such a landscape was a good analogy for the genomic landscape that we wanted to explore. So that's why our pipeline has the name Sarek. And Sarek, yes, and every release of Sarek is based on the place within the national park. So that's why we have this very fancy release name and not the usual release name that you can find in the other pipelines. So Sarek, it's an open source and a slow pipeline that we started at NGI in collaboration with NBIS and we support for Bantumorbanken. Since we are in NFCOR we've got plenty of other developers that are helping us. I think I can talk about like plenty of developers from Cubic and from other institutes as well. I'll mention that in my next presentation but yes, this is another one, so sorry about that. So it's written in next law. I don't think there's any need for any introduction on that slide so let's go to the other one. So Sarek has multiple flavors. You can analyze normal sample with just the germline part of Sarek or you can analyze tumor in a normal pair with Sarek somatic and Sarek works for wall genome sequencing, wall exome sequencing or also targeted sequencing. So within Sarek it's fairly simple. It doesn't seem simple when you look at the code but basically what we do within Sarek is that for the preprocessing we follow the JTK best practice. So I think the latest release is following the JTK 4.1.7 if I'm not mistaken. So it means that we read the map to the reference genome with BWA. We then mark the duplicate with pk-mark-duplicate and we do the recolibration with the JTK base recolibrator. After that we do some variant calling. So for the germline part we use a prototype color from the JTK. We use Strelka 2 from Illumina. We use Freebase and Mpilup to resolve small indels and SNVs. And we use Manta and Td to find out structural variants. For the somatic variant calling we use Mutec 2 from the JTK, Strelka 2 from Illumina, Freebase as well. We discover the structural variant with Manta. We assess the sample heterogeneity, plurid and copinamere variation with Ascat and Control Freak. And we do have information about the micro-satellite instability with MSI sensor. Then from the variant calling we do some annotation using a VIP and SNIP effect. So which gives us access to plenty of different databases. And then of course we are gathering all the different reports from all the different tools into a final MutecQC report. So this is our workflow in a more explainable figure. I guess you've seen this one lately. What we are working on also at the moment is trying to work on prioritization. So that will be one of the future developments that we are trying to work. The idea is to be able to rank all the variants and to help the clinician figure out which variant could be relevant or not. So that should be coming I guess next year or something like that. It's quite complicated. So what is coming soon? We've been working a lot on this hackathon. So we fixed a couple of bugs. We added the BWRA Mem2 in the Dev branch as well. And we are working with Gzela and Felerike on DSL2, which is actually quite fun. And I'm very looking forward to that. And I think that will be my big project for the summer. What is coming next as well will be more tools, trying to have more downstream preprocessing of the final VCFI and maybe also a connection to SCOOT, which is a visualization tool that other teams are using here in Sweden. We did use Sarek a lot in Sweden. So this is what I know of. So if other people are using Sarek, do tell me. I will know how many people are using Sarek. But for us within the Biobank, we are using Sarek to analyze all our tumor and normal sample. At NGI, we use that to process all the normal samples and the tumor normal pairs. And it was also used to analyze 10,000 normal samples from the 3GEN dataset. And we are working with GMS, Genome Medicine Sweden, to figure out some tools to analyze cancer sample as well. So that's a development that I'm looking forward to. We do have an article that we published earlier this year in F1000 Research. It has been approved by 2pRedu, so we were quite happy with that. If you want to get involved, you know the drill. We're on Slack. We're on GitHub. You already know all the address for that, so that should be fairly simple. Yes, so I would like to thank all the institutes I'm in depending form, all the different institutes that are involved within NFCore, all the contributors that are working on NFCore as well. And also all the contributors that are working on Sarek. Yes, if you have any question, I'm here to answer. Thank you very much, Maxime. We got a really good idea about Sarek now. And I can definitely tell you that we use it at Qwik a lot. We have used it for a lot of our, it's the only pipeline that we use for a whole exome and whole genome sequencing. So thanks for the developments. And it was, it's also really fun contributing to it. So Phil Iwilze is asking, can you talk about simple genomes? Can I use it on my E.coli? Yes, so we started working on that, the thing that was last year, maybe a little bit more. So actually you can use Sarek. The prepositing will run following the JTK best practice and you actually just need a FASTify. So as a reference genome, you just need a FASTify. So you can potentially run Sarek on any genome if you have a FASTify. We are also, I have like, we are getting quite big and bigger community. So depending on what people ask us to do, we are adding multiple features like to skip the mark duplicate if you're working with Amplicons, to be able to just do the mapping step and then do the variant calling from the BAM that are just mapped. If you want to work on a genome, an organism that hasn't been analyzed a lot yet. So for example, you don't have any non-variant or stuff like that. And then you can do another round of prepositing with the variants that you discover. So we are working on stuff like that. So if you have any other features that you want to add, don't hesitate to put out some issues and stuff. What about if we want to use another species that is not right now supported but is part of iGenomes? I think we added all of this, most of the species that were within iGenome because you basically just need the FASTify. So I think I added all the FASTify that were on iGenome on Sarek, but I haven't tested everything because I don't have data for every organism. Perfect. That's great. So what if I want to use Centi-on? What is Centi-on again? Okay. So Centi-on is a proprietary solution that they claim that they can give you the same result as JTK best practice and BWA, but with faster response type, with faster prepositing. We do use that for clinical setting at the Biobank. So I did enable the usage of Centi-on within Sarek. Basically, we currently have like two parallel chain of processes that can run with Centi-on or without Centi-on. But to run Centi-on, you need to have Centi-on installed on the machine that you're running Sarek on. For us, on our own server, it's installed with a module system. And we use NFCore config repository to have a specific config for Sarek for our own cluster that load the module for Centi-on. It can be done if anyone else has access to Centi-on. I can help them do a specific config profile for that and that way anyone can use it as well. Perfect. So you just add it to the path and then you can just execute the path? That's fairly simple. But I'm guessing that with the BidualBoolway MIM2, we will see some improvement as well. And I'm hoping that the difference will be slightly shorter. So Phil is asking, how do you test a pipeline? Is it only miniature tests, data sets or something more? Okay, so currently what we are doing is that we are doing only CI testing with mini data sets. But I know that at NGI, we are doing some testing as well. But that's not what I'm doing at the moment. I would love to be able to have some validation tests. We are working on that with you, actually, and with other people at NFCore to do some real-size tests with real-size data sets on AWS. But yes, for the moment we are still working on that. I think we are getting ready. We just need to figure out which data set to use, actually. Yeah, that's a good reminder from Phil that we should finally find a big data set so we can run full-size tests. I think maybe we can try a genome in the bottle or stuff like that. Yeah, exactly. 4,000 genomes are.