 Hi, everyone. Thank you for joining in for today's batch size talk. I would like to begin by thanking our funders, the Chan Zuckerberg Initiative for supporting all NFCOR events. So just some preliminary information. This talk is being recorded and the video will be uploaded on YouTube and shared on Slack and on our website. The talk will be about 15 minutes after which we will have a Q&A session where you feel free to send your question in the chat box where it will be picked up from there or unmute yourself and ask your question. Today we'll be having Hashil Patel, the head of scientific development at Zikera Labs, who will be presenting to us about the NFCOR RNA-seq pipeline, which is a bioinformatics pipeline used to analyze RNA sequencing data obtained from organism to the reference genome and annotation. Over to you, Hashil. Good afternoon, everyone. And thank you for joining. What is the 30 second batch size talk of this awesome series. So I'm Hashil Patel. I am head of scientific development at Zikera Labs. I'm also one of the long term contributors to NFCOR and various other pipelines that we have on NFCOR. If you want to know more about me, there's a link here. Just click on that. It's a blog I wrote recently when I joined Zikera Labs. So jumping directly into some numbers. So this pipeline is one of the oldest and most popular pipelines on NFCOR. And the numbers are just staggering and they always surprise me when I see them. So, you know, we've got 400 forks, you know, almost 60 contributors. It's also almost 700 people on Slack. And it's also one of the most active channels on Slack where people are reaching out for help and coming to join to, you know, ask questions and also just as a forum to discuss the pipeline as well. Over the years, this is this has really been one of, you know, one of the main pipelines that we've had on on NFCOR. And I would say that a lot of this has really been possible as a result of the Testament of, you know, the Testament to next flow itself, which is the underlying language that we're using. It's just allowed us to get, you know, have access to communities, infrastructures and other stuff that we wouldn't normally be able to do with a pipeline like this. So the pipeline itself has gone through various releases now over the years. As I mentioned, Phil from NGI initially pushed this when, you know, when when NFCOR was first starting up and stuff and it was one of the main pipelines that he pushed here and then it went through various variations of updates and Alex got involved, Alex Pells got involved in between for a while. And then there was a sort of a gap for about a year where the where, you know, we really needed someone needs to sit down and update the pipeline. So that's kind of where I got involved mainly in helping out with the implementation of the pipeline. And so, before, up to 1.4.2 the pipeline was written in next flow DSL1 and then some of you may know that next flow has now a new DSL2 it's a more modular language. And so for us, I think that was the perfect opportunity to basically start from scratch right rewrite this pipeline, essentially from scratch in DSL2 to allow us to have a proof of concept as to how it will work on NFCOR because obviously we want other pipelines to adopt similar syntaxes and principles. And so I went about coming up with the first iteration of DSL2 at that point, and we released version 2.0. And since then we've, we've now changed and adopted the way that we're using DSL2 partly due to updates. Mahesh helped out with with, let's say what is the second iteration of DSL2 that we've now got on NFCOR. So that's again it's constantly improving. It's being adopted more and more across NFCOR. And you'll be able to see basically that in version 3.5. So this pipeline has really become the cutting edge or the gold standard in terms of what we're doing with next flow implementations. In terms of the RNAseq itself is probably one of the most popular applications of next generation sequencing. And most people doing experiments will have come across some sort of RNAseq data, I imagine, especially by politicians and what you're doing is you are quantifying the expression of genes in a genome at a given time. And this is typical of say bulk RNA sequencing. And you then want to get a quantification of what your genes are, you know, what the expression of your genes are like in one condition compared to another, and then figure out what is different and try and put that in some sort of functional context like looking at pathways or or doing further experiments to figure out, you know, how expression is impacting functionally what you are doing or how you're perturbing the cells. So the typical pipeline for this would be you have more reads you do some cleaning of these reads by removing adapters and stuff that you get off the sequencing technologies, do some sort of QC. In this case we don't actually have this bit in the pipeline, but it's probably something we may add later I'm still thinking about how to do this properly but this this bit here allows you to sample reads and essentially automatically in first and then plug that directly into alignment algorithms which need this information. So yeah so you would you would do some cleaning, and then you would map to them to the transcriptome. And then you can get some QC out from your genome band files as well like looking for, you know, intronic rates or genomic contamination and all sorts of that all sorts of other, you know, really useful QC metrics from from your genome but most importantly you also get the gene counts out and this is essentially a matrix where you have genes in rows and samples in columns. And then that allows you to plug in these counts that you get from from these tools like our Sam or salmon or other quantification methods in order to do the differential expression between the conditions that you have in your experiment. And it doesn't perform any differential expression analysis and that's intentional because when you start getting involved with stats that's generally where things start getting complicated and so differential differential expression in order to do it properly you need to factoring all of the various experimental factors you have in your experiment. And you can't there's not really a standardized way of encoding that information. And so to keep things simple we basically have you are an AC pipeline which gets you to the counts and then it's up to you how you factor in various sample conditions like whether you've you need to factor in the sex of, of say mice or whether you need to factor in time points in terms of days and how this would affect the differential expression and other confounding factors that really need to be taken into account. So if you want to get an idea of some of the more low level type mapping types of Reagan gave a great talk last week about the door on AC pipeline, where we explained some of these mapping to mapping to various aspects of the genome or the transcript time and complications that arises result. So I won't go into much detail there. In terms of features the pipeline I mean one of the one of the biggest strengths of this pipeline is the fact that this is used so widely and so you know we've got bug fixes we've got feature requests. We've got contribute contributors coming from all over the world. It's used on various infrastructures and clouds which again is testament to next flow itself. And also on various types of input data small data large data medium sized data, whatever you can imagine type data. So that's that's really one of the biggest strengths of this pipeline. The, in terms of the alignment and quantification routes we've got three standard ones we've got star and salmon which Rob Petro actually helped me add, which is really nice of him to come is on NF4 slack and we basically went back and forth a bit before I did this functionality but salmon as it may not be as widely known but it also has the ability to take BAM files and quantify from those. And that's the route that we used for the default option in this pipeline. So there's a star and awesome route awesome is targeted to be one of the most accurate quantification methods. And in recent releases, I've really tried to push making this pipeline as accurate as possible to make it a gold standard best practice type pipeline. And so we've kind of stripped out some of the stuff like feature counts quantification which doesn't really look at have any sort of statistical way of modeling, you know, where we count belongs to for example and so there is no features can feature counts quantification as pipeline, which is why actually high set you don't have any downstream quantification at the moment, because there isn't an appropriate way to project the reads or the counts onto a transcriptome somehow and then do the quantification which tend to be the more accurate methods. We also have a pseudo alignment route so these these routes basically skip the BAM file essentially they go from a fast queue file, and they have this quasi mapping approach where you get you use came as to to then calculate the counts directly from transcriptome and so you skip the BAM file. I guess one of the downsides of that is that you don't really it doesn't allow you to get QC of things like genomic contamination and stuff, which you would, you would need a BAM file for and that's why the major alignment groups at the top here are probably nicer because you can't run this and also run this and you can, it's up to you how you run the pipeline. There's an open request for Callisto as well if anyone, if anyone wants to help out with that. The pipelines runs from bacterial genomes I've known to all the way to plant genomes which have ridiculous amounts of duplication so again it's it supports most genomes. There's an inbuilt strand specificity check which allows you to double check the strand specificity that you've used is quite important in RNAseq because if you get that wrong then your quantification be completely wrong because you're counting reads mapping to the wrong strand essentially. And so there's a warning that currently generated that tells you whether you've got it right or wrong and a whole bunch of other features like you and my support RNA removal genomic contaminant move I did recently. So you can, you can chain this the, the air core fetching just pipeline which is another pipeline that I've written that allows you to download data just from a set of IDs sorry IDs, and it generates a sample sheet that you can directly plug into this pipeline so. Yeah, various, various cool features. It's very simple you've got sample first you want first you to and stranded nurse if you if you have single endator you just literally leave out or leave this second column blank and that's it. And then you have strandedness to then, as I mentioned which is quite important for the quantification there's nothing complicated there. In terms of reference genome options you only need a faster and a GTF or a GFF. If you provide a GFF this is converted to GTF for the downstream steps. If you don't provide any you can also provide indices and stuff to save you having to create them whilst running the pipeline. If you don't, then these are automatically created throughout the course of the pipeline. There's various parameter dots as well all of these links work by the way I'll make these slides available so you can, you can use them as you go. We're looking to move to ref genie but the genomes at the moment we're using eliminate of the side genomes, the standard organization is really nice but it's becoming quite outdated so we'll be shifting to genomes to ref genie hopefully soon. The results for full size tests available on the website. What's awesome about this is that you literally can run a proper full size experiment with just two parameters. You just need to provide a sample sheet with your samples and the genome and the pipeline will literally generate all of the downstream steps for you. This is available on the website for you to browse and we're going too much detail here. Similarly there's there's a bunch of output docs quite extensive docs about the the outputs of the pipeline and some some really nice to see plots and stuff that you can have a look at we're always looking for feedback if we need to improve that. The implementation is next floor native it's all DSL to one process for each process that we have we have one by container, and this really is quite modular and it allows us to update and maintain the pipeline a lot easier because the process is essentially its own dependency. And of course modules, you know, 38 of the modules in this pipeline out 55 or an NFC modules so again it allows us to contribute back to this NFC modules repository we've created which is a central repository to host. Next row, essentially next or wrapper scripts for any NFC pipeline and there's there's a massive toolkit and stuff that we built around this to, to actually help with maintaining modules and I didn't pipelines. So configuration one of the most commonly asked questions now with this new syntax is how do I change the process requirements say. So, I've just put some examples down here but the first thing you would need to do is look in your modules config for the process you want to change. Use exactly the name that is specified in this modules config, because it's quite important that you use that it's because you can have multiple processes with the same name used in the same pipeline if you use and sub workflows and stuff. And so the logic to select exactly the right process will be already defined in this modules config so find the process you want to name want to use in this case it's just this I've copied and pasted out here. And then you can append the arguments as you want, as long as they're not mandatory you can append so here I just want to add this quality 20 argument and so I've created a small conflict file with these options that will only change the options for this particular process Similarly, I can change resource requirements if I want, or I can change a container which is less likely because you want to use a container ship pipeline but if you do, then that's also possible then differential analysis. As I mentioned before you get all of the all sorts of counts up that you can use for downstream analysis the pipeline doesn't do any serious differential analysis. So let's generate some basic QC plots with for PCAs and heat maps that you can use to straight away figure out how your samples look but it doesn't actually factor in any sample or experimental information, which you need to take a downstream and what we're really looking for someone that can give us this sort of talk because it's one of the most commonly asked questions on it for actually, as to you know what you're doing with the downstream results of this pipeline, and it'd be an awesome by size talk to give I've also added this pipeline to next row tower. So secure labs, which is the home of next row now, and also this, this product called next row tower, which is just an awesome sort of way of monitoring and maintaining administering your next product executions and when we're really working hard with the NFC community as well to try and make this even better and so there's a community showcase area the links here, and that you can join and get 100 free hours of credits to to run this pipeline on on our amongst others as well, to show you to give your flavors to what we're doing there. And if you want to come and chat with us, you can find us on Slack and create issues or pull requests on GitHub on Twitter and all these videos and other content is available on YouTube as well. So thank you for your time. And thank you everyone next low community and the NFC community by containers and by condom the great infrastructure that they've allowed us to use without reinventing the wheel, and also my awesome colleagues that secure labs. And also, I guess some of the main contributors to this pipeline, this pipeline in particular as well like Mahesh, Gregor, Jose, Phil who first started off and Alex in between and everyone else has contributed over time. We have an hackathon coming up. If you don't know already has a sign up link I'll put in the slides but you can find it on the website as well. So major themes documentation if you think we need we need to we're missing anything, please come and tell us and we will try and improve documentation wherever we can. Thank you for your time. Thank you for that comprehensive review of their unique pipeline. I feel free to ask questions if you have any really, really actually had a question. He just wanted clarification on whether we are lying to the genome in this pipeline, not to the transcriptome. Good question. Yeah, it depends when how you want to look at it. And so, actually, we do align to the genome, you're right. We project those reads onto the transcriptome so we also get. So for example with awesome what you end up doing is you get this awesome, you get this transcriptome bam as well as a genome ban. And the genome bam is generally what you use for the QC and the transcriptome bam is then what awesome uses to to generate the counts. So strictly speaking, yes, we're probably aligning to the genome and then somehow filtering down to then use the transcriptome. That's quite an old slice slide I hope no one noticed but. Yeah, you guys. Okay, and in order priority of questions, could you clarify how references are bills. Do you need a faster file. Yes, so you would need your genome faster. You would need some sort of annotation to this pipeline doesn't do any de novo guided stuff, or no, or doesn't matter, let's say it doesn't use just the transcriptome faster as an input. So if you have a novel species that you've just done. You can put this on and you've got a transcript on but you don't have a proper annotation this pipeline won't work yet. There's an open feature for that. So what the pipeline essentially does is you've got your genome faster you've got your, your gtf or your annotation, and you extract the transcriptome from those two and use that for all of the downstream analysis. And then in the season any other information is then built from just the faster and the gtf. And then another question from Felipe asks, who is Rob Petro. So Rob Petro is the, the main author and developer of salmon and a bunch of really other cool tools that are used, not only in bulk RNA seed but also now in single cell RNA seed for for analysis. We have another question from Michael who asks, is it worth considering an R environment with a previous DTS object containing all the samples run. Sorry, I didn't understand that. Does the pipeline generate a DTS object. Yeah, like whether it maybe it will be worth considering an R environment within the pipeline with the previous DTS object containing all the samples. So we there is a DDS object I believe that is generated at the end of the pipeline for the counts and all of that sort of information it's a way that you can easily load stuff into your own R environment but things start getting tricky and then become start verging on on actually having downstream type analysis like trip to notebooks and all sorts of other in our studio type stuff where you then need to take the results of this pipeline. And load them into a more interactive environment something that we've been talking about for quite a while actually but it's not a trivial thing to figure out, especially when you want to factor in reproducibility and other things and and how to do that in a standardized way. It's an interesting question so at the moment we don't have anything that does it explicitly. But we do generate the DDS file that you can load into your own in our environment and do whatever you want with that in terms of the downstream analysis. Okay, so I think you alluded to this before but Ramon asks how difficult is it to do conversion from the cell one to the cell two. So for this pipeline it was actually very tricky because it was, it was the first adoption of the SL to an NF call. And so everything was starting from scratch and I had to basically change things about a gazillion times to actually get to where I wanted to in terms of functionality testing and so on. But now, with the awesome infrastructure we've built as a result of various people's learnings over the past year or so. And now, really easily install modules we can, we've got some really good examples of how to write DSL do pipelines, I gave a talk about that recently as well, how easy that would be and how you should attempt to tackle it. We can link to that if you follow up on the bite size channel I can send you a link to that. Yeah, it all depends again I guess on the complexity of your pipeline but in theory it should be a lot easier for you than it was for me a year ago. Thank you for that answer. And Oliver has a bit of a comment and a question so he says great talk the QC metrics are awesome. Something that will be super helpful for plotting and interrogating QC metrics will be to add the QC results as columns for each row. For example, in this case in the input sample sheet to CSV. So could the columns in general statistics multi QC reportable be added to the sample sheet CSV. And he gave an example of how it's done with a diverse. I mean if you have suggestions as to how we can improve it. We have something similar actually that we've recently just added for viral recon that I released last week. And that's used for site SARS-CoV-2 genomics surveillance type stuff where they saw QC and variant information is quite important. But if you have something functioning already that's even better. And then we can have a look at it. I mean, you know, pull request contributions are always welcome as well. But yeah, any suggestions or contributions like that would would be, you know, more than welcome and we yeah we can. I don't see why we can't dump a generic sort of QC flat file type thing. But I think you can export some of that from multi QC already. So, yeah. Yeah, maybe I can chime in there. So multi QC by default will export all tables and quite a lot more into flat files specifically for this reason for downstream analysis. So if you look in your multi QC folder, there's the HTML report. You'll also find the folder called multi multi QC data and inside there there'll be a whole bunch of files and you can choose what format to have those in as well. In fact, that's what I'm parsing for viral econ. I'm multi QC dumps all of these files and it's just really easy not to have a write another parser for every tool that has a log file, because multi QC is awesome and it does it for you so I just literally get all of the information from those table that multi QC generates parser and then use that to generate the QC metrics and that report for viral econ for example, not be meaning to do something similar for an AC but I just haven't had the time. Thank you Phil for chipping in. I don't know if there's anyone who has a question would like to unmute. But if it is in the case. Thank you for splendid review and answering the questions quite well. I guess we'll see each other. We'll see everyone in the next bite sized up next Tuesday. Thanks guys.