 Hi everybody, I'm going to talk a bit about the NFCOR RNA-seq pipeline, which is RNA sequencing analysis pipeline with multiple features for quantification of RNA-seq data, which is one of the largest NFCOR pipelines and has been around quite a while. So I'm just going to be briefly going over some of the features, going to have a basic introduction in what it is, what kind of tools it uses, and then you can basically ask your questions if you want to. So as I already mentioned, NFCOR RNA-seq is a relatively old NEXFO pipeline. So it has been started in 2016, back then as NGI RNA-seq. So from people in the sidelife lab actually, so Philly Ewells was involved and also I don't recall the name at the moment, but there was another person there working mostly on the RNA-seq pipeline back then, Richard, I think, was it, and more or less you can say as RNA-seq is one of the pipelines that caught interest of various other groups who were looking at NEXFO pipelines as a way of doing their analyses. It was kind of motivating a lot of people working with NEXFO back then, for example, when I started using NEXFO in 2017. And I think a lot of other people are also using it as a prototype or have been using it as a prototype to start developing anything in NEXFO. So you can as well see it like this basically, so RNA-seq is kind of the grandma and the granddad of other NFCOR pipelines because it has been there before quite a number of other pipelines, although I have to admit that at least the method-seq and the chip-seq pipelines were also quite old and we started just simultaneously almost with the RNA-seq pipeline. But other pipelines such as piece of magic or eager or Sarek was just put it I think a year ago to NFCOR and wasn't in the NFCOR organization so far for quite a while. So what does the NFCOR RNA-seq pipeline also show as a distinct thing? Well, it's one of the most used pipelines that we have in NFCOR in general. It's quite well established. A lot of people are running this pipeline across multiple institutions, across multiple centralized communities, lots of end users. So this is one of the pipelines which has a lot of traffic on the NFCOR site. So if you have a look at the NFCOR stats page, for example, have a look at the repo traffic subpage there that you can publicly access. You can see that there's both a lot of committers, there's a lot of people that watch the pipeline and a lot of repository views as well. So apparently it's a well-defined use case and a lot of people are actually using the pipeline to do their analysis. That's actually a good thing because that means it's well tested, well established and that is for a better benefit for everybody, I guess. A bad thing about that is it's also involving a quite active development community. The pipeline is historically a quite complex pipeline. There have been lots of additions to the pipeline recently, so new maps, for example, new special features. So adding new special features to an existing pipeline might actually break behavior that others were relying on. So we have had unfortunately a couple of cases where this happened in the past. So the hope is that this will all get a bit better once we may jump on the DSL version 2 train and make the pipeline a little bit more modularizable because then we hopefully have a little bit less work in maintaining this pipeline anymore because at the moment I think the code base is quite complicated nowadays because of this legacy scripting involved. So the pipeline teaches a couple of steps for pre-processing. So basically that involves a quality check using FastQC and some trimming with Trimgalore. And then the main part of the pipeline consists of the pseudo alignment or this part here, the salmon part here was just added last year. And I think some last summer actually in 2019, whereas these features here, Star Heist 2 and the feature comes in string type part have been around for a couple of years now. So but a lot of people were requesting using some pseudo alignment method as well. So we've been so I think Olga was mainly pushing this and also some other people were involved in pushing this and implementing this in the pipeline already. With respect to quality control, we have a couple of methods available inside the pipeline as well. So quality map, RCQC, we have a small script that does duplication rate detections with Duprador. We have a couple of tools that do some basic HR analysis to just check with the samples, for example, cluster together or show some outlier behavior. We have sequencing library estimation using pre-seq. And of course, we have like most of the NFCore pipelines, a nice little multi-QC output report in the end of the pipeline, which could probably also use some more detailed tuning nowadays with new features of multi-QC. So in summary, the pipeline can do a full from reads to counts and transcript quantification and reports this in a standardized way, which makes it extremely easy to use the pipeline in various settings. So, for example, back when I worked at Cubic, we were using this a lot for almost all RNA-seq projects. And I know that there are a couple of other, for example, NGI and other institutions who are also relying on this pipeline nowadays to do their standard RNA-seq analyses. Some of the next steps that are, I think, somewhat in line next. So one thing that I personally thought would be nice to do in the next couple of months maybe is improving the backlog and also fixing and documenting longer-standing issues. We have kind of a couple of these that have been open for years now. Some of them might not even be up to date anymore because they've been bug reports for a previous version that has now anyways been faded out or that is not even a bug anymore in one of the newer releases. Some of the code we use in this pipeline is not up to the news next flow features anymore. So there's some stuff that needs a bit of extra love, let's say. And there's been discussions about also allowing a design input file, like, for example, Ataxic, Chipsiq and Zarek and some eager pipelines also do to enable more steps. For example, there was a discussion to run some preliminary differential gene expression analysis in the pipeline, like a very basic one, or at least provide a script that can be used for that. And there's some discussions on actually integrating that as well. As RNA-seq is so much used across multiple institutions, there was also kind of a discussion about integration of ref genie in one of the upcoming versions, which we'll probably have to wait until we push that forward in the NFCore tools. And the only thing that is also missing is a full-size testing for RNA-seq. Currently, we just test with a very small E.coli or yeast, I think, sample. But the hope is that we also can run some full-size RNA-seq test data to also identify potential issues in the pipeline a bit more easily in the future. And finally, one of the biggest points I think that will make the pipeline a bit less complicated is to move to DSL version two, so that we can strip out certain functionality and keep that modularized so that users can also help developing on a certain aspect of the pipeline without breaking the overall pipeline in general, which will hopefully make the entire code base a little bit less cluttered. So whenever you want, of course, like with all NFCore pipelines, get involved, the code is all open sourced and GitHub. You can join a Slack, a separate channel for RNA-seq. And as you can see down there, there's already 29 people who contributed in some way to the pipeline, some with documentation, some with a lot of code, some with just minor additions. No matter what you want to add, just feel free and contribute. We're always happy to review. Unfortunately, there is no such person like a main developer, but a couple of people are kind of trying to keep up with open pull requests and trying to review and help out where this is possible. But yeah, as it's an open source project, please be aware of that. We are not doing this all in our main work time. So that's an overview about institutions using it on the contributors. If there are some questions on the pipeline, please shoot. Thanks. Thank you very much, Alex, for this introduction to the RNA-seq pipeline. It's a pipeline that many institutions are using. We are also using it a lot here at Cubic and it's a lifesaver. So we do have some questions from the audience. Phil asks, could you please mention the UMI support? Yes, so there is an open pull request actually adding UMI support to the RNA-seq pipeline. I think Gregor Storm implemented it shortly before or during the hackathon, maybe both. And there's also been two other people, Kate and Alison, if I'm recalling the names correctly, reviewing and testing the code at the moment. Sorry, if I'm not having the names right at the moment. But we hope that this is also making it into a new release of the RNA-seq pipeline so that people can also use UMI data. Perfect. Thanks a lot. It will be great. And do you know if the pipeline runs also on AWS? Have you tested it? So we've been testing small test data on AWS and that works quite well. Full-size testing, like when I'm referring to full-size testing, I mean like realistically big test data sets have not been tested so far. But we know that a couple of people have tried it on AWS and for them it worked. So I suppose it works. We have another question from Leon. I heard DSEC-2 is also quite a common tool for RNA-seq. Do you have an experience with this one as well? I guess that could be more referred to a possible downstream analysis, differential expression analysis with DSEC-2. Yes, so there's quite an, I think, still somewhat ongoing discussion on the RNA-seq GitHub repository in a separate issue of whether we should also include downstream processing in the pipeline. I'm not sure about how we actually, what's kind of the current status of this. So there was a couple of people involved. I think Gisela was involved, I was involved. I think Skiddle, Harshal and some others as well. We're also taking part there and discussing whether it would be useful to add some more downstream processing to the pipeline. At the moment, this is kind of blocked by the missing input feature because for that you need to have some more metadata to do that in a useful way. Once this is implemented and so once this input feature is implemented, we could potentially add something in using these scripts. That would be possible in general. But I think there's no such thing as a discussion solution at the moment or a result, whether this will be implemented or not. We're always happy if we hear about use cases. Yes, exactly. We were discussing about it in an issue as well. I can also point the people to this issue again. But we were even discussing having it as a separate pipeline, not to mix it with the RNA-seq pipeline or now with BSL2, even another workflow as part of the RNA-seq pipeline that could also be a possibility. But yeah, this is still a bit another discussion, how to include it. We have another question from Paul. Is there a similar my RNA-seq pipeline, so micro-RNA-seq pipeline in NFCOR? At the moment, I don't think we have something like that. No, I'm not a way of any, maybe others from the core team can comment on that, but I'm not a way of any at the moment. So, there is the SM RNA-seq, so small RNA-seq pipeline feel commented. So, one would need to see for which kind of analysis this one is needed. But Phil mentions that they use this pipeline for micro-RNA analysis. Okay. All right, we don't seem to have any more questions so far. Oh, a fun fact from Phil. So, they have been running 61,327 samples so far with the RNA-seq pipeline and counting. So, it's definitely a well-tested one. Yeah, I would think so. You should also have these numbers, I think. Yeah. Cool. Oh, thank you very much, Alex. Thank you.