 Hello, welcome everybody and welcome to this NF Core training where we're going to go over some a whole bunch of big slow material. This is the second of our trainings that we're doing currently and each one is at a different time zone. So this is the EMEA one. This is the first session of the EMEA one. I'd like to point you to the materials here that we've got a little bit of a schedule and some information about all the courses here. If you go to the NF Core website, that's NF-Core, you can find the information at the very top. If you see events, you'll see information. And if you're watching this YouTube link, obviously you've found the right location for the video. We have a few other materials that are available for us to use over these sessions now, including some chat as well as some other material which will be made available. So if you can go to that website, you should be able to find most of the key information that we have there. So starting off today, you can see that we are in the EMEA session here and this is session number one. And you'll find most of the information we'll be able in the YouTube. If you want to come back to this in the future as well, you can always come back and find these which are recorded and then they'll go up live some minutes after as well. So you can find that information there. So today we're going to start off with a little bit of an introduction on Nextflow, just 25 minutes or so, a little bit of background and sort of welcome an intro. Then we'll go into our first script section where we're just going to look at a very basic Nextflow pipeline which is just made up of a couple of different processes. We're going to jump into a break and then after the break come back and we're going to be looking at a very simple RNA-seq pipeline as well as looking at how we can use containers and with Nextflow as well. The next sessions for the next couple of days coming up, we've got started off with a lot of the kind of basic syntax of Nextflow. So understanding the channels, the processes and the operators, followed by some work on modularization. So this is looking at some of the new features with DSL2 as well as some of the deployment work. And finally then there's going to be a section on the community, on EDF core. Some of the modules work as well as Nextflow Tower and something we'll see on the final day as well. It's going to be spread across several different instructors as well. So hopefully you get a variety of styles there and done. It's the first time I was doing this format but hopefully it works out well. If you've got any feedback as well, always welcome and particularly on the NFcore Slack channels that we have there. So in terms of that chat here, if you've got any questions, you'll need us to notice that there's these three Slack channels inside the NFcore Slack. So if you're not part of that Slack, you can join here with this link here. And inside of those, you'll find the three different channels which are specific for the training events. We have four or five people online at the moment who will be able to help you with questions. Answer that obviously with so many people on the video at once. It's a little bit difficult to answer all those things live as I'm going through it. But I'll try and keep an eye for any credit questions that maybe are not specifically covered. Final thing is just on the materials. You don't need anything other than a browser and a GitHub account to run this. We're going to be doing everything in a Git pod environment. I'll show you how that works later, but essentially it's typically one click to access that. It pulls down the container which has got NextFlow installed. It's got all our data. It's got all our scripts. And you can come back to that in the future as well. So don't worry about sort of having to this. I'll also show you how you can install NextFlow locally, but we recommend the Git pod environment for running that for starting off. So with that, I'm going to just jump over to some slides now and we can get off and just kick off on the main part of the training itself. Okay. So just a little bit of background on how this comes out. So the problem that we're trying to solve, if you think about what we're trying to do, typically with these workflows, if you have any kind of experience in this, you'll know that it can be particularly difficult to essentially write these things. And this is really the problem that NextFlow is trying to solve. This concept that you have multiple different steps, different pieces of software that you're trying to link together, you typically have some input data that you're trying to analyze and you're trying to get some answer out of that, some result. And in between is really where all the sort of messy details lie, the fact that all of our software is maybe it's a script you've taken from someone in your lab, maybe it's something you downloaded from GitHub, a bio-condor environment, et cetera. And you have to pull all of that together into something which can essentially run and not only just run, but run in a distributed manner, run in a way where we're going to submit that across the cluster or into a cloud. And a lot of this compute that we have is obviously very large for doing that kind of analysis. It's also got this concept that we want to be able to share that work. Maybe we want to take our pipeline, make it reproducible. We want to run it in different environments. And all of that presents lots of challenges to this kind of analysis. So here's an example of an ancient DNA analysis pipeline. And this is very fitting if you would have seen the news from earlier this morning, you would see that the Nobel Prize for Medicine was awarded today for essentially ancient DNA analysis. So the timing of this is quite timely for this. And I'm not sure the exact relationship between the software used here and the Nobel Prize, but I'm sure someone from the NF Core team who wrote this pipeline will let us know as well later on. Okay, so summarizing the problem, we all have larger and much data sets that we're trying to analyze, typically sequencing data, imaging data, or any of these kind of large data sets that we're trying to essentially run through and get some answers to. We've got this concept of embarrassing parallelization, the idea that we can take our data and essentially run throughout pipeline in parallel. It means that we run through a particular process, a particular step for as many times that I have samples, or maybe I split my chromosomes up into regions and I run through each of those regions through the pipeline. And this is actually in a paralyzed way. This means that you can end up with tens of thousands of jobs, and each one of those jobs is typically having to be run on some sort of distributed infrastructure itself. We've also got a whole bunch of different tools that we're going to be using. These pipelines are made up of, as I mentioned before, just a whole variety of software which maybe has different differing sort of qualities in terms of the way that it was written, different programming languages, different environments, et cetera. And all of this creates this relative difficult dependency tree. It ends up in situations where some pieces of the software maybe are incompatible with each other, or we have to maintain all of that. And I guess the sort of coming at this from a few years ago, the main way is the way that you'd come to do that would really be to install all those pieces of software individually. And this is the kind of the case that we sort of started at earlier on. Now kind of looking forward, what's the kind of goals of much of this, and what does this kind of enable with regards to next flow? Well, I'm sure you've all heard of FEAR data and this kind of concept that we can make the data findable, accessible, interoperable, reusable. We kind of like to extend that with next flow and take it a little bit further, not just from the data, but the data analysis itself. So in this case, making the data analysis findable, maybe available through GitHub, or accessible in a similar way. Likewise, we like to think things are interoperable, but you can run the analysis on any cloud that's available. Reusable is obviously very tightly tied to the concept of reproducibility, and we see that with the use of containers and how we can run those things in a completely reproducible way. The idea that data itself equitable, free of bias, scalable, be able to take from a laptop to a supercomputer all the way up, as well as being tested. The idea we're going to use modern software engineering practices to be able to isolate pieces of our code, run them in isolation with small test data sets, et cetera, really using a test driven approach to the development of the pipelines, just like we would other pieces of software as well. So as I mentioned before, how is this still before? How are we doing things before things like Next Floor came around? Here's an example of a parasite genome annotation pipeline that actually came out of the Sanger Institute some years ago, and it was really one of the first big pipelines that I was aware of that was written in Next Floor, and it gave us this glimpse into what people were essentially doing prior. Here you can see that each one of those circles is a process. It's a different tool which is being used, and you can see there's really a multitude of different tools that are part of that pipeline. Each one of them had to be painstakingly installed on the machine that you wish to run the analysis on. Obviously, it presented a big challenge and was sort of a lot of time which was spent, and that essentially got placed into our Next Floor pipeline and then sort of show you the results of that in a moment. What was also kind of apparent at the time and related to this was how hard it was to reproduce a pipeline, how hard it's to reproduce even an academic analysis, and this was a paper which looked at that and described that it said around 280 hours or nearly two months to, even when you've got the results, sorry, even when you've got the methods or the software, etc., just trying to like go through it and then get the same results was very kind of difficult. The problem is slightly even worse than that because even when you have all the SiteWrite software versions, you know, all the exact commands that were used, in that case, it's still not really enough to be able to reproduce a result. And this is what we showed early on in one of the Next Floor papers in Nature Biotech was that even when you took a particular pipeline, like the pipeline I showed you before, but even some much more simple pipelines that the results were different depending on the different machines that you use. So, for example, if I ran an analysis on a Mac or if I ran the same thing with the same versions, etc., on Linux, I got different results. And this is what we showed here, both in terms of gene annotation, so like the start signs and start locations of genes, as well as in this case down below, the actual genes which were called differentially expressed and using Callisto and the art package sleuth, we ended up being different depending on those things. And this was just two examples of pipelines we kind of picked up. And we sort of realized that this was obviously a big problem with regards to reproducibility in itself and something obviously that containers have gone on to be very helpful for us. And just the importance of being able to reproduce those things. So that's kind of one sort of small angle on the sort of the idea on some of the topics that Next Floor was involved in early on. Sort of stepping back a little bit about how does it work in practice? Well, with Next Floor, we can write code in any language. We've got this kind of concept of our language. It's something like a script that we write. From there, we can take code, maybe you've got some R, maybe you've got some Python, maybe you've got some command line tools, etc. And you take each of those and you wrap them in a process block. And as those process blocks that get linked together with this data flow programming languages, this is where the data flow part comes in. You define the containers that essentially the dependencies and wrap all of that into a Git repository. And that's what becomes an Next Floor pipeline or an Next Floor workflow in itself. Next Floor as well as being language is also a runtime. So it's also the executor of the pipeline itself. And you have a bunch of different ways to run an Next Floor pipeline you can run it locally on your machine, you could submit it to a slurm cluster, you could submit it to the cloud. In each one of those situations, the Next Floor pipeline itself remains the same. So essentially the underlying code of the workflow is identical. And where you submit the pipeline is really just depending on some configuration. And it's that distinction is that that kind of difference between the definition of the workflow from where the pipeline runs, essentially which has allowed the communities like NF Core and people to come together from all over the world, work on the same code base, and really kind of grow the community aspect of it from there as well. So diving a little bit deeper about some particular aspects of Next Floor that is maybe different from some other aspects. It's a custom DSL. So this means it's an advanced specific language. So it's a language specifically written for writing workflows. And the idea here is it makes it very easy to quickly develop your pipelines, all of the key pieces which are required here. So defining work processes, for example, linking those processes together, really kind of the core pieces that you would expect for from a pipeline language. But when you're in situations where there's something that's not available on Next Floor, maybe something outside the box, then you've got this ability to access this underlying programming language. And in the case of Next Floor, that's Groovy and the Groovy is essentially the underlying programming language that Next Floor is written in. And it's not really essential to know really any Groovy to write a Next Floor pipeline. But you'll see there's a reference section later on that can be useful in some of these cases where you're trying to something which is not possible in Next Floor itself. You have this concept of easy parallelization. And in this sense, there's a sense where the flow comes from a Next Floor. So it's a data flow model. In this case, you can think of like the data itself being part of channels. And that is essentially driving the execution forward. It's this reactive in the sense that the processes themselves are reacting to the data. As the data comes through, they react to it and essentially execute. This is different from obviously the pull model, which you'll see it in something like make like approaches as well. We also have a self-contained approach where every single Next Floor task lives in its own working directory. And every task is almost like independent of the other tasks. This means that it can be easily either wrapped into a container or submitted to a distributed infrastructure system, like AWS batch, for example. And in all those cases, this kind of idea of isolating the individual tasks, treating it like a unit of compute can be very helpful as well for things like resumption of pipelines as well. And finally, I've got this idea of separating out the workflow definition from the execution layer like I mentioned a little bit earlier on as well. Okay. What does it look like? So I'm just going to show you now a very quick example of DSL1. And it's the only time we're going to see this in DSL1 was the original syntax of Next Floor. Under the hood, everything is exactly the same, but there's just a slight difference. So if you see this is slightly different from the thing. The point of this is it shows you very clearly the linking between the processes, which I'll show you in a moment. So I mentioned that I have this task, maybe something you write in a command line or something that you have in a script. This is just BWAMM, and it's going to be essentially aligning this sample against this reference here. And then I'm piping it to SamTools. To run this in Next Floor, you can simply take the original task itself. I'm going to wrap it in what's called a process block. And this is like the unit of compute. This is essentially like the definition of the other process. You can see I've got the inputs. This in this case is the reference. I've got the sample. And then I've got the finding the output, which is the sample.BAMM. And all of that is through and wrapped in the script section here. And the script section remains exactly the same as I had before. Now, this is kind of simple enough. You can kind of see it. You'll notice that maybe these channels are a little bit different from something you've seen before. But really, it should be relatively readable. So the way you can read this is that for each of these tasks of the aligned sample, the task will start. We'll have a reference sample and a sample, sorry, reference and a sample. The script section will run. And then it will create a file called sample.BAMM. And the sample.BAMM will go into this output channel, this BAMM channel here. What happens if I want to use this downstream? What if I want to take this, take the output of this process and then use that? Well, this is where the channels come in. And if I was to simply define a downstream task of this, so imagine that I had the output here that I wish to use downstream, I could create a new process there. Which in this case would be the BAMM channel, which is the input now. So the output of the previous task becomes the input of the next task here. And you see that kind of implicit linking of those two things provides the parallelization going forward. So the way that you can think about this is both of those processes are kind of alive. When the first process reaches and has its inputs, which are essentially reaching it, it'll fire off the execution of the first task. And then when the task is complete, it's going to place that BAMM file into the output channel, which essentially going to drive the execution of the index sample, which is essentially waiting there for that process to go. We're going to go see quite a few examples of this in practice as well. The evolution of next flow has sort of resulted in this in the slight syntax change now, where we define the processes exactly the same, except we just leave out this definition of the channels here. And the channels come in a new section, which we call the workflow. So on the left hand side, you see it's very similar. You have the inputs, the outputs in the script section. And on the right hand side, you've got this ability to define actual complete workflows in which it in themselves can have their own inputs and outputs. And it just allows the little modularity of the language as well. Going back on the concept of data flow, which is consistent across all of this. So you can think about the idea that these processes, as I said, are alive. They're sitting there waiting for the input. And as soon as they receive that input, it's executed. And the thing which links these processes together is these channels. And the channels are the asynchronous first and first out cues. And they are the things that are kind of driving the execution of the processes. You can think of the processes as we're waiting there. As they receive, they create a task. And this task in itself is an instance of the process. So just to put the kind of wording here correctly, you can see that we have a channel. And we've described this channel as having three elements. In this case, they could be files, they could be values, they could be whatever. But in this case, we've got three of them. Because we have three, in this case, three elements in this channel, we have a process definition, which will result in three tasks being executed. In this case, one for every file itself. And they're actually, that those tasks themselves are happening purely in parallel, even at the SAP level, or at the execution level as well. What does it look like in practice? Well, imagine that we had a channel here, which was contained a single sample, sample.fastq. And then we wanted to define, for example, the fastq process, the fastqc process here. So the fastqc process would just take the single sample and run through whilst. And we would get one task of fastqc because we have one sample. If we simply change the channel definition, and we said, actually, I want all my fastqc files, all my fastq files run through, then for every one of those fastq files, this would run through this as well. And this would be kind of the result of that parallelization. You can visualize this in a slightly different way here, where I've got the individual channels going through, the channel going through with the individual data files and resulting in that as well. Okay, a little bit of some comparisons with the syntax of nextflow versus CWL. We've got an obvious comparison here between one with being a specification and CWL, where something like nextflow is much more around both the language and the runtime. So those two things are combined. Obviously, a big difference with being a DSL in that it's a domain specific language. So it's not sort of strictly, you can sort of use what's outside the specification as there. Explosions seem to be quite fluent in the sense that it's quite easy to read, that should be quite clear what's happening most of the time when you're kind of reading it from the definition of those processes as well. And this concept, we have this kind of single implementation, which means that the language itself is very quickly linked to the runtime. You don't have multiple implementations or you don't have variances, which can result in much sort of quicker iterations and quicker development times. Nextflow is definitely much more similar to something like Snakemake. In this case, they both really kind of started off as command line tools, obviously both very popular in the area. The major difference here between nextflow and Snakemake is this concept of a push model versus a pull model. So in Snakemake, when they start pipeline executing, you have to wait for that whole graph to be generated. Once the graph is generated, essentially going from the bottom to the top, only then the execution can start. Whereas in nextflow, you've got this concept where the pipeline is actually driving from the top, which can be much quicker. I'm not sure how completely up to date all of these things are, but certainly nextflow has much better support for the container run times for a long time now. That really means through all the different container run times, I think nextflow supports five or so of those, as well as for clustering cloud and particularly this sort of cloud execution has been a big driver for the nextflow adoption. So that was a little bit on the syntax itself. What if we consider some of the deployment scenarios? So we're going to be running today in the next few days, mostly like a local execution. This is typically used for when you're developing pipelines or maybe just running on a single virtual machine, maybe a single laptop, and it really kind of looks pretty basic. You've got nextflow run, you've run the nextflow pipeline. You can do this with or without containers, but we'll see later on that's pretty much recommended to use containers everywhere. And the concept here is that each nextflow task becomes a process, which runs on your operating system there with this local storage there. And it's kind of a fairly sort of standard way for development as well as testing and the like. What becomes more popular is possession when you have larger data sets or you're running in a university environment is this concept of essentialized cluster execution. Here nextflow takes each task. So like each of those tasks essentially gets wrapped in a slur wrapper, for example, or into as a certain class and submit it to the cluster of your choosing or the queue of your choosing with the resources that you define as well. And here it's important that the intermediate files, the files of generator, the input files are typically accessed via this shared file system here as well. What's become really popular in the last few years, there was this managed services in the cloud. So this is things like AWS batch, but also Google batch and Azure batch now. And in each of the cases, it's very similar. Nextflow is submitting the task to the batch API, for example. And from there, the individual virtual machines are kind of spinning up those resources on demand. And then the tasks are being placed on there. When there's no more tasks in the queue, the virtual machines shut down. And this can be very cost effective when you have, say, a large sequencing run that you need to analyze, or maybe resources that you don't have available so you're on your local cluster. Or when you're just not really running that much, it becomes very cost effective to run these things. And with the use of spot instances and some of the optimization that we're doing, this becomes a very effective way to run your analysis as well. Looking more on the future, there's obviously a move towards having a more cloud-native like execution. And this is really, we believe Kubernetes is going to play a big part. Nextflow has support for Kubernetes. There's obviously quite a few large industrial use cases where Nextflow has been used in the setting. And we see the kind of more evolution of this as it expands. There's obviously a little bit more setup here. You have to manage the Kubernetes cluster yourself, which is not something I recommend for people if they're just starting out. Okay, a little bit on the portability now. So what does that look like in practice? I mentioned the different computer environments. How does it look if you want to go run a pipeline like that across those environments? So if you can imagine that we start without an Nextflow script here, if I was to say Nextflow run and just run that locally, it would run that script locally on my machine and the pipeline would run there. We've got this concept of an Nextflow configuration, which is how you can define the execution itself. So if I, for example, had a configuration here, I'm going to say, I'm going to define in this case an executor, a queue, some memory, eight gigabytes here, some CPUs and some container image. Now each of those tasks is going to be submitted to, for example, my slim cluster with this queue. And it's going to run each of those tasks on that as well. And you'll notice, importantly, the script section remains exactly the same. So it's the same workflow, it's just defined. And again, this is this real clear separation between those two things. The next step, if I wish to say the same pipeline and run it on AWS batch, for example, I can simply switch out the executor. There's the concept, there's abstractions, which exist for queue and memory, CPU, et cetera, all remain the same. And that really enables that portability across that without having to change the actual script itself too much. So just to stress the point, really decoupling those things is important and something that we can stress a little bit over the next few days of this workshop. A quick sentence on containerization or a couple sentences. So containerization is really key to a lot of the stuff that we're going to see over the next days. It really drives a lot of the portability as well as the reproducibility of these pipelines. So Docker came out around 2014. And it was really, really just kind of a few months after we'd started the next floor project. And that really allowed us to sort of fit very well with the model which we were developing. Containers have been used for long-running web services for a long time. Now, we sort of took the concept into wanting to apply it to how we could run data pipelines in this way. If you compare these to virtual machines, you've obviously got a lot lighter, so a lot sort of smaller in terms of moving them around. They can start up a lot quicker, so you don't have this long because of the cold start problem where you're waiting for your virtual machine. Even your task is just going to run for a few seconds. And you've got all of this tooling which exists with Docker, the idea that we can build containers. We've got registries. We've got really nice ways that we can manage all that software. And it becomes really useful for us when we're trying to build, let's say, 60 or 70 different containers for all our pipelines as well. If we think about the containerization support for Next Floor, well, when Next Floor is submitting a pipeline, you can think of it as each of those tasks becomes a job, and it's then that job is submitted to a cluster that say, well, with containerization, each job just becomes a container job. So instead of being a job like BWA MIM, it turns into Docker run BWA MIM or something similar. And that kind of really enables as well that kind of whole portability across different systems. You'll see Docker being used. You'll also support the singularity mostly using shared environments, things like Podman. And we start to see a more adoption as well of other technologies all kind of built around similar standards. So there's a lot of interoperability between those containers as well. And just thinking of when to use containers, I definitely recommend doing this always. Once you get the hang of this, this can provide so many benefits, really helping yourself and your future self as well if it comes to reanalyzing things or coming back to approach it after some months, or even just how you can, with the containers, you can really build on them now. So you can continue to evolve your pipelines using all of that as well. Okay, a few other kind of things on the community. So Nikslow's been growing pretty rapidly the last couple of years. It's a little bit out of date now, but we have around 12,000 people who read documentation every month. And a lot of that growth is really coming from the community of developing pipelines. This is just really highlighting sort of the Nikslow runtime south, but obviously the NF Core communities and then people who are developing the pipelines is even larger than this again. If we think about some kind of additional tooling that we've got available to us, so if you want to run Nikslow, you've got some syntax highlighters, and we're going to use one of these and I'll show you in a moment how we can set that up as well for writing some practical stuff. And obviously, NF Core, and there's going to be a whole section on NF Core later on, but you can think of it as a way to get these collections of Nikslow pipelines which are available to you, but also a whole bunch of really good tooling and best practices for how you can think about setting up your own pipelines or how you can contribute to these ones following those practices as well. So with that, I'm going to stop here at least on the introduction section, and we can jump over into the main section itself. You can find this over at secara.io.training or training.secara.io. It's all the same. I will type this in the chat now. If you type in probably the Explore Training Secara, you'll find this as well. It should all be in the same place there. I'm going to jump back over to here to walk you through this as well. I believe you can find all that information there as well. So the main training material that we have is available here at training.secara.io. Here, you'll find this is going to be here forever, so no worries about having to copy this or anything. This is always available. And the main thing that we're going to be using is a Gitpod environment, which if you follow the 1.2 there, which allows you to have all of the material or this downloaded in there so you can come back to a later date or jump in there anytime. If you do want to install NextFlow locally, like on your own machine, and maybe if you want to do this a cluster later on or when you want to develop things for yourself, if you go down to the section of 1.1, you can see the main requirements for NextFlow. And the top two ones here are the requirements for NextFlow itself. So you need Dash as well as Java. Obviously, if you have a Linux machine or Mac OS or Windows subsystem for Linux, it's typically enough to install this. But if you have any issues, go to the NextFlow train or always come into the Slack channels for that. For this workshop itself, we're also going to be using Git and Docker. And those two things are obviously kind of core things that are useful. And then there's a little bit of material in here for using AWS or Singularity, et cetera, which is kind of more optional and towards the end of that. But for most use cases and for most of the time of all of this workshop, we're going to be using Gitpod. And as I mentioned before, Gitpod is an environment that you can spin up. And the nice thing about it is it allows you to spin up any kind of GitHub repository essentially inside a container. So I'll show you what this one looks like. We are running out of, we're basically using this container as this repository for all of the training. So this is public. If you want to contribute to this, feel free. If you want to make suggestions, it's always available. You can go into here and you can see that there is a Gitpod.yaml, if you're interested in this thing, there's questions about this. And this basically describes how the Gitpod training environment is built. So you can see here that we are installing NextFlow inside of this. We are adding in some, for example, some extensions here. So we've got the extension for NextFlow, which is just a syntax highlighter and a few other things for running that. For our purposes, though, we're going to simply click this link here. And when you click this link, it should open you up with the Gitpod environment here. So you can do this. It'll take you 30 seconds or so. And then once you once you open up, you should be inside of this environment here. We're on the left hand side, you've got some kind of like a file browser here. So I've got some basically everything pulled from that repo. And then I've got some space in here where I've got some text editor if I wish, as well as a terminal down here. I'm going to make this slightly, I'm going to zoom in a little bit. So hopefully you can see my text a little bit better. And hopefully you can see all that. And then I'm just going to close these windows so you can see everything that I'm doing. I'll just give you 30 seconds or so to get into there. And in the meantime, I will just confirm that I've got NextFlow working by writing NextFlow info. And it's going to show me the version and everything there. And we are good to go. So the first thing that we're going to do, and it may take a few seconds for that to go, I'm not going to start completely, it's just we're going to look into the kind of basics of NextFlow. So there's the material inside of here, which kind of goes over a lot of the stuff that we covered. So the basic concepts, et cetera, around the processes, around channels. But a lot of this stuff is best seen in an example. And the first example that we're going to do is we'll put a script here. And the purpose of the script is, it may seem a little bit trivial, but it actually covers a lot of the different aspects of NextFlow. So there is the concept of parameters, essentially of input to our pipeline. We're going to look at how we can create a channel from that. So kind of the first look at how we can look at those DataFlow objects. We're going to define two processes. The first one is going to take an input and create some files. In our case, multiple files, which will be essentially treated in parallel for the second process. That second process will treat those files in parallel. And then finally, we'll bring that all together and we'll get a result out of that. And so all of that will happen with software, which is available to any operating system or to any sort of bash environment. So we're not going to use any special tools. We're just going to be using things like print, cat, splits, etc., also tools which should be available any way you run. It's obviously including our environment as well. So as I mentioned before, this pipeline itself is very basic. We're basically taking a string as an input. In this case, hello world. We are going to print it to the screen and then split it. And this split process, the split, sorry, channel, split tool, I should say, is just a Unix tool and it simply takes that string and turns it into a file. So it's going to split it on six characters. So in our case, we're going to have hello as the first file and world as the second file. And then the next process down from that is going to take that output of the previous step and use it as input where it's going to cat it. So just this is just essentially printing that file that exists with the contents of that file. And then this is just simply changing from lowercase to uppercase. So in the end, we are taking the input string, I should say, splitting it into files and then converting it to uppercase. Let's go through that in practice. And just to see exactly what I mean, I'm going to do that without using Nextful at all just to show you what's going on here. So there's no confusion. So I'm going to go clear here. The first step I'm going to be doing is just going to say print f, exactly like we have in the first process. I'm going to take the string. So allow my text there. I'm going to pipe that into the task, which is sorry, to the split tool. And this chunk is just a prefix. It's just a say, I'm going to create files that are prefixed with this. You could put whatever you want there. But in our case, we're going to do this. I'm going to say LS on chunk. And now I'm just going to say, show me all the chunk files. So this sort of one liner created these two files. So if I say cat on chunk AA, you'll see I've got hello and cat on chunk AB world. I know it seems pretty trivial, but you can imagine this is like doing some process that creates some files and those files you can process downstream. And what are we going to do downstream from there? Well, if you think about our files that we've just got chunk AA and chunk AB, we want to take those and we want to do a cat on them. So chunk AA. And this is a step which will be happening in parallel in a moment. And then I'm simply transforming them to uppercase. So I'm taking the lower case to uppercase. So I'm going from lower case hello to uppercase hello. And then in parallel, the same thing to world as well. Okay, let's just go through the script now before we run this and just see some of the next flow specific aspects to this. So the first line, we've got the she-back and this is something you'll see common in the start of the next flow pipeline. It's not necessary to add in here, but it's always good practice for doing so. You've got the greetings parameter. And parameters are special in next flow. They are defined with this param scope here. And they're special in the sense that they can be defined in many places. You can define it in the script here. You can put it in the config. You can define it from the command line. You've got a parameters file which you can edit in. And the whole purpose of that is that there's maybe many different ways that you wish to change some parameters as part of your next flow pipeline. When you run the pipeline, we often will have a default parameter as part of the actual workflow script, like the .nf file here that sort of runs through the pipeline successfully. Now, the line number four here is our first definition of a channel. And here you can see we're creating a channel called greetings channel. So this is an assignment because we're saying greetings channel equals. And then this channel.of is a channel factory. So this is the first kind of look at what we call a channel factory. We were creating those channels. And this channel is very simple. It contains a single element called hello, which is a string of hello world. So this is you can just think of this hello world gets placed inside here. Now going through the processes, you can see we've got a process called split meters, which takes a single input and then runs this print.f on that input, which is just a value. So it's not a file here. It's just a, we can tell this from this vowel here. It's going to pipe it, create our two output channels, sorry, our two files, chunk underscore. And the key thing here is that the output of this process is going to capture anything which matches this. So anything which matches chunk underscore star. And that's basically, in our case, we know it's going to be chunk AA and chunk AB. Then the next step here is that the process converts to upper, which takes a path. And next floor path is essentially the same as a file. It's a file object. And next floor doesn't make a distinction between actual files and paths in this case. So you'll see path is the word called a qualifier on that input. So we've got path Y. And then for this task itself, it's going to do a cat of that file. And then basically take that and to convert the contents of that file to uppercase. And then this is a bit of an interesting qualifier here. Instead of putting it into a value, or instead of putting it into a path, for example, here, it's just going to take the standard output. So it's going to take the output of whatever is produced from this task and capture it into here. Now, this is the definition so far of the parameters of the processes. Finally, we bring all of that together in the workflow. And the workflow is where this is all executed. And we can see that we're defining a new channel called the letters channel, which is taking split letters and is taking the input with greetings channel. So this here, letters channel is simply this line number 8. Because this is the letters channel here, it's like an assignment. So it's saying take the output of split letters and we call it, give it the name letters channel. And then we want to pass in the letters channel into convert to upper. So letters channel goes inside of convert to upper, which gives us this results channel, which we're viewing. A lot of information in that, we've really pretty much covered everything. So if you capture all that, good effort on that one. But what we're going to do now is kind of go through this a little bit more in practice and see what happens when we start running this. And you can do this as well. So you should be able to say next flow run. And then I want to run hello. And if you look inside of that directory there, you should see that we've got a file called hello.nf. So I'm just going to run through that. And if you look at any of training here, and if you double click, hello.nf, you'll see that file here. So this is just, if you're looking for that, that's inside any of training. I'm going to close this to make this a little bit more legible. So here you can see that this is run through. And the first thing information that you get is a version of next flow that we're running. You can see that we're launching this hello.nf, which is a local script here. You can see that we've got three tasks which we executed in this local three. The first task, which is actually the first process split letters is just run once. So the split letters process has produced one task. The converts are upper processes created two tasks. And we can see that from this two or two. And then the resulting output, the result of all of this is hello world printed to the screen in an upper case. Let's see if we can try to run that again. So now I'm keeping everything the same. I just pressed up and said next flow run hello.nf. And notice how something a little bit different here has happened. Notice how instead of being hello world above, I've got world to low. And when you do this, you'll get essentially random results as well. And the reason for that is that these two tasks, which are running through convert to upper, are running purely in parallel at the CPU level, which means you've got no essentially a way to know which one of them will run first here. And you can see here, this is basically a little alternate here. This is kind of very powerful when you imagine this being run across not two files, but across thousands of files where you get this full parallelization of that and each of those tasks, maybe not running on this CPU, but being sent out to a to a distributed compute infrastructure as well. Okay, so that's a kind of a brief kind of look at a brief kind of look at that pipeline and how you can execute the pipeline. What about some other information that we that we see there are other information that that could be useful? Well, in this case here, you'll see that I've got the convert to upper and convert to upper is saying one of one, but it's actually sorry saying two of two. What's actually taking place there just to be clear, I can show you here, is that I'm just going to take this log here, I'm going to say, NC log is false. This is just purely for the kind of teaching here. But when I do this, you'll see each of the individual tasks written out one by one. Now, when you have hundreds of tasks, it's not very useful because it's very difficult to keep track of. But in this case here, you can see that actually they have split letters which ones once and then convert to upper the first task and convert to upper the second task. And the other thing that's important here is these hash numbers. So in this case here, you can see that the first task has got zero nine slash cd2, the second task here in the 97, et cetera, 18. And what that number is is what we call the task hash. And the task hash is comes very important when we think about resumability. And so modifying the pipeline and being able to do that. I mentioned that next next little task lives in its own work directory. It's essentially an independent task. And that hash is the actual location of where that task is run. The other kind of nice thing here is that that hash is made up essentially composed of all of the information off the task. So for example, it's composed of the input files of the script section of the container. And any changes to any of those things will change this hash. So the working directory in itself kind of ties together all of these things. And this becomes important when we go to do a resumable pipeline. So I mentioned I've run through this pipeline. I'm kind of okay with it. I'm happy with it. But then I realized that I've made a mistake or I wish to change something. In my case here, I actually thought that instead of doing this cat here, I wanted to reverse the files. I'm going to save that now. And notice how I'm going to keep everything else the same. And when I run this, I'm just going to say dash resume. So hyphen resume. It's just a single hyphen. We're not going to let this run through here. You'll see that when it launches the first process here, the split letters, the same one-of-one, but then it says cached. Whereas the second process converts up and both of those have run fresh from the beginning. And you can see why that is, is because we've changed now the convert to upper itself. And that has resulted in the essentially the hash changing and therefore it's invalidates the cache and it launches those itself. What if I was to do this again as to run through next flow, run, hello, resume. But if I'm using resume, it's essentially going to look up that working trajectory, see what's available there and cache those files. So here you can see that the section here converts to upper. You can see that's cached. And notice this 09 cd, 09 cd stays exactly the same. So essentially the task has not had to be rerun. And that's being able to do that as well. There are some more sections that we'll see later on on this. It's something that that can go a lot deeper into in this some blog posts as well, exactly how what makes that up in the processes, which is it was happening behind behind the scenes. Next flow resume though, very powerful for when you've got very large analysis, maybe that runs for days, etc. You don't want to be able to, you don't want to have to resume that. But also you can even do that across pipelines, or you can have colleagues where you've got the same working directory. So if they've run that task before with everything the same, you don't have to run it yourself as well. Okay, let's talk about now a little bit on the on the pipeline parameters. So here is our first pipeline parameter we define, which was greeting. And you could think about how you're going to say like run this pipeline with some new data, some new input data. Well, one way to do it would be to go into the script itself, modify this here, make some say changes, etc. for running that, but that would be a kind of maybe not the most efficient way to do it, especially about our code was in GitHub, maybe something we don't want to change. And if to change a parameter for next load, we have this option here to be able to say dash dash the parameter name. So this double hyphen here. So instead of single hyphen for resume resort, these are fun next law options. If we use two of them, we can specify the parameter name. So in this case, I'm going to say the parameter name. And then I'm going to put in here, instead of saying, Hello world, I'm going to say Bonjour Le Mans. And run through this. And one thing you'll notice is that instead of now, if you remember what the first process is doing, it's splitting on six characters. But instead of now, because of the way this is set up, instead of having to have only two files, which are generated, chunk a, chunk a, b, now we've got three files, which are generated. And again, all those there are running parallel as well. So they all kind of run through there. And you can see that we get three posts, three tasks of that convert to upper. What happens if I was to do this, and I said, and a thousand files or 100 files, say way, next flow will split up each of those and run them in parallel as well. So you can see here that we end up with seven of those tasks being generated. And this is I'm just showing you here a string, but the principles are are exactly the same as well for that. Okay, let's just consider the next step in there as well. I'm just going to revert that to just to show from before. So I'm just going to go back and change my change my settings there. So I changed my second process so that we're all the same and run through that there as well. There's a final thing which I didn't mention so far, which is this flatten. And what we can think of is to consider what's happening with the flatten. What I'm going to do is I'm just going to copy this section out first and just have a look at what the latest channel actually is. So if we if we take what we can do here is any time if you want to view what the latest channel is, you could say, or any channel is you can do a view on this. And this will print the contents of that to the screen. So by running through here, the latest channel in this case is what we would say it's made up of three files. So in this case, three, three files, chunk AA, chunk AB and chunk AC. But one thing that's one thing that's happening here is all of those three files are are a same element. So anything downstream of that would essentially be already run once. So you can imagine here you look at this how there's three of those files and the square brackets is the kind of the giveaway they are all in the same in the same elements. If I was to put Flatten on here then view that now when I launch everything exactly the same, you'll notice that it runs through but each of the files which has been generated is on a new line. So Flatten is the first example of what we call an operator. And operators allow you to manipulate the contents or modify the contents of a channel or multiple channels so that you essentially can get the output that you want. And it's really the channels that and the contents of those channels which drive the execution, which is why this becomes so important. And let's illustrate that let's imagine that see if you can guess I'm going to do everything is exactly the same as before. But I'm just going to remove Flatten here. Let's see if you can guess what will happen. So now this let's say channel I'm passing a single element and not three elements. So that instead of the convert to upper running three times, it's actually only runs once. So it's this modifications this this this changing here of the channel itself which drives the drives the execution different. And we're going to see this a lot more. We're going to see it when we define channels when we when we when we create those channels as well as how we can modify them. There's sort of 50 or so also different operators that you can use to pull this together. And it's really one of the kind of powerful parts to parts to next flow as well. There's a little bit more of a description in the in the section there if you want to look into if you want to do a look into into the 2.4. So below that you'll see that there's a kind of a view of what that looks like as well. But with that I'm going to I'm going to stop the section here because we do have a lot to cover after this as well. So we're going to we're going to take a 10 minute break or so now. And then when we come back, we're going to jump into the section here on on basically creating a real a real pipeline now we're going to create a simple RNA seek pipeline. It's made up of four different steps here we're going to start using some real data. We've got some fast Q files. We've got some faster files and we're going to run through that with some real software and get going on that. So I'll stop it here. Maybe we can come back in 10 minutes or so at five past the hour. Other than that, we can I'll see I'll see you all very shortly and I'll be passing you over to Abhinav here. We will be driving the next session. Thanks very much folks. So welcome back everyone. The small break is over and we let's get moving with the rest of the session and in this session what we are going to cover is what we are going we are going to build a proof of concept RNA seek pipeline building upon the concepts that Evan has already shared earlier. So this time we are trying we will try to dive a bit deeper and then start off from parameters and finish off with a series of pipeline series of processes and complete the pipeline. So for people who are joining just now I would like to highlight that the details about the event are on NFCore website and if you would like to ask any kind of questions please feel free to join the channel on NFCore Slack and if you are not a member of the Slack channel already then please feel free to join using this link and our team would circle back on any kind of questions you might have. Yeah so before we move on I would quickly like to show the datasets that we have already and this would be available for anyone who is who already has a Gitpod environment up and running and we can if you if you go to the left hand side and click on this exploder option you will see these folders ASCII dogs hands-on dogs NF training. So what is relevant for us today is the NF training folder and within this folder you will see a folder called data. So in the data folder we will have a lot of subfolders but the most relevant one for us today is the dataset for chicken genome. So we have a transcriptome file for the chicken genome and the location of like this is a standard faster file so we have from chromosome this until here and we don't have a lot of data this is not a full dataset for the chicken genome because we want to keep the pipeline and the iteration times very short and fast. Similarly we have FASTQ files from the gut, FASTQ files from liver, FASTQ files from lung and these are what you would expect from any kind of FASTQ file. We have the read ID, we have the quality scores and we have the sequences themselves as well and that being said for the rest of the session we will make our way through a series of scripts and the relevant section on the training documentation and the link for the training documentation is already shared on the Slack channel. We will cover section 3. So in this section I will just make my screen wider yeah so in this section we will create a proof of concept RNA-seq pipeline, we will index a transcriptome file, we will do some quality control, we will do quantification on the transcriptome file and then we will combine the results of quantification as well as quality control into a multi QC report. So keep in mind this is not the full fledged NFCore RNA-seq pipeline so this is not to be confused with the pipeline which is much more comprehensive in nature. This is just our own proof of concept toy RNA-seq pipeline and without further ado let's get started. So on the left hand side again you will see these scripts are already there on your Gitpod environment right so let's open script number one this script has the content has has the has basically a bunch of parameters and a print statement so parameters as Evan has mentioned earlier have a special role within an Xflow pipeline because they allow you to dynamically provide a value which you want to use in your pipeline. So anything which is which is named or stored in a pipeline using params.syntax automatically can be overridden can be used on the next flow command line while invoking this script. So if we just invoke this script by default we will see next full run script one it will print the value of reads and the value of reads basically is this the the location where the fast queue files have been stored the one I showed you earlier but notice notice one thing this time the value or the full path let me just make it bigger yeah the the path is slightly different because in params.reads we have specified $projectdir and data ggal and gut fast queue files however when we run the pipeline and print the value that is stored in the reads variable it turns out to be slash workspace nf training public nf training then data ggal gut so projectdir is a next flow specific mechanism to to kind of specify the location or relative locations of your dataset across various scenarios so right now the value which is resolved the resolved value of projectdir is from where I am invoking the script and my pwd is exactly the the the path which was appended to the path of data and gut fast queue files however later on we will see that this pattern is especially useful when we are sourcing the pipelines from a git hosting solution like github bit bucket and we don't have the knowledge of users as pipeline developers we don't have the knowledge of users home directory users working directory in that case projectdir becomes the abstraction layer we can rely upon so that's about that's basically it about the projectdir and then params.reads so we can override any of these values as I mentioned using command line next flow command line and anything that has been encapsulated using params scope automatically gains access to to the command line for example if I want to override multi qc and over here I want to say my well I need to add well let's override reads otherwise we need to add other parameters to the script we say my custom fast queue dot and this time if you run the pipeline the value of reads would not be the default value mentioned here it would be the one that I have specified on the command line so a standard best practice for specifying whenever you write parameters in a pipeline is to specify the default values right and then users can the people who are running the pipeline they can always customize those values while invoking the pipeline now you might wonder why don't we use a default variable for example like temp or var so we can use this in the script however this var my variable this would not gain access to the to the command line because next flow provides a special status to params scope so if any anything you want to expose on the command line we recommend that you use params dot that variable link yeah so after the first script and just a quick note about all of these scripts and sections the first script talks about parameters but it has a corresponding section on the training documentation so we have just covered this section however with every section almost every section we have a couple of exercises so let's try to go through these exercises for the next minute and then I will I will basically move them on cover them as part of the next script so just a quick one minute break also worth highlighting that the answers are available on the on the documentation itself but you are highly encouraged to just try and apply these principles that we are learning the first exercise is basically about adding a new parameter called out dir and then giving it a default value for now it could be anything okay then the second question second exercise is about thinking from the user perspective so when you make the pipelines you have default values and some values could be overridden by the by the user right but a best practice is to print out all of the parameters of the pipeline when it is being invoked so that nothing stays hidden because let's say a parameter x was required by fourth process in the pipeline fourth step in the pipeline and the fourth step would fail because of because of the fact that params dot x was not specified correctly so this does not make a good user experience and that's why we this exercise highlights a way you can use to print out all of the possible parameters that if whether they are overridden or the default value at the beginning of the pipeline so for anyone who has not been able to guess or figure this out it's not a problem we provide the answers and if you move to the next script the solution is there as well these scripts build upon each other and let's move on to script number two in close script one so in this script we can see that there is already a parameter called out dir outer and its value is results or rather we should say its default value is results we can always override it from the command line but the most interesting part is the usage of log dot info which kind of prints out all of the parameter values that someone might override or like if the pipeline is using the default the default value would be printed if this value has been overridden then the updated value would be printed so let's run this through well we can copy this if you want to experiment with the previous script you can just copy these two sections and add them here and then run the pipeline again yeah so you see this banner and this is probably the reason for anyone who has been running NFCore pipelines that you see that green apple and our list of all of the parameters so that is a good real world or production usage of this this technique and you see the updated value for reads is showing up and then the the default value for transcriptome and out there are there so with this let's move on to script number two and in script number two what we are going to focus on is our first building block of the of the of this pipeline of proof of concept pipeline so one thing I would like to highlight right now is the process block and it might help you to think of process as a blueprint of all of the tasks that your pipeline will run right so a process is a process consists of specific sections which we call directive in the next flow DSL language so input is a directive output is a directive script is a directive and these are the most common directives that you will see in a process definition but there are a lot more so if you open the docs here again and if you move to section 3.2 you will see the documentation for process directives and we have around 30 process directives and they are used in specific cases depending on whether you want to control the parallel execution of tasks or you want to run on a cluster with specific options you want to run on in a cloud environment or docker container so we have around 30 directives but the principal ones are input output script so you can think of a process as a blueprint which specifies a contract right now why do I say contract contract is something that any other process or technically any other channel any other part of your pipeline which wants to interact with this process has to satisfy that contract so there are some internal contracts and external contracts let's talk about whatever is within the process scope itself so within the process scope we see input and input has the content of input is basically a path and transcript home so transcript home uh like what do we gather from this statement it means that the input of index process is supposed to be a path type value uh path type value is of course something like this which is a full path that next flow can then build upon right so the input needs to be a path type value and it would be called transcript home so this value itself the name of the variable the prefix is what we call qualifier the name of the variable is uh is something you can use within the script section so if I change something here let's say my transcript home then I'm kind of breaking the contract internal contract because now this value the name of the identifier here is different from what would be staged right so we need to make sure that these contracts are satisfied and then let's focus on the script for a moment because input as soon as the process starts it takes care of the input runs the script takes care of the output so the next logical sick directive we want to discuss is script and in this script we want to run the tool called Salmon this is for doing the indexing of the transcript home file and here you see we are technically between these triple quotes we are mixing the next flow DSL domain specific language with the shell language uh technically you can put in any kind of python script or ruby script or whatever scripting language you prefer but by default it is expected to be a bash script so when this blueprint receives a specific file it it creates a task specialized for that file right so if you go into the task directory that Evan has mentioned earlier like the process hashes you go into that you will not see these variables what you will see are specific numbers or specific paths right so input is a directive which is used by script and then when the task is run the output in this case we have mentioned or specified the output to be Salmon index fuller so the output Salmon index has been specified to the index process definition that as soon as script is complete uh stage Salmon index stage it out so staging something in a process and staging something outside the process so that's the overall flow within a process now I have not mentioned anything about task dot CPUs yet but uh this is the time so task is a special value within a process so uh this task has a few uh you can say fields and these fields are pretty much the same as all of the directives you will see here process directives and then when a process has been specialized into the process blueprint has been specialized into a task you can use things specific to the task uh for example CPUs is a directive and we can specify CPUs within the within the process but since we have not specified it here this value would default to one CPU so by default any of your process would require one CPU to run and would make use of one CPU so we will run this pipeline next and try to understand uh how all of this works out in practice the the workflow section the workflow keyword is new to DSL2 uh and what it helps us do is basically connect the definition uh sorry the invocation of a process to another to us to a chain of process calls right so in this version of next flow the DSL2 version the definition of a process is different from invocation of a process so this means that the closest analogy that you can think of for a process but not a perfect analogy let me highlight the closest analogy would be of a function so you define a function for example in a language like python add nine then you can give you can get define everything within that function right but to call that function you have a specific syntax you say add nine and you say you provide the argument to that function right say one so similarly for invoking the process index that we have defined earlier we want to we we provide an argument in this case the argument is a params which is transcriptome underscore file and the default value of this is the file path file path to the transcriptome file and yeah so this is how you can invoke process let's try to quickly run the script this time when when we run the process when we run the script we ran across an error and the error basically says we have to focus on the command executed so you can see this was the command executed within the task and the command error highlights that salman tool is not found and that is true because we have not installed salman on our github to run with salman you would have to you'd need to add the argument or the next flow command line argument with docker and then the next flow runtime will make use of the available docker container which has salman installed to run this script yeah so this time the process completed and again if we look into the work directory and the process hash and check the command dot sh file the script file this is the final task definition or the or the task which next flow ran right and with this we have we are done with second script and i'll give quick 30 minutes for for doing the exercises next oh 30 seconds sorry yeah so it is not very effective or well command line friendly if you always have to specify with docker for all of the pipelines right so what we can do is simply add docker.enabled to to the next flow dot config file docker.enabled to next flow config file and now we don't need to specify with docker at all and next flow would automatically enable the docker configuration and use docker to run the the script right now over here you would notice another option process.container and this is how we are telling next flow to use this particular container to run the pipeline or run the script the second script we will talk more about this in the upcoming session but for now we are going to move on to the next script feel free to ask any kind of questions you might have on the slack channel the team would be very happy to answer any questions so so in the third script what we are going to do is understand deeper the concept of channels so we have talked about processes which are like blueprints of tasks but they have to follow contracts right contract of input contract of output contract of usage of those input and output within this context of script section however in this script 3 there is no process so we are not going to write any process here what we are going to focus purely on channels so this part is already familiar to us we have seen this earlier let's focus on this one line line number 18 line number 18 basically invokes a channel factory this is what we call next flow next flow you can say native commands or or facilities for creating channels so channel factory so there are many other channel factories so far we have been using channel off which even mentioned in the context of hello dot nf script but this is another channel factory which is from file pairs and this is very relevant to bioinformatics because the fast queue files they often have a forward read reverse read so and the differ based on basically the entire name is the same and the orientation is donated by one and two so if we run this process script number three we will not see anything yet it will just next we will just run and exit because of course we will see log dot info but nothing of nothing else of substance so what we want to do is understand what are the contents of this channel right so you can think of okay so we since we are saving the contents of this channel into a channel called a channel factory into a channel called read pairs ch we can say view and this time when we run next flow again we will see a slightly different result and you see this is the result this or technically this is the content of read pairs ch let me paste this here and highlight a few things quickly so since this is the content of read pairs ch or you can also expand and say something like since this is what is generated by a channel factory from file pairs this is a collection of values so the first value is the sample name since we are currently focused on the gut values it's the sample name is gut but then the second value within this collection of values is the path the full path to the fast forward read and the reverse read and this is how channel dot from file pairs outputs this is the shape of output of from file pairs channel factory if you go back to the documentation you will see yeah you will see that we have linked to a lot of other other channel factories i can quickly mention here so we have channel factories on the left hand side create off from and so on and so forth so you can also directly specify a sra id and next flow would download the the the sample and then also give you the results in a specific shape similar to the one we have just discussed like srr and then the path but with this we conclude our script number three and section 3.3 so next we are quickly going to discuss about about the exercise yeah so the first exercise of section 3.3 is basically about basically a stylistic comment so since next flow let's go back to our code script number three yeah so instead of specifying instead of assigning the value of the the value generated by this channel factory like this using the using the equal to assignment what we can do is basically say set and read pairs channel read pairs channel and this is semantically the same but just just easier to read perhaps so you start off by creating the channel and you provide the input and then you store the results of that channel into the into the into a different channel name and if you run it again with read pair ch view you would see the same result so the same result again just a stylistic comment the next section the next exercise sorry is something which is of which is quite useful when you are not sure how whether all of the samples that you have specified exist or not right so if you go back to the documentation you will see documentation of from file pairs you will see that we have options available for every kind of channel factory in this case what we want to check is check if exists so right now if the file that I specify within the process does not within the channel if it does not exist next flow would not fail explicitly right but sometimes you have situations where you want next flow to fail as soon as the file you are expecting does not exist on the file system right so a practical example would be instead of gut if I say hard right if I run and with this I would just need to update check if exists say true and if I run this script it should fail yeah because the expected pattern or expected files are not found within the data folder right so this is a pretty useful option and this highlights that all of the channel factories basically can be customized their behavior can be tweaked using the options so we have options for including hidden files when next flow does the globbing and following links if you have a collection or if you have collected all of the symbolic links into a single directory and you want to run next flow using that directory you can specify something like follow links and yeah I think as soon as you start using next flow and playing around more with channels these these options become very handy with this I'll move on to to the script number four and script number four basically adds another process I'm gonna toggle some things here so that we can focus on the main content so in script number four we have added a new process called quantification right and I would say the most interesting thing that should stand out is the shape of the input which is expected by this quantification channel quantification process so let's go back to the terminology of blueprints and contracts so the quantification process blueprint requires the input channel to fulfill the contract of this shape right so it means the input whenever quantification is invoked it needs to have the second parameter which has this shape not the first parameter because these are these are positional arguments but the second parameter needs to have the shape so let's jump down quickly and let's see how it translates into into the usage we want to do so we create a channel from file pairs and then we store the results in a read pairs channel and then we invoke the process index using one file one file only this is not the same file that we have used to create the file create the channel earlier right this is a separate file and then here is the file here is the channel that we have created earlier now as we discussed previously in the command I can show here I can highlight here so this is the shape which satisfies the input contract of process quantification right so that is the reason why we are able to specify this channel as the second argument for quantification so how do we know the shape of a channel how do we know that it is the content of the channel are in the same shape as we expect so we can always rely on the view operator and operator is something I will cover as well we don't have salmons with this oh yeah we enabled Docker already cool so this is how we know the the contents of of a channel again right and let me quickly mention about the script section of quantification process so here since this part we have already discussed this this would default to one but we can also just hard code it to two or four so if anyone is using the git pod environment they should have access to a lot more CPUs than than just four but the interesting part is here so you see in this the the content of the channel is like a sample and then a and then a collection technically we call this collection a tuple a tuple of two file parts right so what we are doing here is the first argument for the first argument for salmon quant command we are taking the first value of the reads and the second uh and the second value of the reads we are passing it here and this basically means it would it gets translated into uh I can say echo true or not echo let's choose what we already have we can say work and I say here it says dot s h and here we can see that this file path and the file name has been replaced this reads zero reads one have been replaced by what we sent into this process right so that's a quick overview of all of the changes in this script we don't have a lot of exercises next but I think this script still has an exercise so yeah there is another thing we can do quickly is to use to extend this analysis to other samples right so right now we have limited the analysis to only gut samples but what if we want to run the analysis on all of the fast queue files within that directory and in this case what we will do is resume because we don't want to rerun the analysis we have already done and now we can just say resume and next flow should use the cached results of from from previous runs so you see this first process is cached and then next flow ran two more processes here because the gut samples were already analyzed now the liver and lung samples were analyzed again so you see uh the caching mechanism is very handy and you can you can start off with a small number of samples to develop your pipeline and then when you are more comfortable and more like sure about how the pipeline behaves you can just update the patterns you don't have to like limit yourself to only five samples to begin with and with this we conclude script number four on script number five the most interesting change is I'm just going to fold the other processes the most interesting change is about adding a new process fast QC because in the beginning while we were talking about the design of this pipeline we mentioned we will do we will do indexing on transcriptome we will do quantification of the transcriptome then we will also do quality control and then final summarizing final summary using multi qc so we have added another process called fast q notice this time we add another tag another directive it's called tag and in this directive we are using the value that the process will get from the input right so what this would do is that instead of showing just two or one over here we are telling next flow that okay if I tag if you run fast qc and it has a tag then then enrich this information over here on the command line or other locations as well using this tag so what this means this means that if we run script number five we will see tags on I think even quantification because we added a tag on quantification as well yes yeah so this time you see we have salmon on gut fast qc on gut and if we disable ansi log so because as was mentioned in previous session if you enable ansi log it basically tries to update update the overall pipeline status within this one's within this one single line right but that's a trade-off you don't get to see all the tags on all of the processes right so what we are going to do is we are going to disable the ansi log and we are going to update the pattern here we want to run it with lung lung and liver samples as well and we resume so this time we can see fast qc was ran maybe twice yeah and first time it was run with liver lung cut and same is true for quantification step and this is how you can make use of tagging now we have skimmed over one important well very important directive and that is published here because as of now all of the analysis that we have done resides within this work directory right and if we do tree we will see there are directories within this directory let's do tree three and now we see that we have we have like the results of these this analysis within these directories right but it's not very practical to to store the results in such nested directory for loop and that is the reason why we have we want to make use of published dir directive so if I want to publish the results of quantification into a specific location I can specify here I can say my quantification starts I can come here I can just run the next low command again I can let's run it with resume so it doesn't actually spend any time in computing stuff just uses the cache so this time cache was used and there would be a new folder my quantification results yeah so all of the results generated within this process are now stored in my quantification results directory so this is how you don't really need to even run run the pipeline again you can just use the cached results to publish some of the some of the files you might want in a separate directory if I want to publish fastqc results in a different directory I can just say my fastqc results right another important thing I would like to highlight quickly let's let's let this one run first yeah so again no processes will run everything was cached but the results for fastqc are also published in my fastqc results so you see you don't need to run the pipeline again to save the results in a different location yeah what I wanted to highlight is the use of params.out dir so since we want to allow users to publish results in a directory of their choice not our choice we want to rely on publish dir rather than a specific name of params out dir and this time well this was the default parameter so the results are already published and they're published in results folder so we have all of the results within the results folder and that's how you can provide customizable published locations uh with this we are going to move on to section or script number six this is the sixth script and over here we see a lot of familiar things already we see the log.loginfo we can fold this we see index process we see quantification process we see fastqc process but this time we have another process for multi qc so this is we are almost moving towards the conclusion of the pipeline because the goal of the pipeline was to have indexing quality control of fastq files quantification of faster file and creating multi qc report so over here if you want to add any kind of new tool within your pipeline you have to create or wrap this tool into into a process right and then everything we have discussed about published dir input output script applies within that process itself but for multi qc we have an interesting use case so we have to scroll down and we will focus on this one line basically the rest of the lines we have seen already so this one line because multi qc we can summarize a wide variety of files we want to use multi qc to summarize fastqc results as well as the quantification results so as we have discussed previously the only way to pass in results or content within within a process is through channel so we have to create a channel and let's see the contract of the input contract of multi qc it basically says it will accept everything uh within coming in from the channel it will accept everything right so we have to create a channel which combines the results of multiple runs of fastqc and multiple runs of quant quantification quant ch and put all of these results together and then push to multi qc so this is the acrobatics we need to do with channels so here what we can see is that we have three main channels here which are the outputs of individual processes like indexing has index dot index underscore ch quantification quant ch fastqc fastqc ch and since we want to combine the results of multi qc combine the results of quant ch and fastqc we have to make use of something called operators we have already been making use of it we just didn't highlight earlier so set is an operator and then similarly mix is an operator so let's untangle this this a bit and over here we can write something like this this is perfectly valid next flow syntax we just need to maybe assign the values to multi qc input ch this is the channel name that comes to our mind right now so since we are going to combine the results of quant ch the contents of quant ch channel with the content of fastqc channel and then collect them together because we only want to provide a single value or a single collection of values to multi qc we collect them together and assign this to or store this within multi qc input ch and then this is how we can operators are essential if you want to do advanced stuff with channel manipulation so just like channel have a separate section for in the documentation the next flow documentation you can find a separate section for operators and everything that we have discussed or most of the things that we have not discussed even they are mentioned here so you see repeated use of view you see use of mathematical operators forking operators combining operators we have just made use of mixing operator and the next flow docs is very good because it shows you the behavior of the operator or any content there itself any construct so this is how it will combine the results together and if you run this this script and add is at a view over here we will see what are the overall results so let's do with six we will see what are the content of the channel which is pushed into multi qc process yeah so this is the content of the channel so we have fast qc results and then we have quantification results but these are only related to gut maybe we want to run this with all other processes or all other samples as well so we run this and this time let's go for cleaner terminal output let me make this bigger yeah so this is the result or the content of multi qc input channel so multi qc if we go into the if we inspect the multi qc we say yeah we inspect the multi qc task directory we will see symbolic links to all of the results of previous tasks right let's just focus on the normal links so it has fast qc results as well as quantification results for liver lung cut right and this is how you can combine the output of various processes together so basically it would it would really help you uh at least it helps me to think mostly in terms of channel manipulation so channels have a shape so if you want to change the shape of the channel according to the contract of the process then operators are your friends uh so with that we have concluded the sixth script and we are quickly going to go to script number seven uh seven yeah and here again there are many familiar things already we have talked about those i'll just fold them out we have fast qc again if you see and this time the only different thing or the only thing i would really like to talk about is the workflow dot on complete section so this block is a special block because it works at the level of workflow so over here we we define a workflow which is the interaction of channels and various processes together but when we run the workflow uh we of next flow of course specify tells us that on the command line tells us that okay this has completed and that's it so let me just run this again just to talk about this so we are on we are running script six just to show the behavior right now and then we will move on to script seven run script seven so over here next flow tells us that it has run all of the available uh it has processed all of the samples as per the pipeline design right but if you want next flow to explicitly tell us that hey uh your yeah that hey your analysis has completed and you need to focus on yeah so sorry uh hey your analysis has completed and here are the most relevant results you want to take a look at so you can easily run you can do that with on complete section and over here again this is our good old friend log info we are going to use this to dynamically update or tell the user where they need to look so if the workflow succeeds we want the user to receive this message right and if the workflow fails this is a ternary operator if the workflow fails we want the user maybe this is better so if if the workflow fails we want we want to tell this message to the user otherwise we want to tell the user okay this is where this is the location where multi-cuter report is situated please go have a look and this is pretty much it with six script number seven and we have completed technically this is called completion event of a workflow right so you can use this event for something like triggering an email for for for your use cases right so if something fails trigger an email if the workflow passes trigger an email to such and such person so you can customize a lot of things but just to begin with I think having an on process or on complete event or event handler would be great so folks before we jump on to the next section let's have a quick five minute break and then we will talk about custom scripts and then move on to running remote pipelines and docker mostly right just a five minute break we'll be right back so welcome back before we get started I would quickly like to highlight that whatever content that we are covering over here over in the workshops we will take deeper into all of these contents later on into different sections but we want to just cover the the big picture first and next flow can do a lot more there is a lot more customizability of pipelines of design within next flow that could be of interest to you but if you don't if this was fast or if you didn't completely catch everything that we discussed that's completely okay that's even expected but feel free to just go through the the training content and also to make use of the Gitpod environment to go through these concepts again and also feel free to ask these questions on the Slack channel our team is more than happy to respond so let's kick off this section quickly and I'll quickly cover the custom scripts and then we'll move on to the docker and counter containers so let's talk again in terms of script number seven and I'll undo all of my changes here so in script number seven let's talk about the fast QC process so you see this process is a little bit different from other processes because it has multiple lines of code within the script section and if you see multi QC that's basically a single line over here and the same is true for index and quantification process but you next natively has no limit on how much how big should be the the content of the script section of script directive but you can easily imagine that this can become cumbersome if you have a very big script and if you don't you want to maintain that and read this within the within the context of next load dsl itself so perhaps a better way is to just create a new shell script any ad hoc scripting language would do but let's take the shell script for now bash script to be precise and let's add this within our workspace so what I will do I will create a new file and I will call it text file yes I will add the content here I copied the content from custom script section within the documentation 3.9 and I am going to save this within bin and I will call it fast QC.sh and I will save it so this file now exists in my file system over here within the folder of bin right the next thing we need to do is as you might be used to doing with all shell scripts we have to provide we have to make it executable and we do this by providing by running this command and now the fast Q script is executable so this script basically does the same thing we are doing inside the fast QC process right but this gives us the benefit that we can now remove all of this and just say fast QC.sh and the script would work the same so you can see how next flow can give you a lot of next flow gives you a lot of flexibility in the way you can structure your projects you can organize your scripts you can connect various processes together the behavior after the workflow completes so there are a lot of extension points or configuration points that you can you can rely upon so the fast QC process notice here ran again again because I changed the content of script so it broke the fingerprint or you can say the hash of this process was changed so that's why this process ran again and because fast QC ran again it means multi QC had to be run again because it fast QC multi QC depends on the output of fast QC so fast QC ran again generated new outputs and new outputs came or went into multi QC and as an as an after effect it has to be run again so all of this all of these scripts are available locally right and this with script number seven we basically conclude the creation the development of a proof of concept RNA seek pipeline but we have a lot of pipelines available in the community it may come created by the community and of course is a great example but many other communities or people have published pipelines in their own GitHub repository right so what you can do is you can run a remote project as well I can run this could be a full HTTPS link GitHub link but if you just specify something like this next flow defaults to next flow's default understanding is that this is GitHub so you can specify a different revision like you want to only run development branch of a pipeline or master branch of a pipeline or a main branch of a pipeline in your environment and you can do that by simply relying on this dash R command and so next flow whenever you run a remote pipeline next flow pulls it locally stores it in its cache and then runs the pipeline so this pipeline you can well this is the full version of our toy and proof of concept RNA seek pipeline which is which you can run on many platforms and I would say this is a good pipeline if you want to experiment customize things on different platforms so with this I'm going to cancel this run because this will work you can try this in your own environment but I would in the interest of time I would like to move on to the next section which is dependency management for your pipelines so as even mentioned earlier dependency management is a big problem when it comes to having reproducible workflows right so there are two major solutions for this first one is managing or installing the packages in a specific environment and the answer to that the most popular answer to that is condor and then packaging everything up into a into a you can say shareable package is the answer for that is contained so in this section we will see how docker and condor can we can work with both of them individually and combine them effectively to have a reproducible toolset that we use in our pipelines so in this section it would be great if we can just rely on git pod and the simplest command we can use to experiment with docker is again run so coincidentally that's the only command we have seen in next flow as well so far but next flow has a lot more commands like pull drop clone and yeah so let's just focus on docker run come on and we will say docker run hello world and what docker would do similar to next flow actually is it would check whether hello world container exists locally if it does not it will go to docker hub pull the container and then run it and exit so this is effectively what next flow was doing when I ran a github hosted rna seek pipeline it pulled it locally run it and exit it so we have already pulled a container hello world container but this time let's pull a container which is more wholesome so it takes a little bit of time pulling this these containers so we already have stretch debian stretch slim container on our file system and the next thing we need to do is basically let's check how many images we have locally local images say local images we have four containers over here we have salmon we have rna seek we have hello world we have debian and this is the tag so the next step or next thing we want to do is to understand what does what is what is visible inside the container what is available inside the container right so to understand this technically you can make an analogy that you are logging into a container right but in terms of docker terminology you can run a container interactively so we will come here we will run the container interactively and as you can see we are inside the container so the prompt has changed if I do pwd that is just slash but outside the container I was in something like slash workspace slash get forward right so now we are in different or a containerized world of debian stretch slim container right and over here let's confirm if you have salmon tool or maybe fast qc tool we don't really have any of those tools here so what we want to do is create our own custom container and bundle in the salmon tool so that we can use in use it in the pipeline right so the way we can create a docker file is you can think of a recipe of a docker file so it starts off with certain base in our case it would the base would be debian stretch slim because we already have pulled it we have confirmed that it exists locally we have we have logged into it interactively so we are pretty sure that this works but if you want to create our own custom container and instead of creating this one what I would do is create directly package salmon into it right I will combine a couple of steps here so this is the docker file I'll create the docker file here yeah and added the instructions here and then I will add the rest of the instructions here as well so if you just take a look at this docker file it is installing it is updating the apt get cache and then installing two comma two tools called curl and kause curl is quite useful I'm not so sure about kause how many people prefer talking kause but we will add we will add sorry here yeah we will add the instructions to to create a container which has salmon tool built into it so we can go back to the docker file over here and then in the end we can add run curl all of this command here yeah and then build the image locally so I come and just paste it here yeah I have to move the docker file into training I can create the I issued the command build command for for the image and now it is using the recipe I specified to create to create the container this is the recipe again and it will start off with debian stretch slim as the foundation for everything it will do next and then it will update the cache update the environment path and then it will download and set up salmon tool this is already done so we have created the container already and we can even log in we can run salmon in the container we can check around it like so yeah so now if we log into the container we can say run it and we say my image we will have access to salmon inside the world inside the containerized world right if we come out we no longer have access to salmon so it's a tool we have bundled it into a container uh the next thing we will do is uh so these are hands-on commands that you can use to mount or to give access to a container to your data right so over here we are going to run the same thing but we cannot do it because we have uh the container does not have access to the data because it is on the file system of the host right containers are containers are bundled in a very secure manner so that by default they don't access file system alone so we have to run this command to make sure the container the salmon inside our container has access to the file that we want to use and now it is analyzing that file and it has completed the analysis so it means the instead of just using this tool from our own uh host environment we can containerize a tool and then mount give give that tool access to our file system or technically you can just say access to your data and then you can run the whichever tool you want to run inside that container so this is a great way to to kind of make your entire analysis reproducible and we can you can publish your containers to uh to docker hub and other platforms but let's move on to running our one of the scripts that we ran previously in the session with our custom container that we just created so this time script number two i can open up script number two which just has one salmon index process we wanted to run with our custom container rather than using the pre-built rnsc nf container that the team has provided so we ran it and it's it's working as expected so yeah the same is true for singularity and people who are coming from hpc background you might uh be aware of docker but basically docker requires a daemon to run in the background with root access that's why it's not very accepted in the hpc community uh the the hpc community tends to use singularity or other containers so evan has mentioned in in the presentation that next floor already has support for many kind of container systems since a long time charlie cloud podman uh so we have a lot of containers already supported by next floor so you can apply the same principles you can build your own container bundle your dependencies and run them to use uh run them within your within your workflow it's the same principle uh i will quickly jump to the corner section uh because this is the the second part of the equation so uh the way we installed salmon inside uh inside the docker container was basically i think i can even open it quickly yeah was using curl and then we do move uh from one location to another basically just setting it up using manuals uh instructions right but uh we have tools like konda today and konda also has wonderful community of bio konda and what they have done is that they have uh they have made available all of these software tools used in bioinformatics within the world of konda right so uh what we can do is that instead of using manual instructions to build a container we can combine what konda provides us and what docker provides us and then create our containers using konda recipes right similar to docker recipe we have konda recipes as well and we have an example of konda recipe here konda recipe is a yaml file and then you specify name channels dependencies and you can use a docker a docker which already has konda technically and then create yeah create a docker container using that environment file so you have the best of both worlds you have uh an automated and verifiable and reproducible way of installing your dependencies and you have a verifiable reproducible way of running those dependencies so we can yeah so creating a container out of these out of this konda yaml file i would leave this as an exercise for for the viewers but we have covered a lot of section today a lot of a lot of content today the one thing i would like to cover and then call it a day and then we connect tomorrow again the one thing i want to highlight is the project of bio containers so what the konda community bio konda community has done is that they have since they already have reproducible recipes for all of the software packages they they automatically build docker containers for each of those packages so for fast qc they are doing something very similar they are using konda recipe to create the container and then they have published on published these containers to bio containers project and then instead of building your own containers you can just use these pre-built containers pull them and run them instead of building them all the time that being said you can build upon this upon these containers as well but that is again something of an advanced usage and with this i would say we can conclude the session for today i think we are a little bit over time but thank you again for attending this session i know we have covered a lot of things and rest assured we will talk more in depth about each of these concepts but if you have followed through the entire conversation the entire content then i think you should give yourself a pat on the back this was a lot of content again thank you so much for for watching this live stream and have a wonderful day ahead