 Hello, everyone, and welcome to this Nextflow and NFCore online community training event. My name is Chris, and I'm a developer advocate for CuraLabs, and I'll be presenting the training material over the next four sessions. As a part of this training event, other presenters will be presenting the same content in one of five different languages, so I'll be presenting in English, but we also have presenters for Portuguese, Hindi, Spanish, and French. This training event is really a starting point for anyone who is new to Nextflow and NFCore. It is best if you have a basic understanding of the command line already. However, you don't necessarily need an understanding of different bioinformatic tools that we'll be using as examples. A lot of what we will be showing is really used to demonstrate concepts and ideas that are a part of Nextflow and NFCore. So if you do want to apply what we're teaching here to other fields, we hope that you can do so relatively quickly and easily. Each session will be around two hours long, however, there may be some slight differences day-to-day, as well as between presenters presenting in different languages. In session one, we'll be covering this welcome, as well as an introduction to Nextflow. We'll get you started by going through a simple example, and then we'll really expand on this using a proof-of-concept RNA-seq pipeline. In session two, we'll have a brief introduction to NFCore. We'll go through some of the tools available for NFCore users, developers, as well as really dig into what modules and sub-workflows are and how they can be used between different Nextflow and NFCore pipelines. In session three, we'll really expand on some of the ideas that we've touched on as a part of session one and two. We'll talk about how you can manage dependencies and containers. We'll dig into these Nextflow ideas of channels, processes, and operators. We'll have an introduction to Groovy, as well as modularization. In session four, we will continue to delve into more detail that you will have been exposed to in the previous days, such as the configuration of pipelines, different deployment scenarios, how you can use cache and resume, some more, excuse me, troubleshooting strategies, as well as how you can get started with Nextflow Tower. To access the training material, you can follow this link here, which is training.nextflow.io, and this training material is free to access at any time. You will need a GitHub account to access the Gitpod training environment. You can create a GitHub account quickly and easily on the GitHub website. And like I said, you will need this to access the Gitpod training material, but I will expand on this very shortly. Recordings of these sessions will be available after the event as well, so if you can't attend the entire session or you want to come back and watch something again, you will be able to do so. For asking questions as a part of this training, you better direct your questions to the different Slack channels for the different languages. Community volunteers will be available to help you during the event. Please be patient if your question isn't answered immediately. We've almost got 800 to 1,000 people have registered at the time that we're recording these videos. So at the time of the event, there may be a lot more questions than people available to answer, but we'll do our best to make sure that everyone's question is answered in a timely fashion. So let's get started. This is the training material website, but first we will have a short introduction presentation, which should take about 20 minutes. Okay, so we will start off with this introduction to fresh-producible data analysis with NextFlow and NFCore. So when you think about NextFlow and NFCore, you might think about open science. This idea that pipelines and data and our research should be transparent and accessible to others to make sure that what we are doing is done using best practices. You might also think about open source. So NextFlow is an open source tool if you're interested in what's happening under the hood as a part of the NextFlow software. You can actually go and look at all of the code because it's all available online. And this is really important for reproducibility. This idea of open science is because you can actually go away and really understand what's happening as a part of these tools. You might also think of open community. So both NextFlow and NFCore have these open communities where people can contribute and collaborate and work on either software or pipelines together rather than all working as isolated silos all over the world. We also know that genomics and pipelines can be complicated. So as already mentioned in the introduction just before, we will be using a lot of genomics examples as part of this training material, but a lot of this can be applied to fields outside of the life sciences. One of these concepts is that with genomic workflows, as well as a lot of other workflows and pipelines, there can be a lot of data. So with a genomic pipeline, you can produce terabytes and terabytes of data across samples and during different parts of a processing pipeline. You might also find that there are a lot of different tools that are involved. So you might find that a particular piece of software or scripture of written as a Python or R or even Matlab, Peril or even just a Bash script. All of these might need to work together and have different dependencies. It can be really difficult to actually combine all of this into one large pipeline. We also know that the way that data is passed around may be dynamic. So a piece of software might pass an output into a different piece of software, but based on some logic, it might be getting pushed somewhere else as well. So the way data is passed through the pipeline can also be incredibly dynamic and complicated. To visualize this in a slightly different way, this is actually a pipeline that was created to annotate a parasite genome. So this pipeline actually has 70 different tasks, including 55 custom scripts and 39 software tools and libraries. Every little circle there is actually one of those different tasks involving a different custom script and or piece of software or library. And all the different lines suggest that data is being passed between these different processes or tasks in quite a dynamic way. We know that pipelines can be incredibly complicated, but also that the reproducibility can be very difficult. And when I say that, I mean that if you were to reproduce your pipeline on one computer using exactly the same softwares and tools to try and reproduce it on another, you might actually get different results. And here is a pipeline, excuse me, a paper from the early days of Nextflow showing here in sort of panel A that when you run this gene annotation pipeline that I've just shown on different running environments, you can get different results. So here, for example, you have the same pipeline being run on Amazon Linux with Nextflow and Docker, Amazon Linux with the native, as well as the Mac OS with the native. And you can see that this overlapping Venn diagram shows that there are different genes annotated in each of those different environments. Down the bottom on panel C, you can also see that there's a gene transcript quantification of differential expression and that you get the same effect between different running environments. However, you can also see there over to the right that when you run Nextflow with Docker on two different environments, you can get exactly the same results showing that with the right workflow manager and using containers, you can produce exactly the same results across different running environments, which is really important. We also know that when you read a paper or you're trying to replicate someone else's results, quite often the methods that have been described as a part of that paper or online as a part of a repository don't always have all the information you need. This is a representation of this showing an iceberg. And as you can see underneath the iceberg, there's a lot of extra information, which in this case has been suggested that it hasn't been described properly. So things like the actual language at all was written in, the OS, the version, the metadata, different file formats, the availability of software, parameter options and references for all of this aren't always shown. And there's a really nice reference down there at the bottom if you want to go away and read about this more. But what I'm trying to say here is that even though you might have, for example, the version of a software, there's a lot more that is required to actually reproduce a pipeline across different sites, across different environments, and by different sort of researchers and or groups. So as I already suggested, this is where Nextflow comes in, it can help you with reproducibility. This is probably the best way to describe it. So Nextflow is a language. As a part of that language, we have processes. So processes are these sort of single units that I will refer to throughout the rest of the training. And these can be thought of really as the execution of a single task or a process. As a part of Nextflow, we have this concept of channels, which is used to pass data between these processes. So a process might turn one file format to another, and we can pass that into a separate tool using channels. As we string all of these processes together with channels, collectively we would refer to this as a workflow. So our workflow is written in Nextflow and it is made up of processes and channels. This is a very simple example of what a single process pipeline might look like with Nextflow. So at the top you can see that we have a process block, which we have given the name FastQC. We have given it an input, in this case path input and output, which is a path to this FastQC zip or HTML file. And down the bottom we have a script block, which is the actual execution of the tool that we are running, in this case FastQC. For those that are outside of Bioinformatics, FastQC is a tool used to really sort of check the quality of an input file. Down the bottom we have a workflow block, which we have a channel, which is bringing in the FastQ files and we are piping this into the FastQC process. Like I see this as a really simple, singular process workflow written using Nextflow, but it's just to show that the language is reasonably simple, with sort of clear definitions of inputs, outputs, as well as a script block. As a part of Nextflow as well, we have something called implicit parallelism. So with the workflow block and with multiple files using the channels and processes, you can divide this and have processes run in parallel. So this can be really powerful because you can run multiple processes or the same process, multiple tasks, the same process at the same time. And this can really speed up the way your pipeline is run. And it's also really powerful because you can distribute your pipeline and resources across the cluster or the cloud quickly and easily. So down the bottom here we have some BAM files, which are another common file type when you are dealing with Bioinformatics. That data is being passed into a process and is being split out into different fast queue files. So as already mentioned, we have this parallelization, but we also have re-entrancy. So with Nextflow, you can use a resume flag, which would allow you to re-enter a pipeline at a particular stage. And this is something we will explore as a later part of the training. But using the resume flag, if your pipeline has failed or you want to tweak something as a part of your pipeline halfway through, using re-entrancy, you can jump back into the pipeline at the part that either failed or that you have edited, meaning that you don't need to re-run the entire pipeline. So with Bioinformatics as an example, processes can take a long time and allow, excuse me, that would include a lot of compute. So being able to re-enter a pipeline at a different stage or a stage halfway through a pipeline is a lot better because it means that you don't have to re-analyze or re-compute something that you've already done. Ultimately as well, NextFlow also has a lot of reusability. So a lot of the things we will talk about is the rest of this training. Lead into this idea of reusability and that because the pipeline is portable, you can have all your software and versions contained as part of NextFlow, it can be very reusable across different sites by different people as well. So we will talk about different aspects of this now as well. So the first is code, the second is software and finally compute. So when you're managing your code, NextFlow has automatic integration of all of your favorite version control software. So you can use things like Git, GitHub, Bitbucket, GitLab, GitAir as examples. And with this, you can allow for exact version controlling of a particular pipeline as it develops over time. As well as this, NextFlow has automatic integration of all of your favorite software managers, including Docker, Partman, Singularity and Condor. And all of this is really important for one portability, but also reproducibility. Finally, for the compute, NextFlow will allow you to scale your pipeline across all of your favorite HPC and cloud providers. So we have a few listed here, but there are many more. So Sungrid Engine, Slurm as examples, as well as your favorite cloud for HPC, as well as things like Microsoft, Google and AWS for cloud providers. Reproducibility is really important with NextFlow and NextFlow really helps with making your code and pipelines reproducible. You can quickly and easily, as already mentioned, change your software manager by adding something called a directive at the top of your process block. So here we've added Condor, but you can easily change this to Docker or Singularity as examples. All of this can be quickly and easily changed in a single line. You can quickly do this by just changing Condor to container as an example. And when we're talking about the different execution platforms, here we've added Executor, Slurm, meaning that this could be used on your HPC cloud, excuse me, HPC rather than cloud provider or even something that, excuse me, this can scale to HPC or cloud very quickly and easily. So NextFlow really helps with reproducibility between runs. It is portable between systems and it is scalable everywhere. That's where I'll stop with NextFlow for now and I'll transition to talking a little bit about NF Core. So NF Core is a community effort to collect a curated set of analysis pipelines built using NextFlow. This has been heavily rooted in the life sciences, particularly genomics, but there's really nothing stopping this from evolving into other fields as well if you do have a background outside of the life sciences. Currently there are 75 different pipelines that are available as a part of the NF Core community. As a part of NF Core as well, there are various tools that have been developed to help with the writing of pipelines and the testing and automation of pipelines as they are developed. This is something that we'll explore heavily as a part of session two, which will be tomorrow. NF Core also hosts a lot of modules and sub-workflows. So modules is a word that we use to describe processes for analyzing biofumic data most frequently so using different pieces of software. And these have all been packaged into what we are calling modules, which are really just the processes that have been all written in a way and tested in a way that they can be shared between different workflows. We also have sub-workflows, which are the chaining of different modules together. And again, they can be shared between different modules and using the NF Core tool, and both of these can be installed, adapted or removed from pipelines very quickly and easily. NF Core has these sort of core ideas that make it more than just a repository. It has this sort of ethos that we should be developing with the community. So we will all work together on larger pipelines or initiatives, make our code available for others so that we can all benefit from this larger community. Pipelines start from a common template, and this is really important because this will help you develop best practices from the start of your development process, as well as this, you'll be able to automatically integrate different templates as they are developed by the NF Core community, and also help keep your code up-to-date as your pipeline develops and needs to be maintained over time. There's also this really core, probably one of the most important concepts of collaborating rather than duplicating. So NF Core will discourage creating more than one pipeline for the same type of analysis. So this stops there being four different versions of the same pipeline. Together we will work on the same pipeline together and improve it by adding features and making sure that it's maintained rather than duplicating the amount of work and creating more work for everyone by having more than one version of a pipeline to maintain. Here are some numbers about the NF Core community. As it stands, so there are a little over 4,000 Slack users, more than 500 GitHub organization members. There's over 1,500 GitHub contributors as well as more than 3,000 Twitter followers. These numbers do evolve quite rapidly, and I anticipate that as a part of this training event we'll have many more Slack users and many more GitHub organization members as well. If you're not already a part of these, I really encourage you to go away and join these after the training or during the training. The NF Core community is very global, so we have a very strong user base in Europe and the Americas. We're also expanding into other areas, such as Latin America and Asia Pacific. If you are from those areas and even want to talk about anything, we do have dedicated developer advocates in both regions, so really, really happy to hear from you if you want to sort of talk about different things that you want to happen in your region. Here is a paper describing the NF Core community, so this is a 2020 paper sort of really describing some of the things I really talked about about NF Core. Because it is 2020 and we're already in 2023, some of the numbers in this will be a little bit out of date, but it gives you a really good overview of kind of the core ideas behind NF Core and some of the amazing things you can do as a part of the community. Finally, I'll finish on talking a little bit about NexFlow Tower. NexFlow Tower is a product of Secura Labs. It gives you an intuitive launch pad interface for launching your pipelines with, excuse me, your NF Core or NexFlow Pipelines. It allows you to launch, manage and monitor your pipelines either locally or on the cloud. You can share runs with your teams so they can all be monitored by others as part of the group. And as already mentioned, you can create cloud infrastructure and you can access this very quickly with a single click. Down the bottom there is a link to tower.nf if you want to know more about that. NexFlow Tower has different tiers, so there's a community open source version, there's a cloud version which is free and paid tiers, as well as an enterprise version which is commercial. If you want to know more about these, you can get in touch with the NexFlow Tower team and talk about different things that might be available to you at your institute or workplace. There is just some details to me. If you want to know more about NexFlow, the summit from 2020, all the talks were online and a really great resource for anyone who is developing or wants to know more about NexFlow. Down the side there we have some information about the current training which is happening right now as well as NFCOR hackathon which is happening in a couple of weeks. If you haven't already participated in the hackathon, it's a really great community event where everyone sort of works together on some of the core, NFCOR pipelines, modules, as well as tooling and documentation. Okay, so now we will jump to the training material. So what we will do is we'll jump back to that website that I showed you previously and we will start working through the training material. Okay, so here we are on the NexFlow training website. As you can see, it's training.nexflow.io. It's up there in the browser. What I would like everyone to do is scroll halfway down the page to let's do the available workshops and what we will be using is the workshop material. This is basic NexFlow training workshop content just here and we can access this by clicking on the green button that says start the NexFlow training workshop. What that will do is take us to this NexFlow welcome screen which is a welcome to the training workshop material and an outline of what you should be able to do and understand by the end of this course. So by the end of the course, you should be proficient in writing NexFlow pipelines. You should know the basic concepts of channels, processes and operators. You should have understanding of containerized workflows. You should understand the different execution platforms supported by NexFlow and you'll be introduced to the NexFlow community and ecosystem. As a part of this in session two, we will talk about NFCOR and we'll look at some of the tooling and documentation that is available as a part of that community initiative. So first things first, as we need to set up an environment so that you can participate, follow along with some of the coding that I'm doing as a part of this workshop. So over here to the left, we have environment setup and we will click on that. This will take us to this page here which has some instructions for installing everything you need, either locally or through Gitpod. So while you could install everything locally, I do really recommend that you use Gitpod. So Gitpod is a free virtual environment that you can use for up to 50 hours per month using the base environment. It has everything that you need already installed in this environment. So such as the data, all the scripts and all the tooling. So already installed in there. However, if you do want to sort of think locally, we do have instructions that are listed here. So here for example, we have a list of the requirements. So bash, Java, Git and Docker. Some optional requirements there. Just beneath that we have how you can download NextFlow. So we have this WGIT command as well as curl. So you can pull NextFlow into your system and then you'll need to make it executable down here. As well as that, you'll need to download, install and register for Docker. You can do that by following this link here. We don't have all the information on this website, excuse me, website as a part of this content. But like I said, if you follow this link, it'll take you to everything you need to be able to do that. To get the training material, you'll need to clone it from this GitHub repository. This will download all of the scripts as well as all the data that we'll use as a part of this training. However, like I said, I do really recommend that you use Gitpod and you can quick access this Gitpod environment by following this link here, which is under 1.2.1. So if you click on that, it should create a new Gitpod environment for you. I have already used Gitpod before. I'm going to reopen this earlier today just to check that it was all working. So for me, I had a message saying, do I want to open a new environment, which I said yes. For you, if this is the first time you're using Gitpod, you might need to authenticate your account using your GitHub credentials. And it may take a little bit longer for everything to load. But while that is loading, what we can do is just kind of explore what is appearing in my Gitpod environment and my training window. So as you can see here, this is pulling the container image. So this is the image that contains everything that we need to participate or run this training. What you should see is a window that looks like this appearing. So this is effectively what you expect to see is like a VS Code window as an example. Down the side, we have all of our file options over here. So for example, File, you can open files or folders and things like that. Down here, we have extensions that we have installed. Some of these are used as a part of the training material. You don't need to worry about anything having these installed. Here we have the Explorer, which we can see this NF training folder. Inside of that, we have some data. Let's say data. I mean, we have more folders with data inside those, as well as a number of NF scripts. So these are the next five scripts that we'll be using at different parts of this material and different parts of this material, like a graphic file and a YAML file, which are also used during the training. Here in the main browser, we have what is the next load training material. So this is a link to the web, effectively the same content that we came from, from the environment setup. What you can do is you can either keep this open and follow along with this in your browser. You can open other scripts, and these will just open up alongside and you can just flip between. Alternatively, you can close this and just use it in your browser window. Down the bottom, we have the terminal. So this is just the NF training terminal. This is where we will do it with our coding. What you can see is just we already have all of the normal functions that you'd expect on a normal Linux system down there as well. Hopefully, everyone has now managed to load a Gitpod environment. If not, do feel free to pause this training video and just take a little bit longer to make sure everything is loaded properly before you continue. Okay, so what I will do first is just check that I do have NextFlow installed. So if you just run NextFlow, what you will see is that you get a list of all the functionality of NextFlow and some of the options available to it. What we will do first is just show you the version. So here, for example, you can see that you can use this minus V to actually view the version of NextFlow that we were running. So you can see NextFlow version 22.10.4. But what I would like to do is actually pin a version of NextFlow. So what I will do is I'm going to type in export and then NXF underscore short for version and then type in 22.04.5. So this is just exporting this system variable to my environment. Now when I use NextFlow minus V what you should see is that it is downloading something. In this case, it's a slightly older version of NextFlow and it has given me that as the output down there. So you can see 22.04.5. And the cool thing about this is that this is a really good way of making sure that the version of NextFlow you're running is reproducible. The NextFlow is a rapidly evolving tool and there's always new features being added. But just to make sure that everything is reproducible on my system and yours, we can run it like this. Okay. So what I will do next is go through some of the introductory material. So over here back in the training material we're now going to click on introduction. So this is underneath environment setup over here on the left. And what we have here is 2.1 which is basic concepts. So this is kind of reinforcing some of the things that I went through as a part of the initial slides and introduction. But what I want to do as a part of this training is keep coming back to these core concepts because they are really fundamental to being able to write efficient pipelines using NextFlow. So NextFlow is an orchestration engine and domain specific language, a DSL that makes it easier to write days or intensive computational pipelines. Some of its core features of NextFlow include workflow portability and reproducibility, scalability of parallelization and deployment and integration of existing tool systems and industry standards. So this is really how you can integrate Git for example and use different software management tools like Docker, Singularity and Condor. NextFlow really has these core concepts which you can or keep coming up. And these are processes and channels. So in practice your NextFlow pipeline is built up of lots of different processes which themselves are joined by channels. So processes are executed independently so they are isolated from each other. This is something we will make a bit more sense when we talk about the work directory but they do not share a common rival state and they communicate via asynchronous first and first out queues called channels. So this is something that I mentioned earlier as well is that all of these processes can be run in parallel as separate tasks. So any process can divide one or more channels as an input or an output and these are all connected by different channels and you can sort of string these together and sort of create these webs of interacting processes not so much interacting but these processes are you can communicate between these processes using channels using like I said these input and output declarations. So here is just a diagram to help explain this that we have a channel which has data Z, Y and X. All of that is going into this process and is being split out as task one, two and three so all of these can run in parallel and these have been taken, the data has been taken as an input and then there's some sort of output which has been given in this case there's been called X, Y and Z. Something I haven't spoken about a lot already but will be a recurring theme as well is execution abstraction. So what this really means is that your Nextvo pipeline is extracted from or abstracted from the runtime. So you can write your Nextvo pipeline in any coding language so you can have Python, Bash, Perl, Matlab, many others you can orchestrate these tasks using the Dataflow programming meaning that you can really sort of tune how these scripts interact using inputs and outputs you can define the software dependencies via containers or other software management so Conda, Docker and Singularity as well as this we have built in version control with Git so you can pull straight from GitHub, GitLab but bucket and use versioning really quickly and easily as a part of your Nextvo pipeline. All of this is extracted from the Nextvo runtime so you can move this entire pipeline which is already carrying information that will help define a software, its version as well as the code that is running in the runtime environments and orchestrate this on different systems so you can move your entire pipeline very quickly and easily scale and reproduce it on AWS, Google Cloud, Microsoft Azure as well as any of your favorite HPC systems such as Grid Engine or Slurn. Oops, scrolled a bit too far there. Finally just something I wanted to sort of go back and touch on again is that Nextvo is a DSL so domain specific language and practice is actually an extension of the Groovy programming language. You don't need to know a lot of Groovy to write good pipelines with Nextvo a little bit does help but by no means do you need to be an expert in Groovy and Java or anything even close to an expert to write your pipelines using Nextvo. Okay, so that's all I wanted to sort of talk about as an introduction and I think the best thing to do is just jump straight into some coding so down here under 2.2.1 we have our first script which is Hello.nf and I just wanted to highlight here that there are like little pluses that you can click on to get more information about each of these lines of code and what they're doing but to start off with I think I will go over here and open this up as well so over here in my Explorer I have Hello.nf and this is exactly the same script that is already included here I just have it over here in my terminal now what I will do is quickly just outline some of the lines of code, what they're doing and how they might sort of interact with each other so starting off here with line one we have a shebang which is declaring that this is an Nextvo statement here in line three we have this string Hello.world that is going into this params.greeting params.greeting is a little bit special because we have the params. at the start meaning it is a parameter and it can actually be edited on the command line I will show you how to do this very shortly after we've run this script one or two times here in line four we are turning channel we are using channel.of which is a channel factory to turn params.greeting into a channel and we're calling that greeting underscore ch or greeting underscore channel you might find that I either refer to this as ch or just channel because largely this sort of phrasing here ch is used to describe channel so again we have the string that's gone into params.greeting params.greeting is being turned into a channel by this channel of which is a channel factory and that is being called greeting dot channel greeting underscore channel over here on line six we have the start of our first process and down here on line 18 we have the start of our second process each process has an input an output and a script block some script blocks start with a script just like this but it isn't actually required for that to be detected because it's already got these three speech marks to wrap it up so for this first process split letters we have an input and output and the script block the name split letters is somewhat arbitrary you can call your processes anything you want but I would recommend that you do call them something that is familiar right into Fireball as a part of a larger workflow script for the input we have a value so there are different data types that you can use for when you're using next flow one of those is a value and the second is a path so here we have a value which is x x is also being named somewhat arbitrarily you can call this anything you want any letter any name any string anything can really go in there there are a few exceptions reserve strings or reserve names but largely you can call this pretty much anything you want for an output we have a chunk which is going to be chunk underscore with this wild card so anything with the prefix chunk underscore will be picked up as a part of the output path paths are generally used for files and folders whether values are more for strings down here in the script block we have the printf function so this is kind of like a base function that some of you might already be familiar with and what that's doing is printing in this case x or this variable x so this x is defined up here as our input and we're using the dollar sign to make it a variable that is getting piped into split so split is another function that you'd expect to find on most Linux bash systems so it is splitting the printed input into chunks of six and it is outputting that into files with the prefix chunk underscore and you can see that this chunk underscore is the same as what we have here in the output meaning that the output of the split function will be picked up as the output here with anything at the end because we do have that wild card sort of glob pattern down here in the second process we'll call convert to upper again we have an input and an output the input is a path which we've called y and the output is standard output so this is effectively just like a printed screen function down here in the script block we have a cat which is going to print the contents of y which in this case is what we have as the input so it is going to print the input files that is going to get piped into this function tr which is effectively going to turn all of the small letters into capital letters as described using this az to az okay so you'll notice as well that both of these have been closed off all the process have been closed off and down here on line 30 we have the workflow block what we see here on line 31 is the split to letters process it is taking greetings underscore ch or greetings underscore channel which is what we've defined up here in line 4 so this is a channel of brand stock greeting which in this case is hello world so split letters is taking hello world or is accepting hello world that is being passed through this greetings underscore ch channel and that is being given out as letters underscore ch which is the name of the output channel this is then getting put into the convert to upper process so this is that second process that we've just defined or described we've also added this dot flatten to the end of that so this is a little bit this will be new we haven't talked about this yet so dot flatten is actually an operator and operators are ways that we can kind of manipulate a channel so that we have the channel shaped in a way with the right parts in the right places to help sort of like I said manipulate what you might expect as an input or an output from two different channels excuse me from two different processes to match each other so if a channel is shaped wrong if you have too much or too little information in there it won't necessarily match up and you can use these operators to help merge, separate, flatten mix outputs and inputs to make them sort of match or join each other in this case we're using the flatten which is to split the two files into two separate items to be put through the process separately so we have this letters dot channel and a set of the outputs of creating dot channel which in this case will be multiple files because we have multiple chunks created by splitting hello world multiple chunks it is combining it all into one element and feeding that into convert to other the output of this is results underscore channel and we can see this down here again on line 33 we have a different operator which in this case is dot view we have this this IT and squiggly brackets so that's really just meaning that we're going to take the item and view it or we're going to view the contents or view those channels as items or view the item channel we will come back and revisit this later in session three but for now it's just worth knowing that this view operator is used to view the results channel okay so I think the easiest thing to do is just to jump and actually show you how to run this and then we can look at the outputs and start to discuss what we're actually looking at so to run a next-flow script you use next-flow run and in this case I'm going to execute the hello dot nf script which is the script that we've just talked about so line one we can see the version of next-flow that is run line two we can see that we are launching this hello dot nf it has been given a name so every time you run a pipeline it will be given a different name which is made up of two parts a verb and a name it is DSL2 so just for reference it used to be DSL1 but now next-flow is exclusively running DSL2 unless you use an older version of the software down here on line three with the executor which in this case is local even if you're running this locally on your own device this should say local this would change to the name of your HPC system or the cloud provider that you're using if you're using either of those down the line on line four we have this hexadecil number this is something we will discuss in more detail it is generated based on the script you are running and the files that you're inputting into this but this is effectively a way to identify where a process has been run and we can explore this part of the work directory but for now all you need to know is that we have run this split-letters process we have run it to 100% so it's all being completed and that was one of one task that's been run once here on the next line we have converted to upper it's been run to completion 100% and this was done two of two times here is the standard output so this is what was actually printed to screen as a part of the output and we have hello and world what we might find is if we run this multiple times is that this may actually be world hello I can never guarantee if this will or won't happen but this is just because of the first and first out asynchronous queues that are created because of the channels and occasionally what you'll find is that world will reach the standard out before hello does but there was three from three of hello world rather than world hello so we might just call that as a loss and move on okay so I'll just quickly touch on these hexadecimals again so like I said what you can find is that these hexadecimals are generated for every process every time a process is run as a separate task you might find that these oh we have world hello it'll be given a different special hexadecimal name these will be created and stored as separate isolated folders and your work directory so what you might notice or might have noticed already is that we now have this work directory over here in blue on my screen if you look inside this you'll see a number of different two digit sort of numbers and all letters these are all generated by these hexadecimal numbers that we see over here next to our processes that have been executed here for example if we were to go into this one here which is 7E you could see something like alias work 7E yours might be completely different if you don't see a 7E it's because I've generated that on my system and you might have generated something else you can also see the end of this which is cd1bd6 that's the start of the rest of this long string here if you actually go in and look at that you can see chunka and chunkab so these are both generated as a part of the split letters and if you actually were to cat those and look at the contents of those files kind of what we do is a part of that second convert to upper process you would see chunkAA you would see hello so that's hello which is being converted to upper as a part of the next process this is really just to show that this has happened in a separate work folder and we can use these hexadecimal to actually identify those for the sake of this training what I would do is I'm going to remove my work track to me so mref work that has removed the work directory so there's no longer there there's no other magic sort of happening behind the scenes apart from a few log files that have been created over here in my explorer that we created as a part of the execution they're just used to kind of help keep track of what's going on and I can also reference those if they were errors what I will do is I'm going to run this again so next flow run hello.nf but what I want to do is turn off ANSI lock so what this is doing is it's turning the ANSI lock into false or off there are a number of different sort of flags you can use as a part of next flow and these will help sort of control what's happening and there's a big long list of these in documentation this is just one of them but what I've done here is I have split this out so that we now see every process separated out as a separate line so instead of previously when the second execution of Convert to Upper, the second task of Convert to Upper would have just been shown on top of this one so we wouldn't have seen this 4C at all and we would have seen with this 03 here are our separate lines and what you could see is that if we actually looked into the work directory we'll see a folder created for each of those and then of course much like we did before we could go in and actually look at the contents of that which in this case we chunk A and chunk B again ok what's next let's look at the resume function so to demonstrate how you can use the re-entrancy in next flow what we will do is we will modify our script and we're just going to have reverse and then we're going to use that on Y so this again this is the Convert to Upper process hello.nf script so we're going to use the Y we're going to designate that as the variable by using the dollar sign then we're going to use this reverse function what this should do is it will modify the script or the second process as a part of the script and what we will do is we will use this resume function so here we have next flow run hello.nf-resume now what we should see is the word cached come up next to our processes so because we have this work directory everything we do with next flow is created or happens in these isolated folders and the work directory and then if we were to re-run next flow using resume what you will find is that we can access those folders and next flow will know if it has or hasn't run this particular process before or hasn't been modified or changed anyway then we can resume it from the point that it has changed so in this case split letters hasn't changed and we can use the cache but down here because convert to upper the process has changed it cannot be cached and it had to be re-run again what you can see here down here is the output we have hello world in reverse so we no longer see those turned to capital letters instead we just have it printed in reverse if we were to run this again using the resume functionality again what we should find is that we have cached for both of these cool okay so what you can see here is that two of two again but instead of only having cached up here on split letters we also have it down below on convert to upper so next flow knew that all of this has already been computed and we didn't have to run it again so this is a really powerful way to if something is going on with your pipeline and you want to edit something or even add in another sample something like that you can quickly resume it from a point while not having to re-compute everything that you had already processed what I'd like to show next is how to use or how to override parameters using the command line so up here in line three we have this params dot greeting which is a little bit special and that we can actually override these in the command line so this is a common feature for anything that is a parameter and starts with this params dot as a part of a pipeline or in a pipeline so what we can do is we are going to run this again I might just clear this to make it a little bit easier if you run to c so we have next flow run we are going to go hello dot nf and then we are going to go greetings or dash dash greetings so we can use dash dash to access parameters as a part of the pipeline and in this case I'm going to copy in this greetings this is the same greeting that is used as a part of the introductory training material you can use any other greeting or any other string that you like I think that should have been greeting not greetings so this will likely spit back some type of error so this is still run because this greeting that I have specified there weren't any checks there to make sure it was real it's relied on the hello dot world up here rather than being overwritten using my string in the command line so this should give us a slightly different output yep so there we go so this is still run through split letters it's been split into three chunks this time because it's a little bit longer the upper which we have modified to print the string in reverse and we can see these three strings printed out here because this is a slightly longer string what we can also demonstrate is actually going in and seeing what's happening in the work directory we can go and look at this a a 8 a if 9 etc etc and see what's in there we can see we've got chunk a a chunk a b and a c and if we actually to look at those like we did earlier you'll see that each of these will have slightly different contents being the string that has been split okay so that is all I wanted to show you with this hello dot nf example like I said there is more information about all of this in here and you can click on these little plus marks look at them in more detail what we've done is we'll work through just running the script we've had a look at this entity log we've modified it and resumed it and we've looked at overriding this using the prams as a variable of overriding the parameter greetings right down the bottom here in 2.4.1 some of you might be used to looking at DAGs which this is just another representation of what we've already done or I've spoken about already showing that you can have your inputs which is put through a channel pass through the two processes and then we have the outputs here down the bottom okay so next thing we will do is we will move on to this simple RNA-seq pipeline so this is really a demonstration of a real world biomedical scenario and we call this a proof of concept because it isn't a real RNA-seq pipeline it's merely just an example that we're going to use some commonly used bioinformatic tools which some of you are very familiar with just to show how this can be like a real world application if you are unfamiliar with RNA-seq and you're coming from outside of the life sciences in particular please don't worry too much about what these tools are doing I encourage you just to think about how you can apply your own data to these situations and if you use similar tools you know in different fields or whatever you're doing to sort of focus on the big concepts rather than exactly what's happening with the execution of each script and piece of software so what we will do is we will jump over to the Gitpod environment again we can close this hello.nf and what we will do is we'll go over to the explorer again over here on the left and we'll click on script one so the first thing we can do as a part of this is actually add I'm going to add a multi-string statement so multi-string statements are really just ways to sort of just comment your code so if you want to add a comment in your code you can do that quite easily just by using this this is my code so this won't be detected by next flow this is kind of like just a typical comment type statement as you'll see though in some of these latest scripts we have this multi-line option which we can also use just close that and we can add that to the top so by using this slash you don't actually need these in the middle what you should do actually just to keep those you can comment them out in different ways but anyway if you have all of these in here you can add in your just a multi-line comment and then you can keep adding in lines as many as you want just a new line that's all kind of here and all of the advice I wanted to show that you can add in comments what we have in here at the moment so inside script 1 this is the unmodified file again we have this prams.reads prams.transcript.file and this prams.multiqc all of these are strings they are using the project dir as a base so project dir is kind of like a reserved string that continues to basically specify the project directory so the directory that contains your project in here this has been this folder, the data folder with the ggl folder and some gut files inside of it so what that actually means is over here inside this project directory which in our case is this nftraining folder we have the data folder inside that we have the gggl folder and inside that we have gut, liver and lung files which appeared with 1 and 2 1 and 2, 1 and 2 and then we'll have this fq file extension for those outside of bioinformatics you can have these fast q files fq files they're quite a common file format if you're looking at sequencing data so we have these three parameters each of these have been defined as strings or these are effectively strings in this case they are file paths and down here on line 5 we have this print line which is a groovy function and what we're doing is is printing this line with the variable params.reads which we've cleared up here so we just run this nfcore, excuse me nextflow run script1.nf so again we can see nextflow run the version that was run is telling us what script was run and it's giving us a name and here we just have the output which in this case is reads and then this string that was specified up here is a part of the params.reads you can add as many different parameters to a pipeline as you want I think the exercise that we have included as a part of this training material is actually adding in an outdoor or an output directory as a parameter so what it is asking you to do is add in a new parameter params.outdoor in this case I think the example that we are using is just adding in results so later on we'll use this as a part of one of our pipelines but we're just going to specify it now just as an optional parameter or a different parameter with the value results down here for example you could sort of substitute this out and call this any other parameters that we've included in this case it's just going to be params.outdoor we can just run that again so that's quite nice again we can see the reads in this case the params.outdoor is results and we can see that printed to screen so this is all well and good and you could go away and create some quite complicated strings using the different groovy functionality to print all this to your screen every time you run a pipeline but next I also has built-in log functionality so as shown in the training material we can actually add in a log info with a multi-string statement so here for example we've actually got some information just saying this is an RNA-seq pipeline and then sort of listing the transcript of the reads in the output directory which we've already specified up here in line 4 as well as we have this this other little bit here which is strip indent which is just going to strip the indentation when we print this to screen so I'll just quickly run this again we're going to go next flow run script 1 what we should see is a nicely formatted log information printed to screen this is just a way to show that there is some built-in functionality for next flow to have nice logging for example that you can print all this to screen this helps us keep track of what was actually going into the pipeline and what it was called and potentially where it's come from okay so we're going to move on to script 2 which we can see here so as you can see we've already got this multi-line statement or comment and you can edit these as much as you want I'm going to add a capital letter in there we have the prams.reads, transcript on file, multi-key C and out there in the output directory we have the log info statement which I've already added and down here we have another multi-line statement okay so we've already added to script 2 which is different from script 1 is this process block and this workflow block so for this process block much like in hello.nf we have named it in this case we've called it index we have an input and an output and a script as the input we have a path which we've called transcript don't so as I said earlier you can actually call this anything you want but it is nice to call it something that is identifiable so we're calling the input which we're insisting is a path rather than a value as transcript don't here we have a path as an output which in this case is going to be salmon index it is going to be a fixed name so this isn't going to be a variable name or anything like that so this is the output will have to be called salmon.index specifically because we haven't added any any variable information in there and this is in the quote marks down here in the script block we have the salmon tool so salmon is a tool that's commonly used as a part of RNA seek analysis so it's used to has lots of different functionality for different steps in RNA seek analysis in this case we're going to be using the index function of salmon we have specified the number of threads we have specified the transcript don't file and here we have specified what we want the output to be called so going back through this again we're going to be calling salmon index being a functionality here we have threads what we actually have here is tasks dot CPU and this is something that we haven't really spoken about but tasks is used a little bit like params and that you can specify different in this case in this case task parameters for this particular process so I will come back to this very shortly but this is a way to control in this case the number of CPUs that are being allocated for this process here we have the transcript dome which again was specified using a dollar sign and up here we have the transcript dome which is obviously the same so we're referring when we're using this transcript in here we're actually referring to this input transcript dome path down here is this i minus i which is going to be the salmon index which you have specified as the output I will come back to this tasks dot CPU and show you how that works down here in the workflow block we have the index process and that will be using the params dot transcript dome file so this will be what is taken as the input transcript dome which has to be a path so when we are referring to the params dot transcript dome underscore file we're actually referring to this particular string up here so this is getting passed or given to the index process and this is the script that is getting run on that and this is where the input is used so the transcript dome dot FA is being used in here as a part of the script so after all of that explanation I think I'll just run the script to show how this works so next flow next flow run this case script 2.nf we'll see what the input is excuse me output is so this is actually an error so this is what an error would look like running next flow and this is actually expected so let's just dissect what we're actually seeing here as a part of this error message so we can see that there's an error executing the process index it was terminated with the exit if the error exits status 127 this is the command that was executed the command output was empty and then we can actually look at the command error which was in line 2, 7 command not found so we will just take a little detail here and look at the work directory so we've talked about the work directory a little bit already when we're looking at hello.nf if we actually go and look into this work folder which is the same hexadecimal that's created up here for this process you can see there's a number of different things that have been created I've actually used LL here which is very pretty much just an LSLA so we can actually see a long list with all of the details for this folder what you can see here is that we have a number of different files so command.begin command.hero.logout.run.sh all of these are created as a part of the next flow execution and as I've already alluded to and mentioned that this entire process is getting run in this folder and it is required to run this process and it needs to be in this folder or accessible by this folder in this case the salmon tool was not actually available so this process has failed but you can actually go in and look at things like command.sh for example so the actual command that was run and you can see that this is the same command that was run up here in the command executed when it's referring to line 2 it's actually referring to line 2 here with this command.sh file anyway all of that is kind of an aside the next question is how do we resolve this so as already mentioned there's a lot of really easy integrations for different software managers in this case I want to use docker so what I haven't shown you already is that over here in this next flow.config file we've actually already specified a container with the software that we want to use as part of this pipeline the problem here is that we haven't actually specified that we want to use docker in the first place to manage this container and use the software as part of this so by adding this with docker flag we can run this again and what we should see is a successful run so again launching script 2 it's about to go away and run this process 1, 1 of 1 success so that's great we could add in with docker onto every time we execute this run but that could be a little bit tedious and easy to forget what you can do is much like we've already specified the process container some docker run options here you can automatically include this as a part of this next flow config file so docker enable equals true is the equivalent of adding this to the command line what I've done is I'm going to save that oh cancel I want to save that run again and what we should see is that this is also successful fantastic so this is just an example how you can add information into a contract file and use some smart logic by next flow to include things like containers and automatically enable docker to run those containers what I will do next is quickly show you how you can use the view operator to help view the contents of a channel so in this situation I want to use index underscore channel underscore ch.view so you might remember view from the hello.nf example we're using this to view the output from that script so the view hello world here I've just added this into the workflow block so .view with the brackets at the end and when we run this again remembering that we've also got this buried into the configuration file so that will automatically use that docker image which is also specified there this will allow us to view so the .view operator is allowing us to view the channel which again in this case is actually pointing back to this work directory where this file was created and we can see that this is the path to that file and if you want to actually see what's inside that file in this case folder you can run LS on that and actually see that so this has all happened within the work directory we haven't actually asked NextFlow to store this anywhere else for us just yet but it has run, has been created and is sitting there in the index underscore channel okay as I said earlier as well I was going to go back and revisit what's happening here with this tasks.cpu so for every process you can add things called directives and directives are really used to help control the execution of that process there are many many different declarations you could use up here or directives to help manage this automatically NextFlow is going to use one CPU as part of this process but what you can do is allocate a different number of resources for this process so here for example we're putting cpus2 this will run with two cpus rather than one so before we go any further what I'll do is actually just show you what this previous run looked like so we can go cat work that was the location we're going to go command oh show mm.sh so again this is the script that was actually run and you can see there with threads there is a one next to it now that I've updated this we can run it again so nextFlow runs script2.nf we can look at this again so this is where this was actually the output because we still got this view statement here what I want to do though is actually look at that command that was run so again we can go cat for the work directory look into this folder command.sh so again this is the we know this is the path to this folder because we've got the start of the hexadecimal up here next to the process and if we look at that again you can see threads is 2 so this was modified here as a part of this task execution of use the cpus up here and change it to 2 and this is now running with two cpus rather than one so that is everything I want to say about script2 and we will now move on to script3 so this is script3 here as you can see we've got some of the same stuff that we've already talked about so we're setting our parameters we've got the log info there but down here on line 18 you'll see something a little bit different so on line 18 we are using another channel factory so similar to the channel of which we used in the hello.nf we are using from file peers so this is actually going to create channels from file peers that are a part of well they are identified as a part of brand.reads so what I will do first is actually just run you through a few executions of this so what I'll do quickly is I'll clear clear my browser and I'm going to clear my work directory so it's nice and clean and you know that there's nothing else happening in the background there what we can do is we can do next flow run script3.nf this should run and we won't see anything because the output if we add in a view statement much like we did previously we can actually view the outputs of this channel and they'll be printed to screen so what we can do is run this again but with dot view at the end so this is going to view this read peers channel so what do we see we see one full channel with an element of gut and the second element is this list of two files so we have gut underscore fast 1 and gut underscore fast gut underscore 1 and gut underscore 2 dot fq so what do we think has happened this channel factory from file peers is taking in params.reads params.reads up here on line 4 of script 3 is the project directory the data folder the the gg gal or the ggl folder inside that we have the gut files with either a 1 or a 2 at the end dot fast q so using these curly brackets allowing for either 1 or 2 in that position we have gut at the start and the dot fq at the end so it is managing to identify both of these files here which in this case is gut underscore 1 and gut underscore 2 and this is the contents of these files it's not really important what's actually in those files what has happened with the from files peers is it identified that those two files are a pair of 1, 2 files it is identified that both of those are paths it is found the base name of these two files which is gut and created another element for that at the start what that means is that these two files are now getting passed around as a pair of files and we have a second element as a part of this channel which is the first element in this case of value which is gut so gut is like a string and we still have the two paths to gut underscore 1 and gut underscore 2 as files ok so what we can do is actually use this similar use a similar logic to what we've done previously using a basically a wildcard, a glob pattern to identify more files as a part of prams.reads what we will do is we will overwrite this in the command line so we can use reads so dash dash reads again this is specifying the the reads parameters because the reads is a parameter so prams. dot we can go to the data folder the gtl gt 1 2 dot fq so if we were going to run this now this would be exactly the same as running what we're doing here, the only difference is that this has got the project directory at the start here where we can use a relative path to where we are in the current directory so it does it sort of automatically but what I want to do is actually change this to include everything so not just gut so now we're not specifying that it has to be gut it could be any of the other tissue types in this situation they have an underscore 1 dot fq at the end or an underscore 2 dot fq and what we should find is that this will populate for the gut, liver and lung so this wildcard allows us to use any of the three tissue types because there's been a short time to find all three of these in here it's found this almost like a base name being the tissue type and the two files that are associated so it has created file pairs from the files that were available in that folder and fit this pattern in this case up here it was just gut down here by overwriting the command line it was any of the tissues because we had that wildcard in the pattern just as easily so you just wanted to run it on the liver you could easily just overwrite it with that ok so to move things along there are a couple of exercises here under 3.3.1 the first of which is to use the set operator so set operates in a very similar way to the actual equal sign that we use here to name our output channel in this case what I will do is I'm going to copy this get rid of that we can move this down like this add in dot set and then add in some squiggly brackets for the read pairs dot channel so this is the name I want to set the output channel to and then we can add in a line there to help with readability so we've got this channel this channel factory from file pairs we're still taking the parameters dot reads but instead of using equal sign to set the output channel name we're just going to use set for read underscore pairs underscore channel and we can run this again and see what the output looks like fantastic so that's still running we've got liver and the two file pairs that we've created as a part of that from file pairs channel the only difference here is that we've used dot set rather than equal sign something else that you can do something like a channel factory is actually adding some additional options so this is an additional exercise here under 3.3.1 in this case we're adding this check of exist option check if these file pairs exist and if it does the pipeline will continue if it doesn't it will fail and probably split back a bit of an error message so this is just a way to check that the file exists as suggested by the name you can go and look at the documentation for this online but again if you just run it again nothing else will really change apart from it's actually checked if that file exists and if it didn't kill the pipeline because it wasn't there to execute okay so we will keep moving on just because I am conscious of time so again we're still on the NF training folder we have a work folder which is full of different processes that have been executed or the actually you can't see that because nothing has been put in there rm.rf we're just going to clear that anyway so we're good we've got a nice clean directory again and a nice clean terminal so you can everyone can see what's going on jumping down to script 4 so I've just opened up script 4 and what you can see is we've jumped back to building our main pipeline so we've still got our parameters at the top, our log info we've got process index and now we have process quantification so quantification is another process that we have created or added in here we have again an input and output and a script for the input we have two different inputs so these are two positional inputs the first one is path which is going to be the semen index and the second is this tuple so a tuple is a way that you can have different elements as a part of the same channel so the first part is a value which we've called sample ID and the second is a path which is two reads again value and path are just ways that we can specify the different data types and this will be expanded upon as a later part of the training so here we have the output as well which is going to be a path to sample ID in this case sample ID is going to be the same as the value that we have supplied as an input so this file is really good it's really dynamic in that you can specify things you can reference things that come in from your input as an output for example as a variable down here in the script block we have semen quant so semen quant is another semen tool we still have this threads with tasks.cpu very similar to what we've done with the tasks.cpu up here we've done previously up here in the index we have a lib type we have a semen index so this is the input semen index we know that because it's got this dollar sign we've got a minus 1 for read 1 and a minus 2 for read 2 and we have these indexed reads which are specified here as a part of the second part of this tuple which is a path we've indexed the first and the second part of that list and then we have the sample ID as the output which is a named variable which we have up here as the value so this is going to give the output sample ID which will be taken as an input here as a part of this tuple that's all quite complicated and probably a little bit hard to digest so I think the easiest thing to do is just start to run this and then look at the outputs and then we can go back and look at the workflow block so next I run script4.nf hit enter and this will run and then we can see index is running and then we have quantification running and both of those ran successfully ok so what has happened here first of all we have in this workflow block this channel factory so this is exactly what we just looked at as a part of script3 we've got from the file pairs we've got prams.reads we've checked if it's true for the index.ch index.ch my apologies we have the index process which has taken the prams.transcript.file as processed that and created this index.ch channel is the output so again this is that salmon index folder which we specify up here as the output what we see down here is the quantification process and it is taking two positional arguments so the first one is index and we know this is going to be represented by this one because it's in the first position and here we have readpairs.ch read.pairs.ch which is a tuple which has the first part is a value which is sample ID and the second part is path which is to the reads it has taken the readpairs channel and created using this channel factory up here and then is also taking this index.ch which is the output of this index process so what has happened here is that we've taken an input from this channel factory but also a previous process so we've had two different channels being supplied to this process and then quantification the process has acted upon those and created a channel called quant underscore ch if we actually wanted to view the contents of that we can look at the work directory or we can use the published directory which is something we'll explore very shortly but first of all what I wanted to do is show you some extra functionality of this resume and this is something I talked about previously is that if you were to use resume on this but you're using resume because you've found that you've actually got some more samples for example and you want to include those as a part of your pipeline what we can do is add in something like this and we're going to add that wildcard back in here so we're going to run script 4 again we're going to resume but we're changing the reads we're overwriting the reads parameter from the command line and we're going to include everything as a part of this everything that specifies this wildcard pattern in this folder okay so that's what we're executing again that's all good but as you can see up here we've got this cached one and down here for quantification is actually run three times now not just one as was here we only specified gut because we've changed this to the wildcard it will pick up the liver and lung as well and it's been executed three times but one of those was cached if you were to run this again what you'll find is that this cached will become three yeah so the difference here is that this was cached one and this was cached three so next I identified that this has been run once before up here and here I identified they've been run three times so I didn't have to recompute everything it could just allocate or could just detect that they've been run and then just sort of push straight on to the results under 3.4.1 we have a couple of new exercises the first is adding a tag so a tag is another type of directive that can be used to help sort of manage your processes in this case we are going to add it to the quantification so tag is quite simple you specify tag in this case we're going to say semen on sample ID so again this is dynamic it's picked it up from the value here which is included as a part of this input tuple and if you were to rerun that again I'm going to run this with NC log false so you can see each process or each task for each process so again this is going to split that quantification out into three separate lines so they don't just sort of stack on top for each other we can't see them so what you can see here is we've got this tag semen on liver, semen on lung and semen on gut so we can tell exactly what each of these processes were and what's inside those folders if we were to go and look at them because we've got this tag here which is quite a nice way of just keeping track of what's happening in your pipeline as well as that you can also add additional directives such as published directory so this is quite a useful one to know about because this is how you can specify where you want result files to be stored at the moment everything is happening within the work directory but we haven't actually told the pipeline where to put this as an output so I'll just copy this straight out of the exercise material we have this published here which again is another directive we've got frames.outdoor which we've specified up here, we've called this folder results or we will call this results and down here all this is saying is just mode copy this is another option much like we've used the options in the past to check that a file exists check if it exists true here so again this is just another option okay so what I am going to do is just clear my work directory just to show that there's nothing there and then we will run this so again this is script 4 we'll use the answer log equals false, cool so we see index run once and quantification run three times three separate lines for the liver and one what you will now see though is that we have this results folder and this is new if you look inside that you will see these sample IDs so this is the same as what is here, this is the sample ID this is what we got from the channel factory is that sort of common base name and we have created a separate folder for each of these and aside those there's a lot of information about the actual quantification which was generated using the salmon quant so just to summarize that we'll say in a slightly different way to hopefully make it more digestible if what I already said hasn't made sense is that we've used this publish dir directive to tell next flow where the output files should be stored in this case the prams type out outdoor we wanted to copy the files there and the prams type the files are getting put into this outdoor here which is results and we can see that when we look at that or when we look at our file directory we can also look at it over here wherever it may be here results got liver and lung cool ok so that is really the end of the script and we're going to jump on to script 5 so again we're moving from script 4 to script 5 this script is a massively different apart from we've added another process in this case it's called fastqc fastqc has a tag it's taking the tuple value with a sample ID and the path to reads an output which is just an output to this blog's folder down here in the script block we're making a directory called fastqc and the score with the sample ID again this is dynamically inferred from the input we're running fastqc fastqc is installed as part of that docker image as well which is why this will run so if you were to run this this should all work what you will see down here we also have this fastqc as a part of the workflow so this is the process name and it is taking in readpears underscore channel so this is the same name that we have set from our channel factory so this is taking from filepears so again we'll have that base name at the start either gut, liver or lung as well as the two files listed as the second element this is really just to demonstrate that you can reuse a channel so you can see here that we have readpears channel we have the first one here which is used as part of quantification and then we use it again here as part of the fastqc so what we're really trying to do here is start to build up this idea that you can use these China channels in dynamic ways and you can sort of share them or reuse them between different processes depending on what you're trying to do so I'll just run this again I'll just clear all of this and then we will go next flow run script5.nf so again what we should see here is that this will run in this case because I haven't changed the, we removed the gut here or overwritten it in the command line to a wild card this will only run for the gut samples and we can see it running each process once if you were to overwrite this obviously I'm just going to overwrite it up here for this example by adding in the wild card so I'm not really overwriting, I'm just editing the script which would run this again what you will see is that this will run once for index and then three times for quantification and fastqc because we have files being passed through from that channel factory for each of the three tissue types okay fantastic there's not a lot of us to really explain about this so let's just jump on to script6 here for script6 we've added a new process again this time it's called multiqc so up here in line 68 you'll see process multiqc we have a published directory which is that directive that we introduced earlier so prams.outder and we're going to use the copy mode as well of course like I said you can read more about this in the documentation for input we're just taking everything so this is that wild card so everything that is staged in this work directory staging as a concept that we'll discuss more as a part of session 3 but for now everything that has been put into that work directory that has been allocated as an input when passed into that process as an input will be detected because it's just detecting everything and then the output it's going to have this multiqc report from html the script being run is just this simple multiqc. so just everything in the current directory will be sort of included what you'll see down here is a part of the workflow block down here when we actually run multiqc you'll see these two new operators the first is mix and the second is collect so this mix operator is mixing the quant channel with the fastqc channel so what is happening is you have the quant channel which is specified up here from the output of the quantification process and then the fastqc channel which is the output of fastqc process they are being mixed together so they are all being kind of put into the same place and then the collect operator is merging all of those into one element so before we actually look at this I think what we should do is just look at some of this using a view operator so again the view operator is to the IEW is to print the channel so print what is being passed around in between the processes so here we're launching script 6 this is all running it's all running once because we have we didn't specify the full wildcard setting but I think it's actually better if we do do that so I'm going to go back up here and add that in just going to save that and then I'm just going to run this again with resume so this should be a little bit quicker because index and quantification should be cached or at least some of it and then for the quantification in fastqc for the liver and lung tissue that I need to run again so this is the channel that we have viewed so this is the quant channel mixed with the fastqc channel and everything being collected into the one element what this would look like if we removed dot collect is quite different so again you are still mixing these two channels together that's why you can see the outputs of these two channels using the same view operator but here you see that they're printed on separate lines meaning that they would be pushed through as separate tasks thinking of this a slightly different way again potentially something that is more visual or tangible as well if you were to run it like this so this is the original script I'm just going to run this again I can use all that cached multiqc is getting run once which is great but if you were to remove this collect and run it again so we've removed the collect operator as we showed before this doesn't collect everything meaning that it will be pushed through as separate tasks you see it's run six times so the multiqc process because it's all separate channels that's going to be pushed through as task task six times so that's really just demonstration that with the right operators you can manipulate the way your channels are shaped and merge files and move them around in a way that you can get files into the right place at the right time and use your channels and processes in a dynamic way meaning that you can fine tune the way everything is run and make sure that everything is passed to the right process in an appropriate way okay so the final script is script 7 and this one is reasonably straightforward the only difference that we have here is this workflow on complete which again we have this log info so this is the same as what we've got up here but in this case when the workflow is successful it will either print a successful message telling us where the multiqc report was stored again if you're from outside the multiqc report is just a place where you can kind of collect all of your qc data and view it in a really nice viewer on the web and if it's failed then it will just tell you something's gone wrong so you could just run this again the resume is probably not overly useful but what I just wanted to show is that processes and workflow can sort of all happen say synchronous all these things are kind of like pushed through and as it's available it's first in first out but you can also use this workflow on complete so that once the entire workflow is finished you can sort of back end it by using something like workflow.on-complete which will mean that it's only executed once everything is finished in this case it's just told us where it's been successful and that we can open this file if we want to ok so we're running out of time here so I will just sort of move through the last bits quite quickly 3.8 at the training material is email notifications we can't really set that up too well as a part of the Gitpod environment but it's an example that you could also have an email to say it's run off successful here as your report sent to you after this workflow had finished 3.9 is quite interesting actually it's an example of a custom script so it's kind of already alluded to next flow is quite interoperable you can allow lots of different scripting languages and different scripting blocks and like I said Python, R, Parallel, Bash like whatever as long as it runs on Linux you can include it as a part of of your next pipeline so what I will do is I am going to code fastqc.sh so I'm going to create a fastqc shell script I am going to copy in the code from the training material so all this is doing is saying use this as a Bash script use positional arguments and feed those into the fastqc tool like we do in our main script block I am just going to close that and then copy in some code to make that executable so here for example I'm just making that fastqc executable I've made a directory called bin which is up here now I'm going to move this fastqc.sh which is still here into that bin click execute so you'll see now that we have a bin directory inside that bin directory we have fastqc.sh and it is executable what we can do now is modify our script 7 so this is the script that we finished off on with the workflow on complete but instead of having this make directory and fastqc process this fastqc will run in here like this we can just execute our script which is in our bin directory so next flow will automatically mount the bin I guess there's probably one way to describe it so that when you execute your pipeline it will be automatically there and any scripts you have in your bin will be available for you to execute so what I will do is next flow run script 7.nf and we will run this again so remembering that I have modified this script so it will look a little bit different to previous runs so there's a little bit different to previous runs excuse me so again it's going to run next flow version 22.04 it's been given a name that's all the long info that we asked it to print and then we have the index quantification and now we're running fastqc on gut but it is actually running this fastqc.sh which is a shell script which we created in here and made executable so that it could be run by next flow automatically down here it's just that on complete statement that we asked for okay so all we've done there is is added this in we added in the shell script in the bin folder like I said just to demonstrate that we can sort of include custom scripts in a situation like this put them in the bin and then next flow will be able to detect them and run them automatically something we can also do which is quite nifty as we can add in some extra reports or we can capture some extra reports generated by the pipeline so this is just some code that I've copied out of the trade material as well so with report, with trace, with timeline with DAG which we've called DAG.png we will just run this again I probably could have reused resumed here to speed things up but what this will do is create a report a trace, a timeline and a DAG which will populate over here in my explorer so a lot of this stuff can be quite nice as well just to help keep an eye on your timeline this is an HTML which doesn't render too well here but I think we can click on the DAG and we should get a nice image so this is just a representation of our pipeline all working together with the different processes there are circles we've got the channel factory up here as well as some of the operators in here showing how these channels are kind of manipulated to be passed from one process to another okay so the last thing we have as a part of today's training is something that we will explore in more detail tomorrow is running our code directly from GitHub so what this will do is because of this inbuilt version control is that you can effectively pull your pipeline directly from GitHub or another GitHub repository so in this situation nextflow-io this is a GitHub repository you can go to GitHub and find this and we're going to use Docker and we can run this and instead of having this locally this doesn't exist anywhere here what it's doing is it's going away to github.com nextflow-io and it's pulling this in it does appear to be an error as a part of this I'm not sure why that's there but what I'm going to do is try a different version using this minus r it helps version control so we can refer to a different version so I'm going to use the revision dev which is a development line and hopefully this one is working okay so this is probably the issue here is that it's still running it was potentially still running DSL1 which won't always run on the newer versions of nextflow now so we should update this example that for now it doesn't appear to be running it does have slightly different code here we've actually went to this repository here for example open automatically yes awesome so this is a repository here this master branch wasn't working it'll default to master so I use this dev branch which is still DSL1 it hasn't been updated in some time so that explains that one okay so believe it or not that still ran despite being at least 60 years old which is pretty cool it had all the tools and images available so I could just pull that straight from github if anything it's really showing that you know if you write your pipeline in a good way with all the containers and things available on a repository even years in years and years later your pipeline is still executable by pulling it trickly from github which is pretty cool okay so I think that's where we will finish today this gets us to the end of section 3 tomorrow for session 2 we will actually be diverting away from this training material and we'll be talking about NFCOR in more detail so for that we use a slightly different training environment and we'll really explore the different tools and documentation that's available as a part of NFCOR so with that I'd really like to thank everyone for attending today hopefully you will manage to learn something put up with me mumbling my way through and I look forward to seeing you all tomorrow so thanks very much