 Hello, and welcome back to this Nextflow and NFCore online community training event. My name is Chris, and I'm a developer advocate for SCURA Labs, and I'll be the one taking you through the training material again today. To start things off, we'll have a quick recap of what we've done in sessions one, two, and three, and then outline what we'll do today as a part of session four. In session one, we start with a welcome and an introduction to the Nextflow ecosystem. We got you started with Nextflow by looking at the hello.nf pipeline, and we finished off the session by developing our own proof of concept, our NSC pipeline. In session two, we introduced you to NFCore, and we looked at NFCore for both users and developers and some of the tooling and documentation that is available. We finished off the session by looking at NFCore modules and sub-workflows and how they can be used and shared between different pipelines. In sessions three, we started to look at some of the things that were introduced as a part of session one and two and more details. So we looked at managing dependencies in containers, channels, processes, and operators. We gave you more of an introduction to Groovy, and we looked at modularization of processes. In session four, we'll continue to re-examine some of the things that have been reintroduced previously, such as configuration, deployment scenarios, the cache and resume, and we'll talk about some ways that you can troubleshoot errors with your pipelines. We'll finish off the session by getting you started with Nextflow Tower. If you have any questions during this event, we have a number of Slack channels that are available, and we have a great group of community volunteers who will be there to help you during the event. Okay, so let's get started. I'll be right back over to the training material at training.nextflow.io. Okay, so here we are back on the Nextflow training website. What I would like everyone to do again is go and open up a Gitpod environment that we can use for this training again today. I have already opened a new window over here just to keep things moving. I've already done this. And one thing that I will do is I'm going to again just export the version of Nextflow so that we are using this particular version. As you can see, it's just downloading everything again. As a reminder, this is because this is a whole new environment. I haven't been using the same environment from day to day. Every day, I've been creating a new environment. So you'll be very welcome to either reopen one that you've already used or use a new one just like me. Okay, so back in the training material, again, we use this basic Nextflow training workshop so we can click on start. And then we can go down here to the configuration page. Okay, so what you might remember from session two is that Nextflow will look for the configuration files in a number of different places. And all of these are kind of interrogated in a hierarchy. All of this is because Nextflow is actually decoupled, the workflow from the configuration so that you can quickly and easily modify your pipeline without needing to actually modify your code base, making it very portable, and hopefully easy to configure. So as a reminder, over here on the configuration documentation, so this is in the Nextflow.io docs. If you just look for Nextflow documentation configuration and your favorite search engine, this is the page that should come up. Here is a list of the hierarchy, so the things, the different configuration files and how they are sort of used in this order. So for example, things that you write in the command line with the flag, overwrite things during the prams file or a configuration file you specify with the minus C. Also the Nextflow.config in your current directory, which is higher than your workflow, project directory, also the home config file and things that are in your main.nf. Again, this is all so that you can easily decouple what you've written in your workflow from your pipeline from the actual configuration. So going back to the documentation over here, 10.1.1 is about the configuration syntax. So what you might remember from our configuration file already, we have this dot notation. So process.container, docker.runoptions, process and docker are both different scopes. So we have a number of different scopes listed down here that you can use to configure your pipeline. So here for example, if we were to look at the docker scope, these are all the different things that you can include or configure as a part of this. Similarly, if you look at the parameters, we have a number of different options here and we have some more information about how these need to be written here. But going back to the trading material, what we can see here is that for the configuration syntax, it can be simply what you're configuring as well as the value. So here we have this process.container, so we're configuring the container for the process scope and we've specified that with an equals and the name of the container. Here we have the docker scope and we're configuring the run options and all of this has been simply written with the scope, the option and equals and what we're actually configuring just like this. Here is a number of examples. So 10.1.2 is an actual, a few examples, property one, another property and a custom path. Some of them have used the usual property name with the dollar sign as a variable here. Here we've got the dollar sign with the curly brackets to wrap up the expression. So this is just an example using the actual configuration, using variables in the configuration. 10.1.3, this is just a comment about using comments again and we've looked at this a couple of times already but you can add in comments quite easily with a double slash or of course you can also use the multi-line approach here. One thing we haven't really discussed is using the configuration scopes. So this is when we actually have, you might have multiple things that you want to configure as a part of one scope. So instead of writing as an example, docker.enabled, docker.container as we've done previously here. So what we've done previously is we've written docker.enabled equals true. You can actually write this using the scope like this. So this is when you use curly brackets and you can just wrap all of this up under the docker scope and you could write run options equals this. And then enabled equals true. So what we have here, there should be a capital O and what we have here, so lines four and five and seven through 10 are effectively the same. So we could delete those at this point and everything will still run the same. We could also do the same thing here where we do that for example. So what we're looking at here under 10.1.4 is just how you can sort of write these as individual lines or by grouping them in the same scope using the curly brackets notation as listed here. Okay, something we've also talked about previously in slightly different ways at different times is the parameters and how they can be configured in different ways and also overwritten in the command line. So here we have an example of two parameters. So prams.foo and prams.bar. What we have here is they have been set both in a configuration file and the workflow script. So this will be an example of how we can overwrite what we have in the script with a configuration file. And then I'm assuming what we can do down here as well is we're gonna show that you can actually overwrite this in the command line as well. So as a reminder, we'll set this up. So we can go here. I'm just gonna overwrite this snippet.nf rather than create a new file. So here we have the prams.foo and the prams.bar. Here we're just printing those lines to the screen. We will create a configuration file. So I think what we're gonna do here is actually put these into the configuration file which we've already modified. So here it's got prams.foo and prams.bar. So now down at my command line I'm gonna do next flow run and then we'll do the snippet.nf. So this is a little bit different to what we have written in the training material. This has been written as prams.nf. I haven't done that. I've just called it snippet. I'll put it into the snippet.nf file. And what we can see is that we have module.limand because this is overwriting. The configuration file is overwriting what was already in the snippet.nf because it was higher in the hierarchy. Similarly, we can overwrite this again by just specifying this in the command line. So remember when we have two dashes this is a next flow pipeline option or a parameter rather than a next flow parameter or option. So we're gonna overwrite foo with something, or something else. And we're gonna run that again. So this is just running again. And you can see here that I've overwritten foo. So foo was something a parameter so I could overwrite this as a parameter in the command line. And this has been overwritten which is overwriting what was in the configuration which is overwriting much of the main.nf file. So you can see here that you can stack your configurations on top of each other and modify things at different levels. So it is really important that you do remember this hierarchy and how things are brought in and that if something is coming in at the wrong time and why because it's in the wrong file it's being overwritten somewhere else. Okay, so we also have some options here that configuring an environmental scope. So this allows the definition of one or more variable that will be exported into the environment where the workflow task will be executed. So again, here we have an example which I'm going to include these in the configuration file which I've been adding to already. So these can stay here. That won't cause any problems for us. We now have this process foo. This has been written in lowercase letters that should probably be updated to uppercase. The next time that we're editing these documents we have a script block here. Remember that scripts are with the three singular quotes rather than the double quotes. We've also got this echo true. Echo has actually been depreciated and we're going to change that to debug. So jumping back over here to snippet.nf I'm going to change this to debug. So echo was something that was used in the past. This little bit of documentation looks to be out of date. So debug equals true. We are going to have, this is a script block. So just to, this is a shell block rather. So what I can do there, oh, a few typos. I'm just going to label that there for a sense of familiarity with what we've done previously. I'm going to add a workflow block down here which we are going to take foo. And in this environment whatnot we won't give it any inputs because we don't have any inputs specified as a part of our process. All that we'll have to work with is what we have specified, alpha and beta as a part of this environmental scope here. So when I run this, again, this will just be snippet.nf. The foo parameter is now irrelevant because we won't use that as a part of the script. It isn't being included anywhere. What we can see is we've got beta equals the home to the get pod path and alpha has been set to some value. So going back to what we have here, you can see that home. So this is the home directory. This is from the environment. We have some path. So home to get pod. This is the home get pod, so workspace get pod. The same thing that we have happening here. So this is all from the home directory. If you were to do echo home, you'd see that there as well. So when we run this, these are being included as environmental scopes. Or these are a part of the environmental scope. Of course, over here in the documentation, there's some more information about this here and how this can work. There's a note there as well as a warning. Okay. So this is effectively what I've already run. This is just a slightly different way to view it as well as the results that we were expecting. Okay, so 10.1.7. This is the config for processes. So the process directives allow specification of settings for the task execution such as CPUs memory and your container. So this is something that we have already sort of touched on a little bit. As a part of, excuse me, NF core. So as an example of this, if you were to go and look into a repo or a module, we're just going to look at airflow just because it's sitting there. When we look at these modules, you can actually see that some of this has been specified as a part of the process. So here, for example, we already have a container. In this case, we don't actually have the CPU allocation, but it has been given a label, which is something we'll talk about again very soon. So here's the example, just sitting this all as part of your process scope. We have CPUs memory and container so that when you execute your processes, it'll have 10 CPUs memory of eight gigabytes. And this will be the container that would be used. And here's just a note as well. So the process selector can be used to apply the configuration to a specific process or group process. And again, this is kind of what I've already mentioned or just now as a part of NF core is that we will discuss this a little bit later when we talk about labels. Okay, so here, for example, we have the process food and the memory has been given four gigabytes times the task or the number of tasks CPUs. So something we used as a part of, we used as a part of developing our RNA seek proof of concept pipeline was tasks.CPU. So what you might remember is that as a part of that task execution when we looked at the scripts, what you might remember is that we use this task.CPUs or dollar sign tasks.CPUs. And this would automatically detect the number of CPUs that we wanted to use and include that as a part of the actual script that was executed as a part of this as well. We actually modified this in session one. So here, this is an example of you can actually generate this number dynamically. So you can define a config setting by using the dynamic expression for a closer in a closure. So here, for example, we have four gigabytes of memory per task CPU. So this would multiply based on number of tasks CPUs that you've allocated. So for example, if you were to say tasks or CPUs equals two, you would have eight gigabytes of memory allocated in that process food. Here's just a little note about if you're acquiring more than one value, such as using pods of the Kubernetes, you can express this as a map object like this. Similarly, we've got process container. So this is, as an example, you can include a particular container for a particular process. This is something that has been shown already when we use this as a part of here, for example, we've got process container next slide RNA seek NF. So this was container with everything already in it. And over here, we also show that you can do, you can sort of generate this dynamically based on a particular process and have a little bit of logic to decide which particular container should be using as well as if you should be using with Singularity or Docker. The way this is written is potentially a little bit more advanced than we've actually shown as a part of this training. This is a really common way of writing everything as a part of NF core, but something I think is just a little bit outside of the scope of what we're doing as a part of this training, but something you will get quite familiar with if you are doing a lot of NF core pipeline development. Okay, so down here again, we've got this process container. This is an example of actually using a Singularity, Singularity file and image rather than a Docker container. Both of these, the actual way you write this is very similar. There's quite a hard one to demonstrate live because we don't have Singularity installed in the Docker and, excuse me, in the GIP environment, but something else you can do. Again, we've got an option down here to actually use the condor. Going back to the example I've pulled up here, you can see that you've got this. So up in here to use condor, so prams enable condor is true. You would use this image otherwise you wouldn't. Similarly, we've got some logic in here, sort of saying use this image or that image based on if you're pulling it from Singularity and if you wanted to use Singularity or Docker. Okay, so that's everything that is really included as a part of the configuration, part of this training material. Like I said, I do really encourage you to go and have we browse around the documentation. So there's a lot of information here about how you can write this and include this part of your processes and of course part of your workflow or your wider workflow pipeline, depending on what you want to call it. And this is really, really valuable information because there's a lot of ways that you can modify the execution of your pipeline to make it very portable, which I think is really important. Okay, so that's the end of the section for configuration and we will now jump over to deployment scenarios. So here we talk about deployment scenarios and this is where some of the configuration stuff that we've just mentioned becomes a little more interesting in probably a slightly more applied way. So putting this in context, excuse me, in context, in real world genomic applications, you might need to execute thousands of jobs using very large files. And this scenario, back-skechular is commonly used to pull a pipeline on a large computing cluster because you wouldn't have the resources locally on your laptop, for example. As such, it allows the execution of many jobs and parallel across many compute nodes. So all of those tasks, all of those processes are kind of just farmed out across your many sort of compute nodes across your cluster. Next slide has built-in support for this for all of the commonly used back-skechulars, so such as Griddy, Jinslam, IBM, LSE for example, as well as there's a lot of cloud platforms that are supported now as well. Thinking about your cluster deployment, you can have your script, which is decoupled from your config. Both of these kind of feed in together, and based on the config, you can specify if your script should be deployed locally on this back-skechular. Alternatively, there's some cloud option included here as well. To run your pipeline with back-skechular, all you need to do is modify the next-vote-config file specifying the target executor and the required computing resources if needed. For example, you've got process.executor equals slurm, which is seen it to the slurm scheduler. What you can do here is a part of process. You can see a lot of examples here. The game is another next-vote documentation. This is an example. If you're using the Grid engine, for example, you can specify the executor as well as the queue and the cluster options. And there are lots of different sort of process. Things you can set as a part of the process depending on what executor you are using. Going back to the material here is a lot of examples. So you can use queues, CPUs, memory, time, and disk, all as examples of how you can specify the amount of resources that you want to allocate to a particular pipeline. Again, this is just looking at it again, process, executor, slurm on a short queue with 10 gigabytes of memory, maximum 30 minutes, CPUs, four. So this would be requirements for all processes in your workflow application, meaning that it would be applied to everything. However, that might be fine for some sort of simple, some simple pipelines, some simple workflows, but in reality, you might need to allocate different resources for different parts of a pipeline. And for this, you can use things like with name or with label. So looking at this with name, we've got this process here, it's being sent to the slurm executor on the short queue with 10 gigabytes of memory, 30 minutes, with four CPUs. But using the with name, where you can select with, or process with the name foo or bar in this case, you can say that you want to use two CPUs with 20 gigabytes of memory in the short queue for foo and four CPUs with 32 gigabytes of memory in the long queue foo bar. So you can basically choose the resources for each process, meaning that you can come up with these really sort of awesome ways of deploying a pipeline and all the different processes and different resources, meaning that while a simple process and it doesn't require much, many resources can be done quickly on a short queue. For things that are particularly more complicated and require many resources, you can send that to a long queue with more resources. So here, this is actually quite a cool exercise where we can allocate the process with the name quantification, two CPUs and memory of five gigabytes. This won't quite work because this is probably labeled differently to what we've got over here as part of these scripts. So to show you how to do this, I'm just gonna remove that. So again, this is just in my nextflow.config Under the process scope, I'm going to add and so the process is with name quantification and then I'm going to change this to five CPUs and turn it by some memory, which is more than it had previously. All of this has been wrapped up with the with name quantification inside these curly brackets, which is the same as here. I've just moved all of this within a process scope that was already as a part of my config file. So what I can do as nextflow runs script seven. Docker is enabled. I don't need that, it can be removed. And I'm just going to run that. So you'll see that all of this is running, which is great. What I will do though is we will interrogate what was actually run as a part of process quantification by going into that task work directory. So here I was going to go cat into the work directory here, which is the hash number generated for that task.command.sh, as you see what was executed. What we're expecting here is that five CPUs would have been allocated. And you can see that here threads five. So going back to the script, this is what we'd expect here. Sem and quant threads task.cpu's, because we've defined this here for the process with the name quantification. We've got five CPUs. So this is again, quite a simple example, but you can imagine how this could be sort of built on and then based on what you're trying to do, what you're trying to provide for a particular process or task, these resources could be created quite dynamically as well. Here, so this is an example of using labels. So we've used labels quite extensively as a part of NF Core. So what you might remember is over here, it's going to go to a pipeline. We will just look at this one, because it's at the top. We'll go to the Git repository. Here we can look at the modules. We'll look at the NF Core module. Use Guns.it in the main. Here we've got these labels. And these labels are based on what has already been included. It's part of the NF Core template. And here is part of the configuration file for base. You can see here that we have these different labels, which have been designed to allocate different resources. Here, in this case, it's kind of scaling. So for every task attempt, it is going to increase the number of CPUs and memory and time allocated for these processes. So this is where the kind of dynamic execution can come in. Anyway, going back over here, we've got this process task one, for example, which has a label long versus task two, which has a label short, meaning that you would have these different CPU memory and Q sort of resources allocated as a part of this process configuration. And that's exactly what's happening over here as a part of these NF Core pipelines as well. So this is very cool, because it means that you can kind of create one configuration or one label and apply it to a lot of your processes, which is good because it just saves you from having to sort of specify all of this every time, every process. You can just say, okay, this requires high memory consumption. I want to apply my high memory label. Similarly, something that we do as a part of NF Core is that you can provide specific containers for specific processes. So here, for example, we have the process foo and bar. These probably would be capital letters or upcase letters because that's how I would name them as processes. But we've got the container, other image, Y and other image, or some image X for each of these separately. And these will be used because Docker was enabled. You can see that we kind of have this happening over here in NF Core as well. And then we have, again, over here in the modules folder, I'm just going to pick the NF Core, Gunzip main. We've got here the Condor environment. And over here, well, the Condor tool. Here we have information about the containers that would be used for Singularity and Docker for this particular process. Okay, so down here is a part of 11.3. We talked about configuration profiles. So we've used profiles a little bit, especially as a part of session two, again, when we're using the NF Core, we're exploring the NF Core pipelines and tooling and functionality. We were using the Docker profile. You can create profiles which have set information. So again, like with the NF Core, for example, we have test profiles. So here, just to show this again, we have like this test.config, and this is included as a part of the next.config, and this has already got some input information and parameters already set, which can be executed as profiles. Similarly, we have, again, like this base config, which has all the information about the different labels. We also have like this modules config, which has sort of set information using the with name for each of the different modules that are included as a part of this pipeline. Again, all this can be included as profiles. What I'm trying to sort of show here is that you can create these really interesting and important webs of profiles and configuration sort of parameters that can be included quickly and easily by using things such as profile and the command line. Again, it's a little bit harder to show because we would execute this on the cloud, for example, which isn't set up as a part of the Gitpod environment. On the topic of cloud deployments, we have some information here under 11.4 about how you might set up AWS batch. Little bit of information there. Again, this isn't set up using Gitpod, but some examples of what you'd need to include there, as well as how you use volume mounts for your data. Again, we've got some information here about custom job definitions. This isn't really that demonstratable, or I can't show this as a part of the Gitpod environment. Custom images are a little bit hard as well, as well as the launch templates. But if you are interested in any of this, this is a really nice place to start the setting up cloud deployments. Here is just a quick note on hybrid deployments, which I think is probably worthwhile just discussing. Here, for example, we have process executor slurm and queue short, which would be applied to your pipeline, but here we have with label bin tasks. If you were to apply a label to a process, here you would send it to the executor AWS batch with the queue, my batch queue, the container, my image tag. Basically, what this is showing is that you can have a hybrid deployment, meaning that, say you were running this locally and you got to a process that required much more computational resources and you have available as a part of your cluster. You could dynamically have this sent to a cloud by adding this label, which is cool because it means that you could even run something locally on your laptop, and then if it is a particularly difficult task or process, you could have it sent to the cloud or even to an HPC system relative to your local device, which is cool. Okay, so moving on. Okay, yes, execution, cache, and resume. So this will help bring together a few things that I've mentioned at different points about when I was poking around the work directory. This will come up again a little bit in troubleshooting, but understanding how the cache and resume functionality works will really sort of bring together a few different things that we've talked about at different times. So the next low caching mechanism works by assigning unique IDs to every task that is executed. So you have your process, and as the channels are full with information, so as every different file comes through, a new task is generated. And the task will have this, like I said, a unique ID, which is 128 bit hash number, composing a task input values, files, and the command string as written here. So as we've seen previously, that when you sort of have your work directory, everything is generated in this work directory, and we'll have this 128 bit hash number, which will have everything staged. So everything coming in from your channels, everything that's being included as part of your parameters will be there. It is all staged and made available for the process task execution. The process will, the task will run. Everything is contained in that folder and that effectively environment. Everything is there, and it's all contained. And as such, this hash number kind of incorporates all this information. There's a little bit of information here about using Tree. I could show you this here, but it's gonna be a little bit of a mess because I have run lots of different processes. So there's lots of files here included as a part of the work directory, which is already there. I guess it's not too bad. So we can explore this probably a little bit more, but here for example, you can see that this D6 hash number has all of the files incorporated into it that is needed for the execution of that task. So all the files have been staged with similar links. Some of the things have been generated there as well. Which is cool. Again, we'll touch on this a little bit more when we talk about troubleshooting. But just for now, it's important to remember that everything is brought into that process, which is named based on that hex, that 128 bit, 128 bit number. So how does Resume work? The Resume command line option allows the continuation pipeline from the last step that was completed successfully. But practically what's actually happening is that next slide will start from the start again, but it'll go through and it will test if these work directory folders based on the hash number, the bit number exists. If it's been modified in any way, if it's exactly the same and that number is the same, then it'll say this is cached. I don't need to rerun it. It'll move on to the next part of the pipeline. So the next task or process. This is probably a word a little bit nicer here, which says that, you know, having for launching the execution of a process next slide uses the task unique ID to check if the work directory would exist and if it contains the valid command exit state with the expected output files. So it's not going to miss out as it is actually checking to make sure that the exit state wasn't in error. So if it's satisfied the task execution is skipped, the computer results are used as the process results. So as I've just mentioned as well, all of this is sort of generated in the work directory. The work directory by default is just in the directory that you're sort of operating in. So like if I was to move back here, for example, so I'm actually sitting in a slightly different directory here, one back from this NF training folder. If I was to run the next flow run, NF training script seven, we could run that. And we'll see that we will create a new work folder in the current folder in the current directory. So all of this will run because it's all, everything is relative to the script, the execution script, the next flow script rather. So now if we look at this again, you'll see that we have this work directory, which will have just the files that were generated right now as a part of that execution. Alternatively, you can actually use this W minus W, the command line option to set a different directory for your work directory to be generated. I think it's quite common that people set this to scratch for example, so that it's happening in the scratch rather than a folder higher up in your directory where the files need to be passed up and down more frequently. Cool, yes, here's just another note about how that hash code is generated. So it's based on the complete file path, the file size and the last modified timestamp. And this is actually quite a nice point or an important point, is that even just touching a file will invalidate the related task execution. So for example, I can go back here, remembering this is in the directory we're not in the end of training folder, this is the one that I've just executed outside of that. So if I was to just view one of these or touch one of these, let's do the, what's a good one? Let's do this. Cycles do touch. Now if you were to resume this, this wouldn't work because this shouldn't work because it's been touched. They have a different timestamp and that the resume functionality won't work. So you can see here that the index was still cached and that was fine because the process I have modified didn't impact on the index process. So I was only touching the gut sample, the gut FQ file, meaning that when it was indexing the dot FA file, which has nothing to do with the quantification step, but it does, but it doesn't use the dot FQ file. It was completely separate. So I'll move back into there. So just showing that I've moved back into this NF training folder there. Okay, we'll keep moving on with the training material. 13.3, this is quite an important point. There's good practice to organize each experiment to its own folder. The main experiment input parameters should be specific to that using NextFlow config file. That just makes it a little bit easier to replicate over time because you can have configs and other things just stored with your data. Something that we haven't spoken about at all is this NextFlow log command, which is quite cool. So this is kind of like a way to look at all of your previous executions and potentially resume a process that you have used previously. So I've just shrunk this a little bit. So it's a little bit smaller now, but it's nice to see everything on the same line. We'll try to see everything on the same line. Shrink that up again to see help. Not really. Okay, that's a little bit better. So here by default, what's generated is it shows you the timestamp, the duration, the run name, the status, the revision, the session ID and the command. And all of these are sort of produced in order. What you can do, which is quite cool. So the things that happened outside of this folder aren't included here. This is only the stuff that included inside the NF training dot NF Python training folder, NF dash training folder. What you might want to do is you go NextFlow, going back to the sample here, you can see NextFlow run RNAseq dot NF. For us, we're going to use something like script, we can go resume mighty void. So we can kind of pick up most of this information from here, for example. So we can go NextFlow run script seven. So it's exactly the same script that we've run down here. I want to use drunk Tesla, being the script that was run here. What I can do is go resume, with that particular name, so the run name. But let's say I now have a new sample, I don't want to rerun this process, this script. I want to rerun this pipeline rather. So I could go reads. So again, going back here, we can see that the reads parameter is set here. I am actually going to copy this and create a relative path and change this to the glob. So when this was run previously, it was only for the gut samples. Now, whenever you run this, this drunk Tesla being the run name with the reads, now including extra samples. What we should see, or I hope to see, is that this will be run, but now with, so I need to actually put a dot there. Let's try that again. This will run again. It's picking up on this cache and information. This here, because I have potentially modified that sample, hasn't picked it up. So this is an absolutely good example. But it's an example of how you can go back and reuse information, reuse sort of the start of a run. It's going to use the files that were cached previously and the files that already exist in your work directory and sort of just start and redo the process that it needs to. But anything that has been modified won't be able to be cached or re-included. So here, we have a little bit of information about the execution provenance. The log command we provided with the run name or session ID can return many useful bits of information. So again, what I could have done out of here, so again, just using the next-by-log with the name of the execution. So next-flow log. We will just take information here about this. Actually, we'll do this right here, which might be a little bit more interesting. Next-by-log, we can put in that name and this will give us information about, should have given us information about the process that was run or the pipeline that was run. Okay, so I think that was a bad example because it actually canceled out. But what we can do is we will just replace this with a different name of a run that's completed and it'll give us information about all of the different work directories that were generated as a part of that. What you can actually do is kind of build on this and ask for specific pieces of information about each task, each process by adding in something like this. So this is just actually asking for the process, the exit, the hash and the duration of each of those. And you can see here that we've got that information. So the process, the exit, the hash and the duration will print it out for each of these. Here you can ask this to be printed out in a different format if you're trying to do something else with it or put it through some sort of plotting. Here, so this is quite a useful resource for anyone that is trying to troubleshoot. So 13.5 is about the regime troubleshooting. So this is just a list of things that might have changed which might have impacted the execution of your resume. So if the input file has changed, the process has modified the input into consistent file attributes. So if you have a shared file systems may have some slightly more complicated file sort of labelling, the file attributes, race conditions. So this is an example where basically the variables, X for example, race and you might find that they appear in different orders. You can always just show this. So I'm just gonna put this into snippet.nf. So here we have channel one, two and three and we're mapping those to X and then X plus two. So we have channel one equals it and channel two, we have it. Because of specified X, we can get this basically effectively race conditions. So that's a pretty bad explanation. So next flow run snippet.nf. So again, what I'm gonna expect here is channel one. So we've got three, four and five which is for one, two and three we'd expect it to be X plus equals two. And here we've got X equals it and then X times equals two. So you'd expect that these numbers would scale differently. But you can see here that we've got channel one equals three, channel one equals four and channel one equals five and then we've got two, four and six. But if we actually use the def keyword to declare the variables as local meaning that these two don't interfere we'll see that we get different results. So we can leave that there, I'll just add that there. Again, all we've done is add this def. So we're defining that as a local variable meaning that it won't interfere with any of the other process executions, task executions. Oh, what's happened there? Yeah, so that code block hasn't been updated properly either so we changed that to view and look at it again. So you'd two, three, four or channel one we have three, four and five which is three, four and five but here we have channel two which is two, four and six which is what we'd expect for that modification is one, two and three whether here we have two, four and six I guess that is actually it's not particularly good example because the race conditions have worked out as you might have hoped initially but just to actually demonstrate this again I was gonna copy this out to show you that you do get different results. I just love it when a example works out and it shouldn't two, four and six still working. Okay, so here we've got six, four and six so it hasn't actually worked out that time so it's just because there is race conditions the first couple of executions worked out which isn't a particularly good example but what I was trying to show here is that by adding this the DEF keyword to define this as a local declare this rather as this variables local the final thing is non-deterministic input channels which is when you have two or more channels each of which is the output of a different process the overall input ordering is not consistent over executions so that just means that unless there's anything to kind of tie these things together or make sure that they are being emitted at the right time in the right place you can have non-deterministic sort of outputs which is shown here as well so really just what's happening here is that you've got effectively you've got this pair coming in but because there's nothing to join this or group this here such as a tuple this might not be the file that you'd expect it to be I was just gonna check if this would actually run no probably not without a workflow script but the point of this here is that because the output is the pair and the damn file and the by file so the reference the index rather they will come at a different points if you have multiple files so no guarantee that they'll be emitted in the same order meaning that they might just sort of turn up and be included in this gather process without it in a different order which isn't a good thing okay moving on to troubleshooting okay so here we are for troubleshooting I have demonstrated some of this stuff in slightly different ways sort of through the last three and a half sessions now what I'll do now is just kind of go back and sort of talk about some of this in more detail and talk about what's actually happening and some of those hidden next five files that are generated in the work directory so when you have an error you might get an error message like this here's an example of the different lines that you might expect to find this is the error message that we expected from one of the earlier scripts that we used in session one so it is caused by a description of the error case the command executed this is what was actually executed in the command.sh file which we've looked at a couple of times the exit status, the output if anything was output and if it was empty as labeled as empty as well as a line that you might have the line that caused this this error so what you can actually do and as we've done previously is go into the work directory and look at the different execution directories excuse me, the execution directories contain these files and we can go and look at these files these hidden command files to see what was actually run and try and understand why a particular process failed so what I will do is we will just clear this I'm just going to clear my work directory just to make it a little bit cleaner and we're going to run someone here, we're not going to resume we will just run this outright for the first time just like this so what we can see here is all of the processes have been executed all the tasks been executed are being executed now this is creating different work directories the first one here for example is this 97256 which is here 97256 we can go into that and what we have here all of these different files which will tell us about the command how it was executed how it began if there's any errors, the log everything you might need to know so there's nothing there in the begin this is effectively the error so we have things about the hashes that are generated to check that everything is the same a lot of information in there about basically error logging we have the log file again this is a lot of information about what would be printed to your terminal normally all of this is kept here in the log file this is also what is given as the output and here is the command that was run so here we have a few things that are happening another hit with next flow and how this was actually executed and everything was staged and launched and here we have the command.sh this is actually what was included as a part of the script block if you do have failing processes or failing tasks this is a really great place to find out about what was executed and actually try and debug it it is probably slightly more advanced sort of understanding but everyone needs to start somewhere so I do encourage you to explore that one thing that you can see here as well is that using tree we can look at the structure of all these files and what's actually happening is that here we have some links so this is when the data is being brought in to be staged so it's being brought into this directory, this folder it is being linked in so that when the command.sh for example is executed the file will be there to be used as a part of this process that's all quite a roundabout way but there's some more information about sort of what I've said here about how you can look at the sort of different exit codes and the command.begin and things like that you can ask to ignore errors so if you are, something is failing and you just wanted to sort of skip that process and carry on with the rest of your pipeline you can do that with stuff like error strategy ignore this is something you can sort of set here as a directive for every process that you're using or here as a part of the configuration file as well here again this is another directive we could also include this as a part of your configuration so here it says a part of the process it's just saying error strategy retry so if it fails it would try again we can also include things like max retry so that you can just keep you trying up to a certain number of times and after that certain number it might just turn off so a situation like this would be that resources on your HPC might be allocated or unavailable when it tried to launch or something was taken away or when it was launching and it failed so you might just have to retry it again and it would work here you can sort of set these kind of evolving situations so you can retry with back off for example where you can change the amount of resources allocated based on an error strategy so it fails okay but it doesn't have enough memory I want to retry again with more resources as an example again we sort of touched on this a little bit as well which is dynamic resource allocation so 2 gigabytes per task attempt again so if something fails you just retry it again straight away with the retry strategy or error strategy to actually just try and add more resources to see if that would get your analysis over the line so that's kind of everything as part of this training material what we'll do now is go back and look at next low tower in slightly more detail okay so for the next 30 minutes or so we'll talk about how you can get started with next low tower first of all some basic concepts so what is next low tower is a centralized command post for data management and pipelines that brings monitoring logging and observability to distributed workflows and simplifies the deployment of pipelines at any cloud cluster or laptop some of the core features of next low tower include the launching of preconfigured pipelines with ease programmatic integration to meet the needs of organizations publishing pipelines to shared workspaces as well as the management of infrastructure required to run data analysis scale what all of this is kind of alluding to is that a lot of the things that we've already talked about that you can do with next low such as managing your source code scaling for the cloud as well as managing your software with things like docker you can do all of this as a part of next low tower as well in a slightly more intuitive and easy way what I would like everyone to do is actually access the tower website so you can click on the sign up link here and this will take us to the tower website what you'll find here on this overview page and there's a lot of extra information about next low tower and how it might be able to support the execution of your pipelines so here for example some information about improved productivity, removed complexity reduced cost and simplified compliance and there's a lot of really valuable information here on this sort of main web page you can scroll down and read this in your own time but I do really encourage you to come back and have a real browse around the website and understand all of the features and exactly how they work here for example with some information about the compute platforms so the automatic provision management scale compute environments in the cloud and you can do all of this through next low tower as well as manage your container technologies and your source code right down here at the bottom of the page is actually a full list of the features of next low tower and what I wanted to draw your attention to as well is down here under learn there is a number of other resources that you might be interested in for understanding next flow but here on the blog site is some really nice extended pieces about some of the features of next low tower and how you can sort of integrate next flow some of the sort of other products from SCIRA such as the fusion file system how these can be all integrated into your pipeline and run using things like next low tower okay so scrolling back up to the top I also wanted to show you here just under pricing that what we're doing today is under tower cloud free so it has all of these features you don't have to pay for any of this but there are just some limited quotas and there are a few things that I'm missing compared to the tower cloud professional and tower enterprise and if you are interested in those are sort of what they might offer I'm sure SCIRA will be very happy to hear from you okay so what I would like everyone to do is sign up so here in the top left hand corner of the page we have the sign up button and you can click on that and it will take you to this tower page here you have two options or I guess three options to sign in you can either use your github or google account or of course you can sign up with your email as most of you will be using github environment I think the github link is easiest for everyone so if you click on that you can access the space what you should land on is this community showcase so in here there is a launch pad with access to a lot of the NF pull repositories and we can launch and play around with these to understand what nextflow tower can do and we'll do this very shortly however what I would like to start with is actually the training material over here under 12.2 usage we will look at how you can launch your pipeline from the command line and then monitor it using tower so this is one of the features of tower is that you can monitor your pipelines so back here in the nextflow tower window over here in the top right hand corner you'll see a picture that you probably have as your profile picture for github or google or whatever else you've used to sign in we can look at the drop down window and go to your tokens here we have the option to access tokens and you can add a token over here on the right there is this button that says add token I want you to click on that and then you can create a token name I'm just going to call this my demo it's easy for me to identify I can click add and it'll create this token that allows to connect nextflow to tower using the tower access token environment variable in our github environment so going back here to the training material you can scroll down a little bit and you can see here you can export this token to your environment so here we have export tower access token equals I'm going to copy that and put that here into my github environment I'll go back here and copy this entire token and enter it here to export it and enter one thing is please don't export this version of nextflow this is a little bit old I think this would now break if we were to try and include this if you've really done this by mistake if you go back here to the environment it'll set up right at the bottom it'll show you the version of nextflow that I'm using and I think it'll be best for you to use as a part of this training so we've now exported this token so the tower access token is now in our environment in Gitpod what we will also do is use a tower workspace ID what I will do now is set up an organization in a workspace from my organization so here again in the top right hand corner we can go down to your organizations part of the community which should also be part of here also part of secure labs what I would like to do is add an organization I can give this a name I'm going to call it my demo you'll see here that this format is with an underscore automatically I'm just going to change that to a lower case as well what you'll see here is we have options to have full names, descriptions locations, websites, logos so if you're sitting up in an organization and you work for example, you could include a lot more information here to make it more identifiable for you and others I can click add you can see that this is now being created here so I have my demo organization under the organizations sort of window I can enter that now what I will do is create a workspace I can do this by clicking on the add workspace button here I can say I can call this work one I'm going to call this work space one you can imagine that if you were in a group or an environment where you needed to have different workspaces different projects, different jobs this is quite a powerful way to create organizations and organize the workspaces that you may need access to or allow others access to I'm just going to call this shares so that others can see it then here so now we have work one with the full name workspace one we've generated this ID this will become why we have this will become more apparent very shortly but back here in the documentation I'm going to export the tower workspace ID so I'm going to copy that and my get pod window I'm going to add that here and then I'm going to copy out this ID and I'm going to put that there so now I've exported both my tower access token and my tower workspace ID and as a reminder you can do both of these things by creating an organization here and creating a token here so this is where the things get really cool so what we can do is we can run a pipeline from the command line from get pod in this environment so we can use nextflow run hello and all we have to do is add this with tower and what that will do is it will send the run information to tower and we can monitor it there so back here in get pod I'm going to run hello.nf with tower you'll see that I still have this hello.nf I can click enter what you'll see is nextflow is running it has given it a name so modest franklin as well as this provision number you can see that we can now monitor the the execution with nextflow tower using this address so I'm going to click on that and open it and what you can see now in nextflow tower this is my account it could find this because of my access token and workspace ID so this is my demo workspace one I have all the information about this run so we can see what the command line was down here in general we can see all this information that you can normally access in the command line but it's being organized and neatly stored here as a part of nextflow tower we can see that we have all of the tasks the processes that have been submitted have succeeded we can see those processes there what they were we can see all the aggregated stats of this run so the wall time the CPU time the total memory the read and write and the estimated cost this becomes more relevant if you're running on the cloud we have the load so the cause and tasks the memory efficiency the CPU efficiency all the different tasks that were executed so here we've got split letters once and the convert to upper twice because that was two separate tasks we have all the information about this run so here we have chunk AA down here we have chunk AB all the information that you'd normally be able to access in the command line you can actually access here and it's really intuitive and easy to find and interact with way down here we have more information about the the metrics so the raw CPU usage so it was allocated and used about the memory as well as the job duration so you can see if you are running a larger pipeline especially at scale and you're trying to understand what's happening where and when and what was done this is a really powerful an easy way of actually measuring and viewing what has happened in your pipeline here we can view the parameters so if you have a lot of parameters this is a really nice way of viewing it we have configuration here so all the configuration files and we also have the data sets for example so you can actually store and include data sets here as a part of or data set information in the next flow tower as well as reports we'll come back and view this as a part of actually running one of the community showcase pipelines but what I want to show is just a very quick introduction to this what we can do is actually go back to the Gitpod environment now and run something a little more complicated so as you'll remember and as one of the scripts that we've revisited a few times we've got script 7 which is that proof of concept RNAseq pipeline which we have developed we're going to run this again but with tower this will create another link like this but we can also go back here to our workspace my demo my workspace we can look at the runs and what you can see this is already being generated and is running so again we can see what has run so here we have with tower we have all of the same information that we've already viewed looking at the hello.nf this is already run and we've got some new aggregated stats as well as information about all the different pieces of this run again as already showed you can go in here and actually access all of that run information so every task what was there and how it was executed so what I'm hoping you're starting to see is that next hotel is a really nice way that you can actually control not necessarily control in this case but manage or view your runs from the command line okay so going back to the trading material you'll see that we've basically got to the end of 12.2.1 and if anything we've actually gone a little bit into 12.2.2 there is a lot of extra information here and you can work through this as well if you want but you think the easiest thing to do is going over to next hotel and exploring the community showcase and then going back to our new organization and looking at how you might set some of this up for yourself so here on the community we have the community organization and we have the show case workspace and all these pipelines that have been added already you can quickly and easily launch these using the launch button so I'm going to use the RNAseq as an example and we can quickly hit the launch button what that will do is bring up a full list of parameters so this will be the same types of parameters that we've shown as a partition 2 when we're looking at the NFCore pipelines that parameter JSON file all of that information is included here and you can quickly and easily write this into the pipeline so you can set your output directory to results for example you can add an email, a title of your multi-QC file all of the parameters that recludes that pipeline are going to be included in that parameters file with all the information that we've generated using that NF schema is probably the proper description can be accessed here if you are happy with that you can very quickly and easily launch so what this will do now is launch this using the compute environments and credentials that we already have available here you can see here that I've launched this basically exactly the same pipeline previously and that has already run in 24 minutes so while that is still submitting you can see here this is a slightly more complicated example compared to what we've done from the command line from Gitpod but you can see here this was run so this was run directly from GitHub, we have the GitHub address here we have the name, we have a parameters file with tower the version, the revision that we're using the test profile as well as the latest flag all of this is run, we have all the information where the work directory was in this case it is run on AWS batch using the AWS batch executor all of this information all of the different jobs processes rather are listed here we have all the information again about the wall time, the CPU time, the memory used all the information about the read and write cost, this is only an estimation but it does give you an idea of how much this would have cost based on the amount of rich environments that were generated for this run it was more complicated than that but I just wanted to point out this is an estimate rather than exact cost here again all this information that we've already sort of looked at one other thing we can look at here for example though is that you can actually start filtering for information so I can look at for example the wild type samples and I can go in here and look at exactly what happened for a particular sample as a part of the pipeline you can also see here that every process you can look at the different CPU usage as well as memory usage and the point of this is that you can actually understand where your resources were allocated what processes took longer, what took less where those resources were allocated that's all very very cool again you can see that with the full list the parameters, the configuration file in this case the data set you can store information about data sets on the x5 tower and we have information about that here in this case it's the sample sheet as a part of this as well you can have your compute environments we'll look at how you can set this up very shortly as well but here we've already got some pre-configured compute environments which are both AWS so when this is being executed you can specify which environment you want to use and then you can have all this pre-configured and loaded and just quickly and easily click on this when you are in the launch pad same with credentials you can store credentials here so this might be the keys that you use to get into your Git part, you see in your GitHub account for example or a private repository for your Docker files all of this is available here to store, you can store all of this here you can also do that for different secrets so if you're accessing things like NCBI you can have some pipeline secrets they can be stored here in a really safe way because everything is encrypted and hidden away so that you don't need to worry about security here right at the end we have participants in this workspace so here for example you can see that we've got a number of different team members that have been added they have different permissions and based on those permissions they would be able to have different privileges for creating running pipelines and workflows going back here to the runs you can see that this is still submitted so because this is getting sent to AWS it is just asking for those resources but this will take a little bit more time to run so you might just leave that running in the background you can see that this is ticking over as we speak going back to my workspace so my demo Workspace, Work1 you can see all of this in a slightly different way so this is a completely new environment and this is probably what you would be faced with if you were setting this up for the first time by yourself you can see here that you can add pipelines so this is much like those NFCore pipelines that we've already added you can add this, you can give it a name, a description any labels that you might want you can create labels to tag these pipelines to make them easily findable you can set a compute environment so these are the environments that you can also set up here, we'll come to that very shortly the pipeline to launch so here for example you can actually specify a pipeline that you would like to run this is straight from github if you're doing it from a private repository you would require the credentials but you can add in your pipelines here you can add revision numbers, work territories, config profiles, pipeline parameters everything you can imagine can be controlled here as a part of Nexflow Tower here we can see the runs that have already run so this is where the pipelines have run if I had a launch pipeline from here this would end up in here as well but as you can see we can quickly and easily access these previous runs see what was run, see how it was run and this will help us keep track of the runs that we've done in the past especially if we do come back to something later and have forgotten where something is or how it was run or what parameters we used this is a really great way of storing that information here we have actions so the actions are basically automatic executions from third party integrations so here for example you could have event triggers such as code commits or webhooks so as an example of that if your pipeline has been updated on your repository you could have basically an action here that would ask because it's been updated can you please run it again with this test data set which you could also have stored here in the data sets so this is quite a really nice way it's quite a common way of storing and checking your pipelines are running and that nothing is broken as you sort of update them in your code repository as we mentioned data sets so you can include data sets here you can add a data set with a name description some sort of sample sheet that would specify what samples you would like to include in that run the compute environments which I've mentioned a couple of times already this is where you can set up the compute environments that you want to use with Nexlo Tower so here we have a full list of options so a lot of cloud providers, HPC options that you can use if you were to click on one of these for example you can see that you've got everything that you need to actually include so it helps you include all this information without having to sort of guess and do all this in the command line here for example you could change the region that you're running this from any work directory so you can also add things like wave containers and fusion which are other sort of products that you might want to include all of these are available if you go looking around on the website things like fusion and wave relatively new but they are both really worthwhile exploring as well of course more options how you can configure this such as using batch forge spot instances a number of other options that you might be interested in as well as how you want to stage environment and advanced options down there as well moving along the top two credentials again you can just add credentials to allow you to run these so for example you could use if we're setting up an AWS environment you'd probably need to have your credentials stored in here and make things a lot easier so that you could just kind of link your environment with your credentials and run that quite quickly and easily of course as already mentioned we've got the pipeline secrets you can add participant so this is how you would add other other group members to be able to access and view your runs as well as some other workspace settings going back to the community workspace here we have the showcase still here under runs you can see what's already been run you can see this is now running we can see again this is what was run five jobs have been submitted but none have succeeded yet we have a lot of information here about the status of those and of course we can always go in and view them here what has been run and how it is going okay so that's a lot of information quite quickly one feature I would like to go back and show you though is the optimization of a workflow so here under runs and the community workspace what you might notice is that some of these have this optimization available flag so as a part of next photo you can look at the optimization of particular pipeline so if you were to look at this run here for example you'll see here you have this optimization available so here this is basically a parameters file which has been generated with optimized resources so based on the run that was done previously what next photo is doing is looking at the resources that were allocated and what was actually used and then giving you an optimized configuration so that we can run this again you can have some pretty serious cost savings so if you were to run this again you could copy this out and use this as a parameter but you could also use this optimized configuration as a part of this again you can go back and look at the optimization there so if you were to run this again you can do this okay I don't think there's anything in there that we need to really talk about but you can see there's reports apologies I've missed this so when you go back here we'll go to runs something I haven't shown you is the reports so you can specify of course you've got the execution log but the reports so these are reports that you can specify in a file in your repository you can tell it what reports you'd like to generate and to be included as a part of Tower here for example you look at the multi-QC report this is where it is and all the information about your run is in here so you can jump around and look at all that quite quickly and easily so you'll keep hearing me say quickly and easily but I think that's just one of the it's really what Tower does to me it makes it as quick and easy to view my runs and makes it easy to go back and find out what I did what parameters we used and it really I think is a fantastic way okay so I think that will keep running now in the background it'll probably take around 24 minutes to run like I said you can go back through some of these pipelines you can optimize them if the resources were too high or too low so if you go back and relaunch this you can take the optimization and use that as a part of the relaunch like I've just shown okay so you just specify there using optimization I think that's all I wanted to show you as a part of Nextvote Tower I do really encourage you to come back here and have a look around there are some features I haven't talked about extensively such as labels so for example if you were to go to Launchpad and go to this chip seek you can do things like add labels which can help with costing and sort of adding things back you can't do this here as a part of the community workspace or in your own organization if you had high privileges you can okay like I said I think that is where I will leave it today please do come and have a look around if you do have more questions about Tower or if you're thinking this would be useful at your institution I'm sure this care team would be very happy to hear from you so just to finish off with the API so you can sort of build on top of Tower if you wanted to do that so there's a lot of information here about how you can sort of integrate Tower with other software we've talked about workspaces and organizations and yeah so that's the end of the training material