 Hello, and welcome back to session two of the Community Foundation or Nextflow Training. My name is Chris, and I'm a developer advocate at Sakira. And I'll be the one taking you through the rest of the material in this workshop. What we'll be doing today is really expanding upon what was covered as a part of session one. So we're going to go back and sort of re-look at some of these concepts and ideas that have been introduced to in a little bit more detail. I really encourage you to take your time as you're going through this material. This will be available on YouTube now and forever. So you're very welcome to pause, come back, re-look at things at any time at your convenience. I will be doing a lot of demo style sort of descriptions today, while I'll sort of do the demonstrations with you or for you. But what I would encourage you to do is go back and sort of try and break these. Go back and try and break them, fix them, put them back together, do something strange, see how far you can push the limits of Nextflow just to give you a really good understanding of how all of this fits together so that when you come back and try to do this yourself, with your own sort of data, with your own pipelines, we have a much better understanding of it also. For today, what we'll be doing is opening up a new Gitpod environment. So this is back on the training.nextflow.io website. There's this button here to open up Gitpod. If you click on that again, this will take you through to a window that looks something like this, where you can continue to create a new virtual environment for today's sort of exercises. I have already created an environment like this. So just to speed things up, I already opened this up in my window. It might take you just a couple of moments to load this for yourself, especially as for the first time. If you didn't attend session one, just very quickly Gitpod is a virtual environment that we use because you can open it up with all of the different data and tools and software that we're using, already preloaded. To access this, all you need is a GitHub account, and there's just a little bit of registering and a little bit of waiting for this container to load and pull for the first time. Where we will start today in terms of material is down here in Channels. Largely what we'll be doing is sort of just working our way down this list. Of course, we've really looked at Groovy, but we'll otherwise follow this order down to Secure Platform, which will skip, do cache and resume, and troubleshooting before coming back to Secure Platform to finish on that at the end of today's session. So Channels, we've spoken about Channels quite a lot already. So Channels are these key data structures. It allows for the implementation of the reactive, functional, orientated computational workflows, which are based on the data flow paradigm, programming paradigm. What all this really means is that it allows you to connect up different processes and pass data between them. Because it is this sort of data structure, you can choose how this data is shaped and how it has been passed between these Channels. So you can ask to be grouped or flattened or different ways. You can choose how it should pass between different processes, different tasks. While we have spoken about Channels to some degree, we haven't really spoken about the different types of Channels. So there are two types of Channels. We have Q Channels and Value Channels. And there is sort of one big difference that is quite important. I am going to labour quite hard, especially at the start of today's session, is that Q Channels are consumed while Value Channels are not. So what does this mean? So Q Channels, they're asynchronous, any directional in their FIFO. So what this really means is that the operations are not blocking, the data flows from a producer to a consumer, and that the data is guaranteed to be delivered in the same order as it is produced. So first and first out. These are the properties of a Q Channel to create a Q Channel. Sometimes these Channels are created implicitly. So when you have a process, the output of a process will implicitly be a Q Channel. So by default, these will be a Q Channel. But you can also use Channel Factory. So things such as channel.of or channel.fromPath. So here, this is an example of a Q Channel, so channel.of, one, two, and three. So one, two, and three different elements. They would spawn separate tasks. If you were to view these, you'll see that the output is expected to be one, two, and three. A Value Channel, on the other hand, is also known as a Singleton Channel. These are not consumed, so these can be read multiple times. To create a Value Channel, you can use the Channel Factory value, or by using operators that return a single value, such as first, last, collect, count, min, max, reduce, and sum. What all of this means is that this is effectively going to be a single element. So it could be a single number, a single file, a single list. All of these can be consumed multiple times, as long as it is a Value Channel type, not a Q Channel type. So I've said a lot of words there, but what does this actually look like? So here we have a snippet, which I'm going to copy and paste, and just jump over into Gitpod and put this into my snippet.nf here. I just replaced this here. So it just made it a little bit bigger, hopefully you can see that. But what I have here is we have Channel of one, two, and three, and Channel.of one. So this Channel one has three elements, and Channel two has a single element. But both of these are Q Channels because we're using the Channel.of channel factory. So this is what was explained here. The Q Channel can be created by using the Channel.of channel factory. Here we have a single process. The process has two inputs, X and Y. These are being summed together in the script block and then printed to the output. Down here in the workflow block, we have the process, which is taking Channel one and Channel two as the two processes. So Channel one becomes X and Channel two becomes Y as the two inputs. And then we have this view operator on the end, which allows the output to be viewed in the terminal. So I'm going to run this. And just while I type that in, try to think about what we're expecting to see. So we have snippet.nf has just run, and we see that there has been a single task, and we have a single output, which in this case is the number two. So what we have is Channel two, which was the single number number one. And then here we have the single number number one out of a possible three has been printed to screen. So why has this happened? This is because they're both Q Channels. The element one of Channel one was consumed, the same time as the single element of Channel two was consumed, meaning that there was nothing to be matched with these two extra elements in Channel one. In other words, Channel Y was exhausted. Both Channels needed to be full for the process to be initiated for that to be executed, meaning that there was nothing else to sort of drive these processes forward. There was no more data to push this workflow forward. And it was a single task. The process was executed once. If we were to change this to a value chain, so this can now be consumed multiple times, we see quite a different behavior. So I've just changed this to value, nothing else has changed in the snippet. We can see that we have four, two, and three. So this has been executed three times. So elements, the three elements of Channel one were each added to the single element of Channel two. So that's really great. Sometimes different outputs or different inputs are sort of typed automatically. I mentioned earlier that outputs from a process will implicitly be a QChannel. On the sort of flip side of this, whenever you have a parameter, rams.channel, you can just make this one, copy that and put that down here as well. So instead of having this channel factory producing the channel, I'm just gonna be using rams to make a parameter out of the number one. This will also be treated like a value channel. So we remember this again, we should see that we get three separate numbers being printed to screen. The final sort of manipulation that I've just described is using operators. So operators are another way that you can modify a channel type. So just turning this back to what we had before, so we've got Channel of one, two and three. Down here, this is still got Channel one and two. You'll find down here that we have this exercise. I'm just gonna give you a wee minute just to think about what this would look like. So in this case, we have, you've been asked to add the first operator to create a value channel from Channel two. So all three elements of Channel one are consumed. So we need to somehow turn this into a value channel by using this first operator. So it's giving you a very small moment just to think about that, have a quick go at it. You can copy and paste this out to snippet.nf as well, just in case of accidentally change something. So the answer here is just to add first as an operator. In this case, it's like the workflow block. I just wrote that again. So instead of this being executed once because the element is being consumed, it'll now be consumed multiple times. Great. So that's just really the point I wanted to labor is that there's these two types of channels, queues and values. If you either have a process which should be run multiple times because it might be files getting mapped back to a reference genome, as an example. If your reference genome is not a value channel and it's accidentally a queue channel, it might mean that it's consumed once in your new seeds to see one task being executed. We see this quite a lot coming up in Slack. So it's one of the ones to watch out for. Okay, so moving on to channel factories. Channel factories, of course, are these different commas for creating channels that have implicit expected inputs and functions. What I will do here is sort of work through some of the more common channels. There are others online that you can find just on the next book documentation. So the first one here is channel dot value. So this is a value channel. Of course, these are value types channels. What you'll notice here is that each of these are a single element. So in this case, we've got this optional not null argument. So this is victim empty, but here you've got hello there, which is a string. So the string is a single element. And here for channel three, we've got a list. So one, two, three, four, and five. This is a list, the list is the single element. The next is channel dot of. So this allows a creation of Q channels. So we've just been using this as a part of snippet.nf. In this case, we've got one, three, five, and seven. If you were to copy and paste this in, you'd expect the outputs to be one, three, five, and seven emitted on separate lines. You can do some sort of cool stuff with channel dot of as well. So this is an example of how you can use a little bit of groovy in here. So dot, dot will produce all the numbers between one and 23. I was copy this one across. So in this case, what we'll see is the numbers one to 23 printed to the screen, to the terminal, as well as the X and Y. But you could change this to a different number if you want, and it would produce numbers between the two numbers at the outside of the dots. From list is another channel factory that you might come across. This allows you to have a list as an input. I'm just going to copy this across. So again, you can see that I just got this list up to find this up there, up the top here. I'm just going to add an extra word. So we have this list. It is a single list, which we've added in here. So list has been defined on line one. Here I am using on line four. I'm just going to click run, and what this will do is print out these three words as three separate elements into my terminal. From path, there's another quite common channel factory. So this is a way that you can bring in a list of files. So in this case, if you were to go into data into the meta folder, you'll see these patients underscore one and two, using this glob pattern here to try and pick them all up. Running that again, hopefully we'll see these two files get picked up. There's no view dot view. So what this will do is just print these files discreetly. You can see that we've got both of these files listed here. So this is a way that you could bring in a large number of files from a folder by using a glob pattern. Something that we haven't spoken about to a great extent so far is that a lot of channel factories have options, and lots of different options that give you choices about how you want to sort of use that channel factory. So there are things here about glob and type and hidden and max depth, follow links and relative check if exists. All of these can be used to help you choose which files you want to bring as an input. So as an example, the file type, or if it's allowed to be hidden, or if you're using multiple directory levels, how many levels is the channel factory allowed to search through to find your files? These are all different ways that you can sort of modify the behavior of a channel factory. If you are trying to do something a little bit creative and it's not quite what you think is, it's not doing quite what you want, I do encourage you to check out the options that are available for different channel factories because there should be something that is available for you to actually find out. So it's gonna be beauty used to hopefully provide that behavior. We've already used from file pairs. This is what we used as a part of the RNA-seq proof of concept pipeline. We were bringing in this just GGL data, the paired one and two, probably most likely paired end reads, just like that. I do quite like this example though. So just very quickly, we're gonna add that in just to show you what it looks like because there's quite a cool activity just down here or an exercise which asks you to apply one of these options to this channel factory. And then sort of see the different behavior. So in this case, it's asking you to use a flat true and flat false. Flat is an option, which is when true, the matching files produced to solve elements in the emitted tuples. So that means that they are produced as described here as elements rather than as a list like you would have seen yesterday. So just quickly to actually show you what that looks like. Flat, let's start off by going false. So this is the default behavior. When we run this, you'll see that we get the same output as we've seen previously that the two read files are a single element. These two items are a single element in this sort of channel output. However, if we change this to true, and then hit it again, see what happens. These are now three separate elements in this channel. So effectively what this is doing is that while these files were sort of up to here to the list, now sort of emitted as separate elements and you can already run different scenarios and you'll hear about how this would be applicable and useful in different situations. But the idea here is that you do have choices about how you create these channels. And as you'll find out later, you can also use operators to help sort of manipulate or change how these channels are structured as well. Okay, so the final sort of section in this, this sort of large section on channels is from SRA. So this is an old channel factory that allows you to query the NSBI SRA archive and return channels emitting the FASTQ files matching specific selection criteria. So this will be an example where if you had a few files and you were trying to sort of run them alongside some public available data, you might sort of bring the data in using this channel factory. Unfortunately, as a part of the GitPond environment, we don't have all of these credentials sort of loaded. However, there are some instructions here that you can go away and load all of this up for yourself and implement it as well. Okay, so that's the end of the channels section. What we'll be doing now is moving on to processes. So we've already been using processes a lot, both in session one as part of the RNA proof concept and also the hella.nf. You might be quite familiar with kind of some of these words already. So things like directives and puts outputs, when, now which is an optional clause statement and the script block. These are kind of the five main parts of a process. They are always structured in the same way. So while some of these might be missing, effectively you're always expecting them to see them in the same order. So your process with a name, this will most commonly be in these uppercase letters as I was mentioned in session one, followed by the directives. So again, there are lots of different directives, but these directives can be used for things like containers, labels, resource allocation. There are lots of different directives that can be applied to each process independently. After that, we have the inputs, which will always have a qualifier followed by the input variable name. Similarly, the outputs will have qualifiers followed by the output names or definitions. After that, we have these when, which is where you have these optional conditional statements, followed by the script or the shell of the exact block. Most typically the script will be included down here in these triple quotes. In its simplest form, a process only requires a script block. So the script just being a string statement or a string, excuse me, a string statement that defines the command to be executed. This can be wrapped up into the process and then put straight into the workflow. If it doesn't require an input, then it will just be able to be executed without causing any extra fuss. Something that we haven't mentioned though, is that while the script block will be sort of inherently interpreted as bash, you can actually change this. So by adding a she-bang to the top of the script block, you can change how this is interpreted. Of course, you'd need to have Python installed on the system that this is being executed, unless you're providing a container with the Python environment installed. Lots of different scenarios. But here, for example, this is all Python code. The only difference here is that this has a Python sort of she-bang at the top. And instead of it being called whatever it was, it's now sort of py stuff. So you can copy this and just move this across to snippet, just replace it in here. There is Python installed in your Git pod environment. You can see here that this will execute, which is cool. But you could very quickly change this to R. There's R script or whatever it is. Parallel, whatever language that you want to type. If you could sort of normally do this in a Linux environment, then you could sort of substitute this out here by adding the she-bang and include it in your process block. There's quite a nice tip here as well, something that I won't demonstrate because you've done this already as a part of session one. But if you have lots of different scripts, you can store them in a bin folder which will automatically be mounted. So instead of including all of this here, you can make this an executable and then just execute it using Python in the script block just to make everything nice and readable. So script parameters can be defined dynamically using variable values. So here we've got prams.data, we've got world. Down here in the script block, we've got the variable prams.data. So even those defined up here outside of the process, outside of the workflow block, this will be made available to the process so that when you run this, it is going to print hello world. You can think of parameters as being a little bit special and that they do kind of cross boundaries. You don't need to necessarily worry about staging as much with parameters. So parameters can be available within script blocks if you want them to be. Okay, so something that we sort of bridge into in this next little block is a little bit about the batch syntax and how this can be a little bit tricky sometimes. So since the next flow uses the same batch syntax for variable substitutions and strings, batch environment variables need to be escaped using the batch slash character. The escaped version will be resolved later returning to task directory. In this case, using the PWD, which would otherwise show the directory where you're running next flow. So just kind of show you this behavior. It is something to be mindful of, although you probably won't need to include a lot of this in your workflows. Here we have PWD, which with the backslash, it is the work directory without it. You'll see here we've got this slightly different behavior, which in this case is just the training environment which we're actually launching from. So you can see here that the backslash is kind of moving in and out or escaping these batch variables, which is something to be mindful of. The next thing though is that if you are doing a lot of this for whatever reason, you might find it to be more beneficial to use a shell block. So the shell block uses a slightly different syntax. So city using sort of the dollar sign to escape, we've now got the exclamation mark to include. You'll also notice here that we have shell rather than script and single quotes rather than triple quotes. So again, just to show you this behavior, you'll see that if I was to launch this again, just to add a debug, so that we can actually see the output. So you can see Bonjour Le Bonne. So we're gonna have this X, which is Bonjour, and then this being Le Bonne, which is coming in from the parameters. If you weren't doing this, you're just trying to include params.data, like we've done previously, just up in here, for example. So params.data, I do appreciate that as well, don't I? You'll see that this is gonna not run particularly well and give us an error. So it's an unbound variable, meaning that it doesn't like it. So while this is kind of flip the script, and that these environmental variables are kind of treated in a different sense, you do need to supply it with this. If you are gonna be including parameters or other variables, you'll need to sort of add this exclamation mark with the squiggly brackets to kind of, like I said, flip the script on those. Again, this is kind of an example that I think it is worthwhile going back and trying to break a little bit, seeing what you get away with, trying to understand these behaviors, what doesn't, doesn't work, because this is, most of you won't need to do this as part of your pipelines, but I think it's worthwhile just knowing about it, especially if you are trying to sort of escape and out of bash variables, if you're trying to do something a little bit tricky with naming of files or folders, locations, or something like that. Moving on, conditional scripts. So with conditional scripts, you can basically have these if-else statements. So as an example here, we have prams.compress equals gzip. So this is the parameter that is getting set to gzip. And then we have some files to compress, which is some data that we've used previously as well from the ggl, the transcriptome.FA file. Here in this process, we have an input, which is the file, and then based on the parameter, if it is gzip, it's gonna do one thing, if it's bzip2, it's gonna do something else. And if it's not null, and it's gonna, excuse me, throw an error down here in the workflow, we just have the process taking this input and actually using that as a part of the, excuse me, using that as the input for this. Excuse me, I am completely confused. It's using this prams.file to compress. It's using the transcriptome files, the input, and the parameters being included in the script as well, because it's a parameter. So again, just to show you this behavior, just gonna paste this in. By default, it's been hard coded to gzip. So this will print gzip. It'll do prams.compress, creating the gzip command. However, we were to try and mix this up by adding in a different parameter. We're adding in a different parameter on the command line. We're just gonna say chris, and enter again. This will throw an error because it's not known, which is cool. So this is just a way that you can sort of add some conditionals, or another way that you can add some conditionals to your script block, to run different scripts depending on what the inputs are. So there might be scenarios where you wanna do different things in different situations, the file is too big or too small, or a particular label. There are lots of different options available by using this as well. Okay, moving on to inputs. So of course we've talked about inputs a little bit already. Each input will have a qualifier as well as a name. The name is just an arbitrary value, or an arbitrary name, rather we may describe it, given to the input. The inputs will most commonly have either a value or a path as a type. But effectively, it's the qualifier, and then the variable name, every time. In this example, you'd expect it to have one, two, and three. These are all gonna get taken as elements being passed into the process, in which case it's just gonna be echoing out process job one, two, or three. It's quite straight, it's completely wrong script, sorry. So out of that, get rid of this line here. So running this like this, this will just present print job one, two, and three. If you get the type wrong, generally it will throw an error because it'll be looking for the wrong type of data. So in this example here, I've just changed from value to path while I'm actually still only passing in values. In this case, it's basically just saying that this isn't a valid path type, and it's thrown an error. This is good for a little bit of checking. Sometimes you might get away with it accidentally, depending on if it's a path or a value, or it's being treated as a string or not. It doesn't really matter, but you should be aware of what these are because these can cause you issues if you get the wrong type. Generally, a path will always be for files, whether a value will be pretty much everything else. So things like Booleans as an example will be treated as a value. This is an example where we've got a file input here as well. This is kind of interesting behavior though, so I think this is probably a good one to show you as well as quickly. So here we have the channel, so the data coming from the GGL as well. We know that there's multiple FQ files available. When we run this, we will see that we have sort of alias sample.fastq getting printed multiple fast times. This is effectively because we are not calling the input file by name. All we're doing here is just saying, for every element that's coming through, print this. So as it says here, the process is executed six times. We'll print the name of the file, sample.fastq six times. This is the name of the file and the input declaration. And despite the input file name being different in the execution column, line underscore one, two, three. That's line underscore one and two, gut and liver one and two as well. However, if we give this a name, in this case sample, rather than the sort of fixed name, as we have here, we'll see that this is actually going to be used as the file name. So what this really means is that again, this has been given kind of like a fixed name. It's got the single quote mark around it. This is effectively, it can't be changed. This is what it is. It isn't a variable. But here we've got the unquoted sample, which means that this can be used as a variable name. This is quite a subtle but important difference. Okay. So moving on a little bit, we do need to move a little bit quickly through some of this just to get through everything in a reasonable amount of time. We've already seen examples of having combined input channels and the effects of having different channel types. So again, going from channel of one, two and three, previously we've had channel of one. We'll try and sort of add these together because these are both Q channels. It's only executed once. There's one task. However, if you have a value channel, it can be executed multiple times as shown here as well. In this exercise, of course, all of this also applies to files, not just values. We've got an example here. We've got these two different sets of files getting written. We have one getting turned through the from path channel factory. So this is eventually turning it back into a Q channel type. And down here we're sort of combining this together. So we're still going to have the reads channel and the transcriptome channel. In this case, it is going to be the parameters, which is a value channel as well as the read channel, which is the channel that's been created here as a Q channel, meaning that the transcriptome will be applied multiple times, but the read channel will only be executed or consumed once. So just to show you what this looks like. So you can see here we've got liver one, line one and gut one. We're only doing this once for each file. We've got transcriptome.fa for each of them. So the transcriptome file has been reused multiple times, but the liver line and gut file has only been used once. I mean, hypothetically, you can start to play around with this. This is something you might want to play around with this as well. Channel.of.do.set. Transcriptome.channel. So I'm going to give that a slightly longer name and it probably needs. And then I'm going to supply this down here. These are the types of things you can do just to kind of like find out the limits, find out how these things fit together, find out how changing the file types or the channel types actually affects the output. So here for example, because I've changed that into a queue channel and added that to the process down here. It's only been executed once. I really do encourage you to play around with stuff like this as well. Okay, so input repeaters. This is another kind of more interesting way that you can sort of apply next flow. So the each qualifier. I've talked about value and path. Each is another qualifier, although it isn't used as much. It allows you to repeat the execution of a process for each item in a collection every time your data is received. So for example, we have sequence. So from channel to from path. We have some files coming in. Then we have these two separate methods, regular and espresso. With the each mode, the execution will be repeated for each item in this collection. So the output. You can see that we've got regular and espresso. So for each gut, we have regular and espresso. For each lung, we have espresso and regular, or regular and espresso as well. The same for the liver. So everything has been done twice with these two separate modes. So you can imagine if you're trying to apply different parameters or add different labels in different ways. This could be quite a nice way to include a series of executions by using this each qualifier. So here, this is an example. Where it's just adding you to add an additional coffee type. So you can add this up in here into the methods and see what the outputs would look like. Be careful not to add too many or this will become quite large quite quickly. Okay, so outputs are very similar to inputs and that you can have these qualifiers with the name. However, there is a little bit of difference here. You can also have this in mitt. So you can add a comma, followed by a mitt with a colon, and then you can give this output a name. And we'll see very shortly how this can be applied as well. So again, we've got quite a simple example. We've got greetings or taking hello world. This is going to get passed into the process as an input, which in this case is a value. And this is just taking an output, which is a value as well. So again, this is really just about the output block goes beneath the input block and above the script block. And you just need to make sure that you've got these qualifiers that match the data type as well. When it is a file, this is an example down here under 6.3.2 where the path qualifier specifies the file that is going to be produced when it is a sort of fixed in quotes output. This will be the exact named file. So the file has to be named this or it won't be collected as an output. So this is actually quite a neat way example where it is just going to be taking in a random number. Notice the backslash here to actually escape and allows to use this bash variable. It's going to be put into results.txt which has been collected as the output. As the file name, so that when we actually execute this, what we should see is a string to this folder being presented on our output. And if we were to try and cap this, so just copy and paste to see so that a random number has been written to that file as well. Okay, so this is a slightly or probably an extension of what I've just shown you as well when you've got multiple output files and much like we saw with the hello example hello.nf example in session one. When you have multiple files, you can add just a glob pattern to pick these up using a path as well. So you can imagine that if you are trying to collect all the BAM files or all the index files or whatever as an output from a particular tool, you could use a glob pattern like this to try and capture all of those as a single output for your channel. So here this is actually applying an operator. We haven't really spoken about operators in enough detail at this point to dig into this in great detail. But this is potentially an exercise that you might want to come back and look at later. So this next section here, dynamic output file names. Again, this is an important example which I think is worth highlighting. So in this snippet, we've got a few different things on the screen there. Copy and paste this across. We've got species, cat, dog, and sloth, all the way to some sequences. We've got a channel from list here. So this will be a Q channel and it's making it into species underscore CH. We have this align process, which is taking two different inputs. We value X and we value seek. These are both the variable names that you can see that they've got the dollar signs down here so that you know that they're variables. The output, we're using dynamically. So we have X. X in this case is going to be the value X as well. So this has been used in both places. You can do this because we have it here with the dollar sign inside the double speech marks with the curly brackets sort of enclosing it. It means that we can use it in a way that we can have it right next to, in this case, a suffix so the file type. And we can use it here as the output and also down here in the script block meaning that we can get these dynamic names produced. So here what I'm going to do is just save this again and then just run it. And what we'll see is the outputs from this particular process all being fed out here. We've got sloppaline, cataline, and dogaline. So these value inputs which in this case are the species. We've been able to use these to dynamically name the file outputs. Again, you can get really creative with how you want to do this. You can have a couple of these in here if we had other names or other file types. You can sort of really mix and match you have lots of different options. This is another good one for you to come back and play with work out what you can get away with what breaks, why it breaks take your time to think about it update it keep improving your understanding. Great. So see here this is kind of something that we alluded to a little bit already we have these combinations of inputs and outputs in reality you can have sample ID used here as a value in the tuple for both the input and the output. Sample ID is something that you'd commonly want to carry through every different process with you. So if you're trying to keep track of what these files are they can continuously change name. You might have different pieces of metadata buried in here as well. So even though the file is changing the file type is changing the name of the file is changing by keeping it paired to something like a sample ID can be really helpful which in this case something you can sort of see here as well is that even though all these files are sample.bam which of course because they're all getting as queued in ISA and task directories it doesn't matter if they're called the same thing but you want to just make sure that their late will properly using something like an ID. So again this is just an example where it's asking you excuse me an exercise where it's asking you to modify the script so that the sample file name is given as a sample ID so you can see here that this has been updated to produce a sample ID again the same sample ID is up here so this variable has been used now in three separate places which is pretty cool. Okay so output definitions you can also explicitly define the output the channel using the out attribute so this is something that hasn't been shown to you in a lot of detail just yet so down here on line 20 we've got file.out.view so this is going to be taking the output of foo and showing you what that looks like so in this case I'm just going to be pasting all of this in again just to show you there's something out there of course yeah so I was too tricky for myself so here for example we have foo but we have these two outputs because we've got two different outputs we might need to index that just to choose which of the two we're going to be using in this case we're going for one which is the second position because the indexing starts at zero if we were to change this to zero so all these were the BAI the .ly files changing this back to zero means that we'll pick up these BAM files so these are positional zero and one BAM and BY the other alternative way to do this the way that is more commonly used is by adding an emit which is basically how you can name these outputs so here for example we're going to add BAM and BY to both of these now instead of having to use square brackets with an index we can use .out.BAM and see what that looks like so again just to show you that we can use the other one as well but now we can add in these .BAM and .BY after the .out when we're trying to specify the exact output from a channel so this is especially important when you have multiple outputs so here for example I could just create a whole new output as well this doesn't have to be the same structure there's a lot of flexibility about how this should look so if you're like me you might want to try and line that up just because you can and then down here we are going to add this in as well again you can keep playing with these examples to find out what doesn't doesn't work try and work out why it doesn't doesn't work think about how this might relate to your own data there's a lot of cool things you can do here you just have to sort of play around enough to define them sometimes okay so when the when declarations allow you to define a condition that must be verified in order to execute the process so here for example we've got basically a test to look at the file make sure that it's the right type you will find these are quite common in any four pipelines for example just checking a few things to make sure that it is what you think it is but ultimately this might not be something that you need to add a few different processes a lot of people don't unless you're really checking for things which of course can be good practice to make sure that you are executing what you think you're executing cool finally so this section we also have the directives the directives are these sort of optional definitions optional settings that affect the execution so here we have an example of CPUs two memory 1.8 gigabytes in a container you don't need to add these but they can be really helpful to add them to a process especially things like labels so you can flexibly apply configuration options to specific sets of processes you can do all of this using directives we've seen CPUs before also containers yesterday as a part of session 1 a little bit about the resource allocation though so here are some of the big ones CPUs time, memory and disk there's a few instructions here about how you can actually write these and also what units you can use when you are executing these as well so here for example we've got just another snippet with process CPUs two memory 1.8 gigabytes time 1 hour and 10 gigabytes of disk one of the directives you'll probably use most often is the published directory so this is where the results will actually be published as you might remember that a lot of these processes are included excuse me will be executed in the work directories and you actually need to tell next flow to publish these somewhere if you want to keep a copy of those results outside of those work directories in this case you've got the published directory which is going to be set to results so the folder name and then the files you are picking up using this pattern with the glob pattern in there so for example while we do have the outputs here with the band files these will get collected and put into results folder just to show you what this looks like so what we should see in here is we have the results folder with these three band files that have been picked up by using this here just to show you that you can change this as well or you can even add multiple multiple published directories depending on where you want to store these things so I'm just going to go to band and buy and sort of line those up we can run that again so you can be really clever about how you want to do this so here we've got these two different folders now I'm clicking the different file types you can add lots of different files and folder types to really sort of like fine-tune about where you want to sort of store these you can get really creative as I said also use these variables from the inputs and outputs to name these and put them in different places as well so you have lots of options actually can be quite a bit of fun coming up with creative ways about how you want to put these different places as well so that's everything for processes so what we'll do now is move on to operators so operators are methods that allow you to manipulate channels and every operator with the exception of set and subscribe produces one of more new channels allowing you to chain them to basically fit your needs so when I say mean chain you can keep sort of stacking operators on top of each other as we're seen to a degree as we're already as well there are seven main sort of groups even though the last group is other which is kind of just kind of mopping up everything else but they're operators for filtering transforming, splitting, combining, forking maths and as I said other which is this kind of everything gas what we will do is kind of just work through first of all a basic example and then some of the commonly used operators just to give you an idea of some of the operators that are out there and how they work so map map is a really common operator and we will come back to map under the commonly used operators in a few minutes but it allows you to apply a function of your choosing to every item emitted by a channel so in this example we have channel of one, two, three, and four we can use the channel it's called nums and we can use the map operator to take each of these numbers and then square it by itself so in this case we just have each of these getting admitted as a single item and then that item is getting times by itself and we've got this we arrow here sort of saying this is what's coming in and this is what I want you to do with it finally we're just going to add the view operator to actually view this as well this is just another sort of schematic representation of that as these numbers coming in one, two, three, and four so it's going to be one times one is one two times two is four, three times three is nine four times four is sixteen as I said earlier they can also be chained so in reality this is actually a chain so we've got one view operator one after another starting off with these commonly used operators we have been using the view operator pretty much since the first minute of this workshop so the view operator will just print items admitted by a channel to the console standard output depending a new line character to each item which is why we sort of see them separate out across those multiple lines soon we have the channel.of with the .view the channel factory is .of the operator is .view we can print these to screen in this case it would be foobar and baz on three separate lines an optional closure can be used here as well so you'll notice that we do have the round brackets here and then the squiggly brackets here when we use the squiggly brackets it's treated like a closure which means that you can actually add some customization to it which you can't when it is the round brackets so just showing you this as an example we'll get rid of all of that next flow run snippet all of this has been prepended with just a little dash at the start and you can see that if you started trying to use round brackets for this it will throw an error but you could also make this back to a simple version which would just be even that you can mess with this you can simplify it again again you can just be really creative with this as well there's quite a nice way of reporting or if you're trying to annotate something that you're trying to view this can be quite a nice way of doing it especially while you're developing and debugging so the map operator is another one which you most likely use almost every day if you're developing with next flow every day the map operator applies a function if you're choosing to every item emitted by a channel and returns the item obtained as a new channel so in this example here we've got channel hello.world we're taking each element, each item is going to get passed through using a little bit of Ruby so we're adding this function to each item and it's going to be in this case reversing these strings for us so again I'm just going to show you that you can do this so it's the hello.world it's giving us the reverse of those to print to a screen out there of course you could just remove this as well and it would go back to normal and just produce hello.world you can also get a little bit creative and do more interesting and dynamic things so you can associate a generic tuple to each element and sort of decide what you want to add to that so this is an example here where we're going to take the two words so the same hello.world that we've just been using and by using map where you'll be creating a tuple but in position one we'll have the string itself and then after that we'll have the size of the string so we can add some here and add this again and in reality we can keep adding stuff in here we don't have to limit size we can keep adding in as much as we want here you know you can just start going crazy if you want and do strange things like that but you have options you have lots of flexibility about how you can use maps maps are sometimes referred to as swiss army knife of of groovy that allows you to do all these crazy things all these interesting things and enclosures particularly with map is a way of doing that okay so here's just an example where you can use the dot name method to get the file name and print that out as well when you're using the GGL fast queue files as an input again this might be quite a nice exercise for you to go back and try when you've got a little bit more time because I think it is quite a cool one and of course you can create it and do different things with this dynamic naming or just whatever takes your fancy okay so the mix operator helps you to combine items submitted by two or more channels so in this example you can have one channel with a mix and another so that everything is kind of combined into one sort of streamed output so as an example of this let's just add that in here we're just going to change this to set mixed channel stop channel for you so what we're doing here is we are mixing the channels one two and three all three of those are going to be mixed together I've then set this and made this into a brand new channel so all of those things are now sort of combined or together as a single channel we need to feed all of this into one process as a single input flatten is almost the opposite in that in this case we have two lists which I'm going to show you with and without the flatten operator so if you were to look at these by themselves we've got these two inputs one two and three four five and six without flatten you see both of these separately but if we were to oh sorry flatten add that back in here you'll see that we get quite a different output in this case we've got a typo set that and then true again you'll see that the items in these lists have been flattened out all into one sort of output one stream of output as we do with the last we could make this into a brand new channel set out.channel that unscored channel I find it quite nice to look at outputs like this sometimes just so you can see that you could in fact set this as a new channel and then feed this into a pipeline separately all of these would be treated as different elements that have been flattened out and also sort of merged together now into that single channel the opposite this is collect so if you had a series of elements you want to collect them into a single list for whatever reason you could use the collect operator and that would literally do the opposite of what we've just done and actually merge those up into a list again things can get quite complicated when you have lots of different tuples especially if you are trying to use things like matching keys so if you thinking of a real life example if you had a sample and you wanted to split this out into all the different chromosomes and then process all of those individually and then merge them back together all based on the sample ID you need to think about how this might work with tuples so here we have a channel of 1A, 1B, 2C, 3B 1C, 2A and 3D and then we can group these we're just going to take the first excuse me, it's going to basically click them based on the first part of each of these so for the key and then everything else will be kind of just merged into a list after that again this could be an example where you're trying to you know, collect all your fast cues back up or your band files back up after being aligned individually and then merge them into one big I don't know, there's lots of different scenarios that you might want to do this but here this is quite a nice example as well so in this exercise, so this is something you might want to do as well is that we have from path to create a channel emitting all of the files in the folder dataMeter so we can see that here in the solution then the map is used to associate the base name method so this is a little bit of groovy coming in here and then we're going to add that to each file as well so in this case we've actually used tuple with round brackets we could have also used square brackets in here so just sorry, just to show you what I'm actually talking about so we can run this and it'll be taking the base name from these files grouping them and giving us back the structure here but what I also could have done would have been a file.base something like that might have worked as well to create if my own good not too many there but as it inform me doesn't it it's not going to work you can see that you can't get a little bit creative here but you do need to be careful that you're not just adding in too much sort of extra admin and creating issues for yourself like I am right now anyway you can sort of get creative about how you want to do this as well so similarly much like group tuple there's also join so join works in a similar way but with a distinct sort of difference the join operator creates a channel that joins together the item submitted by two channels with a matching key by default it's the first element in each item omitted so you can sort of see here we've got x y z and p and z y and x because p didn't have anything to match with when you have the output it's effectively missing because there was nothing for it to join on it has to join so it's not quite as flexible as group tuple branch is quite a cool operator as well so branch allows you to forward the items submitted by source channel to one or more channels so you can branch it based on some sort of test or logic so in this case we've got a channel of a series of numbers and then we're branching them into big and small numbers based on if they are bigger or smaller than 10 and then here we're going to print those out just to show you what this looks like you can run this again one of the questions we quite often get is what happens if you do have the number 10 itself I'll let you have a quick prediction before I run that again it's completely ignored so it's because it's not bigger or smaller than 10 is ignored from this I think if you do I might have this around the wrong way when you have it sort of less than or equal to, not equal to or less than it does get picked up so it just has to be careful about how you set up these tests as well there you go there's already a note in there I've done this one before okay so that's really kind of just a very brief introduction to a number of different operators as I've been preaching throughout already I do think it's worthwhile going back and playing around with these in your own time as well what I was able to do here is to show you that there are lots of different types of operators I do really encourage you to go back potentially to this this sort of listing at the top here and have a dig into some of these think about what they do think about how you might want to apply them but also you might not know that you need these until you're in some weird situation where you think okay I need to do this crazy sort of manipulation but these do exist do go and check them out there are also a series of operators for dealing with text so we have in this case a file of random text coming from the random.txt file and we have this split text operator and what this allows you to do is split multi strings multi line strings into chunks containing different numbers of different different numbers of letters or lines or whatever else you want to include it so here for example we could just split this out this text file and put this out into separate lines to be fed through a different process if that's what we're trying to do you can also sort of choose how you want to apply this so you can start to modify these operators as well so you can use buy as a part of this operator and what this will do is allow you to group these letters by lines of two so here we can see that this has all been single lines I can add that back in here and show you that when we do it this way we can group them by lines of two which is quite neat as well so you can imagine that there might be situations where you've got like a sample sheet or some other input that you want to validate or do something with and you want to sort of read in the text file so you might need to think about using an operator like split text you can see here as well that you can sort of put on in this case a closure afterwards to modify it again in this case we've got a closure which is going to take the item and convert it to uppercase letters as well so this is kind of again where you sort of have this intermix of a little bit of groovy and a little bit of next flow sort of working with each other to do cool things the next operator is split CSV so this is a situation where you might have a CSV file so over here in the data we're going to have data in here we've got like patient one for example so this is a CSV comma separated values all of this is sort of being separated here pointing out as we go past that this does have column names this will be relevant soon when you split this out you can first of all use indexing so you can sort of say row 0, row 3 and then choose what this should look like so you can start to choose the rows and columns that you want to pull this from so while we're still splitting out the CSV we're just pulling out the two columns in this case the very first column and then the fourth column indexed by 0 and 3 just for each of the different lines row in this case of course is the name that we've given to each of the different rows or each of the different elements created by split CSV the reason I pointed out the header before is that when the CSV begins with a header you can specify the parameter header equals true which allows you to reference each value by its column name so instead of using this indexing like we've just done here we can use header equals true and then just refer to these using those header names so again we can do something like this and this will give us the same output which is cool, which is a really nice feature because it means that if you do have like a labeled sample sheet so certainly you might be putting onto an Illumina sequencer or another platform already you can choose which columns straight away however they're having to work out what the indexes are which does make life a little bit easier excuse me but also best chance for errors that way as well of course if there wasn't a header already you can provide a custom header with names by specifying it as a list string in the header parameter as showing this example here and then you can also use that again as the indexes when you're actually trying to view these as showing the example again you can also process multiple CSVs at the same time so again we just change this to a glob pattern and all this will get picked up in the same way and you can sort of add all these at once again you can imagine a situation where you've got a lot of different files which have kind of been created or you're trying to like go through a series of sample sheets and do something this is an option that might be relevant to you so this is quite a comprehensive example one that I won't do but I think it is a cool example if you don't want to challenge yourself so this example here is created a CSV file that can be used to input users and input the script 7 so what would you include in the sample sheet but then how would you apply that to actually bring in the samples for the proof of concept RNA seed pipeline for script 7 the example here is just one possible solution you might have another one again this is an opportunity to explore and work out what you're trying to do and the limits that you can sort of work between as well it's worth noting here as well that while we have been looking at CSVs you can also use TSD files they're pretty much identical the same way you just need to use the different separator as an option for this operator it's quite straightforward there's a nice exercise and explanation with example there excuse me exercise an example there finally we have the split JSON so you can pass JSON file format using the split JSON channel operator again you might not need to do this but this is available to you so you don't need to go right go away and write some Ruby code to do this for you there is an operator here to help you with that straight away and this is just a nice big long kind of example here of how this might look if this was JSON's files different inputs and outputs and things like that which is quite nice okay so after this we're going to be moving on to modularization modularization modularization is actually one of my favorite sections in the training because I think it's really an important part of sort of pipeline and workflow development where we sort of stop writing these big monolithic scripts and start to really sort of compartmentalize their code and think about the best way to make things readable and accessible and as part of next slide reusable in our pipelines so modules basically or the modularization really kind of kicked off with next flow with the introduction of DSL2 which allowed for the definition of standalone module scripts which could be shared across all of these workflows so as a part of this what you can do is use this include statement which allows you to effectively bring in some code from another file and include it as if it was a part of your main next flow script so what we'll be doing is we'll be using this hello.nf example and sort of modularizing it in a way that allows us to move the processes outside of the hello.nf script so over here I have my hello.nf this is exactly the same as what you used yesterday as a part of session 1 next flow run hello.nf if you have carried on with your same environment from session 1 it might be easier to I don't think we do you might need to either take some of this with a grain of salt or I recommend starting a new get part environment so that you don't have any sort of weird artifacts left over for anything you might have done in the previous session so with in this case a clean starting from scratch hello.nf example it is still a working pipeline that is taking a string and then just splitting it out into these separate separate files and then converting them into uppercase letters what I am going to do though is move these processes out of this file this is the modularization I have been talking about so for that I am going to create a new file I am just going to check if we have actually called it anything specific in here modules.nf so here I am going to type in code so just go up here and create a new file and call it modules.nf as well I feel like just go modules.code.modules.nf and open up a new tab here make sure you do save it what I am now going to do is completely take both of these and put them in here now I am going to delete them so if I was to try and run this right now it is going to fail because the processes are not included even though they are in this file they are sitting just next door to fix this what I need to do is add in these includes include split letters from modules.nf this is a relative path so if you have created a new file it needs to be right next door in the same directory I am now going to import both of these processes so these named processes convert to upper and split letters and make them available so now when I run it like this it works just the same which is great so I managed to modularize it my main workflow is now much smaller which is great because if you could imagine if I have got 30 processes in here already the last thing I want to do is have an excess of 15 to 30 lines per process at the top of this single file now they can all be excluded across multiple files just like this I am a lot happier as a developer so that is effectively what I have just done there already with those same files now because I have both of these files in the same modules.nf file I can actually shorten this like so which means I can remove an entire line so because both of these are in the same file I don't actually need to have them as two separate lines I could just do again it is still working so you can add them in the multiple lines just using a semi-colon I don't think this will be in the in the training material but just to show you what this would look like as well if you had multiple files like this Goldfish I have completely forgotten already so I have got Extric but Upper Extric I like that I like to have things nice and square like that close that again so what I have done here is I have effectively just moved this out into two files I didn't have to do this I just wanted to show you that you could include statements one after another going up and down here as well if that's what you so wanted to do so I just added everything back there just reverted the changes I made the next thing that can be quite useful and also quite important is aliases so you can include a module multiple times but only if you use an alias so for example I was trying to include both of these processes so split leaders and convert to Upper twice if I was trying to include them as split leaders and convert to Upper without aliases we're going to have some trouble so just to show you what the bad situation looks like so I'm trying to execute both of these like this so we've got the same processes trying to execute them multiple times within the same workflow block and you can see here that I can't because it's already been used this is partly because of the naming when it comes to the processes that have been executed by next flow however to kind of circumvent this to work around this you can use aliases so you can the same modules will include the same module multiple times the same process multiple times but it's giving them a separate name so here for example I'm just going to run this again and what we should see is that this will run and they're actually getting known by their process names split leaders one convert to Upper one, split leaders two convert to Upper two so aliases allow you to name module or process multiple times by just giving them a different name to be known by so the next thing here is the output definitions so next flow allows you to alternative output definitions within a workflow to simplify your code here you can see that we have effectively the same script but you know line here we've got channels of params dot greeting here we have the channel name greetings dot channels or greetings underscore channels so excuse me so we can just include this here like this and then the outputs of that instead of taking basically the named or like creating a new named channel or setting a new named channel we can just use dot out to take the output from this process and feed it straight into the next process without actually sort of using all of these equals or sets of course as already shown you can use indexing so zero one two depending on which one it is and if you've used admit then you can use naming as well so jumping down here to this example this of course is going back so we no longer using modules dot NF but in this case what we've done is we have added in this admit so the standard output admit is going to be called upper and we can see here that we've called the output upper here so just to show you that this is a working pipeline so you can see that it does work like this we could also just use out in this instance because we only have one output from this it doesn't necessarily have to be named when you have a single output either something that is less common I don't do this much myself however I know others like a lot is that you can pipe so when you have these sort of single outputs instead of applying dot out you can just put in this pipe to basically feed them from one to another it's also slightly easier or some people find it easier because it's more intuitive but you'll sort of see here how you have this blend of a channel factory the process you also have an operator another process another operator all of these are used quite interchangeably this can take a little more time and practice to get your head around it's not something that I do like I said it's not something that I do very often it's kind of like a quick whirlwind of modularization the thing I just wanted to point out again is that you can really externalize these modules, these processes and also give them aliases if you so need to because you're using the same process multiple times you shouldn't need to completely duplicate the entire process just to use it twice just give it an alias save yourself a lot of time and energy something that will be new are these workflow definitions so the workflow scope allows a definition of components that define the invocation of one on one processes or operators so here we have the entire workflow wrapped up in this workflow definition and then that is getting called down here in this second workflow block the biggest difference here is that its workflow definition has a name whether down here it doesn't, it's just invoking the workflow that has been given a name above there is a slightly different way of writing this so workflows have take rather than inputs so in this case it is taking greeting and you can see here it's basically fed into the main script much like you'd expect to see in the workflow in the same way that was seen previously so this greeting which can be thought of like a channel has been put into split letters which is being used as the output of split letters is being used and converted to upper which is then getting used and then converted to upper view.it if you are trying to run the script just make sure that you do have these includes there as well as the modules .nf still in working function so my modules.nf is still everything looks okay there however this doesn't work because I need to actually supply an entry which is something I haven't spoken about so when you try and run a workflow like this you need to find an entry point which in this case needs to be the name of the workflow that you are trying to which you are executing with so in this case what I will do is entry my workflow and you're going to jump in here like this because I don't have this down here apologies I don't know if we talk about entry just yet sorry I got ahead of myself there wasn't paying attention so we have this workflow block here but I actually needed to add in the second workflow down here we actually call my workflow we've also got a groovy issue here so my workflow channel of prams greeting which looks fine it's because this isn't added into that block that's better so just to clarify what was going on there basically this wasn't the complete code block I also need to add in the workflow down here to actually call the workflow and also add in what is going to be what is being taken what is being used as the input the other thing which I hadn't updated here is because I used this dot upper I had never updated my modules over here to include omit upper which meant that that failed so I just have to keep an eye on that the rest of this block as well in case I need to actually just add that in okay so workflow outputs much like inputs the workflow can clear one of all output channels using the omit statement so here this is the hello.nf workflow again we've got the two processes we've got the workflow blocks we've got the tape which is greeting and also the omit which is converted to upper this will be given as the main output for the workflow so this is more or less identical to the same workflow as before I'm no longer using I'm no longer using the modules.nf but we now have this omit which is effectively allowing us to use dot out dot view dot is coming from my workflow from the workflow with the omit is the output and interview of course is going to view this as well so again this is a convoluted example but you can imagine if you had a series of different sort of steps in here as a part of main you could sort of start to add these and step these up as well okay you can see here as well we've got these named outputs for the omit so we're just going to call it my data and you can see down here we've got my data and here again so what I'm going to do just add in my data a data dot view so much like we've had omit previously we've added the omit to the output definition for a process here we've done this using omit for the workflow definition we can also call that down here in the actual workflow block as well so again this does seem like there is more to learn and it's more complicated but what I'd really encourage you to think about is how this will affect an entire workflow if you have everything organized like this so when the example isn't relatively simple or clear and you have lots of different tools doing lots of different things with lots of really long scripts it's nice to have all of this I stated out because it means that it's much more reusable reusable reusable you will love yourself much more if you have your code compartmentalized so this is the little bit of code that I got a little bit confused about before about calling named workflows so when you have multiple named workflows that you're then calling in your sort of main workflow you can use an entry point by using this sort of entry flag with the workflow name so as an example of this I'm just going to paste this in here we've got split letters one and two convert to upper one and two as well as these two workflows which is going to be calling all the ones and all the twos separately and then we have down here in the main workflow we have workflow one and workflow two when you use entry you can choose which of these workflows you would like to use so as an example of that I'm just going to change this to hello and run this again what this will do is it will give us all the split letters ones because I used workflow one as an entry but if I was to change this to two you'll see that all the twos have been executed because I've used entry point two so you have a little bit of choice about which of these you want to use as far as your main.nf file and you can include them selectively we can execute them selectively using an entry point okay so that takes the standard modularization and we're going to move on to configuration the next few sections get a little bit hypothetical I will try and keep things moving quite quickly so I'm going to use many examples and we'll have demos as I can however some of this is a little bit tricky to demonstrate as a part of the get apart environment but what I want to talk about now in more detail is an extra configuration so configuration with next flow is known as decoupled which basically means that the configuration files and the settings themselves that kind of spread out across a number of different places and then how these work together is all based on an order of priority so they're actually similar from places that you can add configuration options to your pipeline and in terms of the order priority parameters specified on the command line using a parameters flag much like the greeting parameter that we've been using hello.nf is the top of priority so what that effectively means is that whenever you use a dash dash greeting hello world that'll be applied over top of any other in top of instead of that parameter being supplied anywhere else so it's the top of priority is what will be applied underneath that we've also got this parameters file option so this is an option that you can add at the time of execution you can include a JSON file which has a series of parameters included part of that as well under that we have the config file which can be provided using the minus C option at the time of execution as well this is a configuration file that can have a series of different scopes the scopes are used to kind of collect a bunch of settings under the same kind of banner that's probably how I describe it we will talk about scopes very shortly under that we've got these two config files in the current directory and if it was on github like the pipeline project directory or the project directory itself even if you've got it locally the current directory is where you're launching from so if you've got a .config file in there this would be applied all the settings inside that would be applied if you have a .config inside your pipeline project directory so if you're calling it locally or from github and there's a .config file there this would also be applied underneath that we've also got a hidden .config file in your home directory so this is a really good place to store some things that might be specific to you as a user so it could be like your username and email things like that as an example and finally right at the bottom with priority lists so the thing that will be written over by every other configuration level is anything that's been hard coded into main.nf so this is partly why we have Hello World in the main.nf or the hello.nf file that example we've been using but if we add anything on the command line so we override this on the command line using the parameter flag it will take priority over that so this is effectively what's done is decoupled all these files are decoupled from each other and they're also supplied they can be supplied at the same time and depending on where these settings are being supplied there's an order of priority with those there at the top of priority taking precedence digging a bit more into parameters the parameters pipeline specific settings they can be defined in a number of different places I think actually every little here you could include parameters of course on the command line we can use this dash dash or hyphen hyphen this is the biggest difference between parameters and options so parameters are the two whether or not options will be a single dash parameters can also be stored in this parameters as a parameters file so I see j scenario but it can also be a YAML file you can supply these in a way that you can just include it as a file object at time of execution so just to demonstrate this I'm just going to go just going to revert my hello.nf right back to the original so this is just the hello.nf that came out of the box in this environment what I'm going to do is go prams.json it's going to load that in there and save it then I'll just go back here and you'll see you have this prams file option this has been supplied on the command line I'm just going to run this again here inside this prams.json you'll see that we've got this greeting parameter the same parameter that we've been using the last couple of days and then this has been supplied I also wanted to show you that if I was to add good tag on top of this again it would take precedent over the prams.json because it's at the higher order of priority so these can be stacked and it's only the one that is at the top will be sort of utilised by next flow moving on a little bit to configuration files so configuration files are for more than just parameters as I said that there are a number of different places that they can be included so these next configs in your current directory or the workflow directory this home config file also it can be added on the command line using this minus c so if you were to be adding a custom config at the time of execution you just add it like this so you'd have the minus c with the custom .config or whatever you've called it supplied there at the same time the syntax for the config is a little bit different it can just be a simple text file and you just need to sort of add the name of what you are trying to configure with the value this can be quite simple so if you're just using something like property one equals world property another property is hello property one these can be stacked the expressions can be stacked on top of each other you can use curly brackets expressions syntax if you need to isolate different parts of that as well much like rest of the next flow you can use comments I won't dig into that again but something that hasn't been talked about scopes so scopes allow you to organize different properties in the next flow so in this example here with the next .config we've got alpha and beta we've got x and y the way these are written with the curly brackets in this dot notation the 100% equivalent there is absolutely no difference between these two so just to show you an example of this so we could just change just getting myself together you could just do this and this are completely equivalent it really just depends on your preference and also if you find it easier to sort of organize all these things together have them separated across different lines these are equivalent and it's up to you to decide what you want to do parameters is a scope as well so you can have your parameters as a part of a configuration file so this is an example here it's quite a simple example but what we have here is a simple make sure we use a snippet wherever that's gone we've got hello world but we could also have a nextflow.config so we do still have the nextflow.config in this folder what I'm going to do is just add in these here so these parameters I can supply here so we go nextflow run snippet.nf and you'll see we've got ponjuolamond which is different to what we have here which is hello world is a part of the main script so this is part of the nextflow.config file at the same time as I've shown you previously as well this is cool so Chris is cool so again you can just sort of stack these and choose how you want to apply them what I'm really trying to demonstrate here is that there are all these different levels there are these different file formats it's really going to be up to you as a user to work out how you want to put all of these together however I would highly encourage you to have your parameters externalised from your main script try not to define stuff in your main script it's much easier to have these things defined in a nextflow.config file like this or even in a prams.json or something like that just so you're not kind of relying on the same settings that you might forget about at some point I think it's much nicer to have it externalised where you have much more visibility there's also this environment scope so the environment scope allows the definition of one or more variable that will be exported into the environment where the workflow task will be executed so here for example we have env, alpha and beta which has been set to some value and home to some path here we've just got a script that will execute this executing the snippet above would produce the following output which is basically just printing these out from your environment just showing that these have been made available the process.config is a scope that you should be quite aware of so the process directive allows the specification of settings the task for the task execution such as CPUs memory container you know these directives that we've talked about as a part of the process and other resources in the workflow script so this is really useful in prototyping a small workflow script however it's always good practice to decouple the workflow execution logic from the process configuration setting as I was kind of alluding to previously as well it's nice to have this decoupled the process configuration scope allows the settings to any process directive in the next flow configuration file so essentially means is that you could have a next flow config with the process scope and inside that you can set your directives in this case we've got CPUs memory container and this would be applied to everything within that workflow there's also an example here how you can sort of supply processes with kind of like dynamic expressions using closures this is a little bit potentially it's a little bit more advanced but you've got this process through here in the memory you can sort of scale based on how many CPUs were made available to that as well so you can kind of mix and match and choose how you want all this to fit together we don't have the time to data dig into exactly what this would look like so while you would have this as a part of the process so this would be your process as a part of the snippet.nf this should actually be snippet.config or nextflow.config because here we have the process scope as you're part of the process scope we have a selector in this case it's with name and it can be with label these are ways that you can choose particular processes or series of processes that have been tagged with the same label to supply resources to this is quite a common thing that's done as a part of the nf core sort of community as well but by using the process selector you can selectively apply sort of configuration options these directives to your processes without having to hard code them into every process themselves this is really a part of the foundation for the nf core shared components in that this is all externalised from the processes themselves so moving on down a little bit to configuring docker execution so of course with the support for docker there's singularity, there's conda there's support for each of these individually or you can sort of have them all set up and then sort of choose which one you want to use at the time of execution as well but with this as well again you can use this process scope and then choose the container that you want to use in this case this is one container for an entire pipeline however as we've just sort of talked about up here you could include your containers here as a part of the config file using process selectors as well so this is all pretty much the same across the configuration of docker, singularity and conda to actually get these to run you can add these to your next.config by using process.container and then the docker scope docker.enable equals true to supply these to your pipeline using the figuration in a coupled way without going into a lot of examples you can see that you can choose different images with specific IDs you can choose different images if it's docker, singularity as I've just mentioned you can also sort of supply these to different processes individually there's a huge amount of flexibility here in terms of what you want to do when configuring all of your pipelines actually doing this requires a lot more time and effort but if this is something you're interested in I would consider jumping back to the dependencies and containers and spending a bit more time looking at some of those exercises there and then thinking about how if you were to externalize this how you could do this using these excuse me process selectors and then adding them to what should be the next.config not the snippet.nf file here there's a typo okay so deployment scenarios this is again kind of an extension of configuration in a way so with deployment of next flow you have the option to sort of quickly and easily choose where you want to sort of send your execution so if you have written a pipeline that is containerized it's been written in a really reputable way so it'll be sort of shared on git you can basically have your next flow pipeline you can send it to where it needs to be so you can send it to instead of running it locally on your computer you can send it to your slurrum scheduler and then have it always secured from there or you can sort of send it up to the cloud as well it's all a little bit decoupled so this word decoupled comes up again so like next flow doesn't really care where it's being run you can tell it where you want to run it and if you've written the pipeline using good reproducible practices it should scale very quickly and easily as well how quick to actually swap where the executor should be incredibly quick so inside the nextflow.config file you can just change if you haven't set anything it'll be local but you can just change this to process.executor equals slurrum and then what will happen is that nextflow will try and submit your jobs as slurrum jobs so what does this actually look like this is a little bit harder to sort of demonstrate as well so don't be alarmed when this fails but what I'm going to do is just run this nextflow run snippet.nf I've saved this into my nextflow.config this is a part of the configuration settings this will get picked up by nextflow I'm expecting that to run quite so easily let's try hello.nf instead okay that's a bit better it fails so what's actually happening I don't think I saved it before which didn't help matters so what's actually happening here is that nextflow is trying to send sbatch jobs which is what's used by the slurrum scheduler to actually send your jobs to your cluster using sbatch if you were to go and dig into these processes you can actually look at the run command so the command that was executed by nextflow.run you'll see here that this code at the top is actually the same code that you would use to submit your job to nextflow and then if we scroll down a little bit we will see the actual script .sh this is the script that it tried to run or tried to launch using this nxf launch command so nextflow is actually managing the execution of your jobs for you you don't need to write a series of sbatch commands you can just sort of send this from your head node or whatever node you're working on and as long as it's got access to basically the scheduler nextflow will be able to sort of manage the execution of these jobs for you jumping back to the material here of course there are a number of different cluster resources that you might want to manage as a part of the process scope you can choose the executor and also choose things like the queue and the amount of memory time the CPUs that you want to allocate to your job and again nextflow can manage all of this for you so you don't need to manually go through and think about how you want to edit these sbatch commands you can supply all of this as a part of a configuration file using the process scope and it will manage that for you you can also submit nextflow as a job itself so you could write a launch basically script to submit the job for you so the entire nextflow job would be submitted to your cluster and then the rest of it could be sort of submitted you know locally or from there rather than doing it from your head node and submitting the job sequentially as different sbatch jobs you have a little bit of flexibility as the user for that as well the code here will help you sort of do that as well where it's got this custom launch nf script as I've mentioned a couple of times recently and we'll mention again here because this is quite a cool feature, it's something that I think is very valuable in real world applications different tasks need different amounts of computing resources just like you can supply different containers different processes you can also choose to supply different resources so you know by using the process scope with these different process selectors with name for these two named processes you can supply different cpus memories and send them to different queues at the same time you could have labels for all of your processes so a directive for each of the different processes you could say I want this to be you know big job small job high memory low memory what do you want these to be and and then when it comes to writing your configuration file you have all these labels available which you can supply or you can choose which resource you want to allocate to them as well and of course these could be used kind of in combination or in parallel with things like the with name processor which might sort of specify the container so you can sort of think okay I want labels to be exclusively for resource management so I'm going to have all of my labels set up so that I can apply resource labels so cpus memory queues things like that and then all of my process selectors process with name selectors I'm just going to worry about the container images so you might want to go in and edit one not the other but you have them separated out so you have this flexibility and control of how you want to do this something else you can do is you can actually establish profiles as a part of these configuration files so profiles are effectively groups of configuration options that you can supply at the time of execution so with the minus profile command line option you could say minus profile standard in which case these would be the configuration options that are supplied similarly you could say minus profile or dash profile or hyphen profile whatever you want to say cluster in which case these would be the cluster settings that you want to supply you could also use these in combinations and say minus profile standard cluster you can pick and choose and match and have multiple depending on what you're trying to do the point I really want to make here is that with these sort of profiles is that you can sort of supply them on a whim at the time of execution and you can sort of group together a bunch of different settings that might be applied in specific settings for specific profiles or profiles of users Cloud deployment is especially hard to demo if not completely impossible to demo from Gitpod but we have a lot of support for cloud deployments or things like AWS batch there's some nice notes here about the different settings that you might want to include and how things like long amounts work in the settings that you might want to look at there right down the bottom here I did want to point out one last thing which is about hybrid deployments so hybrid deployments basically mean that you could run your entire pipeline across multiple different deployment executors so you could say I want to start off locally for all of my small jobs well I'm just sort of dealing with a sample sheet as I am you know as the jobs get bigger I want to send them to SLURM but for anything with this label big task I want to send this to AWS batch all of this is possible with Nextvile you have this really great control of the different processes labels you can do all of this in a really dynamic way and because there are all these different configuration options that I've shown you previously you can have all of this abstracted out across different files I don't think there is one sort of best practice for how this should be arranged I would suggest kind of checking out some of the NFCOR conflicts about how they have been set up especially if you are thinking about how real world pipelines are structured for things like this as well seeing how NFCOR has used labels and then also had different configuration files and modules and base and things like that just gives you a really nice idea or a suggestion of how you might want to separate these out yourself as well ok so what we'll do now is move on to Cache and Resume I'll try and keep this section my next couple of sections quite short again just because they are quite hard to demonstrate but also I think a few of these ideas have come up multiple times but also covered in more detail as a part of the advanced training so as you've already seen we've utilised a few times next layer has this caching mechanism that works by assigning unique IDs so these sort of workflow directories the names of these workflow directories to each execution directory with the tasks executed and the results are stored these are effectively 128 but hash values comprising a task input values files on command string so here for example we've got this this work directory with excuse me 12 followed by the rest of the hash there the structure like this is really just used to help separate out the files without creating any clashes so how does the resume functionality really work so by adding the resume command line option it allows the workflow to resume the execution from a step that was last completed successfully in practical terms the workflow will still execute in big sort of air quotes there from the beginning but before a process is executed it will check the task ID and if next flow thinks that the task is about to execute is the same as the task ID that already exists then that step will be skipped and the results will be pulled from the cache or from that work directory again all of the work directory files are basically stored unless you clear them this is quite an important point because if you don't clear this often this file can get quite large quite quickly so at some point you'd need to go in there and clear it out which will of course inactivate the cache the work directory itself is created in this folder called work in the launching path by default but especially on a large system it's recommended that you can make this scratch and you can choose to do this by using this minus w at the time you execute your script or you can also include this as a part of your configuration settings again this might be an example of where you could store this in your home nextflow.config file again this is kind of like restressing the point that's already been made is that the hash code for the input files for the work directory is computed using the complete file parts of file size in the last modified time stamp this is quite important as well because it means by just touching the file or opening a file it can invalidate that section or that task because the time stamp has changed so you need to be really careful about how you test or if you are troubleshooting and you're trying to go back through and work out what's happened by touching a file all the way at the top of the workflow, the top of the pipeline by changing the file or inactivating the cache for everything else beneath that anywhere that file is kind of filtered down will be used somewhere else that's all inactivated and you'll have to start again so advice for organizing your experiments generally I would advise to organize these as separate folders so when you are executing your pipelines the next load is creating logs those logs are kept in hidden files that you might have missed already so you can really tell that this is a bit of a mess of a directory but if you actually look at the hidden files by using LL you'll see that there's a huge number of hidden.nexo.log so we've also got this nexo folder in here where a bunch of other assets and things are being stored as well but if you go and look into all of this this is all information that nexo is storing keeping from my executions so that I can go back and interrogate these again you can actually use this nexo.log command we haven't talked about this a lot so far if at all so nexo.log allows you to view your previous executions in this working directory so you can see here these are all the different scripts that I've run today we've got some basic details about them so when I launch them the duration, the run name, the status the revision and the session ID as well as the command that was actually executed all of that has been kept in this directory if we were to move into a different directory and try that again you'll see that nexo.log is empty because nothing was executed there but if I've been smart, been separating out each of these different exercises throughout I wouldn't have one big series of logs to look through I would only have the log specific to a specific project in this folder so for that reason what I would do is my experiment CD my experiment and then just launch everything from here so that all of your experimental sort of runs and all that data is stored locally if you were to do this as well you can have your work directory stored in here however I would also probably preferentially advise you to use the scratch instead so just taking a step back again CD I'm just going to do a run very quickly assuming that I haven't broken this again broken this I'm just going to do run script I'm just going to do 2. any of that's because I've added that to my this is another good reason to be careful about where you store your configuration options because if you're like me you can forget about things you've added on when someone's going to try that again okay so what I'm building up to here is wanting to actually take this run name and then just go next flow log and as you can see here you can supply the run name and actually view specific details for that specific run in this case all the different work directories so I can go through and work out what these work directories were and go and have a look inside them as well because there's going to be links to look inside as an extension of this there are actually a lot of different fields that are being collected so instead of this I'm just going to add minus L and these are all different fields that you might be interested in and actually sort of looking at or characterizing taking this again so instead of using minus L to list all of these options I'm going to use minus F just to select the fields I want you can see here that I can sort of pull out the process the exit the hash and the duration how long these took to run as well again if you're trying to interrogate you'll run and work out why things might have failed go back and resume a run these are the types of things that you might be interested in going to have a look at at the same time one of the things you can do and I think this is quite cool as well is that you can basically save a template so if there are specific sort of provenance features that you were interested in and want to sort of reflect on quite often what you can do is sort of save this as template.html I'm just going to save that in there so all I've done here is just copied this code out here saved it as template.html and now what I'm going to do is just run this again but instead of minus F I'm going to go to minus 10 and what this will do is actually create quite a nice provenance report wherever it's gone this is a provenance report that's taken the fields that were a part of that template and this is quite a nice so I'm going to show for me okay it hasn't rendered very well here but you can see that this has actually created a nice report for me or as a side of a nice report that could be used and shared with collaborators or friends just showing sort of the outcomes of your reporting as well out of your runs okay so moving on a little bit again so troubleshooting resume sort of going back to resume functionality again there are three quite nice sort of articles being produced by secure and next flow about this functionality and some of the different tips and tricks you might need to use to work out what's going on if you are having trouble with resume and you're trying to work out what's going on I highly recommend these here as a good resource here are some of the more common sort of reasons that can cause your resume functionality to be inactivated so input file name change so you need to make sure there's no change in your input files there's just to be reminded that don't forget your task when your cache is computed by taking into account the complete file path so if you've moved the files around or touched the file, modified the time step in any way it will be inactivated the process of modifying the input so by accident you might have modified an input based on how you've used your variables if you're using good practices this shouldn't be such an issue but if you've accidentally written over something that you've modified a file which is a requirement of some processing steps this would inactivate your cache as well some file attributes on some shared file systems such as NFS main report and inconsistent file time stamp to help prevent this you can use a lenient caching strategy so you can change the way the caching works and make it more lenient generally I would say don't make it more lenient unless you have to lean into nextflow's stringency trust what you're doing trust that you haven't accidentally resumed something and then you're getting weird results because you didn't realize trust that nextflow is taking care of the file attributes for you race conditions for global variables so you might have some channels that are effectively causing race conditions so that might be access to a shared resource or if you're using some type of counting mechanism doesn't really matter what it is but it basically could happen when using global barriers with two or more operators and that variable is defined in the global scope when it could be used in multiple places this is explained in really good detail as part of the advanced training as well if you're interested in that finally non-deterministic input channels so while Dataflow channel ordering is guaranteed data is read in the same order in which it's written in the channel so there's no guarantee the elements will maintain the order in the process output channels so if you're kind of just stacking things on top of each other and expecting it to be in the same order it might not be especially a file of different sizes so you need to be really careful about how you're sort of splitting and rejoining your files together this can definitely be helped with things like a matching key or a thorough matching key okay finally last but second to last and certainly not least is area handling and troubleshooting so there are lots of suggestions here for different ways that you can interrogate your runs of course there is the error execution to pugging so depending on how your process or task was executed you'll get some sort of error with some luck you'll be able to go back through and work out what this means so one thing that has kind of been shown a little bit here and there but I will go back now and just show you in detail is that there are actually a lot of sort of dot files that have been created with every execution all of these are inside the work directory inside the task directory where the process is being executed so here for example I'm just going to take this first process let's go long list we're going to go to work there are all of these files here so command.begin, command.era, command.log, command.out, command.run, and command.sh the command.run is a nice one because this is a huge amount of information about the run of the command this is what nextflow is actually trying to execute with experience you can get an idea of what's happened here however this isn't probably your first point of call all that I think you probably want to look at the log sh this is where excuse me the command.sh this is where you can look at the command that was actually executed by nextflow executed as a part of that run command if you see that variables haven't been replaced yet properly that is a sign that something hasn't worked potentially with your variables as your inputs and outputs so this can be a really nice file to go and dig into the log if something has gone wrong you might have some nice outputs into the log which could be some nice clues begin again you don't really expect to see much if everything has gone right but if something has gone wrong it can be quite enlightening to see what's happened in there same with error you'll see that when things have gone well this doesn't necessarily give you anything extra jumping back over to the training material this is kind of the descriptions or probably better descriptions but I've given you already but you'll see that all of this is kind of listed here as well as there's a reminder here that the input files are sim linked in however you can change this to copying if you want but it's just going to be copying files when you don't need to as well as any files that are created be created here as outputs just to show you what that looks like as well again is using this hello.nf as an example because that's a file we're quite familiar with or a part we're quite familiar with you'll see here that we don't actually have any output files but we can see that this input file has been sim linked in if we were to be using one of the scripts, so scripts 1 through 7 you would see that this would be produced here as well there would be a file that was produced just because I do think this is an important point on the XFlow run script 7 with with with docker so what I want to show you here is just that there are the files that had been produced sitting locally in this folder as well it's nice and quick look what it was going to be looking into you can see here this is the folder and report that were created by MultiQC whether we have these other logs that were actually supplied to it these were sim linked in here we have the files that were created as outputs and it looks like there was a log generated by MultiQC as well so what we can do is actually jump in there and look at that as well again these are all as files that you can interrogate and try to understand if something has gone wrong you can also add in directives to choose how you want to deal with error strategies so here is an example of just using ignore so if it's just an error you can say ignore it just keep moving the pipeline forward you can also do things such as retry stuff that has failed you can try again you can also use some sort of like custom logic to choose about you know how you want to scale this like you know if you're going to try it again do you want to try it with more resources this resources how many max retries should we try before we just sort of fail out you can add all of this in as directives something is quite common is using resource dynamic resource allocation so you know if you are going to be retrying it so the task attempt increases you could try again with more again these are kind of features that you can think about and apply as you sort of scale your pipelines you probably won't sort of launch into all of this stuff straight away I think it's nice to know about it so that we need to sort of think okay now I want to move this to the next step I want to send this to the cloud I want to do more I want to deal with these pesky failures at least now you'll be aware of it so you know how to sort of come back and think about this and know what some of the options available to you are okay so that's the last part of kind of this block of training however we do have one section left which is the Kira platform so I'm just going to jump over here and I will clean up this just a little bit and then we're going to start talking about Kira platform and how you can access it for free so finally to close things out what I'll do is give you a bit of an introduction to Kira platform what I will try to point out is everything that you can do for free as a part of the platform you pretty much have full accessibility as your some limitations on how many sort of records you can keep and how widely you can share your runs but largely this is a fully functioning platform that you have access to for free so Kira platform previously known as ExoTower is a centralized command post for data management and workflows it really helps bring together monitoring logging and observability to distributed workflows and simplifies the development excuse me the deployment of workflows on cloud cluster or your laptop some of the core features include launching of a pre-configured pipelines with ease programmatic integration that meets the needs of organizations so being able to share this with others publishing pipelines to shared workspaces so others can see your pipelines develop and also deploy themselves and management of the infrastructure required to run your data analysis at scale so having managed those cloud computing environments what we start off by doing though is kind of like exploring the platform through the online GUI to do that we just have to set up a couple of things first so the first is this token so what we all need to do is sort of just jump on over here to secure.io hit this login button at the top of the page once you've done this you'll need to sign in with your github if you haven't registered with this before it will take you just a couple of clicks to register but it isn't a big long difficult registration process once you have accessed this you'll land something that looks a little bit like this you might be landing on a community showcase which is a nice showcase for full of some of the NFCore pipelines that you can launch run and monitor just using an AWS environment we have available here just for you to explore look with preloaded pipelines as well jumping back to the training material you'll see here that we have a list of instructions for how you need to first generate a token and then also your workspace so just to do this live in front of you but I do encourage you to if you can't keep up or if you want to stop and look at any of the stuff again you can follow through these instructions here on the website as well but what I'm going to do is go over here to the top right-hand corner of my profile next to my profile avatar rather down here we have your tokens I'm going to click on that if I was to add a new token I would add in a short name here click add and it would generate this access token for me I am not going to do that today I've already pre-made a community demo token here which I'll be using however if you are following this through yourself you can see that you can just add the token name here and then after you've hit add it'll give you your personal access token which you need to copy and keep for now what you need to do with it is actually export it using this line of code here so you would add an export tower access token and then whatever this token was here you would copy this and put this in on the rest of this line here then you just hit enter and it'll export it to your environment after that you'll also need to export a tower workspace ID to set up a workspace ID I'm just going to click here on the secure to go back and then go to this drop-down menu you can first of all you will need to first of all you need to add an organization so you can just call this you know myorg chris it does call it chris's chrisorg chrisorg description and org for chris I can give the location the ID book I can give it a website and a logo if I wanted to as well but I'm just going to add that you can see here that I now have chrisorg listed under my organizations I can click on that and now I need to add a workspace so this could be my project my project you can see here that I'm not giving a lot of thought to these names but you could of course give different projects, different names, different descriptions depending on what you're trying to do and I'm just going to make this nice and public so if I did add anyone else into this they'll be able to see it as well or anyone else in my organization will be able to see this as well what I'm really interested in here though is this ID so going back to the training material we have this export workspace ID I'm just going to copy that move that over here and then I'm going to paste excuse me this workspace ID just just like this great so now I'm all set up that's what I needed to do for now as you'll see at the moment there's actually nothing happening inside my organization or my project we have an empty launch pad we don't have any runs to look at there's nothing here under actions datasets, data explorer, computer environments it's really an empty space apart from the participants which is just me however I could add some friends in here if I wanted to as well going back to this section which is using the online GUI but also launching from the CLI what you can do is we can add this with tower flag to our runs and what this will do is we'll send our runs to secure a platform to monitor so what this looks like is nextflow run script 7.nf with tower and I'm also going to add in with docker here because I've been playing around with my configuration file what this is doing is sending all of my logs monitoring for this pipeline to my project inside my organization so here under runs I can see this whole run running I can see what I've just launched running here you can see here the command line so the command line that was actually showing back over in my Gitpod environment I can see the parameters that were applied I can see the different configuration options that were applied even these configurations that were left over from earlier in the session if I'd used datasets they would have been in here as well and if I was generating reports they would have been here down here we can kind of see this general overview of the run if this was still running you'd see the different tasks submitted and succeeded if anything had failed we would have seen it there we can see all the processes taking over here some aggregated stats about how long it took how much memory was used the wall time all of these metrics that you'd be interested in when running large production pipelines down the bottom here we have some of the CPU usage which allows you to estimate kind of the resources that were allocated I'll come back to that but you can see here that I can actually monitor the run what I'm going to do is actually show you what this looks like with a NF core pipeline so I'm just going to quickly use RNA-seq pipeline Docker this will launch the test profile for the NF core pipeline so it's a minimal test dataset used to test the pipeline using continuous integration and other things like that I'm just going to kill that for a second because I forgot to add width so now as this is launching what I'll hopefully see very soon after it's finished pulling is that I'll be able to track and monitor this pipeline in a secure platform as well really close to being fully launched there we go so you can see how it's been updated and we can see that this is the run all of the real life parameters that have been included configuration options all of this is listed here we've got a job running, two succeeded all of the different jobs are listed down here as well you can see that in a real life sort of pipeline when you actually want to go in and look at some of this information so these are some of the execution logs different things that might be interested of you to they might be of interest to you as a user, as a developer you can start to explore this here using the platform so just while that's running I'm going to jump back over to this community workspace so this is just the showcase area that has a bunch of different pipelines that are already a part of NFCore so for example this is the NFCore RNA-seq pipeline so what you can do is set up your org you could add your launch pad here where you could add in your pipelines directly from github you can start adding in different parameters and resource labels different advanced options you can specify all of this here so that much like over in the community showcase we just have a list full of different pipelines that you want to launch you can just go in here browse these find out what they do, find out the different compute environments that are available to it and then once you are happy you can just say launch and what this will do is just come up with this nice interface that allows you to basically fill out everything that you need to launch this pipeline in your environment what you'll see is that there's also some options for so here's all the parameters all the different hidden parameters as well we haven't really talked about NFCore pipelines in a lot of detail but as you can imagine for real life production pipelines these are quite big with a lot of different settings that you might want to choose there's also the options here to upload these parameters files which we talked about briefly very recently but you can just launch that provided that I shouldn't have added in anything else there but you can see here that all of this is executable it can be launched and monitored from here as well so as this has been submitted it's gone away to try and spin up some instances on AWS so this will take a wee while to launch but that will run quietly in the background while that is doing that I just want to point out a few different features so you can have actions so there's automated workflow executions that can be triggered based on different webhooks for example you can establish your own datasets so you can upload datasets in here and this could be data that you've got stored somewhere so there could be an S3 bucket you can upload your datasets through here and this will be able to go away go and pull in versioned TSB CSV files if that's how you want to store things like patient metadata we also got data explorer so if you ask during your data on different buckets for example you can go away and explore that explore these publicly available buckets at the same time the idea here is that you can go in there and touch this data rather than just referring it to it by a file path you can establish your own compute environment so if you do use AWS for example you can have these compute environments that are available and pre-establish a specific project which would be really helpful especially in combinations with things like labels so what I didn't show you back here is that you can actually add different labels to your runs so that you know where the costs are attributed back to and things like that which can be really helpful credentials you have lots of options in here for credentials so you can store different credentials to different agents or different GitHub accounts or whatever else you want to store in here very similar you can have different secrets in here as well so you don't expose anything by accident finally you can sort of add in your teams so that you don't have to do science alone you can share this with colleagues so they can track and monitor your runs as well at the same time depending on your administration level you control who has access to do what so while you might want to share it with one or you might want to give more privileges to someone else you can do all of that through participants in different roles at the same time so it's coming back here to the run it's probably still just spinning up it's taking a wee bit of time but you can see here that the war time started ticking over we will get an estimated cost based on all of this as well something I wanted to show you as well quite a cool feature is that you can actually optimize some of these runs as well so based on previous runs you can optimize the execution of a pipeline so based on previous runs you can basically learn you know learn the platform will help realize what you actually utilized of your requested resources and they come up with this configuration file which is using the process scope with a series of different with name selectors for every different process as a part of that this is all based off if you were to go back in here and actually look at one of these pipelines you'll see down the bottom here that there's a lot of resources that were requested and that was actually utilized so you know when you've requested a lot more than you've actually used and the usage is very very minimal you can actually go away and optimize based on this so that when you are requesting resources it's much more intelligent than just kind of guessing about what you actually need at the same time for different sort of runs you can sort of have this reporting which has been generated and you can sort of go in here and say accessing all of this as well through the platform so this is kind of a scattershot all the different functionality that is available as part of the platform jumping back here to the actual trading material all of this is kind of explained in slightly more detail but you can see here that you've got options to configure your compute environment there are lots of different cloud providers as well as schedulers available and what you can do with the platform is you can basically hook it in so that as you are launching these runs you can sort of do this on your cluster or the cloud and then keep an eye on it through the platform you've also got options here to set a default compute environment so you can use decide what to make defuse excuse me you can choose what to make default every time you launch which is really helpful so that you can kind of pick something a little bit conservative rather than going to something that might cost you a lot of money especially if you're going through the cloud it's a little bit more about launch pad there how you can click on these pipelines and launch them there's also this pipeline parameters form which I showed very shortly as well so with each of these files you can have effectively a schema file which you can use to get rendered and this helps you input all of this in the platform which means that it's a lot easier for those in our by petitions to access this or if you're running a lot of pipelines regularly it's going to be a lot nicer to input everything you need to rather than having to do it on the command line there's a big long description here about how you can add a new pipeline I won't go into this here but this is an example code that you might want to consider some steps that you need to go through to add on that pipeline as well finally there's a little bit of a section here about API so you can also use an API to start triggering some of the stuff based on different events so you can also choose to supply different resources, different users and things like that as well this is all a part of the workspace organization one of the final points I sort of touched on there as well however ultimately much like I've been saying with other other parts of this workshop material I really encourage you to come in here and have a play around it's not going to cost you anything you can't do anything that's going to break break the platform but the showcase is really nice to come in and see what it would be like to launch have all of your pipelines sincere and set up and then just sort of jump in and launch a pipeline using compute environments sort of out of the box being able to track your runs is really nice so even going back to my project the one that I've just created here you can see that while this is still running I can see like what's been running, what's been requested what's been utilized, how much memory all of these things that we could access a little bit through the logs but it's really just made so much easier through the platform you can jump in here and just pew all of this stuff again this is really useful for debugging and understanding your code as well of course in this environment nothing's being set up this is all kind of like an empty slate but in a if you do go in and set all this stuff up it can be really lucrative really quickly for you as the user of course I'm sure the secure team would be interested in hearing you if you want to know more about this as well but I think the first step is jumping in and trying this out for yourself seeing what you think and seeing if it improves your experience sort of launching monitoring and running pipelines okay so that is where I'm going to leave it for the training thank you very much for attending I know that this session in particular is quite long there's a lot of talking a lot of listening there isn't a lot of opportunity for sort of engagement or exercises but what I do want to stress is that whether you realise or not you've been exposed to a lot of new ideas a lot of you know the concepts that you learned as part of session one have been expanded and we've learned more about different sort of operators and different features of the channels as two kind of very sort of low hanging examples but there's been a lot covered it's okay not to understand everything cool at once go back try and break things pull out the examples try it on your own compute you know try to take what has been shown here and apply it to your own work I hope that it has been helpful from Marcel and I thank you again for attending please don't be a stranger in the community if you need anything we of course have the NF course slack and there's also the secure community please check those out they're great resources yeah thanks so much