 Okay, hello everyone. So today is the second session for our time zone. I'm gonna go a bit deeper in some concepts one second. So the idea today is to cover this a few sessions, a few sections of the training material. So we're gonna talk about channels in more details, processes, operators, we're gonna have a break. Then we're gonna talk about modules to config file, the configuration of the pipeline with next flow and some deployment scenarios. So yesterday it was a bit intense. We saw like a full RNA-seq pipeline. We went to so many concepts, we mentioned channels, we mentioned processes, we mentioned operators. And at the beginning I had warned you not to focus or worry too much about these details because later they will be explaining in more detail. So basically that's what we're gonna do today. So we're gonna talk about channels again, but much deeper, processes again, much deeper, and then operators. And we try to slow down, probably gonna do this in more than one hour. I'll try to focus to make sure it's really clear this part to you guys. And then we're gonna talk about modules, network, next flow configuration and the deployment scenario, okay? So we're back in Gitpod. Today's taking a bit longer to load. So I mentioned in this lecture that before the training, we should already try to start it to make sure it will be ready. When the training start. I think a good thing to do probably is to look at the final RNS6 pipeline script that we did yesterday, which is this one. I'm gonna increase the font size here. So very quickly so that we don't lose too much time with that. Go over what we did yesterday, just to as a review. So we have some parameters here. So when we have this params.params. It's like a keyword. It means that we can pass with two dashes some information to our pipeline, next flow, okay? So we have read, we have the transcriptome file. We are using this environment variable from next flow project here, which is where the NF file is located, okay? We have it out there. We use log info to bring some nice information to the command line. When the user runs our pipeline, they're gonna know which transcriptome file is being used, which reads the folder for outputs, results and so on. We have a first process, which is index, which is gonna use somon to generate an index based on a transcriptome file that we're gonna provide. And then this is gonna be stored in a folder called somon underline index. The next process that we have is a quantification one, which is gonna perform quantification again with somon. We're gonna give a tuple, which here is a file pair. We have an ID like gut, liver, lung and the reads. Here we have pair and reads. So we have two reads for every of the samples. And we're gonna save this to a folder, which is the sample ID name. So because we want to do this dynamically, we want to every task, depending on the sample and the sample ID is gonna create a different folder. We're gonna use this double quotes and the dollar sign to make sure this is a variable here, okay? Then we have the script log calling somon. Then we're gonna use fast QC to the quality control with our samples. And then multi QC to generate a full report of everything we did in the pipeline. But at this point, this process that they're not connected, they're just described. We're gonna use a workflow block to connect everything. We're gonna create a channel. We're gonna use a channel factory called from file pairs. We're gonna have these reads. We're gonna check if they exist. So if they don't exist, the path, there's no reads or anything. Nextflow is gonna have like a beautiful Nextflow message about this. And we're gonna store that to be parse CH. So we learned yesterday that actually this is the same thing. But a lot of people have been using this style of at the end adding the variable where it is gonna be stored. And like in the R community, if you write calling R, you know, it's very common to at the end use some operator to save all the computation to a specific variable. So basically that's what we've been doing with Nextflow. Then we're gonna call the index process and saving this channel, the output, then call the quantification process with the output of the previous process and the samples. Then we're gonna use the fast QC process to check the quality of the samples and save here. And then we're gonna pass everything to multi QC. So we're gonna get this channel, which is the output of the quantification process. We're gonna mix that with the fast QC process. So what's the mix operator again? So the mix operator is gonna just put together different channels as if they were one. So actually we're gonna talk about this soon. So it's better to save this for later. And then we have this handler here. So when the workflow is complete, no failure, no error, it's gonna write to the, when it's finished, if it's successful, no error, no failure, it's gonna write down, open the following report in your browser, blah, blah, blah. You're gonna use this expression here if this is true, do this, otherwise the column here separated them, say this, oops, something went wrong. So now that we went over the whole pipeline that we wrote yesterday, let's start looking at things in more detail, starting with the channels. It's the section five of the training material. So the idea here is that we're gonna always have a task at some point, at the beginning we only have the channels, right? We are providing us inputs to our tasks, to our processes. Again, process is the definition. When it's run, it's a task. So usually we're gonna have a task, a task alpha here, for example, is gonna have an output, which is a channel with one or more elements. Here we have files, so we have file Z, Y, X. And this, which is the output for the task alpha is going to be the input for task beta. Then it keeps happening. But what's the channel? So we said before, basically a channel is a first in, first out queue. So the first element to be added to the channel will be the first one to be run, to be passed to a task, to a process. They're non-blocking, they're only directional. But we not only have queue channels, we also have value channels, okay? So here, just one example, we're gonna create a channel. We're gonna use the off factory. So what are channel factories? Are operators or functions that help you create a channel? So here we want a channel with three elements, one, two, and three. So let's create, script here, test enough. Let's try to open it here, that's enough. Let's split. So you guys can see everything. The other part was five channels. So what we're gonna do here is that we're gonna create a channel with three elements, one, two, and three. And we're gonna store this to a channel named ch. One thing we can do is to use the println function to print the content of that as a variable because channels there are variables. But then we can use the view operator to really see the channel, which is more high level, let's say. So we can do nextflow run, test enough to see what happens. And we see here what's inside, which is the println function. It's not very understandable to us because that's how nextflow is handling these things. What we care actually is what's inside the channel. And that's why the view operator does. Show that there's three elements, one, two, and three. They are shown here in separate lines because the view channel, it prints in your standard output every element once at a time. If we had something like this between brackets, this is a single element. So this channel has one element which is a collection of three values, one, two, and three. If we run it, we have one line, one, two, three, because this is one element. One interesting thing we can do is to use this parameter here, dash DSL one. So again, because we have one dash, we know that this option, this parameter, is for nextflow. If we had two, it would be for the pipeline. So here's for nextflow. By saying this, I'm saying that I want to run nextflow with the older standard for the language. This is not what we should do, but there's one very interesting thing here that helps us understand what is the channel. So I'm gonna run this for you. Same script, but I want to interpret it with DSL one. By doing that, when we have the println, which prints what's inside the variable, we have a very interesting structure. So we have a queue, which is a queue channel. We know that there's one element, which is one, two, three, and we have a poison pill. This poison pill tells the process that the channel has been consumed and it's over. There's nothing else to consume. If we had one, two, three as before, which is a channel, a queue channel with three values, one, two, and three, we're gonna have now something similar before, but with more elements. So now we have a queue channel as before. We have one element, which is one, one element, which is two, and another element, which is three, and again, the poison pill in the end. What if we want a value channel? So instead of using the off-factory channel, channel factory, we're gonna use value. And value channels, they're a singleton. They have only one element. So I'm gonna put the brackets back because there's gonna be one element. I'm gonna set again to CH and let's see what's inside now. Now something interesting is gonna appear. When we had a queue channel, we had a poison pill at the end. Now we don't have the poison pill anymore. We just have the value. This means that the way processes consume this channel is different. Usually you won't use value channels explicitly because actually implicitly it's already used. So when we were using the script seven, oops, so let's open it here. When we were using the script seven yesterday, we had, for example, this parameter which was a transcriptome file. So just a string that has the path to the transcriptome file. And when we call it in the workflow, we gave it as an input to the index process. But if we go to the index process, it expects a channel that contains a path. But we didn't create a path, we didn't create a channel. We just sent a string. What happens here is that when we do that, automatically behind the curtains, Nextflow is gonna create a value channel, you store this value and pass this value channel to the process. That's what's happening here. So even if you don't use value channels, they are automatically used by Nextflow. But there are some situations in which you have to use value channels. And I'm gonna give you one example here. So as I can go to the test, I'm gonna create two channels, which is gonna be one, two, three and four. And a second channel, which is gonna be one, two, three. So these are two Q channels, right? We have here the channel factory OF, which creates a Q channel. We're gonna create a process here, which is some, it's gonna get as input. A value X, a value Y. The output is gonna be STD out, which is the standard output. It's just gonna print to the screen, okay? I'm gonna put the book through because then I want this output to be seen. I have a script block. We don't have to skip typing script. By default, it's already scripted. So let's take it out here. And I'm gonna do this. This is just best script to sum two variables, okay? Then I'm gonna create a workflow. That's how I'm gonna start like this. Now I'm gonna create a workflow block. Then I'm gonna call sum with CH1 and CH2. Let's see what happens. So again, I have two Q channels. I create a process to sum them. What's happening here? So the thing is we have to escape the dollar sign. Otherwise, next flow is gonna try to interpret it. And we want bash to do that. So again, all the questions that you have, you can, next with the book, what's here? You can ask on Slack. So Rob and so many others have been helping us yesterday and today. So feel free to ask any question during the training or in the breaks or even after the training. So these three days, the channel is there for you, the Slack channel. So here what we did basically was to sum one plus one is two, two plus two is four, three plus three is six, eight. They don't have the exact same order because there's no guarantee. They will be, the task will finish in the same order at the same, with the same amount of time. But we have here the right outputs. Okay, that's good. But then what happens if we do this? Now we have a channel with four elements and a channel with three elements. When this happens, something interesting is gonna happen. We have this setting, something interesting is gonna happen. Basically, because we have the poison pill, as you saw before, when the last element, I mean, the process sum is expecting two values, right, from two channels. At the last, when it's finished with three and it goes to the four, CH is gonna send four, CH one is gonna send four as a value, but CH two is gonna send a poison pill. And by receiving a poison pill, the process sums, sum seizes to work. And that's why here we have one plus one, two plus two, three plus three, and then we don't have anything else. Sometimes you may want things to see you keep working. You don't want the poison pill to stop the execution of your task just because there are no values anymore in one of the channels. One way of doing that is you can just have a value channel. So here we have only one element and okay, if it was off, it will end here. So let's do this. I think it's more dynamic to do everything. So here, we're gonna have only two because when the first channel sent the value two, there was a poison pill here and it was over. But if we use value, we create a value channel, it doesn't have a poison pill as you saw. And then what's gonna happen is that all the elements of channel one is gonna be summed with one. So we have two, three, four, and five. So this is one case in which we want to use the value channel, a value channel. But usually you don't do that. You just go with Q channels, okay? So we were talking about channel factory. So we have the value one to create a value channel. We have the off one, which is the most common one to create Q channels. And in the session yesterday and today, probably you saw this thing about having, where is it, the test here, this thing of having the curly braces and why we have this curly braces? Why we have this curly braces? So they can be useful for many things. So let's create a new channel again. One, three, five, seven. No, again, spaces. All these things are just a matter of style. You're free to use it the way you want. If we do it like this, we're gonna have just the content one, three, five, and seven. If we use the curly braces, we have to say it, which means like get channel and every element just, I wanna have it. So it, we're gonna have the same output. When you have the curly braces, you don't have to use the parentheses, they are optional. So here I use them, but I'm gonna remove them afterwards. We don't need it when we have the curly braces. The nice thing is that one thing you can do is to play with it with every element of your channel. So here I wanted every element of my channel, I'm gonna multiply by two. Or you could just be making a string and play with that. Like here we have value and then you have that. So we're gonna have a string value before the name, the element of the channel. And because within a string, we have to use the dollar sign. Then we're gonna have this output here. And nice thing is that the channel factory also allows range. So you can use these two dots to create a range. So I want the channel with all the numbers from one to 23, and then I want x and then y and I want the collection of files and so on. You can do everything here. Strong list is a not very used operator, sorry, channel factory. Basically, if you have a list of values, like a collection like that, you can use from list to turn this into a QChannel, okay? From path we've used before in the pipeline yesterday. So basically you're gonna create a QChannel with a path. Here you can use globs. This asterisk means that I want all the files inside data inside meta. You can also use two asterisks. In this case, the two stars, you can have not only every CSV file in here, but also inside folders that are inside meta. So you want to go through subdirectories. There are some parameters that you can provide to the from path channel factory. So hidden files, type of files if you want files or directors or any, the depth of the subdirectors that you want to visit. If you want to follow links or not, check if they exist. There are many parameters for many of the channel factory that you can use. Here for example, I want all .fq files within this path and also the subdirectors here. And I want to include hidden files. From file pairs, we've seen that yesterday also. So you can give this glob here, start all these files here at end with fq and have one or two. And file pairs automatically will know that it's gonna collect in the form of a tuple. The first element is the, it is a simple ID, like this name of the file. And the second part of the tuple is a collection of files in this case too. So here's one example of what we saw yesterday. This shouldn't be new to you. So lever, there was two samples, gut to samples, lung to samples. Again, we have some parameters for the from file pairs channel factory. In this case, flats. So you can match files, process, so elements, and then meet the tuples. There are many things you can do. So it's very interesting that after, it's very, it would be very nice, after the training, you come back to the training material, you read it again, you try to get more details and even to check the official documentation, which has many more channel operators and channel factories and so many other tools that you can use with NextLow. One very interesting one is from SRA. So here basically you have to set, to configure NextLow to use your API key with NCBI. But when this is all set, you can just give an SRA ID with your API key and NextLow will download the data and automatically create a channel with the path to the samples of this ID that you provided. This works, but there is a pipeline specifically for that called fetchNGS and probably tomorrow in the NFCore training, you're gonna see that, it's an NFCore pipeline. So maybe this fetchNGS is preferred over the from SRA, but you have this channel factory. And again, you can have many IDs and then you're gonna have a channel with the IDs and the location of the samples of every ID. So this is just one example where you would use, we have only one process, which is class QC to your quality control with samples. And in your workflow block, you already start by collecting with the from SRA, the samples and calling class QC. So it's much simplified. This is a question that I think someone asked yesterday in this training, maybe it was the other time zone one. You have a text file and you want to play with it. So here we can use the split text operator. We can use it from path channel factory, but then when we have a channel, then we're gonna use split text to split this TXT document. By default, it's gonna split in a way that every line of the document is an element of the channel and then you view it, but sometimes you don't want that. So here you're gonna use this parameter by to say that by how many lines you want to split the file. So here you're gonna split every two lines. So we can run that actually. Oops, of course we need to do in a file. You're gonna run that and you're gonna have every two lines. But then if we're just doing that, you're gonna see the whole document because in the end it will show everything. So what we do here is to have this operator called subscribe, which basically is gonna do something to every element of this channel. And here what we decided to do is to print it. So we're gonna print the element, which is two lines of these documents. But then we also gonna print a new line with end of chunk to make it clear to us the separation. So here you see we have two lines end of chunk, two lines end of chunk, two lines end of chunk. And then one line because it's odd, the number of lines in this file, this text file. But again, you see we can use this subscribe to make many interesting things. So your creativity is the limitation for that. You can think about anything and you can use subscribe operator. So here for example, we're gonna use by 10. So I want that every 10 lines of this document be an element in my channel. And then we can use this closure here. The same way we used with view before, we can use here with any operator. We're gonna use it with split text. And then we want that every element becomes uppercase. Then if we replace it here and run it again, we're gonna have the same thing, but with uppercase characters. But again, it's gonna be 10 lines, right? Not every two, but every 10. And as you see, everything is uppercase. You can do math, if there were numbers, you can do reverse uppercase. You can do many, many things with subscribe or with the closure. So the nice thing about subscribes that you can have many lines in what you want to do. And here it's a bit more limited or not so readable, let's say. You can ask do many different things. Again, we are using this closure here with view as we did before. Instead of doing just a value and the number, we want to keep increasing the count. Start with zero is gonna be one for the first line, two because we didn't say anything here, any parameter. So it's gonna be every one line. And then we're gonna have this count. We're gonna get that, make it uppercase, and then trim it. So we can run this, we're gonna get, so we're gonna have our variable, which is count being incremented every iteration, a column, and then the value, which is gonna be uppercase and trim it. We cannot use that, can you also use the file? It's not really a channel factory, but you can create a file in the next world language by providing a path, and you can still use split tags and everything. Having a file like this, not very common, but one thing that is probably a lot more common is to have files that are separated by some symbol. So the CSV, which is common separate values, is a very common text file, data file. So here we can just call this split CSV operator, okay? And we're gonna have the fields, we're gonna have a channel with these elements, and we can consume them based on the position of these columns. So I want the row for column zero, for column three. If we open here, so it's a data, but if we open this patient one CSV, we're gonna see this. So it's a bit one, field is patient ID, data ID, S3 tier, and so on. And we have the rows here. If you run that, we're gonna have every row with the first column and the fourth column. Here we have patient ID and the number of samples. But then the position of the column, that's not very nice. So if we have a header and we have names for the columns, I may want to use names. That's what we're gonna do with this code. So basically we have this parameter, which is header true. And then instead of saying the number, I want to name of the column, which is patient ID and num samples, which is the one we just saw, but we were using zero and four. So now we're gonna see, again, all the rows, but I just want the patient ID column and the number of samples column that we have again here. We can do the same thing, we can give a header if there is none. There are many things we can do. And again, the documentation is very large. You can check there are many things you can do. Another thing that we may want to do is to work with CSV files. So here we were just rewriting the code yesterday to work with this thing we learned. Here we can do something with the CSV files. We're still gonna use a split CSV function, but the difference is that we're gonna give this parameter this option, which is SAP for separator. And instead of a comma, we're gonna do this, the reverse slash T, which is a tab. So whenever we have tabs separate values, we can use split CSV with this option to read it. So let's see here, regions, CSV. You see, it's not a common anymore that separating it is a tab. See, it's not a space, it is a tab. And by providing this reverse slash T here, we can do that, we can run this. We have here every row with everything, but here they are table separate values. In the end for next row, they are converted to be comma separate values because it's split CSV function, but we provide what separate is being used. If it's another separator, we can also provide it here. And in the end, it's gonna be inside next row as a CSV. Again, we can give the header true to call the column by the name. And there are many other formats that we can use. So JSON, we have YML. Here I'm gonna show the example with JSON. So we have this regions.json is another way of organizing your data in a text file. Well, here we have this brackets and curly braces and everything. So we can use this function JSON slurper with parse. To parse this file, which is the JSON one, and then we can use this loop to get for every record and entry, I got a patient ID or the feature, which is this fields that we have here. Patient ID, region ID, feature, pass flag. So by checking, by consuming this, we get the value. You can run this so you can see it. You can see it working. And we have it here, IDs and features, the way we asked here. So I want to lose time here with YML because it's just a different format, but it's very similar. If you have a function with your parse or load, we still have a loop going with the entry to get in this fields, it's the same thing. This part is teaching you how to create a module with the sparsers and then call it to use in your pipeline script without having to write this part here. I'm gonna skip that because we're gonna talk about that when we are talking about modules in the nest today. So now we're gonna go to the processes, which is the second section we're gonna do today. So we finish channels, now we're gonna go to processes. Then I think probably gonna do a break, maybe before operators or during operators. Then we come back with modularization, configuration and deployment scenarios. So for processes, again, it's something that we saw yesterday like many things we're gonna see today, but the difference is that now we're gonna see with more detail. So yesterday basically we saw a process that were very simple with input and output and the script block, sometimes even less than that. Actually a process can be much more than that. So this is one example, like a simple process with no input, no output, just a script block saying hello world. But actually the process block can have many more things. So we have the name, we can have directives which are optional. So we've seen some directives yesterday. So debug, publish tier, we saw CPUs, we saw container. So they're all directed that can come here or in the config file. And we're gonna talk about this today, but for now just bear in mind that you can have it here. We can have our input block, which we have all the channels that are gonna be the input for this process. We have the output block, which has all the channels that are gonna be the output of this process. We have a when block. So this when block tells NextLow when this process should be run. So for example, one very interesting thing about NextLow is that it can be very dynamic in the way things happen. So you may probably, you can for example say that, okay, I'm gonna have the samples. And whenever the sample doesn't reach a certain level of quality, I don't want the sequence to be aligned for example, to a reference genome. And then you have all your samples. You're gonna put in a channel. You're gonna call the process align. And every sample that comes, every element of the channel that comes to this process is gonna check something. Maybe the output of another process that was the quality of each of these sequences. If the quality is okay, the process is gonna run a task the way it should with the script block here. However, if the quality is bad, you just skip this sample. You go to the next sample. And then the next one, if it's the quality is bad here. So this is a simple example, but you can think about more things. The thing is whenever this condition here is false, this task will not be run for this sample. So it's a way to use or not a process for elements of a channel. And then you have this script block which is basically what the process is going to do. And we saw that a lot already. So here we have an example with a lot of things happening. We have an echo. So by default, if you don't say what language is your code in this script block, next code is gonna understand that it's share scripts. So here it's share script indeed. We didn't say anything. So we're gonna echo something to a file. We're gonna get the content of this file. We're gonna have, we're gonna get the first line, the first five characters, we're gonna save this chunk one TXT. We're gonna compact that to this one. So we do a lot of things. There are situations which indeed your process is going to do a lot of things, but usually you have to have some common sense and maybe split into more processes. But the thing here is that you can have many lines of code in a single process. If instead of share script, you want to use another language in this script block, you have to tell next flow which language is that. If it's Python, for example, you have to put the shebang that I told you yesterday was a shebang is just to tell what soft is gonna be used to interpret this code. So here I want to use Python. I'm gonna say it's Python. And then we run this script block. Thing is sometimes this gets too long. I mean, you won't write a whole software in Python in this block. It's gonna be trouble to read and you should leave the pipeline file just for the pipeline details. If it's configuration, you do in the nextflow.config. If it's scripts, you create script files, you start the Bing folder and then you can just call the way we were calling yesterday someone. We just say someone in the arguments. Here we will say my script and the arguments. The parameters we already saw yesterday also, we just say params. and the name. And then we can, by providing dash dash this name, when we are calling the pipeline, we can give a different value. So here you're just gonna print. I'm gonna run it again, just in case someone forgot or feels a bit lost. So let's run this. If you don't say anything, it's gonna use the default value that's already provided, which is world. We're gonna say hello world. However, I could also run that by providing a pipeline argument with dash dash. Okay, we have to here. Either we use the book true here. So we see the output or we can use dash process dot echo. We also see it. Or we could use, if we had an output here, I could use view to see the content of the output channel. Here, hello world. We can also change that by calling again the pipeline. But now I'm gonna provide a parameter, which is data, which is this parameter of the pipeline. And instead of world, I'm gonna say Bob. And then the pipeline's gonna run. It's gonna override the default value, which is world to Bob. And it's gonna print in the screen, hello Bob. Hello Bob. There are many things we can do. One interesting thing, I'm gonna show one thing here that is not so evident, but I think it's important to know. So let's run this. You won't see anything, but the interesting thing here is that there's a, something ran. So we have the hash here. So let's see what's inside. Work, one, three. So I want to see the command SH. So here we have, so this is the file that Nextflow creates to run in the local machine, in the cloud, in your HPC cluster, wherever is your compute environment. So it creates this and then makes this be run, either submitting as a job to HPC cluster or anything. It's interesting to know that the working directory will be the one where the job will run. If we didn't have this escaping here with the reverse slash, something different would happen. Let's see what's gonna happen. So again, nothing's gonna be printed because we are not using process echo or debug or anything, but I wanna see the command SH again. So you see now, we don't have the verb anymore. Because we didn't escape, Nextflow understood this variable and replace it by, so pwg is like print working directory. So it got the working directory where we run it. If you want that, then you leave it like this. But if you want it to be where the process, the code is gonna be run, the script block, you have to escape that with the reverse slash. That's the difference here. So there are many different things you can do. Sometimes, so probably in some script, you've seen that we can have like, let's say we have an input here, which is val x. So here we would have x like this. We want to print the content of x, I would do that. So this is a Nextflow variable, but maybe I would have something being set to x here. So how do we differentiate between variables that are variables in the code of our script block and variables that are Nextflow variables of the pipeline of the script file? So instead of using a script, we can use shell. Oops. And by using shell, we cannot access Nextflow variables anymore with this. We have to instead use this format. And this way we know, I mean, Nextflow we'll know, and the interpreter will know that this is a variable from the code, the script block, and this is a variable from the pipeline. So this one use case for using shell instead of script in the script block. Another interesting thing is that, so usually we say that you can only have one script block per process, and that's correct because a process can only execute one script block per time. But there's one thing you can do, which is to have conditional scripts. So I like this example here. I think it's very useful and close to a real life scenario. So let's do it here. So basically what it does here is that we have a parameter for a pipeline, which is compressed. The default value is gzip. So there's a gun zip to compress files. And the second file, the second parameter, it's a file to compress. We have a path to a file that we want to compress. We have the full process, which is gonna be called and the parameter is the file to compress, so the path file. But then there are different script blocks here. If you don't say anything, if you don't provide the gzip compressed parameter, it's gonna be the default, it's gonna be default. If you don't say anything, it's gonna be gzip. And then in this case, that's the script we want to run, which is basically compressing the file by using gzip. However, if you provide a parameter to compress, and it is bzip2, then it's a different script block that this process is going to run, which is the same thing, compressing a file, but we're gonna use bzip2 to do that. And if you provide a file, a method, that actually is not gzip or bzip2, it's gonna throw an error. So we can do here, test it now. I'm gonna, I won't provide a file to compress, because I'm gonna use this default one, but I wanna say a different map. So let's say gzip, I want to bzip2. I wanna compress with bzip2. So this is going to work, and we can even see. So because we didn't say published here, we didn't choose a place for this file to be, but we can just go to the work directory and see that it's compressed here with bzip2. This is an intermediate file. It's not downloaded or anything, just a link to where the file really is, and we provide it here. It's here. But let's say I don't say bzip2, I say Marcel. Because Marcel is not gzip or bzip2, we're gonna go to this condition to the else, and we're gonna throw an illegal argument exception. We're gonna say there's an unknown, here really a line, here would be a compressor, but we get here, it's error. Now that we kind of understood the whole structure of the process, what we can do, oh, should have closed that. So what we can do is to try to understand more about the inputs. So let's start talking about inputs now, the input block of a channel. So basically that's a figure that I like a lot. So basically you have a channel, you have every element of the channel and so on, and more elements can be arriving. So every example we have shown so far, they have a ready channel. When you call the process, the channel is ready. So you create a channel with 10 elements, or with all the files in a folder, and then you call the process. This is not always the case. You could have, for example, there's something called watch path, which is a, so you can use this function to watch for a folder. So the channel is there, and the process is waiting for something. And whenever there's a new file in the folder, according to some globs that you're gonna define, it's called to the channel, and then it's delivered to a process, and then more maybe arriving. So there are other ways you can do that. So what I want to say is that sometimes your channel is being created or filled while your pipeline is running. And the thing is whenever there is a process that has this channel as input, whenever there's an element in the input, it's not a point in the field. The process is gonna run a task of that process with this element. And that's how the input parallelization works. We can always run the process with every element that's arriving. And again, this is requested to the operating system. So it really depends on the operating system or the HPC cluster with the job scheduler or the cloud provider service. So next floor may ask for a thousand tasks, and maybe not all of them are gonna take place. You can also control that the rhythm of this number of tasks being issued. The thing is you have this task being issued and output channels being filled and then next processes we receive this and so on. So we've seen a lot of times already the input block being created. We have an input qualifier, a val, path, retopo and so on. And we have the input name, which is just a name you're given to this input so you can reference it later. So for value here, we have val x for example, and then we can call it with dollar sign x and see what's going on. So again, you can see Omar, so I see there are spaces between the parentheses and the number in the channels. It's just a style. There's some people that use space, some people that use space and put everything together. It's just a matter of taste. Okay, because here there's no block, the workflow block, it just complains that this was used for DSL one and now we have DSL two. So we do have to have a workflow block. If you get this error this morning, it's because of that. I had forgotten to paste the workflow block. So now we have one, two, three, which are the elements of this channel. Again, the order is not guaranteed. It's run in order, but sometimes due to many things in the operating system, some tasks they finish sooner and then they're printed to the standard output. So here we have a path. So because we have between quotes, it's not a variable or anything, it's specifically this path that we're gonna use. And we're gonna reference it here. See, we can use a path again here. I'm skipping some parts that are just repetitive. And again, you can ask the questions in Slack so that we can focus more on things that are different here from what we've already seen. So here I want, oh, I'm gonna collect this, the input path. I'm gonna, the output is gonna be top 10 lines, which is a file in this case. And indeed that's the number of the file that we are outputting this first 10, 20 lines of this file. We can also combine input channels. So by saying combine here, what it means is that a process can have more than one input. So here, for example, this full process is getting two channels as input, which are values, a value X and a value Y. Here, we see that full has been called which channel one, channel two. We already worked with some examples even today by doing that. The process, it has many inputs and you can also have many outputs. The thing is the next section. Okay, so this, we already saw that when we have, here we're gonna have the poison pill. So it's gonna stop here. And then we have this example with value here. We can have input repeaters. So that's a very interesting thing. So I don't really like this example, but I think there's a better example here. So let's say that you have some samples. You have your transcriptome reference file, just like we did before with Samo to create the index. But Samo is not the only softer to do that. Actually, you have another one which is called Callisto. So let's say that I want my process to create index, but I want to create index with Callisto and with Samo. So basically what I'm going to do is that I'm gonna use, I'm gonna call the process command with the channel that has the files. Hit the reads. Okay, one second, we lost here. Okay, this is confusing. Let's come back to this example specifically. We have the each, okay, it's here. Yeah, the each, I was looking for that. So again, here we're gonna give the reads, the scripting file and the methods. So these methods, we're gonna receive them and call them mode here, which basically is Samo or Callisto. And by using this qualifier each, what we want to do is that we're gonna run everything but for each mode. So we're gonna get all the reads with this transcriptome file, and we're gonna use Samo, which is the mode here. When all the sample, not specifically waiting for all the samples to be over, but at the same time, we're gonna issue tasks, we're gonna run tasks, we're gonna do everything but with Callisto mode. So basically we're gonna do this command, which is like creating indexes for all the reads, both with Samo and with Callisto. So it's nice to run this example here because you see that it's not really waiting for anything. It's not waiting for one mode to end to start the other. Actually, they're just like any other tasks next was just being run. So we have here three modes, which is regular, espresso and psycho coffee. And you see sometimes we have psycho coffee, regular espresso and then repeated regular, regular espresso, espresso. It's just being called tasks and the operating system is gonna run and handle that and maybe one is gonna end sooner than the other. And here you see we have all the samples. So here's beautiful at the end. We have the same sample with the three modes, but sometimes a bit more messy. And you see here, to this is what we call a repeater an input repeater with each. So the magic here happens because of this qualifier, each. For output, we already saw that we have an output qualifier, an output name that we're gonna use in the script block or the place, but we can also use a met if we want to give a name for that, a better readable name. So here we have a simple case. We have an output being in value. Whatever happened here in this script we'll capture and start to this X which is gonna be the output channel. So it just, we save it to a channel here and we can see here, we're gonna use this closure just to have received and the content, but it's same thing as we saw before. It's gonna print broad DNA and RNA. We're gonna have, we can also have a path here. We're gonna say the path between quotes. It's a file named result.txt. It's not like a variable that we create. It's really, this is the name of the file. And just like we can have multiple input files, we can have also multiple output files. So here is an example that Evan showed you in the first day that we had a process having different chunk files with parts of the world, the world. It was hello world, we have hell, hello world and then world in the end. So many files being created. So here we have one qualifier, one glob, which is asterisk because it's gonna be chunk one, two, three, which is the way the split software works. Then we can see it here. We can have dynamic output file names that depend on something else. So here we're gonna use X, which is the species, right? Let me see. So when we call a line process, we say the species, we give the species and the sequences. So X here is the species, cat, dog, sloth. So by using this variable here, the name of the output file of this process, because it's a path, it's a file, it's going to be the name of the species.alm. So you can also have dynamically named output channels file and so on. Another thing you can do is to have composite of inputs and outputs. So these are tuples basically. We already saw that when we were working with the iPairs, we had the ID like guts and the two samples, which were paths. So that's what we have here, tuple of a value and paths. The output is gonna be sample and path in this case of the band files, which are aligned with some software we would use here on our raw sequences and we will get the band files. The band, we already discussed that. So here we say that if the name of this file, so fast is an input path. So we can use dot name to get the string of this path. If it respects this regular expression. So don't worry too much here, you can just have a name or an extension. But here in the example, they decided to have this regular expression and the type is another channel here for the input has to be NR. If it's not, don't run this process, just ignore that. Then we have directives, which are at the very beginning of the process. We saw yesterday a few of them like CPUs, container, published here. One thing to pay attention and soon you're gonna understand why. So directives are just the name, space and the value. Because when you use directives in the config file is a bit different. You can use the equal sign. So be careful not to confuse. It's a very common mistake to do. So here just a space and the value. So here we are telling the operating system that when we want to run a task of this process, we want to request two cores to do that and one gigabyte of RAM memory. And the container is gonna be this one. So that's what we do. There are many different directives. Here, just a few of them, there are much more actually, you can see here in the official documentation where you have CPUs, time that you want to allow for that process. So if a task takes longer than one hour, and you finish the pipeline because it's wrong. It shouldn't be less. The amount of memory, the amount of disk, you can give a tag as you saw before. So if we have a task for many different samples, you can have a text saying, okay, this run here is for gut, this run is for liver, this run is for lung. We saw that yesterday. The published here, we also saw that we can use it to say where we want the output file, the results of our pipeline to be stored. We can use stage as to give. So you can use mode, say if you want to copy or to create a link, you can use the stage as to rename the file, some other name that you want. Probably there's one example at some point. You can use the variables here for the subdirectories. You can use patterns. So this process in the end creates a lot of files, but I just want the output files that end with FQ or that have this part at the end, or that start with something. So you don't have to save every output file of that process. You could just get the final process in your pipeline and just get part of the output files that are generated. So you can do that with the pattern option. And as I said, you can use stage as to say with a different name. So now we've been for one hour talking. I think it's enough so that we have a break now. We're gonna do a 15 minute break. Again, take some water, send a bit, feel free to go to Slack and ask questions. I will also be there to reply some of the questions or if you have some feedback for this part of the course. So we have now 15 minutes. After that, we're gonna start operators. Then we're gonna go to modules, next we'll config and a bit of deployment scenarios. So thank you for your attention and be right back in 15 minutes. Okay, let's come back. We've finished process. Now we go to operators. We have about one hour and some minutes to cover operators and next look configuration and deployment scenarios. Close this, okay. So for operators, again, if you go to the official documentation, there's a very long list of operators which are basically functions that apply to channels. So at some point I was looking at the questions that were asked in the Slack channel and someone asked what was the difference between a groovy variable, like a next flow variable and the value channel? And Rob's times that gave a very good answer. So you need channels to use these operators, these next flow operators which are specific to channels, you need channels to work with them. So if you just had a next flow variable, a groovy variable, you wouldn't be able or the operators, they would work differently. It would behave differently. So that's why we have these operators, they are not just functions, they are functions that work on channels. Here we have some of them to be shown today. We have filtering operators, transforming operators, splitting operators, combining operators, forking operators, maths operators and many more. So one example, a very basic example here, I like this figure. So you have this channel, which is called nums, okay. You have four, three, two, one. It's defined here. So visually we have it here, but in the next flow, we have nums equal channel of one, two, three, four. And then you create a new channel, just go square, which is basically the nums channel with the map operator applied to it. And then you have this closure saying what you want to do with every element of the nums channel. And basically what you're gonna say here is that you want it multiplied by it. So the element multiplied by itself. Marcel, are you sharing your screen? Oh my God, I'm so sorry, I forgot this tiny detail. Okay, so I was, fortunately I didn't show much. Thanks Rob. So here we have the operator I was talking about. So we have this basic example. So we have this nums channel, which is defined here in text. And we have this map operator, which is basically get it, which is every element and multiply it by itself. So there was also another question that I saw on Slack, people asking, what is this it dash larger than here? So if you want to call the element of the channel it, you don't need that. You can just say curly brace it times it, you close curly brace. This is the explicit definition. And it's useful because if you wanted to call the element, so let's do it here. So you can see it, this is the explicit definition. And because it is the default one, I can even remove that and it's gonna work just fine. I can remove it and everything is gonna work just the same. But if I wanted to call this instead of it, I want to call this element. If I do it like this, next one is gonna complain. I'm gonna say, what's element? I don't know what's element. Because it is a special keyword it's known with, what's element? There's no such variable called element. But then what we can do is the explicit version. And then we say, okay, I want to call every element, element and then they want to multiply each of them. And by doing that, now it works again. So again, if you're gonna use it, you don't need the first part, if you want some other name you need. But it doesn't hurt to have the explicit version. It's easier to read. So every element's gonna be it and I want it to be multiplied by it again. And then I use the view operator to see the content of this channel. Because nums is a channel and I'm using this operator, the output is also gonna be a channel. This, it doesn't work like this for all the operators but for most of them it does. So here's a very basic example of using operators. So let's talk about basic one. So view, we've seen view like a thousand times already. So basically view shows us what's in a channel. And again, why don't we just print it? Because as we saw, then we're gonna see again now the content of the variable that is a channel is a thing that's not for us to see it's for next row to handle, it's a data flow variable. So if we just tell the printer Lynn, you're gonna see the data flow broadcast, blah, blah, blah. But with the view operator, we're gonna see every element. And if you run this with the cell one, we can even see the values even though it's not very readable. We still see here value one, value two, value three, then the poison peel that we are ignoring here. So what the view operator does is to get this information inside this values here. So if you want to see the content of a channel, remember the operator that you want is the view one. Just like any other operator, we can use closures. So here, if we want, we saw a slightly different example before where we had something like this. But now we're gonna have something slightly different. What's printed here already? So we have every element where it's gonna be a string preceded by this dash. So the map, we've seen it already also. It's gonna do something to every element. So here for map, we're gonna do with IT, we're gonna do reverse to get this to reverse the order of the characters in the string. But again, there are two elements. So we're gonna be two emissions here. When I have hello reversed and world reversed. There are many other things you can do. So here, for example, you can create a collection in which the first part is the word itself. And here we have the size. This is another nice example because what we can do here is that we can just delete this and say it, it size. We just say it everywhere. And why we don't need this explicit word anymore because we're gonna use it. If we had word, then we needed this first part here. Let's say hello, come in five letters, world contains five letters. So the size of function we can apply to a string because here it is on a channel. It's an element of a channel. In this case, it's a string. So we can just say hello dot size. We're gonna get the number of letters of characters that this string has. Here's a very nice example. We have from path, the channel factory. So we get a lot of dot FQ files. We're gonna map and each of these files, I'm gonna use this name. Instead of dot size, I want to use dot name to get the file name and the path here. And then I wanna, even though it's like, is a channel with this information, I want to view the content of this channel. And I want it to be formatted because I know it's a name and a file like this with a column separating the values and this preceding the starting key at the beginning. So we have the file name and the file path. One nice operator is the mix operator. So the mix operator is very straightforward. It just combines in the simplest terms channels. So here we have one channel, which is one, two, three. We have one channel, which is A and B. We have one channel, which is C. And we just can mix whatever many channels we want. So I'm gonna do a slightly different here. I'm gonna say that this is one element, is one element and three elements here. And by doing that, we're gonna have three, four, five, a channel with five elements because I made this one single element with the brackets. So it'll be one per line and that's what view does. One, two, A, B. If I remove the brackets, then we have six elements now. So there's no secret for mix. It's very simple. You just have many channels. You want to create one channel with all the elements of these channels. You're gonna use mix. Very simple, straightforward. The first definition of combining something that comes to mind when you think about it. We have already see flattened also. So in this case, we could have here two elements. So let's see what's gonna happen here. So we have foo and bar. Okay, we're gonna create a channel which has two elements, one, two, three and four, five, six. But because we use flattened, you're gonna become six elements. If we don't have the flattened, we would have two elements. I'm gonna see because it's two rows and view works by having per row elements or here with two elements. But because we have the flattened, it gets every thing in a single element in the new channel. And that's why we here we have six lines, six elements. So as we've been saying like this, all these operators that they have channels or elements stuff, collect is a bit different. So actually collect gets a channel and turns it into a list collection of values. So it's not really a channel anymore in a sense. So here we have four elements. And after using collect, we're gonna have a single element. And you're going to see that because a single row is a one line output of view. Group top is very interesting. So if you have tuples with a key, you can make them, you can group all this. So you have one channel with one, two, three, four, five, six, seven, with seven elements. And because they have a key, which here is the first element, you can group them in tuples. And how are you gonna do that? So not just gonna merge them or anything. I'm actually grouping them as a tuple based on the ID. And here is a tuple because you see there's an ID one and then a tuple which I mean is a tuple with one as one element and all the other values as a second element. So for one we have here, one has A, one has B and here one has C. So for two we have two C here and two A here. For three we have three B here and three D here. So the group tuple does that, gets the first element of tuple which is the ID and looks for all the other ones with the same simple ID and gets together the second elements of these tuples. Join is a bit similar, but we have only one channel here when we use group tuple. So group tuple works with one channel. For join it's two channels. So it's a bit similar in the sense that I have a channel left, a channel right and I'm going to join them based on the key but it's two different channels. Let's see what's gonna happen. Basically we have X here, X has one in this case X has four. So we have here X one, four in the output. So for Z, Z has six and Z has three here at the top. So Z is gonna do three, six and Y is two and five. So Y two here, five here, sweet. So if you have, if you want to group by a key and you have a single channel is group tuple, if you have more than one channel, you're gonna use join. Another very interesting operator is branch. So the idea of branch is that the way your workflow works can be dynamic, can be conditional just like we had the when before, when a whole process to run. But here it's within a process or not even a process we want things to go one way or the other or even more than that. So here we have for example, a channel with these values one, two, three, 40 and 50, so five elements and we're gonna use the branch operator to say, if one of the elements for every element of this channel it's smaller than 10, I'm gonna call this small. It's one branch of my reasoning, let's say. If it's larger than 10, I'm gonna say it's large. And I'm gonna store the output channel, I'm sorry, this channel itself, I create the channel, I operate it with the branch operator and then I store it to a variable channel called result and then by using the dot and the name that I gave small or large I can do an operator or pass this to a channel for the subset of this category. So here when I use the view operator it's only going to be used in the elements of this small category, let's think of categories I think it's easier to understand and large for the ones that are large. So what we can do here and again we're using a closure here to add some text, IT is small, not only it but it's smaller, it's large in a string. Let's see what's going to happen. One is small, four it's large, two is small, 50 is large and three is small. So this is the branch operator. As I said, there are many more operators that you can look for in the operator's documentation in the video documentation is docs.nextflow.io. There are many more that you can use that. A section with Ruby, we decide to skip that because of time limitations, okay? But there are many things you can learn here how to create closures and many other things. Let's go to the next section, which is modularization. Okay, so modules and modularization in general was something brought by DSL2 which is the latest version of the nextflow language. So I'm not sure if it was clear at the beginning but nextflow is a language is a domain-specific language for writing pipelines. In nextflow, it's also the engine, the software that runs this pipeline in nextflow, two things, okay? In nextflow as a language, it exists on top of Groovy to the super set of the Groovy language. So it's very, very powerful. It's not just like a very dry description language of pipelines, it's really a very powerful programming language to describe the pipeline but also to do a lot of things with it. So the idea here with this modularization is that you don't have to have this huge script file with a lot of things. Not only we decrease the size already because a few things like configuration we put to a different file but we can have so many different processes and not only we should decrease the size of the file to make it easier to read and to understand but also there are many other people that need to use part of your pipeline, for example. And it would be nice for them to easily import this part. So what we can do here is to go back to an example that we saw at the beginning with JSON and we were talking about how to parse JSON file in your pipeline and also YAML one but you don't want to have this code here in your pipeline because you want to focus on your pipeline is doing not how part of it is parsing a type of file formats. So what you can do is that you can create a file called parsers.nf, for example and inside this next flow file you have all these scripts for parsing YAML, JSON or whatever you want or some very specific things that someone may want to add to their pipelines. And by having this in the file you can just use this expression here to include and then you say the name of the process that you created from the file. And automatically you can use this you can call this process or this function or anything in your code. So here we have channel from path, you have a flat map in the case here every element is gonna be parsed to this function. And then you're gonna do what we did before by getting the entry and with the name of the case feature patient in the field of the JSON file. So let's go back to modules. This was just one example of how you can use modules authorization, okay. So you can have like the first example we had at the beginning with the split letters and convert to upper that Evan was teaching we can save it in a different file. And then in the current pipeline file that you're writing you can just include them and you can use the process without having the definition description of the process in your current file. So here's one example, the final one would be like that. So we don't have any process in this workflow file here. All the process that are being imported you just have your workflow block. And in the modules and that file from where you are importing you have just the processes in this case split letter letters and convert to upper. So you don't have to have this like two lines like import split letter from this and convert to the same file. You can just get import all these processes from this specific file. And even though it was called like that in the module you don't have to use the same name. You can use allies. So import split letters as and you can put in your language for example or another name that you think it's easier to understand from this file and then you can use it. Sometimes, and this was also a question that came up I think this morning maybe or maybe in this training I'm not sure that someone wanted to run a process slightly different two times. So what you can do here is that you can include the split letters and you call it split letters one and split letters again two. And for example, it could be the mode like you want to run once with Kalisto and one with Salman. And you could have like a branch for depending on the quality of the sample you want to do with one way or the other depending on the size of the reference they don't script some reference and so on. So you can have this as is as you can do this to create new names for a process that you're importing from a module. Here you have your workflow block calling all these guys and as you can see, there's no process block in this script file. All the processes are being imported from the modules. For the output definition there are many ways you can get the output like so far what we've been doing is this way. So we create a channel and we save it to a file to a variable. Then we call a process with this channel and we save it to another channel variable. Then we call another process with a channel variable and we save the result to another channel variable. And then we use view to see what's inside this channel variable. So I think it's very clear. It's easy to understand to read but it's very verbose if you know what I mean. So what we can do actually is use this out dot out to get the output. So I don't have to save the, I don't, every process that I call I don't have to save it to a variable. I can just use the dot out to get the output. So here for example, the split letters I could just have that. So let's do actually this but then I would have to import. So what I want you to tell you is that here we save to letters underline CH and we could just use it instead we could just get this dot out. It's much easier to have here like this. So I don't have now the channel variable that created before to save the output of this process. I just call it with dot out and automatically I will get the output of that. And again here I'm calling another process I can just again use the dot out to get the output again. So it works like that. You have to write less code, it's simpler but it's really, again, it's a matter of taste if you like it to be more explicit or not. But then one thing you can say like, okay, so convert, so depending on my process I can have one output but sometimes I have many outputs. And with dot out, how can I choose which output I want to see or apply on operator or give us input to another process and so on? Because you could have a process A that has two outputs and one output is the input for process B but the other output of process A is the input for process C. So this way you can use with the brackets and the zero one and so on. You can say which of the outputs you wanna work with. Another thing you can do is that you can rename an output. So here convert to upper it's gonna write to the standard output. It's gonna print in your screen if you have the book true or dash process dot echo. But another thing that you may want to do is to use emit to give a name to that. In the case of a standard output, standard STD out is a keyword, is a reserved keyword but then you can use emit to just give a nice name to it. And then you can use like here convert upper. I want the output is not out but I don't want zero or one. I don't want to use this positional numbers. I want to use the name of the output but then STD out you can do. So by using the emits, you have upper which is much easier to see to understand. Another thing you can do instead of all this parenthesis and more parenthesis and more parenthesis actually we could just have here convert upper parenthesis split and then greet in CH and then channel off. You could have like a very long line of code with all these combinations. Here we make it simpler by splitting by line but another thing you can use to use pipes. So I'm gonna create channel. There's gonna be pass to split letters. There's gonna be pass to a flatten operator. There's gonna be pass to convert to upper. There's gonna be pass to the view operator. So you can also do that. Here it's a bit long. So usually what people does, something like that. So it's easier to read and the length of the lines is shorter. It's easier to read. Another thing that you can do actually is that you don't know, you can have different workflow blocks in your pipeline script. So you need to have only one unnamed which is always going to be called by default but you can have named pipelines, named workflows. Here we created one called my pipeline. And then you could have many of these sub workflows and your implicit workflow, which is this one, the unnamed one, is gonna call all of them. And the nice things that actually you can also use nextflow, run, you can use like entry. So again, one dash, it's a nextflow option or parameter. Then you can use entry to choose what workflow, sub workflow you want to run. Sometimes you may want to use one or the other longer version, shorter version and so on. For inputs sometimes because you have sub workflows and you are calling them with your unnamed one with this one here, you may want to pass an argument like a channel or something. So if you do that, you have to get a take block and your named workflow. So here we are saying that it's gonna take an input called greeting. And then when you work your unnamed workflow you pass something. Whenever you use the take, you have to have an explicit main block with saying what process are being called, what's gonna be done in the sub workflow. Here we have everything together. So I'm gonna emit this output. I'm going to call it my data. So that's why here I can use out and my data and then deal. So it's a sub workflow being called my unnamed workflow. This is what I said a while ago, but entry, you can use the dash entry to choose which pipeline. You can have like a sub workflow, my pipeline one, a sub workflow, my pipeline two, and then workflow calling everything, but maybe today you just want to run this one or this one. And then you're gonna use the dash entry option of next flow, like here. This is already written, I don't know. Yeah, so I would have to write, it's fine. So for this parameter scope, when you import, so this is your modules.nf, here's your main one, you are including the process, say hello. But again, here in your script file, you have Olamundo as the default parameters. And in the modules one, you have hello world. So sometimes you may want to have the default from the modules, not from your script. And what you can do is to use this add params. So I'm gonna include, say hello to module.nf, add params, and I want full to be this value. I don't want Olamundo as I have here. Maybe I want to use this for something else. So I'm gonna use Olah. And then I'm gonna have Olamundo. So now let's go to the, not the last one, but almost the last one's subsection. So we're gonna go to next flow configuration. We've seen already, one second to have some water. So we've seen already that it's ideal to have the configuration of your pipeline in a configuration file, so that you only have in your pipeline script things that are really related of what's gonna be run, how and so on. And usually we have the dot, the dot for the config, but also the dot next flow slash config in your home directory. So as I said earlier, you're gonna have in your process the directive name, like container space and the name of the container. But if it's in the configuration file, you have to have equal. And actually you saw that already because yesterday when we worked with containers, we had this, the container was this. And because it's a directive related to processes, you have to have this prefix here, process.container. So this says that for every process in this pipeline, in this folder, we're gonna use the container and this is the container image that's going to be used. So you can create some config variables like that always using the equal sign, not only its space. You can give comments in your configuration file like that. A single line comment is choose slashes. Otherwise it's like that slash star and star slash. You have some scope for configuration. So I think someone asked about that on Slack. I think Ben replied with a link showing all the priorities of whose scope comes first and so on. We already saw that, the parameters. See, there's something interesting here. We can set environment variables this way in your config file and dot in the name. Then you can set them like this in the process itself. You can load an extra config file with minus C. Actually you can do next load dash H and you have two, we have upper KC and lower KC. So for the lower KC or config, you add the file you brought to the configuration that's already there. So it's not like a fresh configuration. You have a lot of things already set and you want to add more things. You use dash config. If you use dash uppercase C, you're gonna override the defaults. So be careful between using the uppercase C and the lower KC because they are different. So we saw that we could set some directives like CPUs, memory, container, time in the process itself but also in the configuration file. So this is the configuration file. So I could just have process, container, something like we have here or process CPUs. We have something like this. All the processes in this pipeline are gonna run with two CPUs. But if I, instead of saying this, okay. And then process, memory, instead of saying that a lot of times typing process, I could instead do like this. Use 10, memories go, your bytes, container, some image. I'm, you know, you could, it's much easier to just have this structure here for all the processes. But the one thing that you may think of is like, okay, but, you know, I don't want all my processes to be like that. And you can use them process selectors. Let me find them here. So by using process select, I jumped, I skipped it, where is it? Okay, press select or what's the example? Okay, we're gonna soon gonna get there. So the thing is, you can also use some variables like test CPUs to set things. So you can have memory like that. Self-memory being always one gigabyte, which doesn't make sense because if other things change, you may also need to change memory. You can see that actually I want four gigabytes in memory times the number of CPUs that the task is using. We will request use. So if we have CPUs one, it's gonna be four gigabytes. If we have two, it's gonna be eight. You can also have attempts. I'm not sure if there is an example here like that. There is in the official documentation, but basically you can have, okay, run this task with one gigabyte of memory. If it fails, try with two gigabytes and then with three and so on. So you can do things like that. You can also have process selectors like with name. Where is it? Okay. Oh, okay. So this is in the next section, but it actually should be here. So what you can do is that you have like this. This is a very nice example. So you have the process block in the config file saying that, you know, all your processes, they will run, they're gonna have 10 gigabytes of memory on the request, 30 minutes and four CPUs. Accept processes with the name foo, because in this case, I want two CPUs, 20 gigabytes of memory and that's it. So this cute thing, I'm gonna talk about it soon, not now. And if it's the process with the name bar, then I want the first CPUs, 32 gigabytes of RAM and that's it. Another thing that may be a bit more interesting is to use, instead of using with name, that is the name of the process, we can use with label. And that's a bit more interesting because you can like in your process, you have some process foo, ba ba, you can use a directive label and say like long or short and you can give this label to many processes. So I may have 20 processes and 18 of them are quick. So I'm gonna put the label short, but the rest, a few of them, they are very long. So in this case, I'm gonna use the long label for them. And then what happens here is that I always want thank you a bunch of memory, 30 minutes for CPUs, but when the label is short, I want to be like that. And when the label is long, I wanna do like that. So this is another way to configure processes in your configure file. And there are many other ways that you can do that actually, let's go back to where we were. Was here. So another thing you can do is that you can set the number of times you want your tasks to be tried. So you have your process definition, then you have a channel with some elements, then a new element comes, your process creates a task and runs it. For some reason it breaks, there's an error or something, it can automatically retry with the same sample. And you can set how many times you were willing to retry with that sample. Here is three. So if the third time breaks, it doesn't work, the pipeline will fail, it will stop, depending on your error strategy. But for now, let's think it's gonna stop. There is a way, another interesting directive for processes. We already saw that many, many times a container, we want to docker to be a neighbor by default, instead of just the name sorry, instead of the name space and the name of the container image, we can also use ads in the hash ID of the image to make sure that we're always using this specific version. For rapid disability, this is recommended. You can also do that for singularity, which is another container technology and many other things. Use same thing. For condo, we can also use for a process, we used condo before, we can say, where is your environment for the packages you installed? Okay, I'm going to do a 10 minute break. I think it was a bit intense, this config part, probably there are some questions. I'm gonna do a 10 minute break and then we come back for the last part of the session today, which is deployment scenarios, okay? So 10 minute break. Okay, so let's go back where we were. Let me share my screen. So now we're gonna talk about deployment scenarios, okay? So the idea of deployment scenario is that a lot of times people will develop their pipelines in their local computers, in their laptops, desktops, and so on, but not always they want to run it locally. So usually you're gonna run it in the cloud, in a local cluster in your company or university and HPC cluster, or maybe some hybrid format, but usually you want to deploy your pipeline to run somewhere else. So here maybe the most common for universities is changing, but too I think the most common is to have a cluster, an HPC cluster, probably gonna have SLURM or PBS, something like that as a job scheduler, so that you connect to your cluster, you submit the job, that whenever there are some resources available, the cluster will manage the execution of your job. So what we do here is that NextFlow can run locally as we've seen, but it can also work with all the main technologies. So if we go to NextFlow.io, we can see here, picture here somewhere. Anyway, so NextFlow works with main technologies of cloud providers and also HPC clusters. So PBS, SLURM, most common ones, we support it. So basically you can just connect to the head login, to the head node, the login node and run NextFlow and it will, without you have to do anything, it will submit the jobs in the local job scheduler, like SLURM. However, there's not really a good practice, usually the nice thing, ideally you would submit NextFlow as a job, like you would do with any other job, but then as a job, NextFlow will manage and submit all the jobs. So then you have the local OSNIC cluster, you have the best scheduler, which is the SLURM, for example, PBS. I have some file system storage and so on and everything is gonna be done there. To do that, to tell NextFlow that you don't want it to run locally in your head and login node or in your local machine, but actually in the HPC cluster, you have to tell that executor for all the process is SLURM, for example. So it knows how to work with SLURM or PBS or any other example, Kubernetes or in the cloud, AWS and so on, Google cloud, Microsoft Azure and so on. Then you also have to manage the resources that you need for each job, just like we did for machine locally or we have to do in the cloud. So sometimes you have queues in the HPC clusters or even in the cloud, it depends on the provider, but again, you have the CPUs, the memory, time, disk and so on, everything we've been talking so far. And then as we saw in the config section which was the previous one, we can set the memory for that process, the time, the CPUs, now also the queue in the cluster in the executor we want to use. We also saw the with name. So all the process with this name, I'm gonna have disk directives. With this name, they're gonna have this directives and all the other processes are gonna have this directive. Here for the one with the pipeline we built yesterday, we could just say the quantification process, I want to have two CPUs and five gigabytes. Everything else, I don't wanna say anything. And by default, the CPU one is one and so on. You can use labels as I showed before. If it's long, this one's gonna be long. So we're gonna do this for long, this for short. So for short, just four CPUs, 20 gigabytes, but for long, I want much more, okay? We can also have multiple containers. It's something I think someone asked yesterday. I could have the process full, I want this container image, but the process bar with the name bar, I want this image container. So you can set which ones you want. You can have one container per process which in most cases is the right thing to do. There's some explanation here, why are pros and cons for fat containers and simple ones. You can also have some profiles and this is very common in F core pipelines. Do you have like a test profile and so on? You could have, for example, a pipeline that you wrote, when you want to run it in your machine, you say that you want the standard profile. When you have the standard one, you don't have to set. So you would just say next flow run my pipeline.nf and automatically the profile would be this one, which is the local executor and the genome is in this path. I don't have to stay where the genome, the reference genome is because it's here. It's always there. But I can also say I want as a profile cluster. And when I say that, it's going to use SGE, which is a batch scheduler. It's going to use the long queue. I want 10 gigabytes of memory. I want to use this conda environments and the reference genome files are going to be here. If I say cloud, in this case, I want AWS batch. This container with Docker, that it will be singularity or other container technology. So here I'm saying that it's Docker and the genome reference file is here. Of course you have to set your AWS batch key, sequence and so on, but here just how it would look for setting the profiles. And then you could just say run cluster or it could say run standard and cloud. And here for standard, it's basically where the genome is. Here it's going to override the last one. Queue here, container, work theater, there are many things you can do. You can choose the AWS region, CLI path. So if you go to the next flow blog, next flow.io and you go to blog, you have many nice blog posts talking about how to deploy next flow pipeline with Google Cloud batch. You also have windows, how to use SQL. Many tips to use next flow in HPC clusters. More tips here. You have for using next flow in Azure batch. You also have for AWS batch somewhere. Anyway, if you go to the next flow blog, the official blog, you have many tutorials with code and pictures and everything telling you how to use next flow on the cloud or how to tips to use it on HPC clusters. You can also have volume mounts like with the EBS from AWS. You do it like that. So AWS batch and you have your configuration. You can have the region. You can have some custom job to finish. So there are many, many things you can do with cloud providers. And then it's a thing like we support the main cloud providers, but still everything is a bit different. Even though it's supported, you have to see like in the blog posts or in the official documentation. I keep talking about the official documentation is docs.nextflow.io. So here you have everything for Amazon Cloud, everything for Amazon S3 storage, everything for Azure Cloud, Google Cloud, things like Apache Ignite, Kubernetes, have many, many things here. So it's a very long document with things about a lot of cloud providers and also other things that we were talking. So we mentioned Project Deer at some point, but you see there are many others workflow variables, objects that you can set and not set but consult like when it was used and so on. So it's a very rich material, the official next-flow documentation and you definitely should check it. So the training material is very nice, but again, it couldn't be so long. So we had to focus on a few things. So a lot of interesting things you can do with cloud, you're gonna see actually tomorrow. So Evan is gonna talk about Next World Tower, which is a platform that allows you from your computer to run something in the HPC. Because usually you have to connect to the HPC cluster of your institution and from there connected, you run your pipeline. But with Next World Tower, you're gonna see today, it's much easier. From your computer, you can run something that's gonna partially run in your computer, partially run in the HPC cluster, partially run in Azure, partially run in Google Cloud, you can really make hybrid deployments. So here, for example, whenever there's the big task label, it's gonna be a WS batch, but otherwise it's something else. I don't know, it's just many different examples you can think. So I think it's that for today. If you have, again, if you have any question, you can still ask in this like channel, even like now, or until like tomorrow, during the training, after the training, feel free to ask all your questions. We'll be delighted to help. I hope you enjoyed the second session of the training. I'm very welcome to feedback. If you have anything to share, something that you think could be done better, don't hesitate to get in touch. And it was great to have you all here. So have a great day and keep learning Next Flow with documentation and the training material, going over what we said today, and also in the documentation and so on. So see you.