 Okay, hi, everyone. Welcome back to day two of the next flow and of course training. Today we will just kick off pretty quickly. It's getting a bit of feedback here. Grace, you're muted if you're talking. Yeah, sorry about that. Just fixing all the feedback issues I'm getting with noise. Yes, hello everyone. Sorry about that again. So we're just going to jump straight back in. Yes, okay, so straight back into it with day two. Today's session will be another big one. So we do have a lot of content to get through really quickly. What I wanted to start with is thank you to James, who's already correct me once a day for being on mute, as well as he's going to be answering a lot of questions happening today in the background on Slack. So a special thank you to him. Also Marcel who's going to be sticking around a little bit this morning and he might take off around half time just because it's quite late for him over in Brazil. Anyway, so today we're just going to jump straight back into the training material. So you remember this from yesterday which is just the training material that you can access through the Secura Training.secura.io. Obviously at the top there's table of contents. And what you remember is that we access this content through Gitpod or most of us did. So here we have different options about how we can sort of revive the training material from yesterday. The first is that you can actually go to gitpod.io and go to this dashboard. So this will be hooked up through your GitHub account and you'll be able to sort of find new sessions that you've brought up previously. And this is one of the really cool things about Gitpod is that you can actually go back and sort of come back to things that you've done previously. So you can see here, this was what I was doing yesterday. I was given this arbitrary name. There's 30 changes. So that was all the material I was playing with yesterday. So this was relative to the initial sort of base repo. This is what I was playing around with last night which was just four changes. I was just testing out a few little bits of code. But what I'm going to do today just for proof of concept is that we're just going to start a new session like this. So again, it's just going to spin up a new sort of working environment for me, hoping a new workspace. And as you can see here, we've got this brand new environment. Exactly the same as what we started off with yesterday. We've got all our files down the side here again which we can open and close with this. We've got this here as a simple browser format. This is going to ask me do I want to do some pull requests which is no. So here's this material here. I'm just going to close this because I've already got it sitting up here but you're more than welcome to keep it open as well. So before we sort of start on today's content, I just want to sort of summarise some of the things that we talked about yesterday and sort of just touch on docker and containerising again just because I skimmed over that really, really quickly yesterday but I don't think I did it justice sort of making sure that I can communicate the importance of docker and how you can make these really nice images which can be used in the next flow. So thinking about script 7 is kind of where we got up to yesterday. We're just going to sort of go through and sort of talk about some of those words that I used in passing. Some of those words are quite specific to the next flow and they're probably quite unfamiliar. So we're just going to work through the script really slowly to make sure that or at least refamiliarise ourselves with some of these words and concepts that we talked about and then today we're really going to dig into them in more detail, particularly operators, channels and processes. So when we think of a pipeline obviously there's all those steps and there's something that you've been talked about yesterday but generally we start with data that's coming into the pipeline we're going to do some sort of manipulations and then we're going to have some sort of output. So thinking about this and also thinking about how this might apply to your own data or something that you might be doing now or in the future the first thing we do is kind of like read in or set some read in the data and sort of set some information that will be relevant for defining sort of outputs from your pipeline. So that's what we're doing up here we're using this parameters.reads parameters.chranscript file parameters.multi-qc parameters.outdoor we're just setting these kind of arbitrary strings which will affect file paths using this project directory where you're executing from all of this is just creating a relative path we set these at the top so that they can be used sort of throughout the pipeline what's happening in here is we're using this loginfo so loginfo is just a way of getting a nice print out of some of that information that we've just put into the pipeline .strip and then this is a little bit sort of groovy-ish but it's basically saying remove all the white space and here we've got finding out processes. So processes is one of those key words that I sort of mentioned already it will keep coming up as a concept index flow a process is basically like the real part of the code is actually doing the work so for us in these processes we've got three main inputs which is the inputs, the outputs and the scripts the inputs we've got we can sort of set these slightly differently so it could be a path, it could be a value a path as transcriptome transcriptome is just like an arbitrary name that we've given it but we can reference it in the script using this variable down here the same with output that we've sort of given it this sort of fixed name which we reference down here but because it's not a variable we're not using the dollar sign and we're kind of doing the same thing with this index process a process given the name index everything records as a part of the process a similar thing here is happening for the quantification process which everything within these brackets is a part of the process again we see the input, the output and the script we see here that we're using this the salmon index which is something that we do define up here but technically you could call it something different we'll sort of touch on that a little bit later but these names are reasonably arbitrary what you actually call them in the inputs and outputs consistent between them consistent between what you're calling them here and what you're actually calling them in the script as the variables we'll talk about a tuple here again we'll come back to this to a little bit later but it's basically creating a channel but in that channel there will be different elements so for example here we've got the value as well as paths or the reads down here again fastqc input output script what I haven't talked about so far is these tags so these are directives directives are just like little bits of extra information that you can help use to effectively tune your process so here we've got the salmon on sample IDs that helps us get that muscle read out next to our command line in the command line when we're executing as well as we've got like the published directory here as well a little bit different from the C what you see in another process I won't go on to that but the interesting thing here is that once you've got all of these processes defined you can use these things called excuse me you can set a workflow script or the workflow down here where you can import the data that we talked about using something called a channel factory which is another concept we'll cover again today for here we're using from file pairs that was the one that created that tuple which was kind of the base name of the fastq files as well as two fastq file elements that we talked about yesterday we're setting this as or we're calling this read peers channel and then we can sort of reference that down here in the script so then just working through the main script we've got the fastqc which for all those processes that we've just defined as well as multiqc where we start to use operates so operators are something else that we'll be talking about again today but we're using mix and collect which are kind of two of the probably the more key concepts for sort of channel manipulation so we're effectively stacking stacking these different channels and then we're collecting them all together into one item so we can feed all of that into multiqc at the same time and then just down here we've got like a workflow on complete so that's after everything else is finished we've got some more long information using a effectively like a javari if-out or groovy-ish if-out statement saying if it's work do this, if it hasn't do this so I just want to sort of cover that again very briefly because there was a huge amount of information we covered yesterday and we sort of came up with this proof of concept RNA-C pipeline but what I really wanted to reinforce was that this is kind of just proof of concept and it was okay if some of those some of these ideas and words were a little bit a little bit hard to sort of follow and understand because that's what we're going to be covering today anyway just as a reminder we can sort of execute this again next flow around to script 7 because this is in the new environment I haven't actually set I haven't enabled Docker in my configuration file so that was something we did yesterday was we put the that's the wrong one where are we extra lot config what you might remember is that yesterday we equals equals true so we just added this into the script so that I didn't have to type this in with Docker in every time so what we should see now is that I can re-execute this without any problems yep, no problems the other thing we talk about very briefly was this idea of the different processes running in parallel with the next flow this is one of the real strengths of the next flow is that it can paralyze processes so it isn't like a sort of linear everything has to happen at the same time so as a reminder about that if we change this to a glob this is going to be including all of the different files we have in the data and the gg gal so this will pick up the gut, the liver and the lung we can run this again but we're going to go through Resume which is another thing we talked about so of course Resume just looks for the cache and if anything hasn't changed could be used from a previous run at will so we can see here that one of three was cached but because we're now including two new files which have been picked up as a part of that glob there are an extra two that process is going to run an extra two times those extra two sets of data so that's just a very quick summary of what we've already covered yesterday and now to date we're really going to be digging into more detail so going back to the trading material I've very very quickly summarised basically everything here is a part of the simple RNA-seq pipeline one thing I didn't cover yesterday and I would like to touch on it very quickly again is that you can actually run pipelines straight from GitHub and this is something that's really common if you're running an NFCore pipeline for example so here for example Nextflow.io is actually GitHub repository so if you have your GitHub repository you could replace it here with the name of the actual repository there doesn't like that C-line 3 for a different revision so what's happened here is that there's a directory to some sort of issue okay so I'll scroll back up to look at the code should probably do that so here what we've done is we've just run Nextflow using a GitHub repository so this could be sort of any major Git provider it'll automatically just pull it from GitHub and you have to necessarily download a local version of a pipeline you can just pull it straight from Git and this is really popular for collaboration RNA-seq here is the name of the repo this is your repo name so this is the Nextflow.io and this is the excuse me the repo for RNA-seq NF which is just basically the same simple pipeline that we've been using and here we're just using it with Docker but as an example we could just change this to my GitHub or something similar I've spent all that wrong but if I had a pipeline in there that I was playing around with you could just execute it like that what I've also done here is I'll use the example of using the version control that you can use with GitHub and Nextflow so say you had a different revision or a release and it was called 2.1 you could use that you could use a branch called it like DEV branch or if you had the main branch or the master branch or development feature basically anything that you've called it and anything you've called it in Git you can also use it basically in Nextflow to execute that so if that's something you're interested in make sure you go back and check out training material in more detail that's sitting in 3.11 just because we are sort of ripping through the time quite quickly as well and I haven't even started on today's content again I do encourage you to go back and look at this in more detail especially if you're not familiar with Docker and how to make Docker images this is a really nice example of how you can create a simple Docker image and how that can be integrated into Nextflow to make sure that all of your data analysis is completely reproducible so again this example is just using like a really simple KSA Hello Docker you can add in salmon and then use it using the same example that we've already created with with with RNAseq okay today's content alright so I've already mentioned concepts today of channels, processes and operators channels kind of being the key data structure in Nextflow and that's what we're going to start with and we'll talk about processes again we've talked about these a little bit already with these different parts of the process blocks the scripting and the outputs there are also a couple of others that will sort of delve into when directives have really talked about and it's about how to organise these and then of course operators grasp and it's really for manipulating manipulating is a bad word but it's helping reshape data to make sure that the channels are made up of the different parts that we want up to be so that we feed it into a process everything is as we expect okay so starting off with channels what is important to understand is that there are two different types of channels Q channels and value channels and we are going to dig into those a little bit straight away this is quite a nice explanation but it is still quite hard to grasp or I do find it quite hard to grasp sometimes even after a few years of experience so a Q channel is an asynchronous unidirectional FIFO Q that connects two processes or operators so what that means is that asynchronous operators are not blocking unidirectional means that it's from a producer to a consumer and FIFO means that it's first and first out so when we look at this simply we can jump back to here so what I am going to do down the side here you'll see that this is a bit NF which is the first code block that we've talked about with the training this one here so we're just going to run that straight away and see what comes out next flow run snippet.nf so again this is when we're sitting in the NF training directory if you're not in this directory then you might need to go up or down one or see the up or down just to make sure that this is relative so we get this error message here well it's not really an error message but it's basically saying that like this print line so print line is very it's a groovy function and it doesn't actually like that and that's just because this is a channel if we were to change this to if you call this a straight string Chris hello if you want to change this to print line ch2 so I didn't say that properly the first time so basically what's happening here is this because this is just a straight string it's a lot easier to print but if you were to try and do ch.view it won't like this much so this is just like a subtle difference we do try and to like actually view something in the command line but when you are using a view it's a really good way of viewing a channel which is what we're trying to demonstrate here so because this is a channel we're using it as an off meaning that this is a Q channel so this is the unidirectional first and first out type of channel we've got these different parts of the channel 1, 2 and 3 you can see that here so when this is actually getting printed out it's printing these out as three different items because the 1, 2 and 3 are getting treated separately because it's this channel idea if we were to try and do something like a collect on this then we'd see it a little bit differently but we'll come to that soon so value channels being the other type of channel are a little bit different so what we can do here is just use this example again so we've got channel.value value is a different operator this is in comparison to value.of like we had in just previously so we can hit to run this again then we can just see down here in the command line hello hello hello so that's for each one of these channel.views view being an operator we are printing it out to the terminal and effectively you could just keep doing this as many times as you want so we've done this value once and we could change this to anything if you ask hello world save that there we go so it's just printed out as many times as you want I mean technically when you do do this as a as an auth you will still get some type of output here but you'll see that it's stacked a little bit differently so because this isn't in any particular order this is just suggesting that it's first and first out meaning that whatever was interpreted by next flow first or as soon as this information came through and the channel was filled it just started printing it out so that's what we keep in the same order that's something you can do with things like lists but not necessarily with a channel of so all of this is quite hard to sort of to follow but I don't think I can explain it particularly well but what I want you to take away is that there are different types of channels depending on how you actually bring it in I guess so here for example was the value up here there was of and it will depend on what type of data you're bringing in might be different if it's a path if you've got a list if it's you know it's a list of paths there's lots of different combinations of how you might want to actually interpret data down here is just note saying that values can be created using a value factory but you can also use these other other operators to bring some in and these are basically math operators or other operators that will produce value channels so that's just that single element that can be reused a lot of this will make more sense soon I hope alright so okay we're just going to slightly change tech and talk about channel factories so up into the house has been a demonstration that there are different types of channels and this is now moving on to channel factories about how channel dot whatever will produce these channels from the information that you're inputting so I'm just going to copy this again as an example and we'll just talk about it as we go so given that probably don't really need to so here we're going to produce three different channels we've got channel one which is just an empty channel this is quite a good way of creating what might be a channel that isn't filled later in your pipeline but it's quite a common thing to create an empty channel so that you can sort of backfill it later when something has been produced we've got the channel dot value hello there and channel dot value 1 2 3 4 5 so this is an example where you can have effectively a list inside that channel and that will be treated as a single item or a single element so again we're just going to pregnaz to show it so what I should actually do here is put on a view so without this we're not actually going to see any outputs so here it's just viewing these channels so as I said earlier view can be used to view a channel if we tried to print these so if we're going to use something like print line line this wouldn't actually work as well because these are channels not just straight values if we were just to import this as a straight value so the hello world without being a channel you wouldn't be able to see that so this is a little bit hugger up because this channel is empty but here we can also see channel 2 dot view being this hello world 1 2 3 4 5 being this channel here so this is effectively being treated like a list there are ways to split out a list there's different channel factories for that but here because we've imported as a single value it's just included it as bottom list so now we're going to be changing a little bit to a different type of channel factory so what we're just going to do is work through these different types of channel factories try to think about how if you are writing a pipeline the different situations where you might try and include these different final channel factories so like if you're trying to import your data or some metadata or some type of variable or flag that you want to try and like integrate into your pipeline to sort of manipulate what's happening at the different steps so kind of tuning a different parameter for example so channel of this is going to be one of the the Q channels of course different to the value channel so we've got one, three, five and seven we're going to be viewing it so here is an example of you can actually add text and kind of a combination of sort of raw string values as well as the variable that you're actually going to be taking out from that channel so it's item it's kind of just like a reserved name for addressing different elements in a channel so channel dot view is going to be printing value dot dot this item so let's just print that and see what we're going to get okay so as we might have expected or as you might have guessed we've got one, three, five and seven and as we print this as using the view the view operator we see value dot dot it be one, three, five and seven item one, two, three and four here coming in from this channel of so this can also you can do a lot of other cool ways you can kind of use channel of to bring in like big long lists of data or different bits of information that you might be interested in here for example we can use channel of to bring in the numbers one to 23 so dot dot was kind of like a groovy way of saying one, two, 23 X and Y obviously this could be something you might use if you're trying to specify all of your chromosomes with a human genome technically you can change this to anything you want so you could say that and I want to add the mitochondria something like that change that back to that cool so here just because I've changed it to one to 10 I'm going to print out the numbers one to 10 I've added in the MT and X and Y I'm going to sort of print it out in this order just because that's how I've specified it here in the channel okay so this is probably one of the more useful channel these next two are probably some of the more useful channel factories but I think they're definitely worth digging into so here if we were to import a list, so the list is described by having these square brackets we could add hello world yay we're going to import this as a list and we're going to use this channel factory from list bringing in this list that we've named up top here and we're going to view it so we're just going to check out what that does so here we can see each of these has been brought in each element of that list has been brought in is a different a different item there so the thing about this as well is that we could make this as big as we want you can try and include different pieces of information here depending on what you're trying to do we don't actually have to have these fixed these could be something like numbers you can mix and match these depending on what you're trying to bring in but what I'm really trying to demonstrate is that if you have a list you can bring it in it will be treated a little bit differently to a channel dot of being the operator which is different to from list but there's lots of different ways to string these together and do different manipulations so okay so from path this is probably the most commonly used one of the most commonly used channel factories and this is probably the best way to bring in your data into a pipeline so previously we've used from file is here we're just going to be doing from path which is a single a single file so if you need more files here there's a list of something like that but ultimately if you want to just bring this in let's just call it channel then we're going to do channel one dot of view just because we want to view the contents of that file this is just bringing in that the path of that file and then the channel basically becomes the path to that channel so that's not the file itself but it's the path to that channel and that might be a little bit confusing initially but when you think about it, when you're feeding in a path or a file into a pipeline you want to designate the path that file, not the file itself the tool itself and the script will go away and look at that file or bring that file in what you're effectively doing is just directing it to that file so here for example we've just used the data meta, glob, everything in CSV and if we go and look at the meta data here you can see that you've got these two files here and if we're already looking at one file we can change it and just look at it straight like that so what we can do here as well is there's lots of different options when you're trying to bring in these files so this is what we attached on this thing we used check if exists this is what we're doing when we're looking for those files from a file so you can definitely add this in here's a quick exercise I'm just going to skim over this quickly because we're quite behind on time already here the example is we're looking for hidden files at the same time which is another example here so for example you're looking for hidden files when true it includes the hidden files but by default it wouldn't look for those so depending on what kind of file structure you have and if you're looking for hidden files or not or try to use relative parts or checking if exists you can really tune how files are brought in using from path again just because we were behind I want to try and do this quite quickly but from file pairs this is the same channel factory that we used yesterday so what you remember is that we used the data, the chicken data the chicken data we've used this glob and everything with a 1 or 2 in that position .fastq and a view and when we run that we get these tuples returned with the first part being a value which is the base name followed by two the two file paths which is there so as you can see gut, liver and lung which are those same files that were seen up here now what we can do in this situation is I think this is great I've got my files in but they're not quite the tuple isn't quite how I want it to be we've still got the two parts the second part of the channel rather than super parts and I don't want that for whatever reason so again there are options to reshape this and the first is flat and flat is a really cool example of how we can sort of reshape this just straight away on the fly as the data is coming into the pipeline so again this is pretty much exactly the same same bit of code that we've just used which we're just going to be using from file pairs this is the file path using glob 1 or 2 but we're going to add this option of flat now I'll run this quickly and then give you a couple of seconds just to see if you can spot any differences okay so what you've hopefully noticed is that up here when we didn't use flats this is a first example we didn't have that in a weird way we didn't have this flat I shall move this one just take the order you'll see that everything here is this is one sort of pair of brackets but we've also got this other one meaning that this other set of square brackets meaning that these two these two paths are effectively in the same part of the channel here for example we'll split this out into gut gut 1 and gut 2 so the gut 1 is the value and over here we've got these two paths with only one set of brackets so it's effectively said flatten each channel so that everything inside that channel is its own element or own part of the channel so this is just using flat which is an option of course there are other options here again type in and max depth all these different things that you can really use to just specify what you're including we don't have time for this today so I'm just going to skip over it but you can basically bring data straight in from NZUBI SRA archive you can just use an API key to specify which data you want this is especially useful if you just want some supplementary data for your analysis or something like that you can just bring it in straight away using something like the channel factory here from.usra so again like from path you can just bring this in here it is automatically sort of splitting this out into a value and the two paths much like what we saw from the from file peers makes all these nice little channels that you can basically use in your analysis and feed it straight into your pipeline no extra work is required which is really really cool okay so I think this is important so I don't want to skip it but when you you can think of situations where you're not bringing in a data file but you want to bring in like a spreadsheet and this is particularly relevant like if you're using like an aluminum sequencer for example where you might have the sample sheet that has been fed into the sequencer and now you think okay I want to use that same sample sheet to import some of this data using so I can feed it through my pipeline so what you can do is use the channel factories in addition to other operators to really manipulate this um manipulate what's being brought in and how it's being brought in but this time we're actually doing it from a text file and addressing different paths for that text file so you can imagine that this would be like each line is a different sample sheet so here we've just got a random text so we'll quickly look at that data where is it data random dot text and you see here that it's just a bit of a paragraph of different lines or separated out seven different lines with just some text in it so when we run this using a snippet again I'm just using a snippet you could have created a different dot any file and executed it from that but just for the sake of keeping everything similar cool so what's happened here is we've gone from path this file this file is obviously a relative path and the data meta random dot text file which is what we've brought in and we've used this split text so what split text does is it separates out every line in that file is a different channel and we've just viewed it and that's that's effectively what's being said here it's split each item into chunks of one line by default but of course you might have a slightly more complicated file sheet there might be something like doing something a little bit out of the ordinary so much like with the other sort of channel factories we've talked about you can actually chain different operators together and use different options to really manipulate how all that information is coming together so again just going to run this quickly and here it's just split out every line and we've used subscribe operator just to we say print out each item and then after each item just print out end of chunk which is what's happened here so we're splitting the text two lines at a time this is an option for split text subscribe which is the how it's going to be printed out and here we've just said do this at the line at the end of each line which is what we're seeing here so this is where it gets a little bit more complicated again we can keep adding on different operators and tuning different parameters or different options rather to keep kind of compounding you know different ideas or manipulations that we might want with our data so again I'm just going to sort of keep demonstrating different ideas and I just want you to keep thinking about or I'd like you to keep thinking about how you could use this in your own analysis here for example I mean I can't think of a real world example for this but maybe if you're dealing with sequence data that you wanted to get to upper or lower case for whatever reason you could say bring in my sheet of data my file, my faster file doesn't really matter what it is I'm going to split it by chunks of 10 so we've got the 10 lines or 10 parts here this is probably just a tool or something that's probably a better example and then split it into or change each item to uppercase so again like it really depends on what situation you're in and what you're trying to do with your data but next slide has a lot of really powerful ways to change this so that you can sort of modify the data coming in or once it's in and you want to sort of change it between two different channels there are lots of different ways to do that so here it's just a simple example in a way using this random text but as to do it you can also make counts or something else like that for each line if I say you wanted to count what you're doing or ascribe a number to each as you're working up so here you can use this count++ which is a little bit more cody I guess than some of you might be familiar with but again we're using this kind of compounding idea of adding this channel from path we're going to split the text for here without any other operator it's just each line and we're going to view that by printing out account which starts off at zero then we add one each time this might be unfamiliar so don't panic if that's a little bit strange we're going to convert it to uppercase and trim the text as well so we don't have any extra wide space again just an example that you can have very powerful manipulations just for time we're going to skip over quite a bit of this but another common application is to use like a .csv so up until now we've been using been using a text file which is that random text a CSV might be a much more familiar example that you might use in the real world so here it's just one of the CSVs that is used in the example so patient one and patient two we've kind of got this this data here which you might expect to be something similar to like real world biological data we've got patient ID number of samples manual failed regions and just all sort of comma separated text there because it's CSV we're just going to pick one example we're just going to choose this one clear so again we're just going to say from path as he does split text there's split CSV which is obviously slightly different here we're actually specifying that a header is true so this is an option that might be useful if you are importing your own data and this is just saying that the headers obviously are true and that they do exist here we have rows and we're going to each line or each row is imported separately so here we've called a row you can call this anything you want you can call it TT import that is using View so here we're just looking at the two different effectively the two different rows so we're remapping this we will talk about mapping very soon but we're saying use patient ID and number of samples from the header and you can use that to basically just pull out the columns that you're interested in from that CSV there are obviously other options that you can do here you can specify your own channels so you can sort of say I'm going to call it column one two three four and five and then when you're using this view here and remapping you can just say column one which is what you've called it yourself you could call it column A, B, C, D and E call it column A and column D here basically there are lots of different ways that you can bring in a sample sheet and sort of really choose which parts of that sample sheet you'd like to look at at one time so down here there's an exercise which you might want to come back to later but it's effectively asking if you were to try and write a sample sheet so you would go and write the sample sheet sample sheet.csv and then we would try and bring that in using split CSV how would you do that how would you sort of check that data as a quick note you can also import CSV files so tabs limited files using the split CSV you can add the option which is sep backslash T or backslash tab just to separate that data and again it just really depends on what your what your data looks like what you're trying to bring in what I wanted to sort of stress here was that there are lots of ways to bring this data in there's a really powerful way of doing this you don't have to go away and write like a custom script to try and pull out just bits of data or names that you want from those files there probably will be a way to do it next flow I'm not going to guarantee that there's always a way but I think just about any reuse case we've come across there is a way to do it with next flow and it's a really effective way of doing it if you can use these sort of channel factories in the right way unfortunately I won't have time to go through this in detail either just because I want to keep things moving but if you do have more complex file formats you can bring in sort of JSON files in different ways and this is quite a popular way of doing it but it's probably a slightly more advanced way of doing it as well which we just don't have time to sort of cover in the detail I think it deserves but if that's something you are interested you can go back and check out the section if you want more information or if I've said something that doesn't quite make sense of course here in the next slide documentation is information about the channel types the channel factories I think we sort of touched on of value from list from path from file peers we skimmed over at SRA there we didn't really touch on binding values or observing events but there is a huge amount documentation and there is a huge amount of sort of functionality with next flow so we just we kind of have to skim over some of that right now but I do really encourage you to come back and check out some of this documentation documentation in more detail so we are going to jump into scripts not scripts into processes so this is going to be excuse me this is going to be section 6 and the training material if you are following along with that so as we have already sort of talked about each kind of sort of if you are bringing in a tool and you are trying to do a task you kind of wrap it up as a process and you give it a name so here for example this process is the index what we call the index and we are taking the transcriptome file using samer.index being the output so just as we are going through this stick back to how we have done this in our sort of proof of concept example okay so this is probably a lot nicer way of saying what I have just said or tried to say index flow process is a basic computing primitive to execute foreign functions so like I said it is really just a way to execute that script and it is absolutely simplest form you can just have a process such as you could call it say hello with only this script you don't always need every different part of this process block because up until now we have talked about directives, inputs, outputs and the script but there is also something we haven't talked about but we will touch on here a little bit but ultimately you can just get away with just the script block which is what has happened here so you could just say process say hello script echo world and in that workflow block which is the same as this one down here you could just say say hello round brackets and it would work and it would just echo hello world so taking a wee sidestep when we think about how to structure a process we have sort of seen already that you can have these directives at the top a lot of this is more convention than necessary you do need to have the script block at the bottom but these others you probably can move around a little bit not always but sometimes you don't know what you're doing but we can when we call a process so again this is that simplest unit of what we use to actually execute the script the process you can give it a name what you might have noticed is that we've kind of capitalised all of the names here this is more convention but I find it quite useful so that you can easily separate out what's the process and what is potentially a channel especially if you're not using a naming convention like quant underscore channel every time you have a channel for example if you were to have that it would be very hard to tell it apart if it wasn't capitalised okay so this here is just kind of explaining loosely what I've already said is that you can have zero one or more process directives you can keep stacking these if you've got different things you want to sort of control so CPUs, published directory some examples that have already come across you can have one or more process inputs or zero one or more process inputs so you don't actually need the input same for the output you don't actually need that if you've just got a simple example like echo hello world when we haven't used yet but when it's basically like if this condition has been met then do this and this is actually the script block here so up until now we've just been using script which we'll just go back and look at here script you don't actually need that in the script by default it will just be script but but for you know you can change this out to be something like shell if you want it to be a shell script there as well but I'll touch on that in a second actually so we'll start off with this script block just because it is probably the most important and I'll kind of explain when I started rambling a little bit here about but a script block is a string statement that defines the command to be executed by the process so here in this example we've got this process which we've called example and we've got this block of text which isn't anything out of the ordinary so it's just echoing hello world you line new line hello and we're just going to put this into a file we're going to cap that file or use a couple of hits and then take out chunk and then we're going to gzip it so in a script you can have multiple lines of code multiple tools as a rule of thumb you should try and make your process as simple as possible you shouldn't try and sort of include too much into the one script but it's much nicer to separate these out it's different processes so that when you are executing that next workflow if it needs to it can paralyze those different processes and you can ascribe different resources using directors which is something we'll come back to as well anyway just to show this as an example a quick look at what we've actually got on that work directory okay so the workflow has run the process example and what it has done is run the script block you can see it has input or output so this is just a really simple example and we can see what else is in here and as you can see there is nothing excuse me nothing new so we don't see a file or anything like that and that's partly because we haven't actually set the output directory or published directory but what we can do is go into this process and actually look at what's happening so as even sort of mentioned right at the start of day one everything we do is kind of cached in this work directory and we could use this this hash number to access that so here it's 37616 37616 and we can see in here that we've got chunk1.txt chunk.archive as well as this file which is kind of like the three outputs of the script we can also look at what was actually run here which is command.sh oops that's not quite right we want to cat that so you can see here this is actually what was spat out from the from the actual execution of the command and all this is archived in these files in the work directory everything is staged there so we can go back in and look at everything that was there so probably a better way to look at it so you can see here we've got some handle files which are the command.sh to run the log to begin there and the error if there's any if you've ever had any trouble with what you've actually run in the script and look at all of this I find this really useful for debugging when I've had something that's failed okay so here we've just used a script this will make a little bit more sense very soon but you can also add in different languages at this point so say you want to use an r script or a python script you could write basically the shebang of the python at the top here within the script block much like you wouldn't bash or another script and execute it in the same way so again if you can paste this in you're welcome to do this home as well script python shebang python sort of code just saying hello world and we're going to print that out to the screen snippet nf cool and it's just executed that but we won't go in and actually code same thing here we can do a very similar thing and that you can just sort of take in a parameter which you'll see here using prems.data here we've just specified it as a variable in the script block so we're taking this from outside of the process we're using it in the process and then we can just specify that or that can be accessed directly as a part of the workflow script the main thing and this next little sort of block of text so there's lots of different examples that you can use to show this but sometimes you might need to escape the next flow script so say you're trying to use a path directory from your local environment or trying to specify your working directory or your PWD you need to escape it and you can do that using a backslash so this is where the next flow kind of meets your local system in a way so if you were to run something like this we're just going to kind of so if you were to run this without the backslash without escaping we'll see what we get still runs debug cool so what we've done here is just add debug at the top so debug is quite a nice way of actually just printing out everything that's happening in the script we will use this a little bit later as well but here what I'm doing is just basically using it to show what is happening as a part of the script so let's compare what's happening here and here and here I'm doing it the wrong way haven't I so what you'll see here is that the outputs are a little bit different and that's because we we've effectively escaped out this working directory PWD so I've got quite a bit of extra code in here but when we're using it with the escape you can see that we've actually the working directory is actually the work directory where this was happening so this is the same location here without the escape you're just going to be using the working directory that the command was executed from so this is really to demonstrate that depending on if you are trying to specify the working directory you're in or potentially what's happening within the next flow environment you may or may not need to escape what you're doing to use these commands that you would have otherwise used in your bash terminal so that's also similar for example if you are trying to create a new line you have to escape that part of it to be able to do that here up until now we've talked about the script block you can also use a shell block so you can use a shell statement instead of the script statement to use a slightly different syntax this allows you to use both next flow and bash variables in the same script which is what we can demonstrate here so basically now because this is a shell over script we need to specify that next flow variables are kind of like secondary to the bash variables meaning that we need to escape using this this exclamation mark and wrap it up in these curly brackets I really encourage you to play around with some of this stuff it is probably one of the more common things we find if people are having problems with escaping out or using the bash variable as a part of next flow which there are lots of good reasons to do that especially when dealing with a full data pipeline but just remember that it's probably going to depend on if you're using a shell script you'll also notice here that these are single rather than double quotes but it's probably to deal with these this exiting out yes so very quickly conditional scripts this is basically as creating a file statements which is saying if it's a script file and it's it's being compressed so my primitive said that it was compressed then do this if it's not and it's a bg if else else if rather it's a bg zip too again this is a primitive that you can set up here do this otherwise throw it through an exception that is an unknown aligner in that prams.confress so that was just print out to the window aligner which would be whatever this is so if you put Chris or something else in there it would be a little bit different and of course just the workflow script so I'm just going to power through hopefully the inputs and the outputs before we take a wee break and then we'll come back and talk about when directors and organised outputs but yeah okay so next flow processes you know the isolated and that's how we can make everything parallel from each other but because they are isolated sometimes it's difficult to communicate or you might need to communicate and that can be a little bit challenging but because of that we just need to send values through the channels so when we specify an input we need both the input qualifiers so this is we might say something like path or value and the input name so we had a question yesterday which was is there a good way to know exactly what you've got on these channels and if it should be a path of value the short answer is probably no but you can sort of work it out depending on what was the output of the last process and what's the only input chances are it would throw an error and it's an unknown value or path type depending on quite what you're doing but I think the easiest way to sort of demonstrate this is just to have a look at some examples so here we've got this channel of 1, 2, 3 so remember this is a Q channel we've got this process which is a basic example to bug again so to bug is a little bit like eco that it's just going to print out everything that I'm sort of pumping through we've got input, a value of X echoing process job variable X which is what was specified this, it doesn't have to be X this could be value hello and this is to demonstrate that you can just kind of call this anything you want in the workflow script we're going to just use a basic example num which is the channel that was set up here oh yeah that's my bad so I didn't name that properly and you can see here that it's given me a nice euromessage basically saying that this is unknown and I need to check the script at line 10 which is just going to be the start of the script block because that's where it fell over so it's just run three times saying process job 2, process job 1 process job 3 this is the sort of FIFO part of Q channel and then it's just first and first out so it's not in the same order as you might expect up here otherwise but this is just an example that we've set it as a value even though that we do have you know the channel is an off being a Q channel it's not a value channel but each element of this channel is being treated as a value which is a subtle distinction but it's an important one otherwise you're probably like right into a few eras along the way if we are going to use an input file so an actual path so up here we're specified a path we can see what's going to happen here so again it's path, debug true so we're going to print it out here I've specified the qualifier as path and I've given that this sample.fastQ again we could change this to whatever we want we could just call this fastQ again we can just change this as much as we need and here we're just executing the process using foo which is the process name and reads from what we specified in the channel at the top so as you might expect because we've got this glob pattern we're actually going to print this six times if we were just to kind of call that gunt underscore 12.fastQ we would only expect to see it twice because we're only going to get files okay what do we have here so I've kind of covered that yes so this is probably the next thing is that you can actually use the same idea to create dynamic file naming so in that first example we've just used sort of this fixed fastQ output so just basically ask for fastQ each time so it's not going to have a dynamic name based on what was actually imported in the input but you can also use the input path we call it sample and then here when we output it we can use this effectively variable sample to help name it collect operator down here just to kind of add all this together so here for example just use LSLH with the sample just to give us each output which would be a little slightly different but to kind of build on that you know like I said we've got this collect which is sort of added all of these together on top of each other it's not one element they are still separate and it's kind of list format again I think this will pretty much be the same thing which is just going to ask for that sort of dynamic name of Gino which is coming from the input which is based from up here which has been set up here as a parameter so it's taken a parameter it's going to be taken the path as the input so it's that qualifier it's pushed it through the script using the Gino to execute the process on that so it's very rare that you would only use one input in isolation in the example that we've been using it's been like the transcript over file in reality you'd probably have multiple inputs at a time so here we've just specified two sets of values we've got x and y and we're just going to echo those out again we've got the debug to actually show this to the screen so if that you'd probably need to do like a view but we don't need to do that here because we've got the debug which again is just like I said an echo cool but as you can see here it's just taken these and spat them out what happens is it's going to take the channels have to be full for the process to run so when you have one, two and three you're going to get a nice output for three C, one A and two B because these are Q channels it's not in any particular order but what happens if we sort of take out one of those values so there's two and one and three in the other you'll see that it's already run twice and this is because the channels weren't filled we can do this in the other direction and sort of add one, two, three, four and we'll see something similar but kind of the opposite and that will see A, B and C but we won't see any sort of trace of this four and this is because these are the Q channels and then they can be used once but if we were to change this to something like a value and just call it one so you can see here that the one has been used multiple times this is because it's a value channel and it can be consumed again and again so this is all sort of why when I was talking at the start about understanding there are subtle differences between a value channel and a Q channel will influence what you can do later on in your pipeline yes so this is basically what I've showed to you already using a slightly different example that we've used a value of one and now we can use that to generate slightly different outputs okay so this is another kind of slightly more advanced example in a way which is for time I'm not going to go into too much but you could use this thing called each so each will sort of say run this for each of these in this case it would be saying you know run this process which is called line sequences excuse me for each of the modes so modes is the second part that we're sort of adding into this process so this would be for methods so sequences refers to the paths which we've added in here and each mode is going to be for the methods so when you run this it will run it each time for each one of these different methods and that's a really powerful way of you know say you had different aligners or different chromosomes or different you know different anything that you want to sort of run it multiple times it could be different sets of parameters you could specify all of this use each to kind of drive it through multiple times okay outputs so outputs are you know if everything that comes in quite often you want to have an output it's a specified much the same way as inputs and then you have a qualifier as well as what you're actually calling it there are different qualifiers so value and path are probably the two major but we also have a sync here called emit so when you use an emit you can which is specified by after a comma you can say emit and give it a name emit is basically used to name what your output will be and this will become more relevant when you have multiple steps or different outputs that you want to name slightly differently so that you can sort of bring them into a different channel later or add them into a process later and this will become more obvious in a second so again this is very simple excuse me simple workflow we will sort of see methods which are specified this list we've got the inputs which is a value so we're treating as it comes in here using food it's treating channels.of methods as it was a list but because we've used this channel of channel factory it's going to become a queue channel again this is kind of where these concepts get a little bit like blurred and muddy but I assure you it does become clear a little bit of practice and then we're just using a receiver.channel so it's specified or we've made a channel called receiver.channel and reviewed it with this it again so it's received and it's going to print out so it's treated it as a list it's come in it's treated that whole set of things as a single item even though it is sort of a list up here itself and it's just printed that out and that's what we've seen here so it's received this as a list I guess we could try and change this to something like from list to right one I'm not sure if this will work because I've used from list rather than dot of I've actually split this out into three different channels because I've used a list rather than off so it's just a slightly different way of doing that and again it's just to demonstrate that you can manipulate these things based on what you're trying to do okay so here's an example where we haven't actually got an input but we've used we've just basically set the output being results.txt and we've used eco-random to basically pump pump something into the results.txt just the time I'm not going to go into that but what I did want to show is that you can still use your patterns to specify multiple outputs this is much the same as when you specify a lot of pattern with your inputs you can do that so that you're not necessarily worried about this here just sort of saying anything with this pattern is fine and that's what we will see here that we've printed holler and we've split that out into chunks and we've just called it chunk one with the letter at the end of it so chunk A, A, A, A, B, A, C we've run that we've also used a flat map just to separate those out and then a view just to view that okay so we're just going to sort of jump into this probably slightly more advanced process just to check out what it does and we'll dissect it and I'm hoping we will sort of cover a few of the concepts that I've briefly alluded to already but also give us opportunity to I'm not sure why that failed sort of talk about a few of those concepts in more detail so the cool thing about NextFlow is that you quite often get like a really nice error message and it will tell you potentially what's failing so I was missing an output file called this. It was expecting this and it couldn't find it and it was expected by our process align so we know that something's gone wrong in this align process so let's just try and debug that now we've got this process align input value X and value seek these are arbitrary names not too worried about that it is looking for this sort of variable which is X being this input here dynamic.align so it is expecting to have these three items as a list I think that's the mistake .align I would have expected this to be separate so what I think needs to happen is the same as what I've already done just previously is that this shouldn't actually be involved should be from a list so that each of these species is treated separately rather than collectively as a single list so that's just kind of splitting those out so that they can be used and I think that's worked so the issue there was just that this species this is a list and it didn't like that because it was expecting that list to be called basically as it was almost treating it like a string right so that instead of splitting this out so that we could get each of these separately and just try to do them all at once which is a little bit strange this is an example that we'll need to go back and fix in here okay so slightly more advanced option we're going to be using tuples here so just a reminder tuples is when we have multiple parts to a single channel so thinking back to what we did with the from file pairs we had the value at start being like gart or lung as well as the two fast queue files at the end we're doing the same thing here so this is the sample ID or we've called it sample ID which is value as well as the paths which are those two file paths joined together in that kind of like brackets and what we're doing here is we're going to take that as an input we're going to basically use echo your command which is just arbitrary this is all just arbitrary code read sample ID into sample.bam so we're not actually going to be really using the sample ID or nothing that's going to be happening to these excuse me to these files but we're just kind of using this as a demonstration using it as a silly of me as a as a dynamic naming situation next flow run snippet snippet.mdf so here we've inverted back to using if you've taken away the debug up here it really just depends on what you're trying to do when you're sort of developing if you just want to check in a bunch of views which I wouldn't really recommend because they're a bit of a pain to get rid of but the debug isn't it's a pretty good option so here we're taking the output which is the tuple so the two different paths we've got this value here so we've got a value for each of these because it's used this twice because we've got this sample ID here which is a little bit strange this example is like this because we could just use this for example we might want to do this and run it again but effectively what it's done is it's just said take those two files which is the sample ID which was there previously that's over here sorry it's called it that and then we just created a path this arbitrary file which is called sample.band which we've created here but of course like depending on what you're trying to do just as again a very arbitrary kind of way of doing this we're just going to run that again so I've changed this from sample ID sample ID is still being used in here but we're just taking the sample here directly to the output so this is like an example you might just use this as a sample name and carry around with the sample for the entire thing we've got lung, liver and gut in here as well so this is just a way to sort of say when you have a tuple you kind of mix and match different parts of it either in your script block input output whatever as long as these are unique names you can kind of mix and match them that's probably a bad example on that you might get confused between the sample and sample.band but you could just call it something like that and it'd be absolutely fine okay so we're about an hour and a half and so we're just going to take a wee break for five minutes just so I can catch my breath we are running a little bit behind but hopefully five minutes is long enough for you guys to go away and get a drink and stretch your legs or go to the bathroom or anything like that so I'll see you back here in five minutes just another maybe 10 seconds and then we'll get straight back into it because there's still a lot of content to cover okay so just a reminder we're working through the content and the training material on the training.secure.io we've just sort of finished talking about outputs very briefly and we're just going to jump straight into using Wynn directives and how to organise outputs very very briefly so Wynn is basically just another part of the process block that you can use where you can set basically criteria to say when this occurs or when a name finishes or starts with whatever and this equals this then actually execute a script block so it's kind of a little bit like if our statement I'm not going to go into that today just to kind of keep things moving but this is a whole nother block that you can add there's a little bit about it here of course there's more on the next flow IO documentation page much the same for directives just for time I won't go into every possible kind of directive that you can use the main thing to remember is that you can conventionally you'll mostly add these to the top of your process block you can add in different things to kind of control what's happening in that process or fine tune it for example CPUs memory you might add in the container so you might set a specific container from Docker that you might use just to control this image that's really common for NF Core as well as it's a really popular convention for most pipelines actually and then you can just sort of add in your inputs, outputs, scripts and things like that that's all pretty straightforward just as a reminder we have used directives that was one of those words I threw around yesterday script 7 tag here is another example where we've added a tag to the end of the what we're seeing in the command line block or the command line when we're executing the pipeline as well as the published directory which is another good one to use when writing a process organising outputs is another good thing to do again just for time I don't want to spend too long on this but you can basically use like published directory which is something we've talked about previously we've just demonstrated but depending on a few uses like an S3 back at an AWS this can get a little more complicated as a rule of thumb just name it sensibly this is probably what most people will use is a sort of a full back option just publish a directory using copy but some of you may sort of look for more complicated somatic subdirectories and there are lots of ways to do this where you can sort of you know based on the parameters or different patterns you can specify where you would like files to go and this is really common like for example if you you know you're in a process and it might create like a VC file thinking about biological data with a bunch of other data just created as a part of that process you might try and sort of send those off into different folders based on what you're trying to save and keep and do do with them in the long run so just for time I won't go into sort of all the patterns for this but this is a really good example here that you can play around with just to kind of see how using these glob patterns as a part of the publish directory using patterns as an option to publish directory you can really sort of directly want to save that data so probably the hardest concept to grad learning next flow is about operators so operators are kind of they're really used to manipulate and control what's happening with your channels and where different parts are going but there's also a lot of options which you might not sort of realise in terms of actually using maths operators to control or to change numbers as a part of those channels so there's a basic example so you sort of start off with like these simple examples again and it's simple as form so run snippet.nf 14916 so what's happening here is we've got this channel of so it's a Q channel, 1, 2, 3, 4 different numbers all being treated as a separate sort of item we're going to create square so it's nums being this channel and we're going to map them map is a really common operator I think map is probably an operator that's some after you've defined your channels mapping is in a really common way of sort of restructuring the different parts of your channel so you might say okay now I'm going to go on this so I'm going to remap it and remove this here we're just using it as a way to effectively take each item of this channel and square it so time is the item by the item and then we're just going to feel it which is happening here so because we don't have a process we can't use debug and we see 14916 as the outputs so here's another example but this time we're going to be using sort of separate strings and we're just going to view them just as an example but we can also refer to these different items in the command block like this obviously added in Spacey you don't actually need to do that it's really just however you like to sort of show your code I like having a few gaps because isn't that working something silly yeah so this should be curly brackets rather than round brackets I should have checked that these are used sort of differently that it's there probably are like set rules about what you can and can't do I still get them refused time to time but these they seem to do is just sort of remember them sometimes is that largely when you're using it that it is going to be these curly brackets or an item rather so what this is doing is just as a part of this sort of script block adding this basically we hyphen it at the start of each item so when it's printed out we see it as a nice list this isn't much different to what you know what we're showing before but you could say you know you could add anything there that you want just to make it a little bit easier to read if you are printing it out or if you're trying to spit this into another or trying to pipe it into something else you can add some text there so as I sort of touched on already the map operator is really really powerful it can be it can be used in lots of different ways to kind of take an element or take a part of of a channel so channel.ov and we've got these two different things in here two different strings hello and world and we're going to take each of those items and we're going to spit it out but in reverse reverse is a slightly different operator here but it's just to kind of demonstrate that you can take these items and using map you can basically use you know different code or different operators or it really depends on what you're doing but you can use it to basically manipulate what you're going to see is the output we're using view so taking the item and then we've used the item.reverse taking the operator and it's flipped it for us and of course there are lots of different different operators that we can use another one here that we're going to use as an example is dot size so city using it we could use we could use it in all these different situations but we just arbitrarily changed it to work because it's a little bit easier to potentially conceptualise hello.world word and then we're going to have word and word size produced as these two parts of of a list and then we can take that using view and we're going to use word and then basically create length and then we're going to sort of show the first part which is word here or here rather contains length so this is an issue we've arbitrarily named these again as word and length and we're going to sort of remap those as a part of the string as word contains length letters so let's just see that hello contains five letters and world contains five letters but we can just add in as many letters there as we want you can change this up to be anything you want just to treat this as a variable you could add in you can use some of these more than once there's a lot of different things you can do with this so you can see here we've got this world random letters contains 15 letters and five letters and hello hello there's lots of different things you can do with map just to kind of take this and then we're using view and remapping inadvertently as a part of view with the curly brackets you can sort of fine-tune control different things I've seen a few questions about kind of the difference between things like mix and collect and flatten and the slack channel so hopefully some of this will help clarify this video as well but we have C1, C2, C3 different channels 1, 2, 3, A, B and Z when we use mix first of all it's taking all of these and kind of just mixing them together so they're all going to be printed out as a part of this view but they are not one channel they are different items they still exist kind of independently still they are all getting printed out together but they are still separate separate items I think is probably the right way to describe it and we see much the same thing here but it's a little bit different when we're using these list items so this is different to what we're just showing here we've got a channel of here let's see what my history you can see here we've got these lists so these are channels remember these are just kind of like set strings almost so we've got the channel of fill and bar so we can import both of these well these are going to be used as queue channels but they're going to be those they're still lists but they're going to be trying to think best way to describe this they're going to be treated differently we'll do this and hope be a little bit easy to explain so using flatten here we are going to be separating out each element of these lists so we've got kind of like list 1, fill and list 2 bar if we weren't to have this what we should see these are still contained as the list so they haven't been separated out so flatten is taking parts of those lists and separating them out for us which is quite nice so if you've created a list for whatever reason or imported something as a list or sort of forced as a list for whatever situation you can see that separately collect is the other big one but I think is quite common so we used collect end of script 7 when we were using this mobby QC so again we mixed these two channels so we mixed the quant channel was mixed with the fastQC channel then we collected them and listed to combine them all into one single channel and what we can see here is similar things so we're going to run it again so we've got 1, 2, 3, 4 collect and what should happen here is this will push back to us as 1 single channel is being collected together and this is in comparison to if we didn't have that collect where each of these is going to retain its own line as a separate separate channel there so we also have this thing called group tuple so again tuple is that word that we've been using to sort of specify we have different parts to a channel and the first might be a value compared to a pair of parts or it's an individual part as we saw with the frompile pairs but we're importing that the chicken fastQ data so what we can see here is we've got a channel of a series of different tuples and what we're going to use here is group tuple being the operator and few and what we can see happening here is it's going to group tuples based on the first element so in this case so excuse me we've got sort of a, b and c all grouped based on the one we've got two c and a and we've got three d and b there are lots of different ways to group a tuple again there might be lots of different scenarios we might want to do this example might be if you've run some sequencing analysis and you've done multiple lanes of the same sample on different runs and you want to sort of merge all this based on the sample name or something to that effect so you can see here that we've got the same output which is really nice there's a nice example here we exercise if you want to have a go at that in your own time we can also use join which works pretty similar but it's sort of just going to use a matching function so we see here we've got these two different channels so that's these q channels x, y, z and p, z, y, x and what it's done is it's just basically joined all of these together so every time it's seen it's joined on the right it's left joined right it's a little bit backwards isn't it but when you have the left and we're joining the right and then we can view it so it's joining this left and it's joining on this channel of right and then we can view it so you can see here that we've got this y2 and 5 so you can sort of see here y2 and 5 so it's taken this and it's added on this and it's joined right onto left which is super confusing when I say it back like that but I hope that at least it makes some sense to someone cool so branch is another operator which is really useful so this might be a situation where you have like big and small numbers and it might be a do this if the file is over this size or do this if it's under this size or under these three lines or if the sample is this then sort of send it this way otherwise do it that way here we're just using a very simple sort of branch operator to specify either small or large using values you could kind of do like a string match or something else in this situation as well so this is something that we haven't really talked about when you have a channel like this you can kind of add on these other little bits here which can be used to define in this case which branch we're looking at so you don't even need this part here you can just say okay when I look at the small this double line here is commens if I haven't mentioned that before I can't remember if I've even talked about this in his talk but you can comment stuff out just using a double dash like that but this one's just refers to this branch here and this is this branch here this isn't really covered in the material but I mentioned it earlier that you can use in a mix with your output and if you've got time I think I'll come back to that I'm kind of surprised it wasn't in the content maybe it comes up later and I've skipped it but it doesn't really matter for now but basically you can use this to specify what's going on and as you can see you can use it to add basically text or do different things based on what has been included here to decide if it's small or large in this case so that you can imagine a situation here as well where you say I'm going to take this small the treatise is my new channel and pump it into this process but all the large stuff I'm going to pump it into this process so you can really mix and match with that so there is a huge amount of information about this and I've really only sort of skimmed over the top and it's a bit of a shame really because there are so many really useful operators that can be used depending on what you're trying to do with your channels some of these are really sort of talked about with like split CSV and split text that we talked about those in the last section when we were looking at kind of like controlling the inputs from like a sample sheet for example but there are a huge different number in here and I really encourage you to go away and have a look at some of these just when you get a free minute I can sort of speak highly enough of the content that's actually in this um excuse me in this documentation the examples are pretty good and you can sort of go in and see examples like this just when you're a little bit confused you're not quite sure how to do something a control search will generally get you to where you need to be in the different data so normally when we run a training like this we would have time to go through the groovy basic structure in idioms today we are effectively just going to skip over this completely and it's quite a common question is do I need to know groovy to write next flow and from my experience I think the answer is no but it does help so understanding a few print line that's very groovy we've got you can sort of use definitions you know how a list works how a cert works you know even little things about like multi-line strings we did this in our log log.info at the start of our it's sort of our NAC pipeline it's just like a lot of like little things like that that do come up but I don't think the concepts that are beyond kind of a quick skim of this documentation might take you sort of 60 minutes just to kind of push your way through and understand some of it some stuff here about like the curly brackets and round brackets really useful but I just don't think we're going to have time to we don't have time to cover it today so we're already quite a wee way behind on the schedule so what I'm going to focus on here is just kind of potentially the modularisation and a little bit of that next like configuration and then also just talk for five minutes about these deployment scenarios about how different HPC system or clouds can be used but we're just not going to get to this today I think I've spent too long talking about these things which I think was really worthwhile because they are sort of labored points but in saying that we're just going to push on and get through as much as we can okay so up until now we've just been building everything into one big long script but of course this becomes really big and long and laborious and it's really hard to sort of scroll up and down and find what you want so what we can do is we can actually move some of this into other files which are sort of in relative paths to our main next flow script so in the content here we've got the example that we covered yesterday so we can include the processes that were described in the script normally and we can put them in a different file called modules or we can name anything else we want and then we can just include them which is a really nice way of doing it so I think we're just going to focus on the hello.nf example today just because it's a little bit easier to cover quickly but what we're going to do is we're going to take these two processes so now everything that is happening in the hello.nf script again you should have this in your file directory off the side here we are going to create a new file we're going to call it modules.nf probably just one of the few people in the world that still use nano yes so now we can just check out what's in that modules.nf and we can see in here that all we've got is just this process and for split letters and convert to upper so we've moved it out of this script now we have it in the southern modules.nf which is local relative to this so hello.nf we've got our modules.nf sitting just there and what we're going to do it's quite nice to do this right at the top and so you can sort of like list these at the top of your script you know exactly what's going on there just for time I am going to copy this so we've got include split.letters from modules and we've got include.convert to upper from modules so when we run this next flow run hello.nf being the name of the script again and it's brought those in so we have specified the module the process we've given each process a separate name and we've included it in this modules file so this is probably you know one way of doing it a simple way of doing it close these because we don't need them anymore but here they are sitting in this file in our example here we've actually included them as separate lines you don't always need to do that if they're in the same file you can just list them like this so they're going to be both included because they're sitting in the same file anyway you include them like this and that should run again without any problems why isn't that working, what have I done so it's a use the wrong punctuation there that's my fault so again all I've done is just sort of list those there under that same include statement I did actually get a pretty good clue that it's in line 3 in column 9 I wasn't paying attention again this has been included really nicely what you'll find is that you might even specify this in separate files and this is something that's really common in NFCOR for example where you would say include fastqc from modules fastqc.nf or main.nf or something like that so you can do that this is a slightly more advanced application of this as well so you can only use a named module once in a workflow process and this is just because the way the channels are driven through if you have the same module used twice the same process and it's named the same thing then it just gets a little bit confused I guess that the channel will be partially full or it's been used twice already and it just doesn't like that it's basically thrown here so what you can do is include the process as something else so you can basically give it an alias which is really cool this is super quick and easy to set up obviously so instead of just including split letters and using it as split letters like we did in the previous script you give it an alias and you can use them separately here in the same workflow script as an example next flow run hello and if we just pay attention to what comes up here so you can see here that split letters 1, split letters 2 we've used the same process using the different aliases twice so that's just something that is quite useful to know output definitions what do we have here okay so quite often we have an output you might not want to necessarily use set or use the equals to create a new channel every time you might just think I'm only using this on the fly I'm only using all one so I don't want to create a whole new channel it seems like a lot of extra work while personally when I code I still would because I like to sort of see things all separated out in space you might not so the first example here is when you probably do something similar to what I would and that you kind of create a separate channel for each it's not going to work because I need to grab in the rest of the hello.script so we need to roll back to roll back to here oh no so rolling back to here we've still got this workflow script but we also still need to go back in and add in these includes again include so again this is going back to simplified version so I've brought these in separately again and I'm not using aliases just make sure this works so this is just kind of like resetting it back to kind of the script a little bit closer to the start but what we can do is we can kind of use this dot out so dot out works the same way as is like an operator and that instead of setting this as a channel like we did let's put it directly below so you can do a direct comparison instead of sort of setting letters.channel which we then use here is the convert to upper we're just taking the split letters dot out which is taking the split letters output being this here and basically piping it straight into straight into the convert to uppers process here we're doing the same thing but we're just indexing it saying zero being the zero element or the first element depending on how you think of it but basically what I'm trying to show here is that you don't always have to create a new channel obviously the channel here this is getting fed in but you can just use the dot out to grab that code so these two blocks of code are equivalent what we should be able to show is this so yeah exactly the same hash has changed a little bit because I've changed the code but the outputs are the same again this here is just an example of where we can add this dot upper which of course is another operator which will just change everything to uppercase which is in this example here but we're just going to skim over that for now using piped outputs this is another way that you might want to design your sort of like main workflow that you can just pipe one into the other this is probably easy when you've got like a really simple script with like singular singular inputs and outputs for me I find it more complicated because I can't sort of see things broken up and especially if I am doing something that's more complicated moving parts around it can be a little more complicated so here is kind of like an extension of what we've sort of been talking about in that up until now we've just been including these processes as from this modules file including in here like this but also you can actually include a workflow from somewhere else so in this example you might sort of specify your workflow in a different folder here it's going to be the qualifier is going to be workflow but in reality this could be like some workflow where you could specify a series of steps and processes which has been designed so separately or independently you could bring it in as like a large code block here it's just like the one code block already but effectively what you can do is just specify name a workflow include it what's that I'm not sure that's failing like that we might just keep moving on it seems to be working now but basically you can it's quite common to specify a sub-workflow whereas it's a big block of text quality control or like the mapping or something else like that and bring it as a whole workflow just to kind of create this way of almost like modular parts of script rather than just single sort of tools yes so the next sort of concept that we'll touch on is just this idea of take so take works in a similar way to input would be kind of give like an arbitrary name to what is what is being sort of received into this workflow here for example we are just going to take Prem stop greeting so in this case I guess it actually the right word you're taking something into the pipeline and then you can just specify that in the text like this so this is kind of just like using it as a named as a named option if you didn't have to take you can do something like this and then you can get rid of this but for now I guess the main thing to realise is that you can use take to bring in different parameters or files or channels or other things like that and this is kind of what's showing here is that my pipeline channel of friends greeting is the equivalent of this cool so the other thing that can be done is this is called calling named workflows in which you can actually specify you might include a series of different pipelines or steps of workflows and based on different flags or criteria you can say I want to use a different entry point here for example if we were to use a script here this is probably something a little bit more advanced but it's worthwhile knowing about the functionality we've got these different workflows, myPublic1, myPublic2 and they're both sort of included there as part of the workflow script or this main workflow script or this part of the workflow script and we use this thing called entry to basically say jump in it I'll call this the wrong thing you can use it to jump in at a slightly different point I don't think I've set this up properly so it's probably going to fail because I skipped a few steps there paying attention ok so here it's just jumped in at the entry point but you can always change this and jump in at 2 for example so this might be a situation where you think ok I've already done the QC separately and I'm not worried about doing that for this pipeline so that's when we jump in straight at my depth of trimming or something like that so we can also really fine tune in each of these processes using different parameters so we have this this script for parameters I'm just trying to think of the right example to sort of cover this here's just a basic example we were basically saying say hello from modules ok we might skip this for now just for time but what's worthwhile knowing is that you can add parameters for particular steps so this is really common in any corner things like that and this will be touched on tomorrow but you can include specific parameters for different processes and you can really sort of tune these as you need to so you can add different flag parameters filters or thresholds or anything like that little note there about DSL2 migration that's just there is some documentations and tutorials and stuff that's still floating around which is DSL1 just be aware of that sometimes you might find something you know why isn't this working and it is because it's DSL1 and something hasn't been updated properly comments we talked about this already so you can use this double dash to talk to you to include the commented outlines you can also use this multi-line option here when you are defining things as a part of your configuration this is more of a syntax then I just find what works for you but effectively these two things here there's a good script still down here for now so you know alpha.x could be the same as alpha.x so you could write that in two different ways that could become y which is your string value which is the same those two are equivalent to when you are writing your configuration such as what we've got here here we've used process.container but we could also sort of write it like this that we have process curly brackets with container and docker enable things like that or process.container and then docker scripted brackets run options and enabled so it's just a different way of writing it different people have different preferences it's not a big deal so here example just more examples we've set different parameters for example we've set those in the config file what is basically the point of this little bit here is to say that you can sort of overwrite them in different ways and there is a hierarchy which is in the documentation one of the helpers might be able to put that into the chat if anyone's interested again we've got a lot about here configuration environment this is kind of what I was sort of talking about earlier is that you can sort of add a different different parameters to tune your processes so it's like some of this looks quite familiar to what you've seen in directors but you can also include it as process here there are some sort of specifications if you're trying to sort of set the amount of memory or CPUs or something like that just as you're really sort of optimizing your pipeline some information there about processes and sort of including the stuff is kind of mapped map lists little bit about singularity little bit about content so I've really just skimmed over this which is a bit of a shame I would have liked to spend more time on it just because we've only got 15 minutes left today the main thing I sort of just wanted to go back before I leave it is that the configuration file is a really powerful way of creating a set of primitives or you know just you can really sort of information in there that is important to your pipeline it's a very quick way to sort of set up a lot of what might be sort of you know, constitution or specific or you know, if you're using a different you know, using singularity over Docker or Condor or whatever you're using you can control a lot of this in the configuration file so I would encourage you to go back and just have a look at that when you get a minute although we will sort of touch on some of it here under deployment scenarios so with 15 minutes left this is probably like 45 minutes of content so I'm going to skim over it quite quickly but hopefully there'll be enough in there that you'll sort of understand what is here and have the opportunity to come back later and have a look at some of it so going back to what Evan talked about in his presentation yesterday we can use different HPC or cloud cluster deployments so next flow next flow is really really fantastic at bringing in scheduling jobs and setting it out to like a local user or a back scheduler and quite often what we'll see or find or have questions about is that people will sort of say how do I do this with Slurm or Sungred Engine or Amazon or whatever and quite often one of the biggest problems is that people are trying to do all of this themselves and the next flow is actually quite capable of doing it without having to create those fine tunes execution situations that you might not want to have to do submitting a job to a cluster for example but before we sort of get into that too much you can sorry trying to think of the best way to describe this so next flow has all the major HPC executors you can just specify using this process.executor with a configuration or profile or something like that not only that you can specify how it will interact with the queue system how many CPUs what type of memories available you want it to use how much time you're going to allow it as well as the disk storage applied for each task execution and you can very quickly specify this as a process and include it in your configuration file which is something that we've just skimmed over so you can just take all of this and put it into your next flow fig for example so instead of having it like this you can probably get rid of most of that or even include it here container equals next flow blah blah blah just like that so very quickly all of this is going to be automatically integrated into your execution and you can run it but beyond that you can really fine-tune what's happening with each process so in a situation where you might think okay this is a relatively quick process I'm just running fast QC I only want 20 gigabytes of memory and I'm going to put it on the short queue because I don't need to send it to the long queue or the high memory queue depending on what type of HPC or other system you're using to submit these jobs so as an example you can say with name foo so any process you can specify these these parameters and then for bar you say okay this is actually mapping it's going to take a long time I want to use four CPUs and 30 gigabyte of memory so you can control this and it's a really fantastic way of allocating resources to make sure that your pipeline is running optimally especially if you are going to be using like a cloud computing resource they're like cost as a factor and if you're spending up virtual machines you don't want to ask for more resources and you need this stuff's really really important so here's just an example of an exercise that you might sort of say with the quantification process this would probably be in capitals if we were comparing it directly back to back to the process here which also has the quantification but in capitals but you're just allocating two CPUs and five gigabytes of memory and in reality I think we could probably run this because it's nothing too to extraordinary next flow next flow run script7.inf and what it's doing here is it's just saying give me five gigabytes of memory run two CPUs it's because I called it the wrong thing so I've mentioned before location it's going to run anyway this is just telling me an error that it didn't know what this was or there wasn't anything matching this and I'll probably misnamed it but yeah so let's just run this with what I've specified is for that process which is really really cool you can also sort of create labels so again here this is a process which is called task1 convention might have had this in capitals but we've labelled it as long so in a configuration file we can also use this process operator I guess called with label short with label long so because we've labelled this with long and it's going to apply the CPU 8 memory 32 Q omega being a different Q to one of the slow and cluster so again very simply we can create this really dynamic and interesting and really powerful I think powerful is probably the word for it ways of tuning what we want to happen with the process and applying it potentially on scale so if you have a pipeline you've got like 40 different processes and you think okay I'm really going to create short, medium or long or single or anything else like that you can add these labels just apply it systematically to each of those so that you wouldn't have to go through and manually change this for each process basically have to do up here with name you can use this thing called labels and it will just apply systematically across all of your data so again I think this makes it even more powerful so it's kind of like another another step on top of this again you can sort of bury this in a profile file which is very similar to configuration profile and you can just sort of say profiles or have standard cluster or cloud so profiles are probably best think of it might be like an institution or profile and you might sort of generate this to sort of say do this on local cluster or cloud and then you can sort of add specific parameters that are important for that local execution so for example if you have a genome file and it's going to be you know it's in this location for a standard or local execution but then if you are also you have a cluster and you think if I use it on a cluster it's going to be in a slightly different location you can sort of add these processes in for example if you have an engine that's going to be slam some information about cure of memory process condit so I guess what I'm trying to say is profiles are a way of setting up different I guess profiles that you can execute using this profile flag which we haven't really talked about yet but basically it works very very similar to config you can just create a profile you can sort of use profile standard and it will know to use these these parameters and processes for executing your pipeline so here's just an example of that you know next flow run script 7 profile cluster so it's centered up to the cluster yep so cloud deployment isn't something that I think we will go into this 11.4 but this might be important so just for time because I think we've only got five minutes left I'm not going to talk about cloud computing but I can assure you that there are like the major integrations for the major cloud platform so AWS Microsoft there's a handful of others and you can use next flow to kind of like automate a lot of this pushing it to the cloud and integrate the storage systems for example all really powerful and next flow definitely makes it a lot easier but what I just wanted to touch on which I think is particularly cool is this idea that you can also sort of say you know I want to execute this part locally but then this part is really time consuming so I don't have the capabilities locally and you can push it up to the cloud using these processes and parameters sort of these cues so most of that is just sort of covered here under like process selectors and AWS batch there's a bit of a launch template there if you want to have any information about that just scrolling back through but I guess like the bottom line is that cloud deployment is also very powerful many of the things I've just talked about with sort of fine tuning your deployments also applying to the cloud so you can sort of set up all of these things like that and I do think it's really important especially if you are using cloud platform because if you get this too far wrong anyone that's done cloud computing before realizing you're quite costly quite quickly just for asking for the wrong resources for a job okay so I guess with about four minutes left I don't really want to sort of start anything else tomorrow what will be happening is you're going to have Phil and Harshal and Evan talk a little bit about Nixflow and NFCore and some of the sort of newer content and ideas that are happening within those communities so what will be covered is just making sure that everyone knows how slack even though I think most of your experts are using slack already using Github within NFCore creating pipelines some of the things that are happening with modules and the Nixflow Schema as well as Evan will talk a little bit about tower as a product from Sakura and how tower can really simplify once you've got your pipeline tower can really simplify the execution of those both locally and on the cloud so with that I think we will wind up for today if there are any questions please continue to drop those into the slack channel James and I and a few others will continue to monitor those if there's anything that hasn't made sense or you want some more information about anything please just reach out I do want to stress that we do cover a lot of material really quickly as a part of this workshop so things don't make sense you know I really like to explain that again hopefully in a way that makes more sense or point you to the right resources to hopefully help help you understand if there's anything like I said that doesn't quite make sense so with that we'll wind up for today but thank you again for attending and we look forward to seeing you tomorrow