 Welcome to the second chunk of the Next Low Advanced Training September 2023 edition. In this first chapter, we're going to be talking about groovy imports and pulling in sort of helpful classes from the groovy and JVM ecosystem to help you accomplish some sort of small tasks and help a bit some pieces inside of your next low workflow. Fantastic. Let's get started. Welcome. All right, let's get started talking groovy imports and JSON passing. As with all the other chapters, we're going to CD into chapter directory. And this one is called advanced slash groovy. Let's have a look around the tree. We see that there's a main.nf and a modules directory. And inside the modules directory, we under modules local fast P, we have a main.nf, which is a fast P module. Let's look at the top level main.nf. We're including that module here on line one. And then we have a very simple little workflow where we define a parameter input by default. Then we use channels from path and use parameter input. We split it into CSV and view the output. Note here that this is not a file on my local file system. This is being pulled in via remote. Of course, Next Low speaks a lot of different protocols. HTTPS, S3 Azure, GCP, blob storage, FTP, all of those sort of fun protocols. What is this main.nf inside of modules? So here we have a really simple little fast P module. It defines a container to run in, expects to be given an input of a meta object and reads. There might be a list or a single path. And it defines two output processes, rather two output channels, a reads channel and a JSON channel. Let's just run the workflow as it exists now. Just to make sure this CSV can be pulled in and passed correctly. Great. We have maps. So we have our rows, which are past as meta maps. Well, I'd rather just group the maps. So let's write a small closure here to pass each of these rows into our familiar form of a meta map and then maybe a list of paths. So we're going to take a row and we're going to say meta equals row dot sub map. This is what we learned earlier. Let's just double check that works. Do we have the sample and stratinus keys? The stratinus is not correct. Let's just double check. It's probably just misspelled stratinus. Okay, great. Do we have these meta objects? Fantastic. Two keys. The trick here is that we've precluded the possibility of adding additional columns. So here we're specifying exactly the columns that we want. But let's say we wanted to make our workflow a little bit more flexible to accommodate the ability to add extra columns in the CSV and pass them through the workflow without modification. Here we're being very prescriptive and we're only going to pass through these two keys. So what we would like to do is read this row object, find all of the keys, subtract the keys that we don't want, and retain everything else. So we want to subtract the keys we don't want and those keys that we don't want to retain are those ones that begin with fastq. So fastq1 and fastq2. So let's grab all of the keys. We can do this with the row.keyset method. I'm going to show you what that looks like. The keyset for each row is this. For example, fastq1 and fastq2 is stranded us. But if we add the split method, that will split this keyset into two sets, some that pass this test. So we're going to find all the keysets that match this regular expression. So that begin with fastq, these two keys, and then a different set of those keys that do not pass this Boolean test. That is, they don't start with fastq. That is, sample1 and stranded us. I'll show you what that looks like. If we just remove the destructuring, see exactly here. So for each element, we have a list containing the keys that match our test and the keys that don't match the test. So the meta keys, we'll be calling the meta keys and then a list that we can call the read keys. So I've supplied here links to the keyset method if you're interested in the JVM documentation. And we're also using this split method, which I described, which divides a collection based on some closure, returning a Boolean. From here, we're going to generate the reads. So here, we're going to use this read keys, which is a list of keys. And we're going to put that into the sub map to specify which keys we want. And we're going to collect over the values and return a file object. So now our reads contains a list of those file objects as path objects. So that's ready to pass. Now we just need, oh, actually let's test it because I think this will actually return an error. Unfortunately, it does return an error. It says argument of file cannot be empty. So if we have a closer look at this sample sheet, we'll notice that not all rows contain two fast queue files. Some of them just contain one fast queue file. So for example, this row here contains only one fast queue file. And the fast queue two column is left empty, which means that we are being passed and passing a null value into this file method, which doesn't work. So we need a way of accommodating that. So now, so here we're going to take the sub map of the read keys and the values. What we're going to do is we're going to add one extra method in this chain. We're going to use find all, which is like a filter. We're going to find all those elements where it does not equal the empty string and then return those. So we're basically excluding elements where that element in the row is the empty string. I'm collecting those again with the file off file method, which now works fantastic. Now our reads in these lists, if there are two reads, we're passing two. And if there's only one read, we're just passing a list of one one rate, which is perfect. Now what we need to do is construct the metamap. So we've got our reads object. So we have this meta keys value, which is a list of keys that don't match fast queue. So these are all the keys we want to pass into the metamap. If we look at, if we remind ourselves of what we're trying to shoot for here, we're going to construct a channel that has elements that match this pattern. So a meta object and then path reads. I can see here, if we look inside here, this module requires or expects two extra keys, this ID key and a single end key. We want to be able to pass those through. We want to make a channel that can be passed to break into this fast key process, fast P process. So let's get cracking. Simple start would be taking the sub route of the meta keys. So if we see what that looks like. Again, remember this closure is returning the last, the last sort of expression, which in this case is the meta object. Okay, great. So we have our meta objects contain the key sample and strandedness. And if we were to add an extra column in there, they would appear here inside of our metamaps. We want to add those two extra features. I want to add two extra keys. We want to add a single end key and an ID key. Here I'm introducing another little piece of handy groovy syntax. This question mark equal operators. What this is, is the same as saying meta dot ID equals meta dot ID question mark and ID. So this is saying, if metadata ID exists, then use meta ID. Otherwise use meta dot sample. This is what we call the ternary operator. So if meta dot ID exists, so for example, if there is a column inside of our sample sheet with header ID, then we're going to use that as the meta ID, which is basically going to leave it unchanged. Or rather, if it's null. Otherwise, we're going to use the meta dot sample. We can use the same value for sample as ID. That leaves us one more key to do. We want to have this single end key. So we're going to have a look at this reads object. Remember here earlier that if the reads object contains two keys, that means we have paired and reads. Or if it contains two elements, we have paired and reads. And if we have one object, we have single and reads. So I think that should just not do us. And we're just piping into view before we pipe into fast speed. Just to double check that everything looks okay. Right. Strandedness, auto, single and false and false. Single and true. Perfect. So for those samples where we only have one fast queue file, single end has been set to true. So now we have this channel that matches the cardinality of the input channel to our fast P process. We can now pipe this into fast P. We're going to make one more modification. And we're going to say docker.enabled equals true. So this way we can use this container definition in the fast P process. Of course, fast P process has two outputs, so we can't view all of them. That's going to take a minute or so to run. Note here that I'm always using the test resume flag to pick up case results wherever possible. This is going to be important, particularly now. I can run resume and it should be able to pick up the case results and I don't need to do any extra calculations. Perfect. Seven case tasks. All right. Let's have a look at these outputs. So we have our metamap. Fantastic. And then we have these fast P JSON objects. Let's have a look at one of them. So there's probably some really, there's definitely some really interesting information here that we might want to pull out and supply to downstream processes. Maybe we want to pull pieces of that out and include them in our metadata objects. But we don't necessarily want to write a JSON parser ourselves. We suspect there's probably JSON parser available inside of Guru or inside of the JDM. So let's try and use existing code rather than writing our own. Before we get actually into the JSON parsing, let's cover publishing files. To make our lives a little bit convenient, let's actually add a publish directive to the fast P process. We made a small modification to the next flow config earlier to add Docker. Let's add one more. Inside the process directive. Inside the process block, we can use the with name selector to add a process directive to just those processes that match this name fast P, which is of course the name of our pro fast P process. We're going to add this publish directive. We're going to add a path key. So we're going to say publish all of the outputs from this process to a directory name results slash fast P slash JSON. And we're going to also use the save as directory. That's ready to save as argument to publish here. So this save as argument requires you to supply a closure. This closure takes is iterated over every element that's published or other every output from the process. It receives the file name of the output, and then it returns the file, the rather the name that you'd like to publish that file under. So if you'd like to rename a file, that's how you can do with the save as block. So here we're taking file name and we're asking does it end with Jason. And if it does, we're going to reuse the file name. And if it doesn't, we're going to say null. So basically we're using the save as block as a filter. So we're only publishing the Jason files to this results slash fast P slash JSON directory. Let's try that. Just kind of pause there for people to catch up that test that you can see now here now in our results directory slash fast P slash Jason. We have sim links into each of our Jason files. This is going to enable us to quickly test the Jason passing without having to write waiting on fast p caching. So let's consider the possibility we like to. Yeah, we know we know that there are some interesting things inside of these Jason files that we'd like to capture and use use downstream. There's a link here to the groovy documentation. Specifically the groovy documentation around Jason passing and it notes that we need to import this Jason slipper class. So I can do that. Inside of my main.nf just have this import groovy.json.json slipper. And I'm going to introduce the idea of a second endpoint now. This end points allows to have different workflows within the same main.nf. This can be very handy for. Testing. Fast p. Jason style Jason. And I'm going to view those just to make sure I have capturing them at all. And to run this second workflow note here that it's got a name. I will run next to run resume entry. Jason tests. So after the dash entry argument, I'm supplying the name of the workflow I like to run. Perfect. So I'm returning a channel with each of our Jason objects in here. Let's create a small function at the top of our main.nf that is going to take you responsible for passing our Jason. So it's a little function. We can call it get fields result. It's going to take a Jason file. And here we're constructing a new Jason slipper objects. We're going to call the past text. We're going to call Jason file dot text. So this dot text method is a really handy little groovy feature for reading a file and returning a string of all the text. That will return the fast p result. So I'm going to pause here for a small exercise. Given this fast p result turned from fast p method is a large map. I want you to have a go at modifying this get filtering result to return just the after filtering section of the report. So if we look at one of these files, you can see here there's a small section called after filtering. So how would you modify that function to return justice after filtering section. I'm going to give everyone about five minutes to attempt that. There's a solution here, but try and give it a go yourself before checking the solution. I'll see you in five minutes. Okay, great. Welcome back. I hope you had a go at that small exercise. And here's there's obviously a number of different ways in which you might attempt this. But here's one potential solution. So let's double check. And I'm going to write a little map method so we can check our results. So at the moment I'm returning this whole fast p result. So it's going to return all fast p object passed into a big map. It's going to be quite the both, but let's just see what happens. So I'm not doing any filtering here. So I'm getting the whole Jason objects returned here as a groovy map. So that's a little bit of a voice and a little bit too much information. So let's have a close look at this map. So there's a summary key. And then inside of summary, we have after filtering. So if we want to adjust the summary, we could do fast p result summary. And that returns a smaller object. Okay, so that's not quite as opposed. So we're getting there. The next thing we want is inside of summary. We want the after filtering block. And we can just use dot notation sort of groovy to step inside each of these elements. And here we have just the after filtering results. And that's what we want. So we have some information like the total number of reads, the total number of bases, total number of Q20 bases, all that sort of stuff. So let's say we want to use that to pass into our metamap object. This is pretty close. This is good, but it's slightly dangerous. We're assuming here that the summary object will always exist. And the after filtering object will always exist. We're going to run into problems if the fast p code changes, and they rearrange that JSON objects. Because if the summary object moves or is renamed, we're going to be calling a dot after filtering method on null, which is going to cause an error. Fortunately for us, there's a safe access option. Instead of dot, we can use question mark dot, which means that if this summary is null, then the question, this will not be called and it will just return early a null object, the null object. But if it does exist, then we'll call after filtering on that non-null object. The last thing we can do is we don't need to assign this to a name. We can just call new JSON server dot past it, summary dot after filtering. Fantastic. Great. So we're passing each of these JSON files through our get filtering results method, and we're returning this nice map with all of the statistics from fast p. Now we can use this to join back to our original map. Instead of taking the JSON from fast p, we're going to map it into meta and JSON. Now we're going back to editing the main workflow, the unnamed workflow, so I can drop my entry. So now we've taken the JSON channel, the output channel from fast p, and turned it into a channel that has the meta map, then the map of extra metadata that we've pulled out from fast p. So now we have two channels. We have this channel returned from here, and then we also have the fast p dot out dot read channel. At some point we're going to want to join those back together, and we can use the join method here, which is a new method we've not talked about, a relatively new operator we've not talked about yet, which is sort of like group tuple, but for joining two different channels on a common key. Whereas the group tuple is for joining elements within a single channel, sharing a key. The join method allows us to join elements from two channels that share a key. So I'm going to type that into join, and because I know that the fast p dot out dot reads also has the metadata object. If we look at the fast p objects, I can see here that the reads passes the metadata objects through unchanged. So because we're also passing metadata objects through unchanged, we'll be able to join on that key. Fantastic. Great. So here we have our elements from the channel. We have a metadata object. Then we have another map describing the fast p metadata, and then the fast q dot gz. So the reads from fast p. So these are the filtered sort of trim reads from fast p joined back with the metadata from the fast p passed into a more convenient form. So there's one last exercise here before we end this chapter. Can you amend this pipeline to create two channels that filter the reads to exclude any samples where the q 30 rate rather is less than 98 point 93.5. So we want to have two channels. So we have one channel of reads that pass the filter, and another channel of reads that don't pass the filter that fail the part that the fail the filtering test. We're going to use probably the branch operator that we introduced earlier in the workshop. But I'll give you five or so minutes to attempt this exercise. We'll see you soon. Welcome back. Welcome back. Here's one potential solution. There are a couple of different ways of doing this, but I think this is a neat one. So we have our, we've joined our fast p reads with our metadata. And we were, we have this new channel that has metadata fast p metadata and reads. What we can do is join. It's sort of unnecessary to have these two metadata objects, metadata from the sample sheet and the metadata from fast p. We can join those together. So let's map this. So we have meta fast p, meta, and then the reads. And I'm just going to join those together with a plus operator. Great. So now we only have one metadata objects containing all of the pieces of metadata from those two maps. Great. So one big metadata objects. And then the reads. Now we need to do the branching operation. We can use the branch operator to produce two output channels, a past channel for those samples whose metadata with a q 30 rate passes some threshold and a failed channel for those samples who reads did not pass threshold. We need the branch operator that we talked about earlier in the workshop. I'm going to pass names to this closure. Stabby operator. And again room from the branch operator, we'd give it a name of an output channel and then some expression to return the Boolean value. Because I already have the metadata from fast p in the metadata object, we can just access it here, pull out the key q 30 base. We can say does it equal or is it greater than 0.935 and fail true. So I'm going to use this true fall through. So that if the element or if the sample does not pass this test, then it will automatically go into the fail output channel. And we're going to call that reads. Let's see if there are any reads that fail. There are no reads that fail QC, which is great news. And he reads the past QC. Fantastic. So we have all of our reads that passed QC. If we wanted to pull out all of the values that QC 30 read values of the reads that pass, we can do it with ease dot pass. You know, this is a metadata and then it reads QC 30 bases. That's QC with reads QC 30 bases equals. Whoops. I didn't mean Q 30 bases. I meant Q 30 rate. Excellent. And now that I've corrected my typo, I can see here that we have two reads, two samples that failed the QC that had a QC 30 rate of less than 0.935. So now we could take these two output channels, reads dot fail and reads dot pass and pass them through different routes through our graph. This is one of the great strengths of next flow in that you can dynamically calculate the trajectory of any particular data set through the graph depending on the results from the previous tasks. So you can sort of divert data through the graph depending on results like this fast P quality control test. You might dump the fail results into a data directory and perhaps even warn the user indicating that there were some reads that did not pass QC results and then have this reads dot pass channel that diverts the rest of the data, the good data through the rest of the pipeline. I hope that's helpful. We'll move on to the next chapter. Inside our next flow workflow repository, there are a couple of special directories which are treated differently to other directories. They are the bin directory, the live directory, and the templates directory. Particularly, the bin directory is extremely useful for a lot of next-flow workflows and heavily used in the NF core community. I'm just going to go over some of the details about each and show you how they might be used. Let's get started. Okay, so inside of the next-flow work directory, there are a couple of directories that are treated, especially treated differently to the other directories, bin directory, live directory, and templates directory. Let's start with the bin directory because it's the most important and it's the most useful for most next-flow workflows. Let's CD into advanced slash structure and have a look around. So I have a little main dot NF, which isn't important at the moment. But let's say that I have some small accessory scripts that I'd like to distribute with my workflow. Those scripts might be a little Python script a bash script, a R script, any sort of interpreted language, a single script, a small file that needs to travel along with the workflow. Perhaps it's being used inside of some of your processes to do some data cleanup or some data munging. It's not an uncommon task to need these small accessory scripts. And there are a couple of different ways in which you might supply those scripts and make them available to you instead of next-flow tasks. The first is to bundle them with your Docker containers. So they could be versioned inside the Docker containers. But for small scripts that might need to be updated sort of semi-regularly, it's a little bit of a pain to have to check them into version control, rebuild your Docker container, upload your Docker container to your repository, and then test your workflow. Fortunately, next-flow has a way to bundle those accessory scripts with the workflow inside of version control. So the workflow and the scripts and the processes and tasks using that rather, so the script and the processes and tasks using that script are all moving lockstep under the same revision control system. So let's give, let's do an example. Let's say I have a small R script and it does some work. The work that the R script is doing is not particularly important. What I'm going to do is I'm going to make a bin directory inside of the root of my next-flow workflow. And inside of bin, I'm going to make an R script called cars.r. And it's a dumb little R script which just loads tidyverse, produces a plot, saves a PNG and a TSV. So it writes two files, a PNG and a TSV. Specifics are not important. This just loads the MPG data set that's bundled inside of tidyverse. And let's say we would like to make a script which uses, rather, a process which uses this script. I'm going to make a little process called plot cars. I'm going to use a particular container that has tidyverse installed. And this script doesn't have any input, but it does have output. I'm going to say it's going to produce one or more PNGs. I'm going to emit them in an output channel called lots. And the same for TSV. And I'm going to say it's going to be emit in an output channel called table. And this script just echoes cars, runs cars. Let's say it could be slightly more complicated. Perhaps we can say echo. Note here that I'm not worried about where that car.r script is located. I'm just calling it as if it was in the path. And this will work both for my local file system or my laptop. But also, if I am running at scale in the cloud, I don't have to care about where that r script is located. In exactly the same way that inside of my processes, I don't care about where the input data comes from. I don't even worry about the path of the input data. Next load takes care of all of that for me. And in the same way it takes care of adding the bin directory to the path. So I can just call cars.r. So one very important thing that I've not done yet is made this car.r executable. At the moment, you can see here, if I run ls-lhbin, I can see my car.r, but it's not executable. So next load will take care of moving that bin directory and supplying it to the virtual machines or the tasks and then adding that directory to the path. But I still need to take responsibility and make those scripts executable. So I'm going to do that with chmod plus x, bin-cars.r, and now they're executable. So that when next load adds the bin directory to the path, it's going to, I mean, I can just call cars.r. Let's run this script and see what happens. The last thing I want to do is modify next load config to use Docker. So in the contractory, I'm going to create a next load of config file. Oops. I could put it in a profile, but I could, yeah, actually let's put it in a profile. So here I'm creating a new profile called Docker and setting the docker.nable true variable inside of that Docker profile. So now I can run next load run, contractory, profile, docker. So what next load is going to do now is going to ensure that the rocker slash tidyverse latest container is available locally, and it's going to run that process. It'll just take a minute or so for that to download while that's happening. I want to make an important note about the shebang line here. I've used here hashbang, user bin, and R script. This is really important for portability. So on my laptop, I might have R script already installed in a particular location. In fact, I do. It's in like homebrew slash bin slash R script. But of course, if I wanted to make my workflow as portable as possible, I'd like to make it available and ensure that I can run it on the cloud. And when I'm running in the cloud, I might not be running inside, like I might be running this R script locally or I might be running inside Docker. And if I'm running inside of a Docker container, the R script location, the location of this binary executable might be in a different location. So if I were to hard code the path here, user bin, maybe. So one option would be to hard code the path here. For example, if I know in this particular container that R script is located in this location, that would work for that particular container. But if someone else wanted to use my workflow and maybe update the container, we use a slightly different version of tidyverse. And if that container had the R script executable at a different location, this shebang line is going to fail. That's why it's particularly important to use a portable shebang line, ideally using user bin end. So that will go the same if you're using R script or if you're using a Python executable. Whatever your interpreter is, make it as portable as possible because this script may be executed in different contexts in a container or different containers or locally or inside a current environment. I can see here that my next task is finished. My process is done. So let's just double check that it actually ran by viewing the output and viewing the PNG output channel. Fantastic. And same for the table. I can see this. So what I've demonstrated here is the ability to use a small accessory script that I would check into version control that would move in lockstep with my next-flow workflow versions. And I can just call as if it was already in the path in all potential contexts and next we'll take care of making sure that's true. So let's have a little look behind the scenes and see what next-flow is actually doing here. So it was this process 46 slash 5C. So I'm going to CD is that directory. I'm going to use code. So this is just that single task working directory. You'll remember that next-flow executes all tasks in their own individual work directory that's keyed by the hashes of their inputs. And I can see here the .command.run file. So this is a file that next-flow writes that takes care of setting up all the data, modifying the path, making sure all of the input data is available. And it's worth sort of occasionally looking through this .command.run to see what's happening behind the scenes. Inside this .command.run we have a couple of important flash functions. One of which is this nxf container nf. And so here we're exporting the path to be the existing path to the nxf container nf. So here I'm adding that bind directory to the path. This works on the local machine. If I was operating on the cloud, this nxf container nf. It would be copying in, first copying in the directory to make it available on the virtual machine and then adding the path. Inside this .command.run you also have staging in, staging out, rather staging in of input data with ln-s. If it's on a local file system or copying it in via cloud command lines if you're operating on the cloud. Again, here is a warning just to be sure about using a portable Shebang line in your open scripts. The second important directory that's treated in a particular way by nextflow is the template directory inside of the root of your nextflow workflow repository. You might have situations where the script block is becoming quite long. If your script block here you might have a long bash script or perhaps you have started a small Python script and you're using perhaps you have some small, rather increasingly large Python scripts in here. It starts small but it sort of grows over time and you feel that this grows beyond a few dozen lines and starts to dominate the main.nf of your workflow. You can use the template directory to spin out that script into a separate file and move it to a template file. I'll show you what that looks like. Instead of our plot cards function we're going to make a new process called say hi. I'm going to use the debug directive that we talked about last time for echoing out the standard output from each of the tasks to the nextflow standard output. I'm going to make a process that expects a channel that's going to supply a new name and I'm going to use a template at a .py. Instead of writing a string here I'm going to call this template method or function and supply the path to a template. You can see here in templates I have this at a .py so this is a very dumb little Python script but you know it's not quite Python because I'm using a full assign string interpolation here so this is nextflow native string interpolation so here this is not just pure Python this is as if I had written this in a script block so instead of writing this in a script block like this let's say I had many lines here it was becoming overwhelming let's just see what it looks like I'm running it in a script block actually just to show you that it works I have each of these names small pipe on my part require a closing braces great so now we have hello hello each of these odd dog names but let's say instead of having this in a script block I'd like to use the templates function so instead of this script block I have this at a .py which is under templates and now I can just simply write template at a .py perfect and so you can see here that we have the same results we have hello for each of these names because this at a .py template it's been pulled out and added in place of my script block the last special directory that we'd like to talk about is the lib directory and this is slightly more complicated you might remember in the last chapter we talked about adding a small helper function so that may not we had that json passing function and certainly for small functions like that json pass was certainly fine and totally reasonable to add them to the main .nf but sometimes those sort of functions grow in complexity and it would be helpful to sort of bundle them into our own groovy class and you can add groovy classes into the lib directory and make them use available for use inside of your workflow both inside of main .nf and any imported modules the number of the reasons why you might want to use this are many and varied for example the nf core irony seek workflow we have at least five different groovy classes that are defined inside of the lib directory for doing all sorts of utility and sort of miscellaneous accessory tasks I've provided the link here to the nf core irony seek workflow and it's worth having just a quick look through the example classes the nf core template uses most of the ones in the nf core are simply provided to execute early in the workflow and provide utility functions like helping pass parameters help texts fancy formatting and all of that sort of stuff but the lib directory can also be used to provide functionality inside of the workflow itself so let's give an example of that let's make a metadata class and so this is just an example problem in almost all cases I'd probably recommend you use a simple map for passing metadata in almost all cases that's sufficient but I just want to use this as an example to show how you might add a class to your next workflow so inside of lib let's make a new file called metadata data.groovy and inside of metadata.groovy we're going to make a new class classes in groovy begin with this class keyword the name of the class and you can because it's jvm and object-oriented you can extend existing classes so I'm going to make a class that extends the hashmap and I'm going to add a simple new method called high which just returns a new string called hello workshop participants so you can think of it just like a normal map just like the many maps that we've been using but it has one extra method called high so we can use this in our workflow I'm going to clean this up we can get rid of our script example I'm in Montreal so I'm going to make a new channel containing city name I'm not going to use it at the moment I'm going to use the new keyword to construct a new instance of our metadata class and I'm going to say hi so I'm going to call a high method on that metadata class let's see what that looks like when I run the workflow remember that high method returns a string saying hello workshop participants great so that's what we see here so we've made a new class called metadata and a file called metadata.groovy inside the lib directory and because our next flow ensures that any classes defined inside the lib directory rather it ensures that this lib directory is added to the class path it means that I can just simply call new metadata anywhere inside of my workflow and get a new instance of that object at the moment we're just sort of making a channel that contains this city name but it's not actually doing anything with it so let's modify metadata.groovy class our metadata class to take a name we can have a constructor so now we're going to have a constructor that takes a location or takes a string but we're going to call location and sets this instance variable location so now this location is going to be available inside of any of our methods so now instead of returning a simple hello workshop participant string let's customize this message by location so if this has a location then we're going to say hello from this location otherwise we're going to say hello workshop participants the last thing we're going to do is inside of this metadata we're going to pass that city name as it if we want we can give it a name that's not necessary I forgot to say the file apologies perfect so hello from Montreal always remember to save your files before executing your workflow so we can also use this method inside of a process so we could pass this through next-door channels into a process so we can have our custom class I'm just going to call it meta so what we're going to do is we're going to echo metadata high so we're going to have next-door resolve this call the high function on our meta object which will return this hello from Montreal string perhaps hello from Montreal and hello from Boston and it's going to pipe that into t just going to echo it to standard output and also write it to this out.txt we're going to add debug true so that we can see it happen on the standard output and here we run next I run oops we forgot to actually put it through here create these new metadata objects and pass them into the use meta process and you can see here we've written hello from Boston hello from Montreal if we want we can do the output from that process which will just be that out.txt we've defined as output here and just to double check you can see out.txt hello from Montreal and hello from Boston so why would you ever want to do this most of the time you won't need to but there will be occasions where it can become very helpful so for example let's say you have some metadata class some metadata inside of a map object and you'd like to add some methods to that metadata object to for example pull out the prefix of an adapter so let's say you have a metadata class that has an adapter key and you want the start of the adapter you could add a method like this getAdapterStart which just pulls out a substring the first three characters of the adapter this is just a very simple example but you can imagine adding extra methods to that metadata class to pull out specialized pieces of information or calculate special values I'll show you what that looks like so let's say in our metadata class in addition to that high method we also have this getAdapterStart and it says if there's an adapter key inside of our map then pull out the substring first three letters create the new metadata and we're going to also this new adapter key so now inside of the useMeta process we're pulling this metadata adapter and the key which was just as normal and we're also using this getAdapterStart method so now inside of out.txt we should see the prefix is AAC which is correct as we described earlier in the workshop these get methods can be shortened but in the context of a map it's probably safer to use the full method name so at the moment we've just demonstrated I've demonstrated using simple methods to pull existing data but you might want even to reach out to external services perhaps to a LIM system or to an API or to some sort of tooling inside of your infrastructure let's add this getSampleName method to my metadata.groovy object and to do that I'm going to need one extra piece of groovy I'm going to import this jasonslip object just as we did earlier for the jason passing example because in this getSampleNames method I'm reaching out to an external URL here in this case it's just a postman code but you can imagine this might be any piece of infrastructure an API or a sequencing machine or any external piece that's communicated via HTTP calls and I'm reaching out to this URL I'm opening a connection I'm getting the response checking that the response is 200 that the response is okay and we pass the jason the response into a jason object or we pass the jason object into a map and then pull out the args.SampleName so we're pulling out a piece of that response this is a very simple little dumb example but you can imagine doing arbitrarily complicated things inside these getSampleNames obviously it's important to keep this not go too overboard you don't want to be doing real computation inside of these things inside of these classes and inside of these methods but it's sometimes very handy to have that ability so now if I modify this process useMeta to add this meta.getSampleName inside so when NexLo constructs the process constructs the task and writes the .commandosh it's actually going to call this getSampleName on the meta object and I can see here it's reached out to the web, called the URL and pulled out the response now there is a warning here caveat when we start to pull custom classes through NexLo we need to understand a little bit more about the caching mechanism when each task is run NexLo calculates a unique hash based on the tasks inputs for example if there is a file input or a path by default it will take the path the string of the path the mass modified date pieces of metadata like that in the case of a value like a map it will take the hash of that map and it will take each of those pieces to calculate a hash for the task as a whole but because we're adding methods to this class that adding methods here is not going to change the hash of the value of the item being passed in which can be dangerous because we might change the way getSampleName works but that's not going to change the hash for the task and so we're going to have caching pull up cache results when perhaps we don't mean to so for the next five minutes or so I'm going to ask you if you can run through this exercise can you show how changing a method in the metadata class does not change the hash of the task we'll be back in five minutes or so okay great welcome back here's one example of the solution so let's say we run this with resumed I picked up cache results expected but let's say I changed the behavior of one of these methods for example getAdapterStart so instead of grabbing the first three bases I want to grab the first five bases of the adapter if I run nextload run resumed a naive approach might imagine that this is going to change the task cache because I would sort of anticipating this is going to change the script because this getAdapterStart method is called inside the script but keep in mind the inputs to the task are just the value itself so it's just in this case because I'm extending hash map it's just the keys and values inside of the hash map which is not changing only the successory method so even though I've changed that method nextload should still pick up the cache tasks cache will only change and be recalculated if I change the values inside of that and hash map so if I add if I change the hash values now nextload will recalculate the hash hash will miss and then it will use the new getAdapterStart method but changing the methods in and of themselves do not change the hash so in this example I've shown you extending an existing method existing class this hash map class but we could also go on creating new classes for example let's create a dog dog groovy a dog class and this class is sort of like a records object it contains a string name and a boolean value it's hungry let's simplify our workflow so let's create a new dog where his name is phyto and we're just going to log in folk found a new dog and then showing the dog oops we run this fantastic found a new dog and then it's returned the string representation of this dog object we can pass objects through channels so if we change our workflow to have a channel of dog names and then we pass it through a map where the closure creates a new dog objects with the name and we view those outputs we see our three new dogs but at the moment we're missing something before this can be used inside of a next flow task and caching to be used correctly so for the next five minutes I'm going to pause if you could run this exercise and show that the dog class is not cached when resuming a workflow we'll see you in five minutes okay welcome back make a process that uses one of these dogs to show that the caching is not working dogs have a name property which I can address like this I can pass each of them to the pat dog process and have run it three times now if I resume not change anything the note here that no caching has been used for these processes to do so we need to turn this dog into a value object so next loop very hopefully has provided a decorator to serialize these custom classes by adding this value objects decoration it will use the sort of properties here the name and the is hungry value as the elements that are being hashed into the to provide the inputs to the task the last thing we need to do is register the class with car cryo which is the serialization library I'm going to do that inside of the main.nf ideally at the top so I can just do it here importing the cryo helper and then cryo.register dog so now last exercise for this particular chapter is I want you to show the dog class now that we've done the work of adding this value object annotation and then registering our class with cryo helper that can be used inside of processes and the caching works correctly see you in 5 minutes welcome back let's try now with this new cryo register dog class so now if we use this pass this dog to this pat dog process first time registers the caches and now if we rerun the same tasks we have 3 cached tasks so now that we've registered this dog next load knows how to cached this or sort of calculate hash for this dog object so in this case because we're using the value object it's going to take the name and the boolean values so these classes here the class values here and use those as hash inputs so that if they change we should see two cached tasks and one new task perfect great so this is how you might introduce custom classes into your workflows for small utilities or reaching out to external services from here we're going to take a short break if you're watching this live now is a great chance if you haven't already to drop by the septum set 23 advanced training channel on the core slack myself and other volunteers are hanging out there if you have any questions we'd love to help out if you want to chat anything next low or NF core we'd love to talk to you there of course very welcome to pop in questions all the way through the workshop but we'll see you there if not we'll see you back here for more training in half an hour see you soon we want to cover a little bit about next low configuration well might not be the most glamorous of chapters particularly finish on it is a particularly important piece of how next low works and something that we think some people can get wrong quite easily so we're going to talk about where configuration is set order of precedence and cover some sort of intermediate configuration options about dynamic directives and label selectors let's get into it so let's talk config it's an aspect of an expert that can be a little confusing to some there are multiple ways of learning configuration it's very flexible which is a great advantage but can be a little intimidating for some newcomers gives us a couple of complications like particularly at which location should I be learning a configuration value and what sort of configuration can I change or what sort of values and what sort of behaviors in next float can I change by configuration the first things I think we need to cover is precedence but before that let's make a new directory I'm going to call it advanced configuration let's talk about the order of precedence what I like to how I think about it is that the precedence is really roughly in order of distance from the command line invocation so those parameters specified directly on the command line like parameters with a double dash notation or configuration options with a single dash notation take the highest precedent you're typing them on the command line they're right there at the command line the next highest level of precedence is the parameter files or parameters that are supplied by a dash params-file option so these are again supplied by the command line but rather meet directly on the command line referred to by a command line option similarly the configurations the sort of twin, the cognate config option is the dash C option so params-file for parameters and dash C for configuration so parameters are those parameters that are set by the workflow and the dash the config is configuration that affects next float as a whole the fourth level of precedence is the file named next-floater-config in the current working directory so these are not referred to by the command line it's sort of implied there's configuration values when next-floater-run it looks for a next-floater-config file in the current working directory and picks up configuration of parameters from that file that's the fourth level of precedence one level further away is the next-floater-config in the workflow project directory so that might be a next-floater-config in GitHub somewhere or GitLab or bitbucket or code commit or something like that so it's the next-floater-config in the workflow project directory so in the root of the workflow or a configuration file that's imported by that file by include-config almost second to last is the configuration file in home so home.next-floater-config so these configuration values apply to all runs for a given user but they're very low precedence and lastly the final configuration values are specified by the workflow itself in main.nf or any of the .nf scripts inside of the workflow so you can think of it as stepping away from the command line from right on the command line to a file referred to by the command line application to a file in the current working directory to a file on GitHub or a remote server and then finally parameters configuration in main.nf on the workflow in the remote server so why might you use some of these configuration values for example system-wide configuration these are really handy for setting configuration values that you know you want to have applied on every run on a given system that is for example I might want to use Docker on every run on the system like even on this particular machine I have Docker, I don't have tools installed so if I run the .nf workflow it will take a second to download the containers but I don't have any of the tools installed so I will always, this will certainly fail because I don't have for example fastqc installed or fast yeah fastqc in this case so I'm always going to want to run inside of Docker so I'm going to run docker.enable equals true into home.nextload .config just to make sure it's there and now I can run exactly the same command next we'll run rnc.nf and it's going to use the Docker containers it's going to pull and use those Docker containers for each of these processes and I didn't have to change anything on my command line indication because those configuration options specified in home.nextload.config will apply to all runs unless they're overridden by configuration at a higher level of precedence another reason you might want to do this is for example you're on an HPC system that always has Slurm as the executor and perhaps always uses Singularity for their containers your home.nextload.config file might include lines like these process or executor is Slurm and Singularity.enable equals true the configuration values will be inherited by every run in that system so let's talk about overriding process directives process level directives like CPUs memories there's a whole list of them here in the next load configuration there's many many configuration process level directives all of these can be overridden by by configuration so let's give an example let's create a new nextload.config in the current working directory and we're going to say process CPUs equals 2 so I would like to make sure that every single process every single task that I run in this working directory uses 2CPUs so I can run that same nextload run at RACGNF the same workflow and it's going to ensure that each task is supplied with 2CPUs and that's going to be all of the tasks unless otherwise overridden we can make these configuration values more specific by using process selectors so we can select the name or any labels that have been defined by the workflow authors so I'm going to run again a process block and this is in nextload.config using the with name selector I'm going to make sure that any processes or any tasks that match the process name RACGNF that is this task here are run with 2CPUs all other tasks are left with default I can override the default with CPUs 1 so this says this ensures that by default all tasks get 1CPU unless they match this with name selector in which case they get 2CPUs and if we have a lot of tasks we might also be interested in using glob pattern matching so instead of RAC colon index we could do dot star colon index to match any tasks with a process name that ends with index just to make sure this is working it may be helpful to set the tag which is the tag is the value that's printed out in parentheses here of course I could log into the CD into the work directories and double check the CPU allocations but sometimes it's quite handy just to set the tag and I can see here the tag for the RAC index has been set but the others have been remained as the default so this is interesting but the really powerful work comes in dynamic directors so you can specify dynamic directories using closures same sort of closures that we've been talking about for the whole workshop and those closures are evaluated as the task is submitted there's a mix of next flow that is a data flow oriented sort of paradigm in the task execution is calculated at like there can be there can be elements of the task that are calculated as the task is executed rather than before when the run begins so your next look can be very dynamic and calculate directors as an example of what tasks submission time so let's say we want to run the RAC again the run we've been demoing here that includes a fastqc process that looks like this so it's the process fastqc and it has some inputs the sample ID and path reads I can already see here that we're dynamically evaluating the tag this sample ID is being used in the tag directive here with dollar sign sample ID we can override that tag and if we're using setting values inside of the process like a .nf script we don't need the equal sign but it's really important to note that if you're setting values inside of configuration the equal sign is needed the other thing to note is that if we're using dynamic directives we're going to need to supply a closure here so we're saying I'm going to defer evaluation of this I'm just going to supply a closure that's going to be evaluated later rather than evaluated when the configuration file is passed I'm just going to change the wording here to found a sample fastqc, found a sample ggalgal so in addition to the sample ID value we also have access to the reads value so we might want to dynamically scale for example the number of CPUs on a task depending on how many reads were supplied again because we're going to use a dynamic directive we're going to put it inside of a closure I'm going to say the CPUs is value equal to reads.size so if there is only one read supplied then the task gets one CPU and if there are two reads supplied then the task will get two CPUs just to make sure that I'm actually doing this we can set a tag to pull that reads.size evaluate that submission time as well perfect, an even more advanced option that I haven't seen a lot of but I suspect we might start to see some is scaling the resources like CPUs or like memory based on the input files size not just the size this size here is calculating the size of the collection so how big is the list of files but we can also calculate the size of the actual files themselves in bytes so here let's make this tag directive a little bit more complicated so the total bytes we can do arbitrary computation here the total bytes is the reads and we're going to use the spread notation that we introduced earlier so we're going to iterate over every element in this reads collection and calculate the size which is going to return the number of bytes and we're going to echo that out here or rather we're going to change the tag to say the total input size is and we're going to use as memory unit and a memory unit is a next-low built in class for turning strings and integers into human readable memory units and we need to actually sum those to sum those together so when we take the reads collection we calculate the size of each of them that leaves us with a collection of integer sizes and then we call sum to sum them together you can see here the total input size is 1.3 makes so you could use just as we're setting tag here you could use the same sort of idea to calculate the total input size of a given process and use that to scale for example the memory or the CPU use of that task or process you wouldn't necessarily want to use that for all tasks as a blanket statement but if you know particular programs or particular steps in your workflow scale or the memory required scales with the input size then this might be a helpful way to dynamically calculate the amount of memory that is required by a particular task and lastly the most common sort of dynamic directive is used for retry strategies next-low gives you the option of if upon task failure resubmitting that task again you might want to resubmit it again because it's a flaky program that requires network IEO or you would just like to give it another go but you could also dynamically evaluate things like CPUs and memory so that on resubmission the task is allocated more memory or more CPUs to do this two directives are needed the max retries and the error strategy directive the error strategy directive determines what action next-low should take given a task failure that is if the .execute file does not contain a zero so the options are either you can terminate the run completely and this is the default so upon any task failure next-low stops everything it cancels in progress tasks and winds everything back cancels and sort of finishes and exits as quickly as possible the second option is finish this is a little bit more elegant and the next-low will allow processes and tasks that have begun to finish on their own steam and then tidy everything up but it's not going to submit any new tasks the ignore retry strategy the ignore error strategy simply ignores any errors from that task on that process and it simply doesn't output anything into the output channel this can be a little bit dangerous but there are times when it's helpful and the retry error strategy is the one that will allow next-low to resubmit the task and it will resubmit the task up to a value specified by max entries so if we're using a closure to specify a dynamic directive we also have access in addition to the variables defined like reads and toe like the ones here sample ID and reads we also have access to a special variable called task and this task variable has an dot attempt property and that dot attempt property is incremented every time the task is retried so this allows us to work out for any given run and what is the number of task attempts so here we can imagine setting the process configuration like so so here we're going to say for all processes with name RNA sequence we're going to retry on any failure we're going to retry up to three times and the memory taken by this task given to the task will be defined by the disclosure so on the first attempt next level attempt to run the task with two gigs of RAM and one hour of wall time on resubmission it's going to take four gigs of RAM and two hours of wall time and finally it's going to try six gigs of RAM and three hours of wall time as before I mentioned before when using config configuration via this next way dot config or you will always require equal sign and as opposed to specifying configuration and process directives inside of the NF file where the equal sign is not required I hope this has been helpful and I look forward to seeing some interesting configuration values popping up in the NFCore community soon with that I want to make sure there's plenty of time for Q&A we have a lot of people who have registered for this and I expect there will be lots of questions and just so we get them we're going to pause here to open up the floor on Slack so don't forget to log into the SEP23 Advanced Training Channel on the NFCore Slack ask your questions there and I want to thank you for coming and listening to the first community advanced next-flow training we'd love to hear your feedback and we look forward to seeing you at a future event thanks very much I'll see you soon