 Welcome to the next low advanced training September 2023 edition. This is the first time we've run this course for the community. So first thing I'd like to say is I would appreciate any feedback you have about the content or the pacing. This is a as the name suggests an advanced next low tutorial rather an intermediate advanced next low workshop. We're going to explore some of the more advanced features of next low and particularly introducing a little bit of groovy and how to use them to write efficient and scalable next low pipelines. As the name suggests this is not an introductory workshop I do or rather we do assume some basic familiar familiarity with next low. We're not going to cover all the basic concepts. And the other important thing to note from an advanced workshop is that next flow at its intermediate advanced levels. There's a lot to cover. And there will be some things I'm sure that I'll cover here today or the next couple of days that will be not new to people. And for those people I apologize in advance, but I've tried where I were possible to sort of dot in little groovy concepts or some advanced concepts. So even though you might be familiar with for example the map operator we're going to try and introduce some interesting ways of dealing with closures. So we hope that there'll be something for everyone in this workshop. We'd love to hear feedback about how you found it how you found the pacing and the content. We'll be hanging around the next flow community channel. So we look forward to seeing you there. The other thing to note is that if you're watching live. There is a slack channel in the NF core slack or where everyone who is participating in this workshop live will be hanging out and asking questions. I strongly encourage you to take advantage of that. As soon as possible, I'll be hanging around and some other volunteers will be hanging around in the next flow channel over the next couple of days. We'd love to hear feedback. We'd love to see your questions. And we'd love to see people answering each other's questions. This is a community event after all. So without further ado, let's get into it. So the materials for this event will be up at training dot next flow to IO underneath this advanced training tab. It's just like the inch in sort of fundamentals training. There are two ways of doing this. The recommended way is probably via get pod by this open and get pod link will spin you off a virtual machine with all the materials you need for the workshop. You can if you prefer run it locally with the local installation of Java and bash. But it's just so much easier and I strongly recommend you use the get pod environment. If you click this link, you'll be directed to get pod where you can open this particular repository. You can either choose VS code on a browser or a desktop in this workshop. I'm going to be demoing using the desktop editor, but they'll both be the same experience. The standard for cause it gigs around should be sufficient for most. If you just click continue, it's going to spin you up a workspace in which you can get started and conduct the workshop. This will take about five minutes. So I'm going to pause for five minutes here and give us a little break to an opportunity for everyone to get spun up. I'm going to give us a few extra minutes just so that people who are having difficulty can post on slack. You will require GitHub login. But once that sort of five or 10 minutes is up, we'll get started. We'll see you there. This chapter is all about next low operators. It's not a comprehensive tour of all the next low operators there are far too many to cover in a workshop of this length. But what I want to do is give us a tour of some of the more interesting or underutilized operators that are available in the next ecosystem. And we're going to use those operators to talk a little bit about groovy and introduce some basic groovy concepts. All right, enjoy. Welcome to our operator tour. The first thing I'm going to do is navigate to the directory for this chapter, which is CD advanced operators and have a little look around. I see here we have a main.nf and a data directory. I'm going to open the main.nf in code here and have a look around and see that this workflow is a very simple little workflow. It takes a channel of five integers and passes that channel into the map operator. And to that map operator, we've supplied a closure. This closure is a very simple closure described here as a sort of canonical example. It's a closure that takes each element, which by default, it takes the value it and multiplies it by itself. That will return a new channel from the operator map, which we passed to the view operator, which just takes the inputs of that operator and inputs to view and principle standard output. Let's have a look at what that looks like when we run it. And as expected, we returned the integers 1, 4, 9, 16, 25, which are the first five integers squared. As I said, by default, each element being passed to the closure is given the name it just for convenience sake to save us having to describe and give a arguments to all closures. But if you want to be a little bit more descriptive and you can give it a name. So with this stabby operator, I can use the same closure. So this will generate the same result. I can double check that with next load run. The note here that I've given this closure, rather the inputs to this closure, the arguments for closure are named. So each of these integers in turn will take the value in num and then multiply by itself. And the return value is the last expression in this closure, which is this num times num. So the square of the value, which is again piped to the view operator. If you find yourself defining a particularly useful closure and you find that in your workflow is used many times and you'd like to have a single place where that's defined, you can actually name closures inside of Groovy and inside of next load. So here I'm going to give this closure name and I'm actually going to give it a type as well. So Groovy is an optionally typed language. I mean it operates on the GVM, which is strongly typed. So it's possible to give these integers a type. It's entirely optional. In a lot of cases this is not necessary, but it can sometimes help with debugging and I just want to make sure that everyone here attending the advanced workshop is aware that it's a possibility. So I'm going to give this num times num. So now I have this square closure. So now into the map operator, I'm going to pass the square closure as an argument. So here I'm taking the same channel of five integers passing to map, but now instead of defining the closure in line, I've defined it here on line one, passing it to map as an argument. Let's run that just to make sure that everything has everything is the same hasn't changed. That's perfect. Great. If you have defined a closure with name here you can actually compose them together. So if I have my square and let's say I need to would like to do something else for these integers, so maybe add a value. So I'm going to define a new closure called add to. And I'm just going to add two to each element. And I can use this notation here this greater than greater than notation takes the output to square and then passes them to the add to so composing these closures together. So this should change our result. And as expected, instead of squared numbers we have squared numbers plus two. So I've taken the channel of integers passing through map, which then composes these two closures together. And the return value of the last closure, the add to is a return value of the map operation, which is then piped to the view operator. So we could alternatively write this in next load because we already have this idea of composing composing closures inside of next load. This will give us the same, the same value. So here we've taken the map operation and just passed each of those named closures through to concurrent to serial map operations. And this is sort of more interesting to those inclined towards functional programming but you can actually carry these. So let's say we take closure called times in which takes two arguments, a multiplier. And some other value, let's call it, you know, let's call it retain the same names as documentation. And so this closure takes two arguments, a multiplier and some value to be multiplied and it multiplies these two things together. I can carry the functions, carry these closures that is take a new closure which sets the first argument to be some specific value. So I'm going to make a times 10 closure, which equals times n to carry 10. So this carry function basically takes the argument here and sets it to the first value here. And so what we return is essentially a new function, a new closure rather that looks like this. So times 10, these two lines are equivalent. So this times 10 takes this closure to arguments just sets the first value to be 10. So we want to see that an operation. Let's do that. So we're going to pass this named closure times 10, which is the result of the carry operation on this closure takes two arguments and pass it through map. So integers one, two, three, four, five have been multiplied by 10 and returned the values 10, 20, 30, 40, 50. Perfect. Let's move on to the view operator. So again, this is a very commonly used operator, particularly helpful in debugging. It's sort of like the print line equivalent of asynchronous next-lay programming. I've used the view already quite a bit here in the demonstration of operators. I'm using it actually in every example, just to view the outputs of the pipeline or this sort of series of composed operators. But, and this just takes the value and returns a string fired like a stringy representation of each item in the channel, but we can customize the output if we'd like by supplying again a closure to the operator. So here, I'm going to keep this times 10 example, and I'm going to pass it to view, but in this time, this time, instead of just a default view, I'm going to pass it a closure. And I'm going to print out a helpful or informative message. I'm going to say I found it just an example. And so now instead of just a default, in string versions of those integers, I'm printing a string message found 10, found 20, found 30, found 40, et cetera. We can do more interesting things if we're interested in exactly what sort of class those items are. I can use this dollar sign curly braces notation to put any sort of arbitrary groovy in here. So here I'm calling the get class method on the operator. So this get class method is available for all objects in groovy and returns the class of the object. So what sort of numbers are these, I can see here that they're java.lan.integer. So this is helpful for customizing the messages that are returned and printed to standard output by the view operator. So we talked here about naming these closures, but I want to be very clear here that in almost all circumstances, it's better and more convenient to keep these things anonymous closures, that is, pass them in, pass them in directly here. But there might be occasions, considering this is an advanced workshop, there might be occasions where using those name closures is helpful. So in almost all the next load workflows, the map operation are going to be pass closures directly like this. Perfect. All right. On to the next operator in our tour. The split CSV operator is particularly useful because a lot of bioinformatics, computation biology, a lot of sort of data processing and batch data processing workflows begin with some or at least at some point in the workflow will need to read some semi structured data like a CSV or a TSV. A very common pattern is for the input tool workflow to be a sample sheet, a CSV or a TSV which describes the samples or the inputs to the workflow maybe like a sample ID and then maybe some path to some file and optionally some metadata and some extra columns. So we're going to see some more complicated ways to manage that a little bit later in the work workshop. But the split CSV function is a really great handy little tool to have in your work belt. So I'm just going to copy this here. So let's have a look what is this data sample sheet does CSV. So you have this little CSV here. It has these five columns and ID. I repeat, which is the piece of metadata, maybe a type and other piece of metadata, which is tumor or normal, and then some path to some fast Q files. Let's pass that through the channel of from par, which returns us a channel of path elements, like file elements, and we're going to pass that to split CSV operator. We're going to add the argument header was true to split CSV. I guess if you'll remember, we had our sample sheet CSV, these header columns ID, repeat type fast Q1 fast Q2. This will ensure that when the elements that are turned from split CSV the elements in the channel are maps that are keyed by the header names. So this means we don't have to sort of keep track of the head. We don't have to read the header independently. It'll be returned for us and everything will just work out. So let's run this basic next little workflow and encourage you to follow along. If you're watching this. Right. So here we have. It's a little bit hard to read because the liner I think that you can see here that each line here is an element in the channel. And each element has is this map. So maps in Groovy are represented in their string formed by the square brackets and then these key value pairs separated by commas. So this map has an ID key with the value sample a key repeat with value one key type with value normal, etc, etc. So now we have this, we're going to pause for a small exercise. So from the dictionary. Advanced operators use the split CSV and map operators to return the file that it's actually in return a channel that we suitable for input into this process. So you can feel free to consult the documentation that's linked out here. So this process is going to expect to be provided a channel as type input where the elements in the channel are a two pool where the first element is an ID. And then the second element is a list of fast queue files. So I'm going to pause here and give you time to to work that out. Okay, welcome back. I've hope you've had a chance to attempt that small exercise. So as I said before, because we specified here to true, we have a really convenient way of accessing the column headers, which means I'm going to pass it through a map closure. And I give it a name just to make things a little bit more explicit. So for each row, we're going to return a two pool or a list in square brackets. The first element of the list needs to be a value ID. So I know here that I can see each column, each element in this row has this ID key. So if we just did that, let's have a look at what we're returned right to return just the row ID in a list. So the next thing we're going to do is we're going to return make sure there's two pool has two elements because the second element needs to be the fast queues. So if I just remind myself the column headings here I can see in the sample sheet.csv we have columns fast q1 and fast q2. So I could do something simple like this. And that's good. And that's going to get us almost all the way there. Okay, so this is the right sort of shape in that we have a sample ID and then some paths. But it's really important to note here that these at the moment are just strings next load doesn't necessarily know that this is a path to a file, even though you are nice humans could probably guess this is path to file. We need to make that explicit to next load by default, this map, these keys and values are strings or integers. So to make turn them into a path, which is going to be important for provisioning the data and sort of next stage in the data are going to wrap these in fire the file method like this. So here we have to pull the first element is the row ID the second element is a list of fast q files. So let's see what the difference that makes. So now we're actually next was actually the string fight version of these parts is the full path. I want to show you what just so you can. So you don't have to believe me and let's double check the class of those elements. So I'm going to use this view operator and again I'm going to supply closure just like I showed you earlier. The first element is the ID and then fast cues. The second element is so ask you to start. Let's just have the first fast queue out. I'm going to call the get class method on that element. We'll see double check these are actually strings class the second element is class Java Lang string. So if we wrap these in file methods to turn them into real files. So next thing is to stage them in as simple links or copy them in if you're operating on the cloud. Great. So now that I've wrapped them in the file method you can see here that the class is no longer string, but is now son. The exact class there is not strictly important. But I just wanted to show you another demonstration about how to use that view operator to get a little bit more helpful information. Okay, so in this exercise as I've noted here in the documentation we've lost an important piece of metadata that is the tumor normal classification. If we look back at the sample sheets here we were missing this repeat and this type is repeat and type columns, which might be useful later on. So let's, if we can it's a really good practice to try and hold on to as much metadata and pass that metadata through the next flow workflow as possible. Don't drop it there's very little cost to sort of keeping the metadata flowing along through the directed acyclic graph of connected computational processes. It's important to keep that metadata flowing through the graph for as long as possible because you don't know when you might need it. And once it's become sort of associated from the math data, it's very hard to join it back up again or not yet hard to join it back up again. It's much more convenient just let it flow through so we've at the moment we've returned the sample ID and the paths but we've dropped these two columns repeat and type so let's get those back in. So instead of returning just the ID. What I'm going to do is going to make a new map. I'm going to call it metamap. I'm going to give it a road ID. Maybe a type. And what was the last column. Repeat. And now instead of passing the road ID, I'm going to use this metamap. So again, this is a map objects will be called maps in groovy. So Python these would be called dictionaries or associated arrays or hashes and Ruby. So this metamap is going to be the first element. I'm going to print that through view that just double check and see what it looks like. Fantastic. So now, instead of just having the sample ID, the first element of each of the first item in each element being passed through this channel is a map keyed with keys ID type and repeat. So the moment is quite repetitive. This way of defining this with ID is where ID type is where type repeat is where repeat is stuttering is a little bit ugly and prone to sort of easy to make mistakes. A little bit later on in the workshop, we're going to show you how to sort of neatness up. But I want to sort of put a mental pin in that. So we can remember what this looks like and we'll find a more advanced or like a clean way of doing this a little bit later on the day. All right, now on to another extraordinarily useful operator, multi map. The multi map operators a way of taking a single input channel and emitting each element into multiple channels, multiple output channels. So taking a single channel and for each element, we're going to pipe that element into multiple output channels. So it's a way of sort of branching the operators. So let's assume we're going to establish sheet that has two normal pairs bundled together on the same row. So it'll be sort of far for fast queue files per row. So a fast queue on a fast queue two for tumor and a fast queue on fast queue two for normal. And let's say we will have a process, our workflow we'd like to treat the tumor pairs separately from the normal pairs. They need to go through some extra processing or some sort of QC checks, for example. So using the split CSP operator would give us one entry that contains all four fast queue files. And of course, because our particular is a hypothetical workflow we really need to separate those into individual channels, a channel for just the normal pairs, a channel for just the tumor pairs. And to do that we can use the multi map operator save my accidental typos I'm just going to copy and paste it here. So we're taking the split CSP. This sample sheet that I provided earlier we were reading this sample sheet to ugly.csv let's have a look and see what that looks like. As I described earlier has four fast queue files per row to normal fast queues and tumor to tumor fast queue files. In addition to the normal metadata of ID and repeat. We're passing it through split CSP with the header true arguments and passing the output of that channel, the output channel from split CSP to the multi map operator. Again, we're supplying a closure here to multi map, we're taking each row. And we have this interesting syntax here. So here, the first thing we're doing is defining the name of two output channels, we're going to this tumor is going to be the name of the output channel and this normal is the name of the output channel. So for each row, we're going to do these operations and commit that into the tumor channel and we're going to do these operations and admit them into the normal channel. So again, we're taking constructing this metamap object we're creating a map again with a very stuttery ugly style we're going to fix later. We're constructing this metamap and then in the second elements, we're passing the tumor fast queue one and tumor fast queue two into the tumor output channel. And in the not into the normal channel, we're doing basically exactly the same operation. But instead of tumor fast queue one, we're passing normal fast queue one and normal fast queue two. So if each row in the input CSV, we're passing by making an output channel tumor that has the map and two tumor fast queue files and normal output channel, which has map metadata map and then the normal fast queue files. And then we're using this set operator to define the outputs for giving it a name. We could also do it like this. So samples equals the output for this. But I think a lot of people in the community prefer this sort of more linear style that is the data flows from top to bottom, and then we give it a name here. So now we have this new objects called samples. And this samples has two output channels, which can address by name with a dot notation. So this tumor refers to this name here. So the samples dot tumor is an output channel, which we're viewing here to my IT. Let's just view one at a time and we'll see what it looks like. Oops. So we're expecting here to find MetaMap and then fast queue pairs, fast queue one, fast queue two, just for the tumor samples. Again, fast queue one, fast queue two, but just the tumor samples. And the first element in each, first item in each element in the channel is this MetaMap. Perfect. If we uncomment this, we should be able to see nor as well. So it's going to get a little bit busy the terminal. Perfect. And again, I'm using the closure to the view operator to make sure I disambiguate between tumor normal and the outputs. So there's some details here. We could, unfortunately, because of the way the multi map is to return multiple channels. So using name closures as we talked about earlier is not going to work. If that said, if you really want to do it, next load provides this convenience method multi map criteria to allow you to find named multi map closures. You should need them. But I've linked out the documentation here if you'd like to do it. But in the interest of time, we're going to skip over that because I think in almost all cases, this sort of notation where you define the closures in line inside of the multi map operator call is fine. That will get you 99% of the way there. The next operator on our tour is the branch operator linked out to the documentation here. But the branch operator way, the branch operator is a way of taking a single input channel and turning that into a new element into one and only one selection of output channels or one actually can be more than one output channel. So in the example that we talked about here, the multi map operator was necessary because we, for each row, we wanted to emit into two channels for each row. But let's say we had a need a sample sheet, for example, the one sample sheet dot CSV, the one where we had for each row, we had had a normal pair and a tumor pair or rather the normal pair and two of our own separate rows. But let's say we'd still like to have output channels rather we'd like to generate a channel that has just a tumor pairs and an output channel that has just a normal pairs. So we don't need to split up each row into two different channels, like we did for the multi map. In this case, the sample sheet of CSV versus the sample sheet of ugly dot CSV. In this case, we just need to decide which output channel each row needs to go to. And for that we're going to use the branch operator. So here we're taking a channel from path we're making passing this path here passing through split CSV just as we've done before so nothing has changed there. What we're using is map operation, which takes a row and generates this metamap and then a list of fast queue files. So that will look just to be clear. Let's have a look what that was returned here. That's it. So it's just the metamap and then a list of far list of parts. So we're going to take that. And we're going to pass it through this branch operator and then give it a name called samples. So the branch operator takes this meta and reads. And again, just as with multi map, we defining the names of the output closures with this name and then a colon. And then to the right hand side of this name, we're going to define an expression that returns a Boolean value, which says, for each element being passed into the branch operator. So for each row in the split CSV, it's going to try and match one of these Boolean operations. So it doesn't match this condition. If so, the element is emitted into the tumor channel. If not, it passes down to the next test. Does the row match this condition? If so, it passes to normal. If no conditions are met, then the element is discarded and is not output into either of these tumor or normal channels. We happen to know ahead of time in this particular example, every row is either meta type is tumor or meta type is normal. But you can imagine situations where you might want to filter in addition to this branch operation. So let's have a look at what is output here just to make sure. So I'm taking again, I'm giving it a name samples, and this samples object is a sort of a complicated object in that it's a list to object contains two channels, a tumor channel and normal channel. We can address them again just before with multi map this dot notation. So sample dot tumor returns us the channel where elements meet this condition and the sample dot normal channel returns all elements where this condition is met. So if I view those fantastic. So we have a normal and then we note that for this fast queue, it's a normal fast queue. And again, for this one, it's a normal fast queue. And we have the matching tumor pairs here. So emitted into the tumor channel or the samples dot samples of tumor channel. We have this tumor R1 to fast keys and we have tumor R2 to fast keys. And the element is only emitted to the first channel where the condition is met. If it doesn't meet any conditions, as I said before, it's just discarded. We could simplify this. You could, you might, if you would like to avoid discarding any elements, you might consider introducing a sort of catch all channel other with just a condition true. So if an element passes through each of these and this returns false and this condition returns false, then it'll sort of fall through this always returns true. So any rows that are supplied as input into this branch operator. If they don't match these will end up being supplied into other in this particular condition in this particular situation I happen to know ahead of time that every row has either to type as normal or type of tumor. So in this case, there are no best samples to other is an empty channel. But you might want to consider doing this sort of taking this sort of practice to sort of make your pipelines and workflows a little bit more resilient. You might want to warn the user. And we can use the next low built in method log dot one to to warn the user or even more strictly log error. So return an error and sort of hold the workflow if one of these condition if something appears in the samples to other channel. So if there are no samples that don't happen to me either of those than the workflow procedures normal. No errors generated here. But if you did have a sample a row that didn't meet that actually let's see what that looks like. Let's duplicate this and instead of two more normal in the type code we're going to say unexpected. Let's put it towards the top. So now we will have that row that passes through this test. This will return false this will return false and it will fall through to other and where it will be pass through here and log error will be called. And here it is. I was not expecting this. Obviously this is not particularly helpful error message. I'll leave it up to you as an exercise that reader to put a more helpful error message that I want to show you here the error is still occurred. Or simply if you are confident ahead of time that you have only sort of two conditions a meta dot type is tumor or something else. Then you can simplify the expression like this. I'm going to get rid of our. Unexpected row. Next I run and we returned a tumor normal output channels. So in this example so far all that we've done is the emitted the inputs verbatim without any sort of modification we've just taken inputs and channel them into one of these two output channels either samples dot tumor or samples dot normal. And we've not made any changes to those elements as a path through. We can the branch operator does give us the option to make some changes and emit a slightly different objects into that output channel. We can do this with this return syntax. Let's return instead of returning the meta object without sort of just transparently we're going to use the plus notation to add a new key to sort of combine two maps. So here now we're just going to come in at normal. But here instead of just passing meta and reads transparently through we're returning a new element, which is the result of the meta added to this new map. So this is combining new maps basically giving this new key to the map and passing the rates through transparently. So if we run this view the output to the tumor channel, we should see that the meta map now has a new key with the value my value. So this is a way of appending new keys to the map or changing the map as it passes through the branch operator. So have another exercise here going to pause for a couple of minutes. So how would you modify the element returned in the tumor channel so samples dot tumor to have the key value pair type abnormal instead of type tumor. So we'll give you five minutes to work on that and I'll be back in a second. Welcome back. So the exercise was how would you modify the element return this tumor channel to have this a new key value pair. So the type is abnormal instead of type tumor. So there's a couple of different ways into doing that. Later one potential solution here. We can use this plus notation to not only add keys to a map, but also to override keys. So the plus operator combines two maps and precedence is given to the map on the right. So if these two maps, this map and the meta map have shared keys, the key on the right will take precedence and over at the one on the left. Plus method also returns a new object returns a new map. So we're not actually modifying the original map, which is an important, important sort of qualification, which we might talk about in a little more detail on the letter in the workshop. But this meta map meta plus this new map returns us a new, a new map where the type is no longer type tumor, the type should be type abnormal. So let's run the workflow. Just double checked it. Our keys are working correctly. This overriders working correctly. Perfect. Great. You can see here that in the tumor channel, we have the keys ID, repeat and type, and the values in type are all abnormal. Whereas earlier, before we overwrote them here, they were type tumor. So I want to talk a little bit here about this samples object. This samples object. Let's have a look. We can see what type of class it is going to return us sort of a class internal to next load. The details of this, you won't use it on an everyday basis, but it's interesting to show you how, how you might go about finding this. We have this class next load script channel out. And this channel out object, this samples object is actually an object contains multiple output channels. You can see this CS addressing each of those here with a dot notation. And there are a couple of different operators return these multiple output channels. The multi map and branch operators are two obvious examples. So as you can see here, let's just simplify our example a little bit. So here we have a workflow, which takes five integers, passes from this multi map channel and returns to output channels, small and large. With a large is the input with 10. So add it to 10. So we have a small numbers dot small and numbers are large to independent channels. And the large we have 10, 20, 30, 40, 50 and small we have one, two, three, four and five. So we can, we here we're using the equal sign to assign numbers variable to hold those two output channels. We can also use the set notation, which is the same as all. There are more interesting situations I've noted here occurs when you have a process that accepts multiple channels as input. For example, let's say I'm going to add this debug true directive is we're just short everything printed from the script to standard output is echoed out to standard output and the next low command line. So this process multi input is going to take two channels. I'm going to take a small number and a channel containing a big number. This script just runs echo and says small is an next level do the work of replacing this dollar sign with the value supplied from the first channel. And being number will be the value supplied from the second channel and it'll run this process for each. So we, we have this process multi input. So it takes two channels. We can call it like this small numbers large and comment out those new calls. So multi input is supplied with two channels numbers are small numbers large. And if we run this workflow, oops, I've made a small typo. So big num, I've called it bug none. So because I've had this debug true directive, which again, it's really helpful way, helpful little thing for debugging, whatever's echoed to standard output inside of the script. Usually, we'll just be silently captured by next low in the dot command out and the dot command log error. For example, if we took see the dot command log for any one of those tasks, we can see the standard output from that task. But if we use debug true, the debug true directive, then the outputs for each of those processes echoed out the standard output. So it's a quick quick way of double checking and seeing what is happening inside of these scripts. So the this multi book channel, this multi process takes two channels, small and large with which we've addressed by name here. So we've taken this multi map operator. The output of that is assigned to the numbers variable and then we pull that each channel from that number stop variable. But we could even more concisely pass that in directly. Because the output from multi map is an object that contains two channels, we can pipe that into a process that expects two channels. And this will work exactly the same way. This can be a handy shortcut for making your workflows even more concise. But again, as with some a lot of these tips, most of the time you're going to want to do the standard standard approach with assigning a variable. And then addressing them directly. More of this is the case more often not because generally those output channels want to go to distinct paths paths through the DAB. But it is want to make sure that everyone's aware that you can if you happen to be a have a process that accepts multiple channels you can pipe them in directly. So the final clean solution looks like that, which is very, very tidy. Okay, now on to a real favorite operator group double. So this is very, again, very commonly used more commonly used in multi map and branch operators. The group operator is a way of taking a single input channel and combining elements from that channel that share some common key. So let's take this workflow as an example. We're taking a channel from path again sample to see to see passing it through split CSV. So again, exactly the same start to all of the examples we've seen so far. We're passing it through map. We're turning this row into a tuple like a list where the first element is a metamap. And then we have the word repeat. And then we have a fast queue. One of the changes we're making here in this example is the metamap only contains two keys ID and type. So we're no longer including the road of repeat in this metamap. And that's because we're going to use this, we're going to pass this to group tuple, but we only want to group on the row and type. So we want basically to ensure that the repeats for each sample end up in the same element in the output channel. And because group tuple by default, we'll group on the first element in which in this case is the meta objects. If we were to include the road or repeat inside this meta object, then we wouldn't have then each element will be distinct. The first element for each would be distinct, which would defeat the purpose of the group tuple. So let's, before we pass it through group tuple, let's have a look at what this returns. So here we have the first element. Here we have sample ID type normal. And if we search through these, you can see here that there are two elements in this channel which share this common key metamap with ID sample a and type normal. There's two elements. Similarly, there are two, two elements in the channel that share this key sample a type tumor. If we to pass this through the group tuple operator, now elements that have shared a key will be joined together. So here we have a key. And then the road or repeat has been turned into a list of road or repeats. Similarly, the list of Farscue file paths has been turned into a list of list of file, Farscue paths. So the first element in this list is a list of two Farscue paths. And the second element is another list of Farscue paths for elements where there was no other elements that share the same keys. For example, here we just end up with one element in the list of repeats and one element in the list of Farscue paths. So this is a very handy way of sort of grouping data together. A really important caveat that we're going to address later is that this is a blocking operation. So we don't know, Nexflow doesn't know until all the elements have passed through here, which ones it can group. So this is, if you have this group tuple in a workflow, we're going to show a way to mediate this a little bit later on in the workshop in the next section. But by default, group tuple is blocking in that all elements have to pass through here. And then only once all elements are passed into group tuple will it start emitting output elements into the output channel. So this can be blocking. By default, almost all of the Nexflow operations are deeply asynchronous in it. Elements will be fired into the output channel as fast as possible to get your data through the graph as fast as possible and get you to your outputs and results as fast as possible. But this group tuple is cannot finish, cannot start emitting items until all the inputs are ready. We're going to talk a little bit more about that in a second. And the last element I want to cover here on our operator tour is the transpose operator. So the transpose operator is a way of sort of rotating matrices or like lists or lists, which can be a little bit difficult to visualize. Rather than going through sort of that sort of rotation operator a handy way to think about it is as the inverse and this is the way it's most often used or is often used as the inverse of the trans the group tuple operator. I'd encourage you to in your own time have a play around with the transpose operator, because it can be very handy. But I want to show you what it looks like when it's operation is the inverse of group tuple. So we have this group tuple elements once I've just shown you earlier. So we have the metamap and then a list of repeats and then a list of fast coupes. If I pass this through transpose, what we're returned is back out the same elements that we viewed before we passed it through group tuple. So that is elements with a metamap, a single repeat and then a pair of fast coupes files. The transpose operator can be thought of as the inverse of a group tuple can be helpful for sort of undoing that operation in particular contexts. And that concludes our chapter on the operator tours. From here, we're going to take a short break. If you're watching this live now is a great chance if you haven't already to drop by the September, September 23 advanced training channel on the NFCore Slack myself and other volunteers are hanging out there. If you have any questions we'd love to help help out. If you want to chat anything next low or NFCore, we'd love to talk to you there. Of course, very welcome to pop in questions all the way through the workshop that we'll see there. If not, we'll see you back here for more training in half an hour season. This chapter is all about metadata propagation. It's really important concept in sort of batch processing computational biology and bioinformatics to attach metadata to the samples and to the inputs and outputs of processes as data flows through your graph. What you're through your workflow. Next low has a bunch of really helpful concepts that make this easy for you. And so here we're covering some do's and don'ts about metadata propagation and in batch workflows. Let's talk about metadata. This is a hugely important concept in next low and in batch processing in general. Next low has some strong opinions about how metadata should be handled. Metadata should be explicit. Be very wary of metadata encoded in file names. There'll be times where you want to include metadata. It's high dimensional or more interestingly shaped. It doesn't fit into a file name or contains characters. They're going to support characters that would make illegal file names. And the other important concept is that metadata should travel through the channels in next low with the data. Ideally, as something like a tuple element, like a map, something simple, but flexible. Let's give some examples. First of all, as with all the chapters, I'm going to CD into advanced and in this case, metadata. So this chapter contains a little main dot NF to help get started. As I've explained here, in an ideal situation, next low workflows begin with a sample sheet, some semi-structured data, CSV, TSP, something similar like that. But for the sort of exercise, let's begin with the worst case scenario. Someone's just given you a bag of data. Maybe it's a directory with some fast queue files in it. I'm just going to use fast queue as an example. And you need to make sense of these things. We're going to use this really sort of worst case scenario to introduce some healthy syntax and some features that will be helpful and more complicated workflows in another next low context. So let's have a look at this data directory. So here inside of data, we have directory called reads and inside that we have a directory called treatments or treatment A with a bag of fast queue file and then a directory called treatment B with a bag of fast queue files. So if you kind of have a couple of sample sheets in here, which we're going to ignore for the moment. So whoever handed us this data has violated one of our two important principles in that it looks like they're encoding some metadata in file names. There's some underscore separated list of metadata inside of the file name. And what we want to do is pull that out of file name, get it as quickly as we can into a metadata object like a real map that we can pass through the files, pass through the workflow in next one. So the first pass the sort of data flow that you've inherited here that I've given you as an example is very simple little workflows two lines uses the from file pairs method. It's a channel creation method and generates a channel with files that match this pattern. Let's have a look what happens when we run this and it gives us a channel lots of elements and elements in the channel. Each element is to pull first is a key and then a list of matching pairs like a pair of fast queue files. This from file pairs method is doing some work for us it's already pulling out this key. So it's the sort of first part of this match up to the R1 R2. And then it's given us the two to two files here. In this case the ID is just a simple string. And we want to sort of augment that and make it a map to store some more complicated data. Because we might have more than one piece of metadata track like for example me why one also encode the true and normal status or the replicates status or the sample ID explicitly rather than having them in this underscore separated string. There are lots of different ways in which we might do this. But the first thing let's use this tokenized method which I introduced here to break up that key, the sample key into its constituent parts. We're going to pass this through the map operator. We're going to give the inputs of this closure names. We know that it's an ID and then a list of reads. This is the tokenized method which is like the split method in Python or Ruby to split that string on underscores. It's going to return to a list of strings. I'm just going to run it. And because the closure returns the last, the value from the last expression and the assignment of this tokens variable in returns the tokens variable itself. So I can see here the tokens is equal to this list. We have a sample, a rep and a true and normal status. So if we're pretty confident like this is a very basic sort of schema, we're just going to assume everything is underscore separated. But if we're confident about stability here, we can destructure that list into its component pieces and give them names directly. So we can say sample, replica type equals ID. So guys, let's have a look and see what that returns us the same. Perfect. But now because we've destructured this, we can use the sample as a variable name, replica as a variable name and type as a variable name. So a little bit stuttery and repetitive here. What we're doing here is we're making a new variable meta. We're making a map with the keys sample, replica and type. And the values sample we're pulling here from these destruction context, replica and type. Great. So now we have metadata as a map rather than underscore separate of the string and then a list of reads. I want to note here that I'm using this destructuring only works in a tuple, which is the parentheses separated list and not square brackets, which is gives you a list. If we wanted to get a little bit fancier, we could also use the transpose and collect entries to produce the same map. Let's give an example about how to do that. So this is a perfectly valid solution to this problem and a perfectly reasonable way of doing it. But just as an excuse to introduce some more groovy syntax, let's say, given this is an advanced workshop. We're going to take this idea tokenize. And so here, let's have a look at what this returns us. So what we're doing is we're tokenizing that ID into its constituent parts. And we have that list. It'll be a list of its pieces. And then we have a list of keys we want to assign that will look like this. So each element in the channel is a list of the constituent parts from the sample and then a list of keys we want. What we can do then is call on that the transpose method. Which just like the transpose method from the operator to rotate the list. So now instead of being a list of two pieces, where each piece is a list of three elements, we have three lists of two, two lists of three. We have a sample, the actual sample name and then the key value, the key, the value key value key. We're going to pass that. Call the collect entries method on that list. We're just going to switch these around. Perfect. So now we have a map. Key by sample, replica and type. And there. Just to make a little bit more readable. Can put these on new lines. Just a little slightly cleaner way of getting that getting that map, but it gets us to the same place where we have a map with keys sample, replica and type. And then a list of fast key paths. Now here the fast key parts of the full parts, which means that next flow inside the from file pairs method has already done the work of wrapping those into turning those strings of paths into file objects, a path objects. So if we go back to the previous method that is this one here. And let's say we make a change we would like to make a change. We've defined sort of the rep prefix should be removed. I can use regular expressions. So you'll notice here that any of the replicates, they're named rep one, rep two, rep three, rep one and rep two. So instead of rep one rep two, we would just like them to read one or two for the replicates. So turn those into integers or simple integers. Another really handy syntax sugar that's available inside of Ruby is subtracting regular expressions from strings. So let me use. So I could say replica equals replica take a regular expression. So starts with rep. Now, instead of rep one rep two, replicate is just replica one, replica one, replica one, replica two, replica two. So we've removed that rep prefix from those key values from those values. Another little syntactic shocking instead of replica equals replica take we can just write replica take equals. So these two lines are equivalent. If we're going to take something from itself, we can just sign that a replica take equals regular expression, just to show you that I'm not lying. Let's run it again. Perfect. As I'm noted here, there are some really helpful syntactic sugars pieces that groovy provides to you, and they're available inside of next low. So you can trim parts of the string from another thing. So if I had the string 123, I can subtract from that to and that'll give me the string 13, or I can subtract. So I can subtract in this case, a string just a plain old string, or instead of subtracting a string, I can subtract a regular expression. So here I'm subtracting the regular expression t and then any character and then optionally a space. So you can subtract strings and regular expressions and strings to sort of make your life a little bit easier. So here we're almost where we want, but we still don't have the treatment captured now metadata. Just as a reminder, if we look in the data directory, we have these fast cues that live inside a directory called treatment a or treatment B. And so at the moment, we have the sample ID, the replica, and the tumor normal status encoded in our metadata, simple sample replica type. But we also want to capture this treatment data. So it's available to us sort of in that you can see here in the paths, we have this treatment B. We want to pull it out and get it into our metadata. To do that, these reads objects are objects that implement this Java and I auto path interface. I'm linked out here if you want to check the full documentation, but that interface gives us really useful methods, including get parent. So if I just comment this out. So what I'm talking here is this reads, which is a list of paths. I'm calling the collect method order. And that's sort of like the map method inside of next load. So we're calling this closure on each element in the reads list. So that's two, in this case, it's two elements, two paths. And we're calling the get parent method on each of those. So let's just see what that returns us save that run the workflow. What returns us the parent, which is just the treatment A or treatment B directories, because it's the full path. So it's the path of the parent directory. And another really handy little groovy syntactic sugar piece is this star dot method, which we call spread dot notation. So another way of writing reads dot collect it doc parent is if we're just calling a simple method on a single method on each of these elements. So it doc some method, we can abbreviate that with this. So these two lines lines eight line nine are equivalent. Just to show you, oops, I've left extra. So don't know. Great. So that just before we return the parent methods. So this returns another path object. Instead of this whole path, we really want to turn into a string which just contains the name that final directory name. So we have a list. We can use that same spread dot notation to call the get name on each of those elements. So now the get name takes a path and returns just the terminal fragment. So now we just left the treatment a or treatment B. The last thing the life. So another way of doing that if we wanted to do it the long way would be so these two lines are now equivalent. But this is a little bit shorter, a bit nicer. And the last thing we wanted to note is that there is an extra piece of syntactic sugar in how groovy deals with these get and set methods. So there's a lot of methods that start with his get prefix, get parent, get name, get simple name. For any methods that are of this type get something where that something begins with a couple of letter. You can get that method have access to the that method using the sort of property style notation where you just call reads dot parent and name. So these three or three methods. So that is the same as so this it dot get parent is the same as it dot parent and it dot get name is the same as it dot name. So we can all four of these lines are equivalent, but we have this really lovely short method using the spread notation and the property styles in tax. Just to show you that it's still working. Let's run it. Great. So that's what we're going to go for. Actually, got one more thing to do. Just as before we had where we had rep one rep two, which we wanted to shorten and remove the prefix is just one and two. We might want to do the same thing here, but we have treatment a treatment be but we might just want a and be so we want to remove the prefix. Just as we use the subtract method before that subtract method is just an alias for the dot minus method. So we could call dot collect. It dot minus. We might do that. Or we can use the even shorter version can use our spread notation again minus. We're going to subtract from each element in this list. This regular expression. Right. So now we just left with the A's and the B's from the treatment. Of course, in this particular example, I'm treating a treatment be always going to be the same for each of your pairs, but we're just going to continue along with this for the sake of demonstration. So our final map pleasure. These are all Google and so we're going to delete them that treatment reverse and treat before and treat reverse is what's returned from this spread dot notation collection. And then we're going to tokenize that we're going to remove the replica prefix. Let's give ourselves a little bit of space here. Lastly, we're going to add treat forward and treat reverse. So that meta object now contain these keys in blue. And we returned here the meta and the reads. We're going to be taking that to view so we should be able to see it from the standard output. So now we have all the metadata in a convenient little convenient metamap. The key sample replicate type treatment for and treatment reverse. So this metadata now can flow through the graph with the reads that we can use to split join recombine group tuple the data. The resulting channel would be perfect to pass to a process that looks like. Something like examples that has input. This channel that has a val meta objects is the first element and some reads as a second element in this pattern is very common in a lot of next load workflows, particularly in the NF core workflows. So if you have your data in this sort of style, you can import modules from NF core and start using those almost immediately. So we're going to see a lot more of this sort of shape of a channel where we have a meta object and then some paths or reads. Perfect. And that wraps up the metadata propagation chapter. Welcome to a new chapter. This chapter is all about grouping and splitting. So this is one of my favorite concepts in next load. And I think it speaks to one of the strengths of next load, how to group and split and sort of construct more complicated graph structures inside of your workflow. How to take data from all processes split them apart and then joining them back together further down the tag. Hope you enjoy it. Let's get started. Great. As before, and as with all the other chapters, we're going to CD into advanced into the chapter directory. In this case, it's advanced slash grouping. Now we have a little main.nf, which in this case is actually just blank. Let's see what is in this data directory. So we have, as before, a data directory. This time we have a genome dot faster, maybe a faster index, some interval bed files, and our reads from the last chapter split into treatment A and treatment B. We also have a sample sheet in ugly and a clean form. So let's start with our main.nf. Let's start with a real basic workflow. We are reading channel from path sample sheet dot CSV. Let's just remind ourselves what this sample sheet looks like. Oops. This is a really simple little sample sheet ID repeat type fast Q on fast Q to really simple. We're passing through split CSV, which we talked about earlier today. We're passing this for a map operation where we're taking each row we're constructing a metamap object. We're passing this into a tuple. The returns. So we're turning a list for the first element is the metadata map and the second element in the list is a list of file objects. We're doing one more modification that we haven't seen yet today and that is we're adding this check of exists true method. So this is a handy little method argument to the file method, which we'll have next look go out and check to make sure that file actually exists. This check of exists true works both on local files on a local file system, but also on remote files and blob storage, for example. Let's just run that see what we get just to make sure that everything's working as we expect. Fantastic. We're passing it to view so we're echoing everything on stand out but we see our metamaps and we see our lists for fast Q files. So one of the things that I flagged earlier today was that I was not really happy with the stuttering repetitive nature of the way we were defining our metamaps here is ID row ID repeat row repeat type row type. So we want to sort of, I mean it's ugly to look at but it's also error prone. We can, if we want, make use of the sub map method. It's already a method defined on maps defining groovy to quickly return a new map constructed of a subset of an existing map. So here from split CSV, we're already returned a map because we specified here true. If we didn't spice that header true we just end up with a list. But because we know we have headers in our CSV and supplied this argument, we returned a map in this row objects. Because we were taking a map here and we want a map further down, rather than making a new map like this where we define each of the fields in turn. What we're going to do is meta equals road or sub map. And then just supplies arguments. The keys that we want to pull out from that map. So these two rows are identical. They will return the same result. So instead of row row ID repeat row repeat type row type, we're just using the sub map and specifying the keys that we want. I'll give you a moment to catch up and comment that out and run and explore just a sanity check that everything's working as we expect. Great. So we have returned a row map into our meta map by using the sub map. So we've just pulled out these three keys. And something that's really important they want to note here and I have a bite size talk about this is that goes into this concept in a little more detail. But we want to ensure that we're treating this meta map safely. This sub map method returns a new map rather than making a modification to the existing existing map row. It's really important. I think when you're using map operations or any of these closures inside of next low to ensure they're always returning a new object rather than modifying the objects in place. This is because next low is deeply asynchronous. And if you are modifying a map in place or modifying any object in place, it could be that somewhere else in the workflow that same object in memory is being used in a different part of the workflow. So if you're modifying it in place somewhere that will affect an operation elsewhere in the workflow which is undesirable and can be difficult to do can cause errors that are difficult to debug. So this road or sub map plus argument that we talked about earlier from combining maps, both sub map and plus return new maps instead of modifying the original map in place. So those are safe operations and perfectly good to use. So now we're going to pause for five minutes for this exercise. Can we extend this workflow in an unsafe manner. So instead of using sub map what happens if we return this modify this row option operating a rough row object in place. Can you do something unsafe just as an exercise, obviously not something that would be encouraged in real workflows, but it's an interesting exercise to see if you can break it. I'm going to give you five minutes for to pause here. I'm back. I hope you had a chance to attempt that exercise. Don't worry if it was a little bit challenging. I'm not sure if I've really given you all of the tools you need to do this completely, but I wanted to give you the opportunity to have a go just in case. So let's take this. I've got one potential solution here of how to unsafely modify the objects. This samples channel, we're going to map it through closure, which does nothing. It just map sleeps for a small amount of time and then returns the objects on modified and just reviews the meta object. Because this closure does nothing remember the closure returns the last expression, which in this case is it. Whatever was passed in just passes through transparently without making any changes. So we should see here in the view channel that after should be unmodified the same matter that we passed in earlier in parallel. We're going to do another map operation on the samples channel. We're going to take that same samples channel. We're going to pass it through this operation. We're going to say that assign the type to be broken. And we're just going to view those results. So here we should be seeing reviewing two channels. First, we're expecting to view unmodified metamaps because for passing samples, we're not doing anything to it. We're just sleeping and passing it from modified. We should see that unmodified matter objects here. In the second example, we're viewing another channel taking the same samples channel passing it through a map operation to produce a new channel. Where we're modifying the type value to see be broken. Let's see happens when we run this note here. The type is broken in both the modified and the unmodified channels. This is happening because while this is sleeping, we've modified in this map operation here, the type value type the value under the type key. And because this matter is the same object in memory, and we're doing this dangerous operation modifying in place. This breaks the unmodified expectation here. So to do this safely, what we would do is use the plus operation, which I've already described returns a new object and is a safe operation. So if we use this plus object instead of modifying in place with type meta dot type equals. Fantastic. So now the unmodified versions have type normal type tumor type normal. And where we expect it to be modified returns type broken, which is what we'd expect. So the summary is these plus operations are perfectly safe and good to use. But the modifying in place like this meta dot type is broken that will is a dangerous operation and should be avoided wherever possible. So let's talk about passing maps through processes. Let's construct a dummy read process. This is not very advanced by informatics, but it's just going to stand in place for a real operation. So let's just imagine this is a BWA operation or something similar. We have a process here called map reads, which takes as input to channels. The first channel is the channel of the form meta and then reads so value and then a path. And then the second channel we expect to be giving a path to a genome file. And we output some startup and the script here is just a dummy operation. She would just be standing place for a real BWA or I said to mapping operation and the workflow we're going to change slightly as well. First thing we're going to do is we're going to create a new channel called reference. And that reference channel is going to be a channel from par and the data genome faster and we're calling the dot first method. This is really important because by calling dot first we're returning a value channel rather than a regular channel. This means that it's inexhaustible and we can reuse, we pull that values from the reference channel over and over again. It's not going to be consumed in the same way. The rest of it, we're going to construct this samples channel. And so here we're just going to call them that reads process. We're going to pass it to channels. The first channel is going to be the samples channel that we construct here. The second channel is going to be the reference, which is a value channel pulling that genome faster file. Let's see these operations happen. So here we have our meta object sample repeat type and then a path to a bound file. Which of course is just an empty bound file but standing in place for a real bound file. So now we have a BAM file, but we might want to merge these repeats. So we have mapped these reads individually for different repeats. But for example, sample A type normal is represented in two BAM files here. Let's say we'd like to conduct sort of a BAM merging operation. And so we need to join these two channels together into one. For that we're going to use the group tuple. If you remember from our operator tour, the group tuple groups channels, groups of elements from a channel by some sort of key and by default, it's the first element in the item emitted by the channel. So from map reads, we're going to have to make a modification because at the moment the first element in the channel, the thing we'll be grouping by is this meta map operation, which includes the repeat, which means that this is distinct from this where we need them to be the same so they can be joined together. So to do that, I'm going to conduct a map operation, meta, BAM, and we're going to use our friendly sub map. And we're just going to pull out the ID and the type. So we're going to pull out the ID and the type from this. We're going to lose the repeat and we're just going to return the BAM. Let's see what that looks like. I'm going to use the resume operation so I don't have to recalculate those BAMs, even though they were just very short. So now we have these elements where we have, for example, here, two elements that share exactly the same key and that can be used to group tuple and combine them. So we're going to pull out and find all elements that share the same key, which is a sub map containing keys ID and type. Excellent. So now we have some elements in the channel with one element because there was only one to begin with. There was no grouping necessary, but others. Here we have an element in the channel that has a meta map ID sample ID type tumor that has two elements, two BAM files that can be used to join together. By default, group tuple, as I said, groups on the first item in the element, which at the moment is the meta map. We can turn that map into a special class using the group key method, which takes our grouping object as the first plumber and the number of expected elements as a second parameter. You might remember earlier today I talked about group tuple being a blocking operation. That is, it won't emit anything until all of the inputs have been ingested. So it knows how many, so it knows that it's safe to start emitting elements, that it has the complete picture that it's not going to accidentally miss an element in the group tuple. This group key method, we're going to change this meta object into a group key object that encodes how many elements we expect in the group. So here, we're going to change this after map reads, we're going to change this map operation. Let's say the key equals, we're going to use the group key function, pass it the meta map, same as before, meta dot sub map, and then we need the number of items in the group. But the question is, how do we get this number? The second argument to group key is the number of the items in the group. The question is, how do we get that value? I've left that as an exercise here. So how am I going to modify something upstream of this so that we have the number of the items in the group of L to use here? Eventually we're going to use it like this, key and then bam. But how are we going to get this? I'll leave that as an exercise for you. We'll give it five or four, five or seven minutes. Okay, welcome back. I hope you had a go. I hope you had luck trying to work out this little problem. The key here is that we need to know at what point somewhere upstream, ideally somewhere before the expensive, let's assume this map rates operation takes a long time. Before the expensive map rates, we need to encode the number of items in the group. We can do that earlier. I'm just going to give ourselves a little bit more space here. Here is an example solution. So we take the CSVs before we split it on rows. We've turned the sub map just before, but here I'm going to group early before the sub map before the mapping operations. I'm going to do the same thing where we group on ID and type and then group couple. So we have the number. So we have our repeats already grouped. So just so you can see what that looks like here. You take the map. We group on ID and type. We include the metadata repeat because we're going to need to pull that out again later. Let's say we need it for the mapping operation and the reads. We grouped up also. We group on this small sub map. So now we have the map that we're going to use later after the mapping operation. These are the repeat numbers and then these are the fast QGZ lists. Now from here, I know how many repeats there are because it's the number of elements in this value here. The second item in the element of the tuple. So for these two samples, I know that there are two repeats. And for this sample, I know there's only one repeat. So to encode that, I can use this repeats.size. The size method on a collection returns the number of elements in that collection. So in this case, it will be two, two and one. And I'm using our plus operation again to combine maps. This metamap with this new map, which just contains the one key repeat count. Let's see what that looks like. So we have a map with repeat count. I can see there are two repeat counts here. But further down repeat count one, we only have one. And you remember earlier that I said the transpose operator was sort of like the inverse of the group tuple operator. So now to undo that work, I'm going to transpose this. So now instead of our repeats being together in the same element, I've split out the repeats. But note that the metamap has been retained so that I know that this row, even though it's repeat one, comes from a sample for which there were two repeats. And I know that these two FastQ files, they were repeat one. They come from a sample that only had one repeat. So I'm going to pass this down through the metamaps so that it's available here when I select the group key. The last thing I'm going to do is I don't need this. I don't need to, I'm going to pull this repeat ID back into the metamap. So I'm going to take the metamap, which contains the meta, repeat and reads and combine these two. So now I'm just back to two elements or two items in each element, the metamap and then the reads. I'm going to call that samples. Let's just view that just so we see. Great. So now I metamap contains all the normal things, but now it has this extra element repeat count, which I'm going to use here when I'm sitting the group key. So now I can use this metadot repeat count. Perfect. So I've run the group tuple operation, but now because I'm using this group key, those elements from group tuple will emit it much faster than they would otherwise have been because I don't have to wait for all of the mapping operations to complete before my group tuple operator starts to emit items. The group tuple operator already knows that, for example, for some of these samples, it only has to wait for two samples for two of those repeats to be present before it can emit them. As soon as the repeat count, as soon as this second element in group key the repeat count is satisfied, it's going to emit it immediately rather than having to wait for all the samples. This is particularly useful for large runs where you have dozens or even hundreds of samples and you can start emitting those items downstream much faster as soon as possible. So now that we have sort of BAMs together, we can do some more fake bioinformatics. So this is just standing in place. We have a combined BAMs process, which takes a meta object and then surpassed to some BAM files and input BAM files. And then we produce a combined BAM file. In this case, we just cut them together, but you can obviously imagine that in a real situation you'd be using a couple tools to combine those BAM files. So now after the group tuple, we can run combined BAMs. Fantastic. So now we have our combined BAMs. So we have the metadata sample and then a BAM file, which is the combination of all the repeats for the sample. So the previous approach demonstrated, like the previous exercise demonstrated a fan and approach using group tuple and group key for efficient fan end, but we might want to fan out our processes. This is really common in variant calling pipelines where you might want to, for example, call variants on each chromosome or over a set of intervals. In this example, I have a intervals file. We have this intervals dot bed file. See what it looks like. This is a dummy little bed file, which just has chromosomes. But let's say we wanted to fan out these. So now we have our combined BAM files with dynamic re-operation. We've combined our repeats. We'd like to do some interval calling over each interval and parallel and then found those back in later. Using the previous exercise, we can turn this into a channel of maps. So let's make a new channel. I'm going to do it at the top. A new channel called intervals. So as before, we take a path, a bed file, and we use split CSV rather than header true, which we've used in the past, because we know that our CSVs have header files. This time we're specifying the headers manually. The bed files traditionally don't include headers. So I'm specifying that I expect this intervals dot bed file to include one, two, three, four columns. And I'm just going to call in chromosome, start, stop, name. I'm also specifying the separator as a tab character rather than a CSV, which is the default. I'm going to collect, use the collect file operator. Actually, let's just see what that returns us. I'm using a little shortcut here, a little cheat. I'm calling return here so that everything downstream of this is not evaluated. Great. So after split CSV returns us this a channel with three elements, three maps. After that, I'm going to call collect file, and I'm going to make a new entry, a new file for each entry. So the collect file operator can take a closure. It takes an input. And I'm going to return a list of two elements. The first element in this list is the name of the file that I'd like to collect into elements that come through this closure that share the same output file name will be collected in one file together. And then the second element is what I'd like to collect into that file. In this case, we're calling entry.value. So we're pulling out the values from each of these map key value pairs. So it'll be crew one, zero, 11 and interval one. And we're joining them with tabs. So that returns us three files, which looks like this, the first interval, the second interval, and then the third interval. So we're going to give that collection of files, that channel of files, a name called intervals. Let's have a dummy genotyping process. Just as before, we're not actually doing genotyping, just because we want to demonstrate the splitting and merging rather than actual bioinformatics, we don't want to wait for real tools. So we're going to call this genotype on intervals process, which takes a meta object, BAM, and then a bed file, maybe a bed file of intervals, and does some fake genotyping on that. So from here, after combine BAMs, what we're going to do is we're going to combine this with the intervals file. Let's just see what that looks like. Combine operation is like it takes each input element, and then it emits a new element with that multiplied by all the obvious bed files. So for each input, now we have three elements being emitted to the channel, one for interval one, one for interval two, and one for interval three. That allows us to calculate the genotyping in parallel on each of these intervals. So now we can pipe this in. Now this process, this channel emits items that are the right shape, the right cardinality for the genotype on interval process. That is, it has a meta map, a BAM, and a bed file, a meta map, a BAM, and a bed file. So I can now pipe that into genotype on intervals. The last thing I'm going to do actually before I, after this group tuple operation, I'm just going to remind us what this looks like. Here we're viewing the operation. And here this map, if you'll remember, this is actually, so it was a meta map, and then BAM files. If we get the class of this, remember here that we had this group key operation. We've used this group key to group tuple here. But now we sort of want to do away with that group key. So what we're going to do, remove the group key. So we have this, and we have BAMs. And I'm going to just return the group key. I'm going to return the group key. This will remove the BAM just to make sure we should just have our straight meta object back. Hopefully we should see it's a linked hash map. Whereas if we look at that before we map and pull out the get group target, we should see a group key class. Great. Then now we can, after group tuples, using that group key, we can dispose of the group key by calling this get group target, which returns the meta map. Pass that through combined BAMs. We're going to combine it with intervals and then the genotype and all of those intervals in parallel. Now we have a, let's say we have our merge genotypes. So we have some VZF files. We have some genotype BAMs, for example. Let's say we want to merge those now. We can make a merge genotype process, which takes the meta object, some BAM files, and again just some fake merging operations. Look at this, what's returned from the genotype on intervals. We have lots of these BAM files, one entry per sample per interval from our BAM file. Now pass that through the merge genotype because the cardinality matches, we're expecting a BAM, rather we need to group tuple. Instead of this group target, we could also, as I've shown here, use meta.submap. That's also totally fine. Another way of getting out the, returning a map instead of the group key. So we're combining our intervals, we're calling it genotype and intervals. We're going to group tuple and merge those genotypes together. And now we have our merged genotypes. So in this operation, we've done some complicated work here. We've taken our samples, we've grouped all of the repeats together. And we've done this in a way that's very efficient by using the group key. Before using the group T, we've encoded the number of repeats early on in the process and passed that metadata down through the map reads process. So we're going to be available here for group tuple. So now we've grouped our repeats together. We've combined them all, combined the repeats into single BAM files per sample. We've removed here the group key element. And then we've combined this over the intervals channel. So now we have three BAM files, three elements per BAM file. One with each interval from our bed file. We've genotyped over each of those intervals in parallel and then grouped tuple back. So we've grouped back by the meta map. And we've merged genotypes and then view the outputs here. So we've done some complicated branching, splitting and merging. I hope this gives you an idea about our sort of taste and sort of strategy about how you too might implement efficient grouping and splitting in your next load workflows. Perfect. We'll see you in the next one. That concludes the content for this first chunk of the workshop. So for the next hour or so, we're going to leave this open for Q&A. So feel free to drop by the SEP 23 Advanced Training Channel in the NF Core Slack. And we can discuss anything that you've come up with the content that we've presented so far. Any questions or concerns or general next flow and NF Core queries. We look forward to seeing you in Slack. See you soon.