 So welcome everybody to the day two of the NFCOR hackathon. Today we'll have a session talking about DSL2 and modules in next floor. And we have first a talk by Evan Floden and Paolo Bittomaso who need no introduction. So yeah, I'm gonna give the stage directly to Evan and he's gonna talk to us about DSL2 for NFCOR. Awesome, thanks a lot. Yeah, it's super exciting to talk to everyone here today. Maybe a little small introduction maybe. So I'm Evan, I'm CEO and co-founder of Secura Labs. I work with Nextflow, we've also got Paolo on the line. If you don't know Paolo is the main developer of Nextflow. And as I said, we wanted to just ask to do a very small introduction onto DSL2. So if you don't know DSL2 stands for Domain Specific Language 2 and it's an extension of the Nextflow language. And it's super exciting thing as we've been working on this now for I think almost a couple of years and in the next week or so as part of the July release this will be rolled into the kind of main release of Nextflow and then it'll start to become more widely used as a kind of prefixes. I've been using it now for probably almost a year now in terms of the main pipelines have been developing and I've found it like phenomenally powerful, exceptionally useful. And once you get your mind into using this way it's typically the way to go. At the same time we're still sort of teaching it from the basics of how to use the traditional Nextflow because I think it's important to get the kind of concepts right to begin with. Let's begin with what I wanted to go through was just to imagine our existing pipeline. And if we think about what the data pipeline's got you can kind of split it into two parts. So you can think that there is the part where each process or each task is trying to do a, it's trying to run some pieces of software, run some script, do some processing of some data. And then there's the second part you have to think about which is how you connect these things up. So traditionally it kind of looks kind of basic like in this example, but as you say we've seen many times before that if you have very complex examples here this becomes quite difficult. And what we're sort of thinking about with DSL2 is how you can split these two things up and splitting these two things up actually helps you write more efficient pipelines or write pipelines more efficiently. So the two parts that I'm talking about here are essentially where you write your code. This is the process block. This is where you focus on the actual task that you need to write that the pipeline should run in that moment. And then the second part is much more related to what the inputs and outputs of those are that what the channels are, the Dataflow programming site that we see here. And that is the separation of these two things is one of the main aspects of DSL2 along with several other new additions which can kind of result from that change that we're able to do. So as we've been working on this for a long time so it was really one of the more requested features over the years was to be able to add this to NextFlow, be able to modularize the workflow definition to be able to create independent components and then to be able to reuse those independent components in different workflow projects. So essentially the reuse of components was an exceptionally important part and being able to do that with NextFlow. It also allows us to break up the script. So you think that typical NextFlow pipelines are one massive .nf file. And if we're able to break that up, we can kind of split them on a lithic, script up and use it into both reusable pieces but also more manageable aspects of it. The challenging thing about this is that in NextFlow, the processes had the kind of implicitly defined what the inputs and the outputs were. So the execution of the pipeline using these processes was kind of intrinsically linked into what the pipeline was doing. So if we look at, say, for example here, this is the typical example that we use but consider the process here, notice that the process contains the channels in here. So it contains from Genome Channel, from Reeds Channel. And because these are declared implicitly, they're really also defining the dependencies associated with that task or with that, the dependencies associated with that process. So because the process definition implicitly determines how the process will connect to the rest of the workflow, it makes it difficult to split up because this Reeds channel, for example, could be used in another step. And then I can, for example, define the align sample and the downstream process because consider that what we saw here, now these two processes are linked together. So therefore it makes it very difficult for me to take index sample and use it in a different location. So as I said, DSL2 is a major revision of this. It's really focused on the modularization, fluid definition, component reuse. And we've been working, say for quite a long time, to be able to do this. And it's very, it's kind of great to finally see that over that time, the iterations that have gone on, the time that we've spent working with the community to work on specific aspects of it to define it, really paid off into making something which is now ready for the stable system release. Okay, so if you've ever used Nextflow before, you shouldn't find DSL2 to be too different. It's really a natural extension of what we've got. To keep backwards compatibility, we've added this definition of the startup script that you use. So if you enable DSL2 like this, then you're able to use DSL2 in your script. The key thing to note when you look at a DSL2 process is that now we're no longer defining what the channel inputs are. So the process doesn't need to declare from and into and for that component. And previously we were using from to say where the process was receiving the data, we were using output to say where the data was going to, but we no longer need this anymore. So we just have this original path transcriptome and path index. This should be fairly standard. So the thing that comes about is now the process definition is no longer tied to a specific channel or data flow. And therefore we can use it independently. So how do we use these independent? How can we take these DSL2 components and then reuse them in a pipeline? Well, this for example, this index process could be defined in another file and we can include it into a main script. Look at this example here. This is a DSL2 workflow definition. We are including the index process from some file, some module that we could have defined. We could contain one or many of these. And then on the workflow step here, we could say we want to define exactly like you would in next flow. So we have a new workflow definition here, but the transcriptome is the same. So this is just defining a variable here, the parameters to a transcriptome. This is all standard next flow. Read peers channel from some file peers. And the difference now is that we call the processes almost as functions. So instead of having to have include the process block here where you'd have a block of the process, we can just name the process and the inputs themselves are in an order that they would be expected by in the process. So transcriptome becomes the first input channel to index. Read pairs channel becomes the first input channel to fast QC. And the quant process has two input channels. The first one is the index dot out. So the output channel of index process. And the second input is the read pairs channel. Another thing to note here is that with DSL2, we're able to reuse channels several times. So we no longer have this limitation of having to split the channels out. I can use read pairs channel here and I can use read pairs channel here without having to split it. The other thing to note is the use of this dot out. So with this dot out, I'm able to access the output channel of index without having to necessarily define it as a variable. And this can have some advantages for just making it simple to see. We also have a new way of actually defining workflows or sub workflows in this case. So we can provide a name. We can just see, for example, workflow RNA-C. And then we can have these new words for defining what the inputs and the outputs, not for our processes, but what a workflow is. So we use take, which is the word for the kind of input of a workflow. So we have the transcriptome, read pairs channel as we had before. We can then have the main section, which is what is executed. So index, past QC and quant, exactly the same as before. And then the output of the sub workflow can then be defined with this emit keyword here. So this is a quant dot out. Final thing I'll show you and then I'm gonna pass you on to Paolo who will do a full demo of how we can convert a pipeline to DSL2. We also have the ability to use pipes here. And what this is saying is that we can define, we can write our workflow now no longer, sort of all along one line. We can pipe the outputs of processes, of operators to the input of the next step. So imagine this example here. I'm looking now in the workflow block. So the channel dot from path params in. This is creating a channel, which is containing our input. I'm gonna pipe that channel to split faster, which is a next flow operator, which is gonna split the faster up. That the output of that operation will then be piped to the align process. So that align process will run for each element which is emitted from that channel. And then finally I can pipe it to the view operator to view the output of that. So this is a kind of a neat syntax that we can have to really powerfully use on the concept of Linux pipes and really use the expressiveness that comes about from using Linux and apply it to now distributed computation with next flow. So with that, I'll pass you over to Paolo who will do a more of a demo and hands on into how we can use this and the basic concepts behind it. Nice. Let me show the screen. Okay. Yes. The idea is to show you how to migrate a small pipeline, a small or a big pipeline to these new syntax. I'm going to use our small RNSEC pipeline that we've used for demoing purpose. That is this repository next for why you are a NSEC. I have a lot of copy. And I think most of you are already seeing this but if you haven't ever seen, there are just small for task and there is at the beginning some parameter definition whether it's Transky Tom, the output there, et cetera. Just a log info that shows on some runtime information, some information about the input, the pipeline. Then there is the repeat channel definition with the usual trick to split the channel to for the limitation on next flow to manage multiple inputs. The same channel has multiple inputs and then there is a task for the index creation using Salmon that the quantification start always using Salmon, the fast QC start and then finally Mati QC that collects all the output secret, the final report. So the first thing to enable DSL2 syntax is to declare this flag at the beginning to make it possible to allow next flow to use both syntaxes without breaking existing code. So this is an opt-in feature that must be enabled using this flag. The final version preview will be replaced with enable but for now we can just use preview and this is enough to tell next flow that you want to use this new syntax. So like Evan was mentioning, it's not a radical new syntax for next flow but there are a few changes that allow us to make many more things. And the first thing was to remove all the from declaration, from the processes because this will allow us to reuse these process definition in multiple place like an independent model. So we have just to remove all the front part and this basically make it possible to use now the processes like the definition, the task that I want to execute but without linking them each other. So they are independent models basically. And the same way, I need to remove the into definition including this, this was, I will enter more detail about into later about this part but we have to remove also this then all the process into definition here. Okay, we are done. So now we have the task but if I try to run this, it's doing nothing because what is happening now? Because I don't have any more. Okay, it's telling that there is the reward but actually it's down editing because I have just a task definition but there is no decoration how we want to use there because before we had the definition the from into there was creating the relationship within the task. Now this does not exist anymore but we can use, let's remove this because we don't need for now we have the ability to declare a new component that is workflow that is what allow us to recombine all these tasks together. And for example, I want to start using the index like before. Now since I declared in the index task one input, what I have to do is to use the processing that's passing the input that I declared in the process definition. So here I know that I can pass the parameter trust with Tom. So basically the process become like custom function that I can invoke inside the workflow scope. This is what DS2 allows you to do to reduce the processes like function, custom function that run your task. And this is enough to run this workflow. Run me, let's use Docker. I also resumed, oh, we got resumed. These were run the first task and now basically what I have to do is to recreate the pipeline recombining the task like before. So yeah, index then I can use the fast QC task. The fast QC was declaring the read pairs. And this is what I was doing here in the channel declaration with the intu. There is any more DS2, the intu operator by can use just this channel. Another advantage DS2 is not just about modernization task but also it removes the need to split the channel when I have to reuse multiple places. So make the syntax much more readable with less decoration, the same different channel name for the same channel. So much more concise. I can use read pairs here and now I can run it again. Okay, now I have execute task. I'm still executing it. And then I have the quantification task. Quantification task is taking the index generated by the previous index task and then the read pairs. So how now I can combine these two genes. I can express in different way but the simplest way that you can use is to use the process name to reference the output that it produced. So here I can say that index, I want to take the output index using these implicit variable we can apply in the index task. So I'm telling involve the quantification passing like first argument, the output of the index in location. And since it's taking a second parameter that is the read pairs, I can just use the same read pairs channel. So we can hear this line. We are seeing two powerful feature, the SL2, the way to reference the output of a process, just say process name dot out and they're using the same channel two more times without having to split the channel like we had to do before we did. So it's much easier, much more readable. Let's do a resume now. Here it is. Finally, we have the multi-QC task. The multi-QC was a bit more complicated because he's taking the first parameter that was receiving all the output from the fast QC and quantification task and we can do the same thing that we were doing before. Am I using the new syntax? The new syntax would be that I want to take fast QC output mix like before to what? To the quantification task out and collect. This was the same snippet that I had in the previous implementation but just adapted to the ability that I have now to use the process name to access to the output of that process name. But then I continue to use the same operator mix, the same operator collect, et cetera. And the second one was just the parameter multi-QC. I think now let's try to resume. Fantastic. So in these few lines show one of the main features of ESL2, the ability to use process like custom function, how we can reuse the same channel multiple times and how we can access to the output of process using this notation. In this way, we can isolate the main logic of the pipeline in a much more complicated way. In a much more concise way, allow us to treat this piece without having to recreate mentally the link between all the processes. So much easier to treat the, especially when you have a complex logic in your pipeline. But still, we haven't seen how to create separate models because I just readapted this pipeline to use the new syntax by using just a single file. So what the point was to create models. How to create models? I can just take all the processes and include in a separate file. RNA costs enough. Oops, there is too many else. And that is an extra model. It's just an extra script that contains processes declared using the new syntax. So without into a from. Now, since here I'm declaring some parameter I had to declare it like I will do in any order next to a script, these parameters without the full value resource. So it's not in change. This is the usual thing. But now I can include these RNA task script from the main pipeline. Now, using include and then specifying the name of the processes that I want to include. So inbox, fast QC, quantification and multi QC. Worries, why is not taking the other template? Then I need to specify which is the script path that is RNA tasks. The extension is not needed. Instead the dot slash is mandatory and you have to specify or a relative path like this or an absolute path starting this way. And also there were some version in which this was allowed, meaning include everything but we decided to remove because that was easy to allow to include everything made very difficult to understand when we have many inclusion from where the processes are coming from. Also the curly bracket notation was optional but the final version this is going to be the only way to define the inclusion even if you have just a single inclusion like this to make the syntax a bit more consistent to a different way to include things with curly brackets with a curly bracket. So the inclusion is always with curly brackets like this. Okay, I think this is enough to run it again. And fantastic. Now I have the same pipeline which the main body of the pipeline are just 50 lines and it is this, include the task. Then there is a channel declaration that I could even put inside here. And then the workflow button. And the nice part is this that I can use this task in here as a separate model that I can include into different script. And also a nice thing that I can create models, components that are not just single tasks like this but also sub workflows, how to do? Well, just using this notation maybe I want to create a sub workflow that does always the index, the fast QC, the quantification execution. What I have to do into this file and separate file does not matter it just declare our flow to which I can give a name. Maybe hermetic pipe. I want to do, instead of invoking the task QC I want to just have this like a piece, a component that they can reuse. And the problem is now how to isolate the parameter and the output. This can become input of the sub workflow using this keyword that is the equivalent for the input that we are declaring in the process. In the workflow instead this take and allows you to declare and identify then you will use like the input for the index and also another input is the read period channel. Then start the body of the pipeline and then maybe you want to return this output from the pipeline execution and to declare the output in the pipeline you can use a mate. So what do you have done? We are creating a sub workflow that takes this to the input around these three tasks and then we produce like before the combination, the first QC output and the quantification set output. Let's try an action. I can include now the hermetic pipeline from this file. Please confuse me a bit here. I will put here. And then, so now instead of cooling this I can cool hermetic pipe passing parents return and repair CH like before. And now multi QC takes out and then like before the parameters are modified for multi QC. This is enough. So we can isolate a piece of workflow in this way. Let's try to run it. There it is. Now we're executing for scratch because the namespace of the execution change if you notice now this process become RNC pipe index, RNC pipe fast QC, et cetera. And then the last one is multi QC. So basically they are nested inside a namespace. And what else? The last thing that is nice to mention is already was already mentioned by Evan, maybe a couple of minutes to about this is this pipe notation that we have. The pipe notation allows you to concatenate over either a processes together in a bit more readable way that can be maybe more intuitive, especially since we are used to work with the best syntax and Linux syntax in just a way to combining the operator using the pipe notation also into an extra script. So basically I could say that the fast QC out is pipe to the mix operator since the mix operator was taking also an argument, we continue to use the parenthesis notation here. But then the output, the mix goes into collect. I don't need any more, the empty, the parenthesis when there is no argument. And now it's a bit more readable because we are saying that the workflow is producing this output, that is fast QC output mixed with the quantification output and that collect everything. The only suggestion is try to not abuse to use this notation. I have seen some people trying to write 20 lines or scripting you concatenate all tasks together, maybe that would make too complicated to iterate. But having small one liner, I think it's nice to make the code more readable. The last thing, the workflow can produce one output or more output and you can give us a name to this, assigning that to a variable. So if I do something like that, then I will be able to access the hernesic pipe output using this, this notation. So the name that I use here into the image section, but also I could say that I want to return what? The quantification result using quantification out and maybe the fast QC result assigning to the fast QC out. So now basically my component, my sub workflow hernesic pipe produce three output and then I can choose which one I want to use or all together or maybe just fast QC result. Okay, these are the main feature of next flow DSL 2 and the ability to create separate models or library or task and processes itself before they can be included each other, but also the ability to use this pipe notation and also the ability to reuse the same channel in different place without having to use the int and the ability to create copy make the resulting screen much more readable, much more QC and I think this is also why some point we decided to call this DSL 2 because it is a little change in the syntax, but at the same time the impact that provides is so big that it's going to provide a complete new experience when write pipeline with next flow. Okay, this concludes my demo. I don't know if we have time for a question in the chat or whatever. Thank you. Thank you very much Paolo for this hands-on demonstration. This will really help us port all our workflows to the DSL 2 and thank you Evan also for the introduction and yeah, we do have some time for questions now and we actually already have some questions in the chat. Yeah, I can read them out loud as well for the YouTube followers. So we have a question from Moritz. Since workflows can now reside in a separate module, it is kind of natural to start versioning them in separate repositories. Can we import from a repository URL? Okay, yes, this is a classic question. The short answer is no because at least not in the basic form that you can specify from a URL because that is very dangerous for the stability of the pipeline. This was one of the biggest failure writing in first generation workflow engine that was allowed to import data or script from a remote repository. We know that this repository change over time. So that will make the pipeline super fragile as soon as this URL breaks. The idea is to instead to include external dependencies from, for example, using Git subproject or Git models and also another possible extension which NFCore is working very hardly is to create a way to create a NFCore models that will be possible to resolve using NFCore tools. Maybe feel more details about this. Yes, exactly. This talk will be followed by a talk by Phil and he will explain more details about this. Exactly. But not directly using a level next to language. Exactly, yeah. So we have another question. I like the pipe feature. Will it be possible to pipe like in Bash, see the final output row by row or does each process need to finish before the next one starts? Not sure what they did in the question. Say the one row per row. I think the question goes down to it. It's actually operating exactly like next flow. So when there is a process which has, for example, a hundred tasks, it's actually operating in the exact same way as next flow running the thing. So it's happening in parallel and it doesn't have to wait for one of them to finish. It doesn't have to wait for one process to finish before the next one begins. So it's a full workflow definition exactly like you have. Okay, perfect. I hope that this answers the question. Does somebody else have some other questions? Please write them to the chat. So from Steve, the new DSL2 and module system resembles some aspects of CWL workflows. Do you think we might be able to use CWL tool definitions at some point using this? In principle, it is possible. We had also a tentative to implement CWL importing to run into next for DSL2 CWL task but we didn't have enough time to go into a real implementation. So there is a kind of proof of principle what is approach that may show that it's possible but is nothing usable at this point. Maybe at some point we will implement this. Perfect, thank you. Some further questions? So even if you have some further questions later so at the end of all the talks we will have a discussion session so you can also continue asking your questions there. And we will see you next time.