 Hello everyone, thank you for joining us for yet another Bitesize talk on this Tuesday. We'd like to begin by thanking our funders, the Transukabag Initiative, for supporting all NFCO events. So a few, some few details is the talk will be recorded, it's currently being recorded, and the video will be shared on our YouTube platform and shared on Slack as well. So for today we'll have, as usual, around 15 minute talk, and then it will be followed by a question and answer section where you are free to post your questions on the chat box or unmute yourself and ask your question to the speaker. Today we are glad to have with us Maxine Garcia, who's a bioinformatician at the Science for Life Laboratory at the Karelitska Institute in Sweden, and he'll be discussing or talking to us about DSL2, which is a syntax extension that implements the definition of module libraries, and also a better way of writing more complex data analysis plans using Nexru. So Maxine, over to you. Thanks a lot, see me on. So Hi everyone, Maxine here. So today I'm going to talk to you about like the new, the updates that we have on the new DSL2 syntax, especially for the modules. So brief overview about what I'm going to talk about, what's new, what can be done, and what should be done. So yes, first let's begin with a disclaimer. So this is my own takes on the new syntax, like other developers might have some other idea. And I think the best thing about like NFR is that we are a community. So yes, of course, there are like some driver forces, like such as Archie Laumich, that try like to do stuff and that are like having a lot of idea and stuff. But what I like is that everyone has their own voice and everyone might say and might improve everything, even like me. So and of course, because we are all doing this as a community, the syntax and the logic will definitely evolve. So I think this is just like the current state of the new DSL2 syntax. And what we want to do, I think is we want to forge the best practices for all that. So yes, I will just show you what I'm doing right now. And let's discuss at the end to see where this is going. So what is new in the new DSL2 syntax for the module? So basically, now the module are fully self-contained. We don't need any function.nf anymore. And we don't need the problems whenever we call like a module some sub-workflow. And all of the logic when calling like a module or a sub-workflow can be done with using the new task.ext directive. So we can set up like a different argument. We can set up the prefix for the file name. And we can set up, we can use the when directive to set up when and if a particular like module should be run or not. And of course, this task.ext directive are no use with with name selector in the module.config. So instead of using huge problems.map, no, we just have with name selectors. So it's completely like a brand new work. We can do almost like everything. And I think that's the main issue with that because we can literally do almost everything. And with that, I think the things that we really need to figure out is like to make good closures to decide like how to use the argument when we have several or just one, like we can use closure like to decide that we can use closure also to to design like the prefix. And we can use closure to decide when to run the module. But the main issue is that there is some downside. So we've got this whole new syntax for the module that allow us the usage of the task.ext directive. So that means that the logic can go into the config. So the main issue is that now we have the logic in the workflow, in the sub-workflow and also in the config. And that can be like completely messy. And I think that that can be like super bad. So I think what should be done is that we should be super like careful whenever we are setting up like task.org, especially because for the argument, if we're setting that up in the config, then we must be like very certain to explain how it's happening and where and why and everything. So I advise like to everyone to write this comment to explain the whole logic of that because first it will be good for future you because you will forget about it. I know I already did some stuff. So the comments are already helping me and I just started last week. So definitely it's good for everyone. And it's good also for other developers because I just said earlier we are part of a community and we are not coding only for ourselves, we're recording for everyone. So it's good for everyone. And I think comments is always like a good practice. So now I think it's time like to have some examples. So I will begin like example with Sarah because I just like finish PR that we merged with Frederique last week. And I think we are now the leading like pipeline with all of this with some development. So now if I look at the prepare underscore genome like sub workflow, which will prepare like the indices that we need and we'll prepare like some other files that we need before launching the whole pipeline. It's now super simple. I just like launch all the tools and that's all. The whole logic behind all that will be in the in the config file. So as we can see like the sub workflow is super simple. It looks super clear. It looks amazing. And I'm like super happy with that. So which is why I think we need to be super careful and we need to comment. Here, for example, I commented, I said that yes, this will be run if the aligner is BWMM. And this one is run if the aligner is BWMM. Of course, I added like some specific information here in the beginning of the file to set that for all modules here, when close condition is defined in the module.config to determine if the module should be run. And here I explained like how the condition is defined and extra command to say like if there is an extra condition, then it's specified in command, which is why I just explained here. So now let's just have a look at the module.config regarding to that. So for example, if here with BWMM one. So here, this is the published here like directive that will figure out how to publish the file, how to save the file if you want to save the file or not. And we can do all that within the module. So we don't have within the configuration. So we don't have to take care of that anywhere else. And that's so simple. So for example, here in this case, we will save this file only if we specify the save reference params using the publish dish more published mode and with this specific path and specific pattern. And we will run this process only if we have the power of the aligner, which is BWMM. Only if we don't have the BWL way params. So that means that we don't have any BWL way like indices that are provided to the pipeline. And only if we start if we start the pipeline with the mapping step, if we start the pipeline with a later step, for example, if we start the pipeline with the line calling, then we don't need to have the BWL way indices. So we don't need to run this one. Similar for the BWL way process, we only run it if we have the BWL M2 and it's so on for all of the other like process here in this pipeline. So here, for the for the indices and the preparation of the indices and all the other tools, it's fairly simple. I just use like some condition within like this closure to decide why it should be run and why and if and not. Maxime, sorry to interrupt. Somebody asked if you can increase the font size a little bit. Oh, yes, of course. Sorry. Like that, I think it should be better than. Yeah, looks good to me. Yes. Okay, so then let's see something a tiny bit more complicated. So here we will be looking at the mapping, the mapping some workflows that we use in SUREC and that I hope like to publish one day in the NFCore repo that it can be used by other pipeline as well. So this workflow has been like refactored like several times by Frederic and I and I'm pretty sure like we have other people that are like looking into that as well and that we will like improve that again. But I'm always happy for that. So I think it's good. So here it's the same in the SUREC workflow. We have the whole logic that decide if this sub workflow is one or not. So I will not show that but I will just show here inside the workflow or we inside the sub workflow or it goes. So basically with this like task.x directive, we can really like set up the whole logic inside the config file. And so the pipeline itself like is much more simpler. And here we run like just big value and mem one mem or big value and mem two mem on the input file with the indices. And we set up true because we want the output file to be sorted. Here we're just gathering like the the band file outside and we are remapping and we don't want to solve the workflow but that's like that's an extra step. And in the end that's all we so only if this clause is true, we will merge all the band file and we will only do that if we want to skip the mark duplicate. So it's all explained here in the command only if we want to skip mark duplicate or only if we want to say the band file. So only in this step we will merge the band file and we will index them. And then of course we gather all the version. So here we have all the modules that are called here and the whole logic will happen again in the workflow in the config file. So here we see that similarly to what we've done with the indices we run this big value mem only if we have the param aligner between mem. For big value mem two is only if we have the param aligner which is big value mem two. We can see is that we set up a particular argument depending on the meta map. So in our case in Sarek we have a specific like specific like handicap if we have like some tumor sample. So if our status is one meaning it's a tumor then we have like this particular like value. Otherwise it's the regular like parameters that we use. And then similarly we have a particular like prefix that we use only if we have some split fast queue. If we split the file like at the beginning. And then as I explained with the merge and the mapping we only do that as I said before when we save the map file or when we skip my duplicate. And that's all. So I think this whole idea about like improving this whole like about this whole syntax really like allow us like to go to make the sub workflow like easier to read. But in the meantime we really have like to push everything like into the command and to explain all that. And I think okay I made a bad copy paste here because this is exactly the same. So let's see like one more complicated like sub workflow and I think that will be my last example for today. So here this will be like the marking duplicate sub workflow which can be skipped as I explained earlier. So here as an input we take like the BAM maps which contain like the meta map plus the BAM file or the BAM index we can which contain like the meta map plus the BAM and the index. Then we only have one of those depending if we are skipping my duplicate or not. So in our case oh we can have both of them because it's an optional like channel but let's see it doesn't really matter here. Here in this case I will run like some tool on the BAM file to convert like the BAM file to ground like when we have like no duplicate which is why like I have this like huge name for the for the for the module. And this will only be run when we have like when we are skipping my duplicate. Otherwise when we are running my duplicate if we are running my duplicate with Spark we will run that. If we want to have like some if we want to have some some quality control tool run out of run out of mark duplicate then the output of mark duplicate will be BAM otherwise it's cram and if we have a BAM then we will convert the BAM to cram because we want to use cram in our pipeline. So that part is like slightly difficult to understand which is why like I try like to comment everything and which is why like I try to put like extra comments in the in the in the config as well. If we are not running like Spark but running mark duplicate then we are running like the regular module for that which is the JTK for my duplicate then we are converting the BAM file to cram and then in the end this channel cram mark duplicate will contain only one of the following channel because we only have like one solution either we are skipping mark duplicate either we are running mark duplicate Spark with the BAM output or running mark duplicate Spark with the BAM output or running the regular mark duplicate and which is just one of these solutions. So in the end if we are running mark duplicate Spark and the report on the BAM file then it runs this one and otherwise we run like the we run the the report on the mark duplicate BAM output or input and otherwise we do central stat on the cram file and that's all. So as you can see like in the in the workflow in the sub workflow it's a bit complicated but it looks clear to read and I think that makes it like easier to understand even if the logic is a bit like fuzzy which is why we have everything here in the module so here in the module seems like similarly we will have like the prefix that will explain what is the output file should look like and we have a proper like one like directive that will that will explain to us how to run it and why and where. So and this is all this is all explained there and here we have that for all of the process that we have there I think what can be done here to improve will be like actually to sort out all of the with name like selectors and I think that was like a good idea like to first like sort out the select group the selectors by sub workflow but I think maybe like sorting out the selectors will be good I'm still thinking about if we should sort them like alphabetically or if we should sort them in the orders that they are in the sub workflow I think that is like a different solution I'm not sure what to do there of course what you can do as well with the name selector you can group several workflow together several module together which can lead like to extra issue because you might not notice that you're defining twice the same like extra suffix or extra when so you need to be really careful when you're defining like several like module at once and I think yes okay I think that was all for my examples so I think I would just want to thank like all of my institute and all of the institute I'm working with and for and everyone that help us with Sarek all of the institutes that are part of NFCOR and all of the people that are contributing to NFCOR if you need help I will recommend to watch like the whole bite size even if they are not up to date otherwise you know where to join us on Slack or on Twitter and everything and now I'm open for question and I think I saw like there was some raised ends I think what it's what it's other question yeah thanks thank you for the introduction here I didn't follow all the discussions on Slack and GitHub can you say why this when syntax and in the config was chosen because what I saw now was that mostly in your configs you were referring to global parameters and then in the sub workflow you basically you had a comment of the condition so to me it is much more obvious to put in the sub workflow an if statement with those global parameters because then the logic is right there to read and it kind of hides away that logic and makes it more difficult in my opinion but I haven't read the whole discussion around it no I think I agree with you that it hides away the logic definitely the logic is a bit I don't know why but I think I think it's a good way to go because for me it will be like much easier like to control the software with work and it will be much easier to make like some sub module or sub workflow I mean that will be easily shareable between like different pipeline which I think is something that I would really like like to advance at the NFCore like level because for now what we're doing is that we are getting like good of having like a sub workflow that are like looking good I'm thinking like for example we have a good like Trin Galore fast QC sub workflow but that is mainly just copied like over from one pipeline to another and I think it will be very good if we can do that and I agree the logic is done but I think if we explain it well with commands it will be it will be like good and you don't need to actually like to use the if logic I think the if logic or the whole logic with the when prefix can be decided in the in the config or not I think that's something that you can decide like for yourself in your own pipeline or in your own sub workflow so for me adding like this when like directive to the module give us the possibility to do more stuff and I think the problem with that yes it can be good or bad depending on what we do with it I hope I reply to your question quite yes thank you um also Friedrich asks whether dividing the configs for sub workflow can reduce size and increase uh yes I would I think like having just a simple config file for each sub workflow could be like easier and we could even have like the config file sitting in the same folder as a sub workflow that's something that we can decide or not I think that's what I said earlier like we need to decide like what are the best practices and how to reinforce that and what to do how to follow and how to go on with that I don't know if there's anyone who has another question apparently there's no one else who has a question okay then no problem I'm pretty sure we will have like more questions on slack like as soon as more people have seen that and as soon as more people realize what we can do with that because definitely yes this new like syntax like can be very helpful or could be very like dangerous depending on what you want to do especially with this new like usage of the when directive yeah but maybe like even if you develop a standard syntax for the normal processes modules like quality control and trimming then they can literally be applied to every folder don't have a problem with it okay so thank you guys for joining I will see you next week on another basis