 And now regarding the progress of the hackathon, this time we will do it a bit differently than the previous days. So instead of going through the hackathon teams, we will have a more in-depth discussion about the hackathon modules team, which was led by Harshil. And there were a lot of developments in this team during this hackathon. So that's why we want to present this time some slides and discuss more in detail on what have been the processes there in the NFCore modules and DSL2 team. So please, Harshil, whenever you're ready, we can share the slides. Oh, yeah. Sorry, I've just asked Paolo to join. Would someone mind sending him the link to the Zoom session? So I don't think he's on here. That's why I have the slides. Everyone see that? Yes, perfect. Right, so is Paolo on? I don't think so. I don't see him in the participant list. OK, so maybe we can start talking about some other stuff that may not need Paolo involved, I guess. Another option is we could always do the wrap-up from the other tool. Did you see we were going to do other wrap-ups at all or just quickly, the other groups? As there were no volunteers for the wrap-up this time and some groups were mentioning that there were not a lot of developments, I thought we can move directly to the wrap-up for the modules and DSL2 hackathon project. OK, so I think this stuff will require Paolo's input. So maybe if we start with this. So the idea here is basically to discuss amongst us little niggly things that have come up over the past few days with DSL2 and maybe to try and nail down some ways of dealing with these things. And I guess I think we've gone some way in now getting together something that works with the simple FastQC example. But I think we may still need to refine the design before we start mass producing modules, otherwise we're just going to have to submit loads of pull requests to update everything. And so it might make sense to just think about things in a bit more detail before we do that. And that's exactly what this is about. So this, I believe, came from Gregor. And the question here is, do we have several containers for each sub-command of a particular tool? So BedTools, as you know, has multiple commands. And do we build separate containers for that? Or do we have a clever way somehow of just referencing, say for example, a main BedTools container that we can use? Any thoughts or comments on that? Gregor, are you here as well? Maybe you can just unmute and chat. OK, can you hear me? Yep. OK, yeah, so I have a few thoughts about it. The one thing is that having different versions sounds ugly. And it could possibly happen somehow that, and within the same pipeline, we use different versions of the same tool, which we absolutely don't want. On the other hand, it could just make things more complicated because the continuous integration is coded to check a whole module and build the container for it. So it could just be more streamlined to have the environment file within each module. I can just chime in. I completely agree. I think both would work. But I think, although it's messier, it would actually be simpler to keep each of these as isolated modules with as few dependencies outside of that directory as possible. So my gut feeling is to say just everything has its own Docker image and its own condo environment. OK, so do we name these containers BedTools underscore subcommand, or do we name them just BedTools under the main tool? Because, I mean, are there likely to be clashes that way in terms of how these containers are built? Yeah. So Gregor and I have been talking a bit about how we pin the code. So when you define process.container, you need to pin the Docker version. And we don't have versions in NFCOR modules. We just have commit hashes. And you don't get that commit hash until after you've committed your code. So it's impossible to use the git hash in the code, if that makes sense. Yeah, it's a bit of a catch-22 situation. So Gregor's suggestion, which I think is really good, is we use a hash, a file hash of the Docker file and the environment file. And if those two are identical, then you'll end up with an identical container. Which actually maybe is not such a bad thing. But we could also add other things into that hash. For example, we could add the file path into the hash, which would then make it unique. So they shouldn't clash, but we need to be a bit careful about it, I guess, if we use the same name. Because as long as we have that tag there, it might be safer to use underscore subcommand. It would certainly be a bit clearer to use. Because I can also imagine that some subcommands might need other tools in them. So they might not be the same. So for example, if you're using BWA to index, you probably don't need samples. But if you're doing BWA to align, then you probably do want samples. So you don't want to call both of those containers the same thing when they're different. Yeah, good point. But that does mean, I guess, on the downside there is that we'll be downloading loads of containers, right? Or more than we'd need. Yeah, I think it may make sense possibly to even throw in the main.nf of the module file within that hash to, I guess, make things even more unique than just using the Dockerfile and the environment file. Because the code in that will be different across subcommands anyway, right? Yeah, that would work. Yeah. OK, cool. The other hand, it would lead to the fact that we rebuilt the Docker container even if just a comment was fixed in the next flow script. Good point. Yes, and you can use the path if you want. Yeah, to the Dockerfile. OK, we can figure that. I guess, so the solution here is then just to go for separate Docker images, right? OK, cool. There are a couple of things in the comments. A couple of votes for having a single container because it's tidier, which I agree with it's tidier, but it's just really cool to spot things. And Steve's saying, at least come out, got around the hard coded hash thing by having separate repositories. One repo for containers and one repo for modules. Which, yeah, it does allow you to get around the issue. But then you have two repos to maintain separately. Which you can see here. Yeah, I think the idea here is for us just to try and make things as easy as possible, as encapsulated as possible, possible per module, I think. It would be nice to have it in this way if we can. But I also do get the point about having one container. But maybe for starts now, we just go with separate containers. And then if we can find a clever way to maybe maintain one container per tool, then we can implement that at a later date, maybe. So. So just a couple of questions about bio containers. Tell me if this is annoying for me butting in with questions. No, no. So the question is about why can't we use bio containers? And was that not better? Seeing as we're using condo anyway. It's a really good question. Originally, we were planning to use bio containers for everything. We hit some issues though. We found it was difficult to make multi software containers. For example, PWA-MEM and tantals in the same container. So that should be fixable. So we're going to hopefully have a talk from Gern tomorrow about exactly this. And you can kind of imagine quite a few times where we might need more than one tool in there or other kind of weird stuff. And the other thing is just not having control of the build, which again just makes it harder to fix things if we need to update the packages which are in the container or if a certain tool needs something weird in there which bio containers doesn't have. You don't have control over it. Whereas if we're building all of the containers ourselves, we can guarantee that they're stable and that they work and we can fix them whenever we want. So I'm a bit split on it. I still think it'd be really nice just to use bio containers. But we kind of said recently, for at least to start off with, we just do our own thing. And if later on we decided bio containers works and we cannot switch to that, we're definitely holding this a bit. Perhaps one more comment on that right here. I guess it's a good point to have the procedure properly established for building a container. And then you can perhaps rely on bio containers and just a pointer to the bio containers, to a bio container in most cases, and then but have the automation in place to build if that is not provided. Or even if the simple build could be just a from bio containers like a single line Docker file. Yeah, I think that partly makes sense as well. So at the moment, I guess what we could do for now maybe is possibly, so for example, at the moment, we still need to figure out all of this versioning with FastQC and stuff. So I guess in the interim for now, we could point to a bio container which is working, which allows us to test things using Docker and singularity and stuff until we've, so maybe we could take the flip approach, use bio containers for now until we've figured out how to do things in a more stable manner in terms of us customizing the Docker builds and stuff. What do you think about that, Phil? Yeah. Yeah, I don't know really. I basically, I can see arguments for everything. My all the way through NFC modules are so many choices to be made at every point. My gut feeling at the moment is do everything ourselves and keep everything super simple to start off with. And then, I don't know, it's tricky. I don't have a story in the way on the containers really. I guess one of the downsides of doing it that way is that you can have version mismatches in the bio container and the condor environment if we're planning on using both. Unless we lint for this specifically somehow, you could end up pulling Samtl's version 1.8 as a bio container and have 1.9 in the environment. And then you've got version mismatches across your software, which is an ideal, could still, you know, things will still work, but it's probably not ideal to have it in that way, I guess. Yeah, any more thoughts on that from anyone? Just unmute and chime in. Francesco here, I'm not sure if anybody is ever gonna run into a storage or space problem obviously, because having one container per sub command that will obviously generate a huge number of containers, which might then you run into the problems you have mentioned and controlling that they are the same version. So potentially there might be different versions per sub command. So personally, I would prefer the container per tool rather than sub command, but I see absolutely the point of versioning control, what happens in each module that is organized by sub command. So I don't really have like feel, I think I can see the pros and cons in both approaches. In general, I would prefer to have less potentially identical containers because that's what you might end up with. If following the discussion of the other day about having more is the latest version, potentially the containers of all sub commands would be exactly identical. Just to clarify having version mismatches across different sub commands, that's possible even if you use the same container name, because it will be accessing specific container hashes for each sub command. So if you pulled in different versions of those modules, different versions of the sub commands, they'll hard code different tags, but even if it's the same container name, they'll be pulling in different tags. So sharing an image across different sub commands, for example, doesn't avoid the issue of having different versions of the software. That's up to the pipeline developer to avoid. But yeah, I agree with file size. It may be, maybe we just do what you suggest. I should just have one called, we have a Docker file and environment file in every single sub command, but we call it, if it's just got bed tools in it, we call it bed tools and we use the hash of the Docker file and the environment file. And then if all of those three sub commands have the same condo file, the same Docker file, they'll end up with the same name and the same hash. And so they will all share the same container, even though they're all building that container. Does that make sense? Yeah, yeah, I think that's a good solution. Okay, cool. Yeah, that does make sense. Yeah, so in the end, we would only end up with one bed tools container because each sub command would be building the same container. And so they won't be rebuilt with those unique hashes. Right? Exactly. And if you wanna have BWA and samples, we call it BWA samples. Yeah. The only thing is we need to store somewhere what the image will be called. Yeah, exactly. So I'm sure we can get around that. Yeah. So the way we approach it now then, I guess is if you have a module for bed tools merge, then the name of the container is actually bed tools and not bed tools underscore merge. Likewise, if you have a module for bed tools sort, the name of the container is bed tools and not bed tools underscore sort. And ideally that will work in this context that we're talking about. And we don't have one container, which is great. Cool. Anyone else, anything else to say? Right. Gregor, do you want to jump in here? I think you made these slides, right? Yeah. I was wondering about the metadata documentation which we now have in form of YAML files. And I find them a bit complicated and I was wondering if you can simplify them because now if we have this input which basically has two input channels, one the tuple foo and bar and the second one with the bass, we would have this nested list in YAML which looks ugly and is probably hard to maintain which has the first channel and the two items in the tuple and the second channel channel listed separately. And yeah, if you go to the next slide, I was wondering that one option could be, Argel, can you go to the next slide please? Yeah. That we just make a flat list because the input name is unique already anyway so we wouldn't necessarily have to split it up into the different channels or if we maybe could come up with something more doc string like, which is in the next slide, that we basically have the documentation and the code together. And yeah, here I mocked some sort of Chala doc like approach, but it could be whatever that we declare the inputs, the type, the name and the short description in a comment close to where the channels are actually defined. I think we talked about this way back after ISMB last year when we created the NFCOR modules repo. So Sven and I were in Switzerland and we got inspired to start dealing with all of this stuff. And I think we had some conversations with Phil at the time as well. So I think issue number one on the modules repo is actually for documentation. And I think we did propose something like this at the time as well, but then they can't remember why we rejected it. Phil maybe can remember, but I think it was more the ability to easily pause and use the documentation downstream was the argument we went for. But I can't and Phil, do you remember what we, why we decided that? Yeah, I remember the conversation. I remember exactly this suggestion because it does feel very elegant to write it in line. And I think it was what you said that just having a YAML file that you can dump into other tools is a bit easier rather than having to like parse through and pull all these out. But I don't really remember to be honest. If I can say something here. So actually I'm already using this metadata documentation in some employees that I have because I copied it from the, when I was in the hackathon in London. And I think it's cool to have the YAML because it gives, so you can parse it easily. And then, for instance, if not only for documentation but if, I don't know, for instance, in the future you can, you want to automatize the connection of the models. So for instance, you can create a tool to kind of link some models together and create the pipeline automatically. The YAML that can be more useful for this kind of things, but. Yeah, I think that's basically why we decided to go with a separate document. At the time, I think I was pushing to have it in the same document like you Gregor, but it made sense to have it somewhere else because if you have it in the same document it just means that you know just by looking at one file what's required and what's not as opposed to having to look at a separate file or documentation file to figure that out. But in terms of downstream usage, I think it made more sense to go about it in this way. So do we go for a nested input or do we not? I don't completely understand that second slide about how you can have flat that's much better, but I don't quite understand how you can, yeah. If you've got like two things with multiple values in each one, how does that work? I mean, the name would still be unique. So we wouldn't basically document the channels but just the variable names. But I agree that's probably better to have a nested and the point that Josee actually just made. Yeah, it's just a hassle to write those nested lists. Okay, cool. So are we settled on this? The consensus we're sticking with what we're doing currently, I don't disagree with that. Yeah, I think so. Cool. So this may be related to what we were just talking about in terms of versioning the containers. I think Steve put a question up on there about will the containers be version tagged with the software in the container? At the moment, we don't have plans to do that. And I guess that starts becoming complicated when you have multi-tool containers as well because what version tag can you use in that instance? And so I guess our containers are great for that because it's just one tool, one container and you can version it. But when things start getting a bit more complicated and you need multi-tool containers, then it's difficult to be consistent with that, I would say. And also you would need to track the version of the software in the main script somehow in order to allocate it the container within the main script. You'd need to track it there too as opposed to abstracting that way, I guess. Because at the moment, the Dockerfile in the environment file deal with the version of the software. So in fact, only the environment file deals with the version of the software, the Dockerfile just builds the environment. But if we want to track the version of the softwares within that environment, yeah, more, then we'd have to somehow include that information in the main script as well when we're referencing the container to pull. And I think it's just going to become too much work to be honest. Thoughts? Hello? Yeah, I'm happy with this proposal. Yeah? Okay. I have a site comment about what you said of doing compound modules using more than one tool. Wouldn't they be instead of a module like the ones that are with a single tool, we can create a workflow that is, in fact, it's component of two modules. And these, for instance, will solve the problem with the Docker containers because each of them will actually got the container that it's from the real module. So for instance, if you have, I don't know, some tools and whatever, you will have a workflow that includes the process some tools and the process whatever. And each of them will inherit the Dockerfile from the real module. I don't know if I make myself understand. So I guess where we need multi-tool containers is for example, I mean, I guess the primed example is where you run say BWAMM and you want to, and so by default BWAMM can like BWAMM files. Because we have to buy BWAMM. Exactly. And so in order to generate a BAMM file, you need to pipe that through the samples. And if you use a bi-container, that will only come with BWAMM. And so I'm pretty sure they probably would have built multi-tool containers to deal with this, but it's again, it's finding it. And simplicity is sake, it's probably easier for us to just build our own. But yeah, in that case, you would need both of them physically within the same container too. Yeah, I forgot about these. It's a good old Linux pipe. Okay, cool. So we're happy with container version tags. Any more questions or comments? Well, if we have more time, just to better understand the limitations here, can you summarize the automated build process compared to, let's say, the Docker Hub automated builds? I'm asking because in the Docker Hub context, it's pretty clear to me how you can have, obviously not in the sort of more complex situation like multi-containers, but you seem to have some flexibility that you may not have here and that might have to do with the build process. So if you can just give me some indication of how that is done. Yeah, so what we currently have, we have two different, three different systems, actually. So right now, all the forgetting modules, just all the DSL1 pipelines we've had up to this point work with Docker Hub. Yeah, but an environment file and a Docker file and the Docker file just has condo in it and it just does condo run whatever. And so it builds the condo environment inside of Docker and it's a Docker file is very, very simple. And that builds on Docker Hub and it's an automated build. So whenever there's a new commit to the pipeline, it builds a new one and tags it for some of these and so on. That's changing in this version of the tools release. So all the existing pipelines will now build on GitHub Actions. That's because Docker Hub can be very, very slow. You do a release and it can take like eight, nine hours for a Docker image to appear, which is kind of a hassle. And if you're doing a pull request or a change, that means you have to add the software and one pull request, get it merged, wait a day, come back and then write the code for it, which is kind of a hassle. So now we're building on GitHub Actions. And it only builds if the Docker file or the environment file has changed. And so that means that automated tests run with the updated software. So you can do all the one pull request and the build works in about 10 minutes and then pushes to Docker Hub. So the Docker Hub image is exactly the same, it's just the build that's happening on GitHub Actions and it's much, much faster. And we can all do it, we can test it in the same field. With modules, we're basically taking exactly the same approach. So it builds on Docker, on GitHub Actions if the Docker file or the environment file has changed. But we've also tweaked it, that we've dropped Docker Hub completely because GitHub now has a Docker image repository. So if you look at the NFCore modules repo, you'll see that it's got all the different Docker images which I hosted, I think it's just one at the moment for your modules. It doesn't really make any difference, it just means the Docker address is different, but it means everything, the modules, is all in one repo. So you don't have to go fishing around for it and NFCore admins don't have to go and make new Docker repositories. Does that explain how it works currently? Yeah, yeah, that's very good. And yeah, as I said, I mean, I've noticed the Docker builds on the GitHub Actions. You also mentioned, or maybe it was hard show, the possible solution to adding tools, like some tools to biocontainers. Can you elaborate or is that? I haven't tried this myself and Harshal's left. And I can't remember how this is done. I think it's done by a pull request to the biocontainers repo or something like this. And when it's merged, then it starts to trigger these multi-containment images or something like that. I'm not totally sure though, but maybe Harshal can fill us in. And if not, then yeah, hopefully, we'll have the time to explain how it works. Did you hear the question, Harshal? Sorry, no, what was it? Briefly, layers. How do biocontainers multi-software images work? How do biocontainers multi-software images work? Biocontainers. To be honest, I haven't figured it out yet. So I needed one a while back for the NanoSeek pipeline because there was a demultiplexing tool that required pigs installed to deal with compression and stuff. And the biocontainer only comes with the tool installs and not with pigs, which would help obviously to have them both. And so I submitted a pull request to, I think the repository on GitHub was called multi-tool containers or something. But then I never heard back. There was a lot of automation going on there. So it frightened me off a little bit, but there was lots of bots doing lots of different things. But I didn't actually get a physical response from anyone. So I'm not entirely sure I'm being honest how to build them. I don't know if anyone else has done that. I guess it will be a good opportunity to ask then on Friday because we will have also a talk by Björn from biocontainers. So we can definitely ask him then. Yeah, absolutely. I think that when Björn or biocontainers posted recently about comments on biocontainers and improvements and stuff, and I think we linked him to that PR. So yeah, we can definitely ask him on Friday. Cool, thanks. Right, so we're done with container versioning. Genome and disease. So Maxime and I have been wrestling on GitHub and on Slack and all over the place about this. And it's an issue I created a while back, so we have an iGenomes config that we ship with the pipeline template, which is a default config file for reference genomes coming from AWS iGenomes. And a lot, in fact, most of the next generation sequencing pipelines, we have used this information to pull genome data automatically from AWS iGenomes. So currently there are different ways in which you can actually use the BWA. It's a stage, the index is probably the right way of saying it. You have to do it in a particular way, otherwise things start breaking on AWS and stuff. I think AWS is able to handle globs. Maybe Maxime can chime in here as well. So the way that I've traditionally done it in the pipelines that I've written is in order to allow for flexibility, whenever I'm creating an index, I just get the prefix, which in this case would be genome.fA and I put that in a variable and then when I'm running BWA mem, for example, which requires the directory for the index and also a prefix, I just input that variable into the command. Maxime does it by doing sort of a glob expansion for all of the files in the index that are actually required to be staged to run BWA. But for me, the main thing for me with that is that users would have to provide exactly that syntax on the command line to provide their own index and I've sort of tried to, you know, maybe keep things simpler for the user, arguably, and also not having to type on the command. I think that's probably not an ideal solution. The solution I have is not ideal either because we have to put things in a variable in the context of the main script, which is obviously now going to be a problem because we're using modules and so we'd have to pass that also, that same variable to the module file separately. So I'm not saying I've got the ideal solution either, but I think we need to come up with a standardized way of dealing with this because if different pipelines will be sharing modules, then obviously this will become important in terms of how we structure this information and how we deal with it. So I was wondering if people had any issues and it's a good time to discuss, maybe. So Harshal, in case you didn't see it, Maxime was mentioning that, yes, AWS can handle Globso. Yeah, yeah, I did say that. So everyone's happy for me and Maxime to wrestle this out then, basically. Yeah, I started using gname.FA notation many years ago for the same reason that you said it's easier to specify over the timeline. I don't really have a strong opinion on the other thing. Maxime, go on, make your case. Yes, I unmute myself. So I think, yes, it's nicer if you look at your iGenome.config file, but when you look at your code, the stuff you do to get all the other indices from the genome.FA, it's completely terrible. Yes, yes, I definitely don't have the ideal solution. I now completely agree with you, but I'm wondering whether there's another solution in which we can avoid having to provide a list of indices. I guess another thing also is that, I mean, it's not a big deal, but so BOTI and BWA use the same notation to provide the genome indices. In the way that I've implemented things, which again, isn't ideal, but in the way that I've implemented things by just using the suffix of the index, you can use the same code for BWA and for BOTI. Whereas with this, you would have to explicitly provide separate extensions as well for BWA, BOTI or for whatever other genome, you know, a liner that you have. Sorry, I need to get that. My gut feeling for this one is that only really harsh or unmaximal strong opinions and no one else does. I think I'm very strongly keen on my option because for me, yes, you just design the indices like the extension that you need for your indices. And that's all you don't have to do any trick to get the right stuff because you get the right stuff already from start. And I don't think it's that much of an assault just to decide of which extension you need for your indices. I might just use the time for a flip and comment that it might be time for someone to update BWA to index on the fly. We already do that also within Sarek, but since it takes time to index, to do the BWA-making index, we also provide the possibility to specify the BWA indices. So basically you can do whatever you want. Yeah, that's totally what I meant, is that the BWA indexing it should explicitly slow, not parallelize, so yeah. One other thing which is maybe relevant to this is that we're hoping to add support for refgenie in the near future. Refgenie is a management tool for your reference indices. And we're basically, I'm hoping that we can kill AWS IGNAMES is my hope. It's a separate command line tool, but we're going to write integration into NFCore. And then you'll do refgenie pull BWA and it will basically write an IGNAMES config file for you, but next we'll be able to consume and use automatically. So my hope is that gone will be the day use of having to write minus minus BWA anything. That everyone will just use refgenie to manage all the references and everything will just be done programmatically via files and then it doesn't really matter which one we do. The rich case of glob is probably better because it's more explicit and if I'm not actually typing this on the command line there isn't an actual advantage to doing the harsh ones. But none of that exists yet. And it would also require everyone basically to use refgenie to write everything in complex files. Yes, fine. And also for me, one more point about the glob approach is in Sarac we already use glob for several other options that use at least two VCF files. For example, for non-indels we have two index files so we use glob already for that. And I know it's for me it's much easier just to decide what I want beforehand and then don't need to write anything fancy in the code. Okay, cool. All right, I concede. So what we will need to do is in that case I guess it would be nice if we can update the iGenomes config before we do the tools release to use this glob for all of the BWA and bowtie indices that we have in there. And that way with the next sync if people decide to use modules then all of that code will still work I guess because this will be dependent obviously on the iGenomes config as well once we start porting pipelines to DSL too. So we'd need to change the iGenomes config in the pipeline template to add in these globs where appropriate for BWA and bowtie too. For star I don't think it matters as much does it? Star just uses one file or something. Is that right? Yeah, directory. Yeah, okay, cool. What only thing I would say is if we're using the refgne integration as an argument for doing it one way or another maybe we hold off until the refgne integration is actually done because I'm not 100% certain on how that's going to work with stuff like BWA indices. Yeah, I think this is more relevant now because we're moving to DSL too and so whatever approach we decide to use with say, I don't know, with a module that we have for BWA mem it will be staging the indices in a particular way in one place on NFCOR modules and so all the pipelines will have to provide the input in that format and so I think we need to get this right sooner rather than later just to be able to deal with DSL too to be honest. And so yeah, maybe we need to just update the IGNOMs conflict to deal with that and then any pipeline that is now being implemented in DSL too will be pulling the files in the right way and providing inputs in the right way and so on. Yes, we're currently trying to do that within Sarek so we managed to have a BWA mem to index process that works well without any issue. I haven't tried yet if it works gathering the BWA indices that I already have done but I'm pretty sure it will work as well. Okay, cool. So maybe I mean you're using different config files though, aren't you in Sarek? Yes. So maybe we, yeah, we need to roll this out to in the IGNOMs conflict for the pipeline template so when all the other pipelines get synced through the automated syncing then this gets updated there too and so everyone's on the same page if they want to switch DSL too everything's in place for them to do that. Yeah? Good. Cool. Okay, so is Paolo on? So I think Paolo wrote yesterday that he wouldn't be able to join us for this session in the end so. Ah, okay. Cool. Without him. Alright, so maybe we can discuss amongst us then. So this I think is for me the single most important thing with modules is in terms of how we implement it how do we provide parameters that are specific to the scope of the module itself and from what I have figured out there are two ways. So there's either where you can physically provide a parameter which is specific to the scope of that module and can be overwritten by the main script or you provide it as an input. So you can see how I've commented out two lines. One is either providing it as a parameter which you can then overwrite using something like addparams. Yeah? When you're including the module but the problem with that is if you start now dealing with work flows and for example say you need to sort a BAM file in multiple places then you know how would you overwrite these values in that instance how would you publish these files to different places because you're you'd only be including them in one place for the sub-workflow and then using sort BAM say for example in multiple places in your main script so how would you customize that when you invoke it multiple times maybe part of this is just me not understanding how to do it. I had a brief chat with Felix yesterday who said just brute force the approach and call things I don't know include something underscore one and underscore two or underscore I'm using it for fast QC and underscore I'm using it now for mark duplicates and then have a separate addparams definition for each of those but that just seems a bit messy to me and I was wondering if there were other options that were available to us to deal with this. Well I thought that you had to do that anyway because I think you can re-use an important process in different places I was going to say exactly the same thing I thought we had to import it multiple times if you want to use it multiple times but the problem with the addparams stuff is that maybe then you have to put the includes in different parts of the workflow I mean that because you still have the variable set to put it at the beginning or something like this but... Yeah but you see this is where things start getting complicated because when you start dealing with sub-workflows so you only have to include the same module more than once if you're using it in the same context or in the same workflow context right? With sub-workflows that sort of throws things out the window a little bit because you can create a sub-workflow to do something very minimal like sorting a BAM file for example but then import that sub-workflow into one that is doing the alignment so BWAM once it's done the mapping you sort the BAM file that's its own workflow then you've got another workflow where you're say for example marking duplicates and then you want to sort the BAM file after that and so now you've got another sub-workflow to deal with that and because you're using that sub-workflow in different context it's okay but if you see what I mean when you actually create that sub-workflow you're only importing it once even though you're using it in different contexts and then how do you overwrite some specific parameters so say for example I want to put the mark duplicates BAM file in a different folder where I want to give it different suffix or something like that these sort of customizations will need to be able to handle because you can't put everything in the same folder and you can't name all of the files the same either we'll need some sort of customization to deal with that and the alternative seem a bit messy to me but it's just been time you're not trying to figure it out I'm not sure if I'm making sense I'm just wondering if we couldn't just have both like a params and a value because for some parameters you might need it on a per file basis so for each file you want to specify a different name or whatever parameter you're carrying along with it and some things you might want to configure globally without having to build a separate channel for it and then just in the end concatenate both variables yeah I think these things can start getting very large very quickly in terms of input definitions and stuff also we have to remember that optional inputs are a bit tricky at the moment so you know we could have one input just for samtool sort args like I have here we could have another one for the output directory or publisher here we could have another one for suffix in the script section which basically allows you to customize how you call your bound file but it could get it could get out of control quite quickly which is why I was wondering whether there's a more clever way of anyone's experience a nicer way to do this and there was this map that Paolo mentioned yesterday that you just specify one variable that's actually a dictionary of parameters I was sort of getting there so I may not understand the use cases for modules that I haven't tried much recently so things have changed but yes, whether you could have a generic metadata map or input map where perhaps then you could stick to the the the same the terminology used for variables so you know published here and so on and so you give a single map as an input yes so I did think about that and initially I thought it would be nice to somehow collate the sample metadata in that case we could collapse we'd have a meta sort of map for sample name and another one that would be whether the sample single end or not and then pass that to this so we don't have to explicitly have a single end there and it also lends itself to be a bit more flexible that way in terms of providing sample meta information but for things like argument suffix where you want to publish it would be amazing to have a map but how would you then provide that information how would the user provide that information on the command line so say they want to overwrite the information in this map you'd have to initialize these maps all over the place in the main script to default null values somehow so is there a clever way to do that across numerous tools to write this thing all the way from the command line if you see what I mean it can easily get ugly and with a lot of initializations I guess a lot of defaults and so on although you could have a predefined default map within the module and combine the maps or overwrite the maps with the content the incoming one to only overwrite what's set to a different value then because you'd have to somehow within the script assign whatever the user is specified some of these things the user won't want to change for example the pipeline developer will decide where to publish the output files the majority of the time is out there which is quite a standard parameter what they want to publish so what we've used here as params publish underscore results the idea behind that Gregor came up with was that we the pipeline developer would have a default or this module in fact would have a default set of files that it would publish but if the user didn't want them to be written then they could set that to none right? the same way as you published there to now so some of these will require we would like the user to have control over some of them not and so how we pass these user defined parameters into those maps is I guess where things get tricky in scale for a few modules it's fine but when you start doing things in masses where things become tricky but wouldn't that somehow have to be handled by the pipeline anyway if you import the same module twice you couldn't use the same tool sort arcs on the common line anymore anyway because the pipeline wouldn't know to which module these extra arguments should go to well the assumption there would be that whatever additional arguments would be applied to all sample sort calls I mean trying to assign different arguments to different calls of that within the same pipeline it would be a nightmare I think but I mean sample sort is probably a bad example but it could be that you have used the same module at different places with different options then it would be up to the pipeline developer to expose the parameter to the end user that can set the parameters to one of the two modules yeah and then that's actually not even sure if you wanted to have that for any module that the user can override the parameters from the common line directly without that the pipeline developer intends that yeah so that's where things become tricky in that in order to make things flexible we leave this up to the developers and I think it definitely makes sense doing it this way because you know even across NFCore we'll have you know SAM tools being called in ten different ways and people have different preferences the same with BWAMM for example so for us in terms of maintenance and less work it makes sense doing it this way but yes I agree so maybe I don't know if in that instance the pipeline developer wants to give the user different options to use then they have to somehow provide the parameters for that to override but I'm still trying to figure out in my head how this whole map approach would work in scale so I think the map approach should be fairly straightforward for some of the basics the sort of next row layer of stuff here like the publish there and so on as for the tool parameters themselves I think it's best to have the module only as I think discussed as well deal with what's required exclusively and then only have a whole poking hole to pass on any other arguments whether you know for a dedicated params map meta.params or something like that as a one big string of whatever else is being used in the actual workflow yeah okay so I guess with this we need to come up with implementations for this it didn't seem like so I asked Paolo yesterday it didn't seem like there's an obvious way to deal with this the only sort of tip that I got from him was that map approach where you can supply a map as an input to a process and then extract out fields that you would have in that map already initialized in that map within the module context so I think we need to come up with ways of dealing with all of this because again all of this syntax will change if we're pulling in sort of settings from a map I mean in the bigger picture if you combine the idea of the schema that you have introduced and maybe the idea of the maps you could imagine things as complex as you want that can be grabbed as overall pipeline definition I mean probably we really need to hear from Paolo about what is actually possible but in terms of stepping stones maybe at the beginning we could keep things as simple as possible because I mean obviously here there's a struggle between having community pipeline where you want to address a high level of customization because then you want to make it possible for everybody to use it in the most flexible way and then on the other side you have the more customization you introduce the more you make these setting modules difficult of course so maybe in the beginning we could have a minimal version where we try to make these problems as minor as possible we're using less customized parameters and then when we have clarified the mapping thing we could see that combined with the JSON schema for the whole pipeline so I think we already are doing things quite minimally to be honest we're not doing much so in this instance we're only adding I think three or maybe four parameters to the module file but those I think are quite essential to allow people to reuse the same module across pipelines we stripped out most of the complexity actually would have been taken out from this custom parameter evaluation and all of the other things that are quite that happen within individual pipelines we stripped all of that out by just saying deal with the inputs and outputs and give me an argument string for the rest of the parameters the other things like publisher and suffix and stuff I think are quite minimal already we just need to figure out a way of setting them from within the pipeline main pipeline context yeah so probably I agree with entirely with what you said I would probably say that the map would be a good place to start to play around and give it a go I'm using in my pipeline I tend to use maps everywhere essentially passing them through the pipeline from the bottom there are some pitfalls there so it's easy to forget about making sure not to modify the map and not to break the case but yeah there are big benefits so I never care about the file names at all because all the information is carried down through the pipeline by the maps I agree it's I think it's the best way forward to provide information to the module file because you just have essentially one more input that contains all of the customization for the module that you require and also the great thing about that is it's flexible for future additions and maybe even future removals you don't have to change module files everywhere so for example if we decide to add another parameter to this module file because we think we need it we can just we can add that in the map and I'll deal with that in whatever way it's definitely a more flexible approach I think I guess like DSL2 itself in NFCore modules we just need to sit down and implement it but I think yeah it's been useful having this chat so we can figure out what the best way to do this is are there any more comments or thoughts on this okay cool so what if anyone else is willing to play around with it they'll be really cool as well we can use we've got that FASTQC module up there and you know we're just a simple test example attempting to get a map into the process that allow you to overwrite the fields within the module file that would be really cool if someone can have a go at that as well and then maybe we can compare notes but I'll see if I can figure this out as well the next one really was a question for Paolo I think there may be some some custom published code coming in the next release or soon and I just wanted to know whether how it would look and whether it's worth us going through all of this effort with Publisher or just waiting until next week until the next release to just use what's in that but maybe we can take that up within another time I'm not sure if does anyone else know if there's anything coming or we can use to customize inputs and outputs from module files in terms of publishing I have a really good question if that's okay I just remember that you mentioned that you would tend to have a publish there for a default one for the modules is that for you in your experience because for me it may be sticking to the default so it seems that the default should be not to have the I tend to not want to publish everything until fairly late in the pipeline so just floating the idea I'm not saying that this way is better but just maybe just telling a bit to understand better where it's coming from so Gregor came up with this syntax when he proposed this he suggested default all and none so the way that I understood that I don't know what happened to all maybe we decided to chuck that out and maybe Gregor can explain that in a minute but what I understood from that is for any given module file you know whoever submits that module to NFCOR modules there are some obvious things that you would need to output now in this case it's not much of a problem because we're just outputting one BAM file but you can imagine in some instances there will be tools that generate you know I don't know 10 files and you may not by default want to output all of those but there might be some important files that you could output and the way I understood it is that the module file will deal with that logic for you there the more obvious things that need to be you know put in the results directory it will deal with that for you but if you decide that you don't want any of it then you set it to none and if you decide you want all of it you set it to all I don't know if Gregor can expand on that but that's the way I understood it yeah I think we ditched all as a default option but the module developer is free to add more than none in default if he thinks it useful like for instance if you have an alignment tool you might only want to output stats for multi QC but in some cases you want to get the BAM as well so that could allow to set multiple flags for a more fine grained output solution yeah I think we're going to have to throw away some of the custom flags we've been using in NFCore for a while now like save a line intermediates and all of that because this sort of customization in terms of outputs is we'll have to go with a more generic approach to deal with that and but I think we should have default all and none though do you not think probably in those cases all equals to default anyway yeah but I think default I think default should be a minimal set of files that you seem that the person that provides the module to NFCore modules deems to be useful files I think that will offer a bit more flexibility in your case like you say you know if if you don't want to output the BAM file then you don't have to put the BAM file you just output the stats file and so the default would be to output the stats file only and not the BAM do you see what I mean yeah for the alignment I think it absolutely makes sense to have more flags than default and none but we need to have all default and none for every single module yeah unless we just I mean we can have just if params.publish results is in default or none for example if you want to output the same things you can maybe amend the logic that way yeah no it's completely fine for me I think Phil said that he thought that almost too much and I said yeah I want to keep it a way to keep things simple yeah I think we will start getting complaints very quickly about file sizes and results directory sizes if we just output everything can I make a reminder suggestion if we just tweak it slightly instead of having all default and none we have all logs and none because that's much more explicit and I imagine come out to a similar sort thing because that's save aligned intermediates that's all that is really it's just saying do you want the logs or do you want the logs and the files yeah just a suggestion yeah sounds good yeah yeah that makes sense um I guess that where that would break is where the where a process isn't generating logs but it's just generating loads of files and you want to be selective about what you output there do you see what I mean so you're not you're not you're using you're using a tool which generates 15 CSV files we have a fourth option which is clog and then you give that a value so that like any output files that match a glob are saved that's a cool option I like that so in a way that links back to the idea of using the map for customization and maybe Paola can comment on something more internal on whether you could pass basically the text flow syntax or at least the information that the map could contain uh uh fields matching uh the directives or the settings for the directives so to avoid this you know the sort of introducing the new new standards here but rather just following the the path yeah I like that I like that I like having being able to customize which which extensions you output um yes but I guess from the developers perspective um it will be fine because they know next flow and globs and code and stuff but then again you know if the user wants to output specific files this goes back to providing commands with which are overwriting what the developers outputting and do we see that okay or not the user sorry just generally I would advocate I don't know I'm not sure if we want to give unlimited access to the finest details for the user I'd be fairly I know it pollutes the code in the pipeline but I'd be fairly happy for the pipeline to choose which of the options to open up to the user and I think this is quite a good example where you don't necessarily it's not that useful to make that and I really do like the I agree again I think this could be abstracted by the pipeline developers so I think the modules should really be designed to work well for the pipeline developers and then the pipeline developer can decide to expose some options to the end user using some custom flags yeah I agree but we have to design the modules in a way where all of that is possible um I guess which is yeah what all of this conversation is about okay so what do we go with default none all sorry we're chucking default out what was it none all log and glob or none all and glob I think none all and glob is fine right on the other hand I like the idea to have an option for the module developer to pre-define some files that are useful but maybe we can keep that for later yeah I mean I quite like that as well but yeah I think the module developer can do that with the glob right so if you can still keep save save lines and intermediates or whatever and then you just make the glob just like star.com yeah actually good point for the pipeline developer not for the module developer right so the glob would be an option where just the pipeline developer that uses the module can just pass a pattern of files he wants to publish but what if what if by default in the module file there's a glob yeah that could work yeah let's do it that way right so if by default in the module file we provide a glob as what to output that would be your default alternatively if you can overwrite that or you can you can up everything or nothing yeah that's good okay cool nice anything else yeah labels how are we going to deal with labels in module files actually Paolo is here by the way I think as far as I've seen so we could move on to the hardcore next flow stuff Paolo what Paolo yes what are we doing with labels Paolo yeah the only way is to use categorize research requirement without having to specify for each process yeah exactly so how so we're basically coming up with ways obviously to to make NFCore modules as generic and usable and you know as useful as possible and so these sort of things now where you know with DSL one haven't been an issue everything's been in the main script and so for example labeled in this case is just saying so NFCore pipelines themselves come with a base config that defines resource usage like low, medium and high and so in this case I'm just providing one of those labels to this process and saying give me medium resources and you know not having to provide it in every module file so how do you how do you think we should deal with that in NFCore modules on a more granular and remote type basis if you were to find a general solution for this thing because because in principle one could say that you could define I don't know low memory and for example large memory the problem then you are deploying these many different system that is difficult to find a concrete value that works everywhere so I mean I I you will try to do this in yes somehow to have low resource requirement high and then use the configuration file to adapt to any different configuration I mean so what we could do rather selfishly is say that anyone that using NFCore configs I mean if this label doesn't exist nothing will happen right you can put by using the configuration file that you are ignored so maybe the solution is we just add in labels as we would normally with NFCore pipelines into modules and if it's not coming from an NFCore pipeline it won't be a problem because it's defined it's not defined in whatever set up you are using you can define exactly if the configuration is not using to this label they are just ignored okay can they be parameterized parameterized I mean so to put a param so that you can put the label matches the one that you have in your configuration meaning that if in your configuration you put that it's a minimum requirements you pass it to a param to the module and then you can use this the thing that you have in the configuration you understand so like you still know that you can put you have params out there so that label is set the same way so you have a params label and then you have in your main pipeline that's it this is the other way around the idea is that you declare something in the task and then somewhere else you use the information if you want to parameterize the label the thing is that you will know which is the configuration that you are using and you will know which are the labels that you have in your configuration you can but you will then have to pass to all the processes the user should still be able to overwrite this I imagine with their own config even if we provided labels here so process medium for example which is a standard NF core label even if we provided this within the module file if the user wants to overwrite that they can provide with name pre-seq and resources in their own config and that should overwrite this is that right the user so the idea is okay fine if we provide these labels by default within the module files on NF core modules NF core pipelines themselves are able to deal with this because we have a config file right yes you should try to how to say an ingenious definition within all modules for some some research requirement and then you can adapt the configuration to this label okay so your label is going to be around for a little while so we should use it in modules yeah but now are you using them or not sorry now are you using the label in DSL1 we use it everywhere so you can use it the same way why are you struggling with models what is changing I'm just trying to figure out so NF core modules the whole idea behind it is that it's sort of a generic repository for hosting modules so this is sort of a customization that would just be relevant to NF core pipelines and so if we're trying to supply these modules to the entire next-level community then I guess it's just discussing whether we put these I don't know I think an accessory abstraction is not useful in this case try to keep some more consistent what you want to do at least my menu I think when you try to make things too abstract then it's not going to work but it adds much more rather than stuff that is not really useful ok cool so I think maybe we'll just keep the labels in them and use what we're doing because it makes sense to do it so maybe we need to play around with this a little bit more yeah I don't know if other people want to use outside of NF core they're probably going to have to find a solution yeah but I think they should be consistent in your project yeah I think yeah that's fine as long as you know whoever wants to use it outside NF core can overwrite the resource exactly I think this is always the best solution so to be pragmatic sort of a concrete problem ok cool since you are already using I will continue the same way ok cool thank you the when statement I've heard you not to use that try I would try to use it as less as possible ok cool so use if statements instead yeah the beginning it was a good idea especially in the the first generation I'm now moving to yes I have to move into models this is less useful ok and so with if statements so say for for execution of modules say for example you have a workflow that is that's got a parameter to to skip fast QC for example right as a simple example would you have to create empty channels everywhere to deal with the instance where a module isn't run and your output you're emitting the value of that particular module you see what I mean to manage conditional execution exactly conditional execution it should be possible yeah but you'd have to create empty channels right to deal with ok cool so this I think we've been talking about it came up yesterday on Slack this is another initial option input now I need to add something to the the syntax tree to manage this so this I think this might not even be related to optional input so this is more to do with when we so the way that we're going to go with NFCore modules is that the module itself only really deals with the inputs and outputs for simplicity and the developer of the pipeline and in some instances the user can override those parameters to run that particular tool when that just gives us a bit more flexibility but there will be instances where for example that additional parameter string may require having additional input and output files and for output files that's not problem because you can do optional output but I guess for input files it is because you can't do optional input if that makes sense so if you provide input files via a string as an input they won't be staged as a physical file I was thinking as a physical file, what do you mean a physical file? yeah so if you if we have a generic parameter string so say for example we want to run SamTools right and so SamTools you've got BAM input and you've got sorry you've got maybe FastQC FastQ input and BAM output so the NFCore module will only deal with the input and output of those files so the other arguments that SamTools can take will come from a parameter string and in some instances that parameter string might need to deal with additional input and output files defined in that string do you see what I mean and so that's where things become tricky because you're not correctly staging those files as an input and output you're providing that within the string yeah no first after it has to be like input parameter so yeah so that's where it will get tricky this is why there should be a better way in the syntax the only the only trick that we have so far is to use a fake file exactly but it is the only option it's called no file or something right yeah I've used that okay it might be fix next or like this okay all right so we can we can find a workaround the main reason I think so we've discussed this already in quite some detail now but it'd be good to get your opinion on this Paolo so we need to be able to provide arguments to module files and we're trying to figure out how to do it essentially so we can either provide it via a params value like we have at the top or we can provide it as an optional input as we have at the bottom so not an optional input as an additional input argument model optional model params you mean model you mean all the processes in that module file or a specific task like this some sort okay so say for example say I create a sub workflow to run samtool sort and so as you know sorting BAM files happens in multiple places in the main script so you've just created one sub workflow to sort the BAM file and you can reuse that same workflow right but you don't want to there are certain things wherever you call that samtool sort command in the different places you may want to give it a different file name here or you may want to give it a different output directory here or you may want to give it a different you know physical file suffix in this place so wherever you're dealing with the execution of that particular module you may want to customize certain aspects of what's happening within the module file itself so if it is a parameter that apply to a specific invocation of some tools that can change each time that you pull different parts of the pipeline it should be an input parameter of the process if it is a general setting that is applied to all samtool rounds I would put like a params you know what I mean? yeah I think so so the way I see that is that you're treating the module as a function exactly for this reason since now you can process like a super function around your tools okay cool so what you're doing is you're putting in so that is a parameter it's an input of that function I would think this way yeah and maybe you could if you need many of these not just one or two you could have an optional no optional parameter that is a map that you can use to to provide any optional parameter I have to say imagine that in your input you declare like in this case input table val okay and the single end and the path of the bound and then maybe there is another input the second one that is val options and then the second parameter could be a map that you pass like a name it a key pair mapping you know what I mean yeah exactly that would allow you to pass any extra optional parameter as long as they are not files files yes exactly exactly I think that is the solution we came up with in the end as well to somehow implement a map to deal with all of this yep that makes sense maybe the simplest thing to do yeah oh published did you say you had some fancy plans for this in the next published here let's see here the logic is that um the plan is to have something much more in general the meaning that my idea was that we have two out next mechanism that allows you to capture the output on any process independently by the process so separate the couple the output the output generation from the task itself since these requests also worked into next flow I decided to postpone this and continue to use published here for now and this is why I removed the that mechanism that I was introducing that was published up in the workflow and then I realized I had too many limitations this idea to even more decouple these two concepts so in the future we will be able to say I want to take all the output of some sort maybe sender so outside the process okay that's quite cool or maybe you can say I want to capture all the output that I have this annotation this is the process name or some other annotation that we will be able to do somewhere in the process but separate in this part and since this is very important for future this will be implemented later and for now is to have this transition period in which we just continue the same mechanism like it is now published here so this is also why I would suggest to not make too abstract then using like I was telling for the labels just use what you have now try to not complicate too much because later we will advance mechanism to manage the output okay cool that's encouraging right so we dealt with that dealt with that dealt with that dealt with that okay any more questions from anyone else for Paolo while he is here and any other questions in general or discussions around DSL2 that we should maybe we may have forgotten or need to discuss okay so like Phil said yesterday either everyone has gone to sleep or everyone's brain is hurting so thank you everyone for being on this call thank you Paolo for joining and yes I think we've definitely got a clearer idea as to how to move forward with all of this and yeah well we're all on Slack so get involved and see you there thanks guys