 So welcome everybody to another week and despite size talk series. Today we have Alex Pelsa from the NFCOR core team and also a clinical bioinformatics lead at Bödinga Ingelheim and he will talk to us about software packaging for NFCOR, a matter that we've been all like concerned with when writing NFCOR pipelines. So thanks a lot Alex for joining us today and we look forward to your talk. Thank you Cizella for the introduction and as Cizella already mentioned, we're talking a bit about software packaging today for NFCOR. It's not strictly speaking NFCOR way in this case, but more or less also involving writing condo recipes, bio condo recipes for packaging any bioinformatics tools or general purpose tools to be used in NFCOR pipelines, for example, and NFCOR modules, of course. So just to give you a brief overview, are we going to talk a bit about best practices and software packaging, what we've been trying in the past, what we've been using in the past, but more or less focused on what we try to do nowadays. Because this is a limited bite size talk, it's not very long, I have to focus on the most important bits and cannot go into all the details, especially now also when we come to the next point. Bio condo and condo forge was actually the difference between the both of them. Then also the how to package a tool in bio condo and condo forge. This is also just a brief overview about what you have to do, what you have to follow what kind of are the caveats that you have to circumvent if possible. We cannot go into full detail there of course because there are some peculiarities and as you all know, things can go wrong very quickly if you do things improperly. And then I'll also talk a bit about bio containers and documents singularity at a glance there and summarize and wrap that up all up. So the best practices and I'm only focusing on the dues here to limit us a bit in time consumption is that we usually try to work with other communities here in NFCOR. So we are heavily relying on upstream projects to package our software or tools for pipelines that is true for both bio informatics and general purpose tools. So for example, bio informatics tool of choice would be GTK or SAM tools, which are already packaged in bio condo. But there are of course also other tools that are not strictly bio informatics related, which are usually going let's say just a Python library to color some output or something like that could go to condo forge. And bio containers is the preferred way by NFCOR nowadays how we use containers, both Docker and singularity to actually use them in pipelines. And what we actively try to do and whenever you ask something around containerization around packaging things in NFCOR, you will always get pointed towards these projects, these upstream projects. We really encourage people to contribute to these because this is not just for you a good idea to do that, but because you also will receive frequent updates, for example, the packages that you pull there push there. Well, bioconda and condo forge these always if people start new with with these people are usually really a bit confused what's actually the difference to have to push my packages to bioconda to have to get my packages and to condo forge actually they're very similar but not strictly speaking the exact same thing. I already briefly mentioned it a bit bio condo as the name suggests already is really strictly focusing on bio informatics tools. So it could also chemistry computation chemistry tools of course, but it's more bio bio related condo forge is more for general purpose tools. So if you have some fancy Python some fancy are based package that you would like to get there, you can actually push that to condo forge. So there's this easy decision tree basically bio condo life science related condo forge general purpose stuff. So that's where you have to get your packages there. And as we all working towards making these to making any tool bioconda and condo and or condo forge tool package. We always try to either make the decision whether it's a bio informatics tool or a condo forge tool that you have to push it there. The packaging really relies on similar infrastructure the setup is a bit different, but the overall things are very, very similar. It's not strictly more complicated if you want to get something to condo forge. They just have a tiny bit different setup if you if you produce a recipe for condo forge bears you have to produce a very different one for bio condo. But if you know how to do a bio condo recipe you usually can learn how to do a condo forge recipe very quickly as well. It's not too complicated. So to guide you a bit through how that could look like. These are a couple of steps that you have to usually follow. So the first step would always be to check if your tool is already available on bio condo forge. So the slides will be online after this talk as well. So these things will be clickable. So there's a link for bio condo forge which are basically direct links to the package index of both repositories and you can simply search for your tools. For example, if you would like to see whether there's a tool, a recipe for Sam tools already available that packages Sam tools. And you can just click on bio condo because there's a bioinformatics tool obviously, and then search for Sam tools and you will be seeing this page basically so that you have multiple versions of the tool you have dependencies of that packages. So for example, it depends on HDS lip but also some lip GCC, lip SAP lip and some other two dependencies. And it also lists multiple tools that are relying on this recipe. Obviously, if you add a new one there won't be anybody relying on your recipe as of now, but in the future that might actually change. So this is quite a nice way to actually see whether there's something available already for your tool that you want to package. So the second point that you would like to that you should usually follow here if you want to package something for bio condo and condo forge is to check the contributor documentation for adding to bio condo and condo forge. So you both have very, very detailed documentation available how to do that in the respective case. So bio condo has a listing of basically a checklist that you can tick off also giving you some hints on how to do that most efficiently same for condo forge. So condo forge has a set in the beginning quite bit of a different approach how to do this. But nevertheless, they also have a step by step guide available on them on the page and again this is linked here with the individual links to the respective pages. So you don't have to search for that you can simply click there and then just go there and it will explain how to do this efficiently. There's also a bonus hint since most of the tools that we need to package for NFCOR are bio condo tools. It's not as common to do condo forge packages but the majority of tools that we use in NFCOR as we are kind of a bioinformatics pipeline community is that we package things for bio condo. And there I have to say there is this bonus hint for bio condo please just think about joining the Gitter channel and asking their detailed questions. So if you have experienced issues with packaging things for bio condo there's usually really large crowd around similar to what is around in NFCOR. That can help you with your packaging needs for bio condo recipes. Also it's quite advisable to join the Github organization of bio condo because that makes your life also easier. It gives you the permissions to review other recipes and learn for example by looking at other recipes more efficiently. Although it's all open source it also means it's a bit easier because you can also trigger the bots and also notify people from the court team to have a look at the recipe if you're a member of the bio condo organization. Which is similar to NFCOR it's free you can just join. It might take a couple of days however but nevertheless you can do that. The third step of course would be then writing your recipe. Usually what I do there is I either rely on the templates and again there's a link here for some exemplary templates or I just recycle a similar package recipe. For example if I package a python recipe which is already on bio condo which I want to actually get to bio condo. I actually typically try to look for another python package that is already on bio condo and then just try to figure out what I need to change to make my recipe right. However I have to say your mileage may vary here because sometimes these are really different dependencies and also if you just have a lucky person somebody already made a pie-pie package for example you could also use the skeleton templates where this is possible. That does not always work but in some cases if your package is already luckily on pie-pie available you can just go con the skeleton pie-pie in the package name that will automatically create template already for you that should also pull in and fill out the dependencies of your package for example so you don't have to figure that out on your own. Similar things exist for R and some others as well so if you click on the link above here you will find some more information on how to do that and how this is for example done for pearl script for pearl tools and other tools out there. A cool thing also that James mentioned before I started giving the talk here is also that you can test your recipe locally. So this conda build thing that you have to install manually so if you install conda it's not always there but you can use conda to install conda build that will set up an environment where you could also locally test building a recipe which will give you a bit of an error handling opportunity before actually pushing this to bioconda. So if you follow these steps usually you should at least get somewhat a half functional recipe out. I would say in some cases if you're lucky and especially that at least for me who helped true in the most cases when you have a pie-pie package this already built pretty much well. So such an example recipe could look like this usually this is just a build as a script which is just used in the build step of the recipe and then you have this meta YAML file which describes some of the content of the recipe. So usually people set the version of the tool package up here and then just refer to this in the version string here then build numbers need to be changed at some point if you for example bump a new version of a recipe then you have to increase this. You have to list the source URL. This has to be a fixed URL so it cannot be a URL that is overwritten all the time. And then the requirements to build to run and also to host the host requirements are actually listed here in the recipe. So this is just an example. They are much more complicated ones out there but they are also much more easier ones out there. So this is a CC++ tool which means some of the main compilers and the C compilers have to be present here for example. So if you're done with writing that recipe up then what you could do is submitting a pull request to Bioconda and then waiting for automated build checks and linting checks to hopefully tell you that your recipe is in order and everything that needs to be done is done properly. However I have to mention here again Bioconda and Conda Forge are slightly different here. So they have a bit of a different setup there. In Bioconda you have everything in one big master repository in Conda Forge. It starts a bit differently and how that difference plays out in the end is actually listed in the documentation that I linked in one of the first slides. So we cannot really cover that fully here. So if you're lucky and everything builds fine then once somebody from the community approves or reviews and then approves your recipe then this will be merged and your recipe will then be automatically available in Bioconda and Conda Forge package indices in a couple of minutes. Sometimes it takes a couple of hours however. That depends on how fast the synchronization works. Now we've been talking about Conda recipes and Bioconda recipes but what about Docker and Singularity containers actually. So because as you know most of the NFCore pipelines really strictly use Docker and Singularity containers all the time and not necessarily even have support for Conda recipes. So what about that. As it turns out the Bioconda and the Conda Forge communities really went into a quite good agreement with the Biocontainers community. So all the Conda and Bioconda recipes are automatically built as Docker containers and also as Singularity containers and if you click on the Bioconda package index for example the same tools one that I just showed in one of the previous slides. You can just click here on the container button although it says none. It's actually not none. It's actually there and you will be seeing a list on KIO where the SAM2's Docker images have been uploaded automatically by the Conda Continuous Integration Service. So these are automatically available which means also if you create a new recipe then automatically a Docker container for your recipe will be available in a couple of hours. The same applies to the Singularity containers. These are built by the Galaxy team and shared via a Galaxy Depot server which is also linked here. So you simply can directly download that from there and then have your package of choice available as a Singularity container. You don't have to even write your own Docker image Docker file or Singularity file. It looks like this so the only thing you have to basically do then you can run directly from KIO by your containers and then you have the SAM2's version here and you can do the same with Singularity and there you have your Singularity URL with the Sam2's container although these are different versions here at the moment but nevertheless I think the point is clear. However, that is always a relationship with one tool per container. So if you download the Sam2's container from bio containers you always have just Sam2's in there. It's nothing else. If you want to combine for example BWA and pipe the output from BWA to Sam2's directly you have to create a so-called Malt container which is a multi-tool container which is also a nice way of combining multiple tools together. If you for example in a pipeline want to pipe outputs from one tool to another in a single process step which in some cases definitely makes sense. For example automatically converting Sam output directly to BAM or CRAM output to make the compression play in hand that usually makes sense to combine for example BWA and Sam2's into one container. So this can be done using the multi-tool container service also by the bio container community. There you only have a set of tools to a so-called hash file which is basically just a text file you add that which versions you would like to combine. Open a pull request with that and then wait for this to be merged and then after a couple of hours you will have a combination of these as a separate container which you can then use for your purposes. Well after all these containers and combat packages you probably are wondering a bit how to use these containers now efficiently in NFCore pipelines. Well a lot of people really made a lot of effort to make that much easier especially with these Lv2 pipelines where you actually have modules available. So in this case as has been briefly outlined already in the past especially on the select channels around that and around building modules. We really rely on bio containers and the NFCore tools methods around there to actually make that as easy as possible for you. So if you for example install multiple tools like FastQC, Sam2's and MultiQC in your pipeline using NFCore modules installed these will automatically have pre-configured URLs with the latest versions of these respective tools in the modules description. So you don't have to worry about actually looking up these Docker and Singularity containers in such a case. If you for example write a new module you can simply do that with NFCore modules create and then this would automatically ask you whether in an interactive way to tell your name which tool you would like to write a module for and then it will automatically look up in the API of bio containers whether there is already in the container available and we try to actually get that in your module already. Updates work very similar so if you want to know how to update such a module then there's also an update function there that will automatically also update the container URLs if the module code has been updated. And if you build a new module tools will always search by containers by an API to query these URLs for you. So to summarize a bit what we've learned a bit about today, although not in very detail because time is a bit limited here. What we usually do and that's kind of the standard approach to packaging software and tools for NFCore pipelines is that we check Biocon and Conner Forge whether there is already existing recipe of the tool. If this is not existing we typically try to add it to either Biocon or Conner Forge to make sure that it's available to the broader community and rely then on biocontainers and Galaxy to build a Docker container and keep the singularity containers for us to be used. What's also a good idea is if you don't, if you don't want to maintain the recipe on your own, you can also rely heavily on NFCore modules which have preconfigured URLs already. And what you always should do as well if you work with modules use NFCore tools because they automatically fetch and update URLs in the modules for you if you need that. A good, and that was also briefly mentioned by someone in the Slack channel today to me is a good thing is also if you have any issues with Conner packages then please try to use one but as a drop in the pre placement the commands are not really different. The only thing is that you get much better error outputs you will know much better what went wrong and you will also get much faster dependency resolving which will tell you much faster okay where your issues are fixed up if you import a Python package that is incompatible with another Python package in your condo environment, you will see that much quicker with member than with the regular condo. So some last words maybe software packaging can really get complicated sometimes so to be very honest I spent more hours than I would like to. I'm not making bio condo and forage packages, but nevertheless this always plays out in the end, because once you're there once you did it once, it usually is really easy to update these bio condo packages. It's nice because there are many other people out there, especially from the other communities like bio condo and condo forage, who will automatically pick up packages and update them for you. They even have automated update bots that will from time to time check GitHub repository URLs and just send an update for your recipe from time to time, which in some cases you can just review and then just accept them then you will have a new version of your tool available if you do that Similarly, if you build your own Docker files for example all the time, you have to do all of the heavy lifting on your own which is cumbersome and takes a lot of time. So maybe it's a good idea to invest the time to bring everything to bike from the code approach, and then just rely on that. Without always ask these as a set multiple communities around, who are really happy to help. And then also we have the NF co communities like so the help channel for example you can also ask for guidance and input on your recipes. It's not really a problem we have a lot of people who have experienced with this so if you're a beginner and want to get somebody looking over it before you actually go to the, let's say hardcore bio condo and condo forage communities who are multi experienced users. Then you can also ask there if you want to. And always remember a collaboration is a key key factor there. So if you do everything on bio condo and condo forage, it's also good because everybody benefits, not just NF core users who are using your packages maybe with a pipeline. But if somebody for example wants to use your tool with for some custom analysis they also will find this on bio condo and condo forage and they use it, which means that you also get contributors and users for for your own tools for example, which is always great because you also get feedback you also get improvements sometimes feature requests, sometimes even PRs that help fixing things. So it always played out nicely for me at least. So yeah. So, and that's just this basically all the help pages that we have and if you have some questions you can also just ask them now. Thank you. Yeah. Thank you very much Alex for this insightful talk. There is a comment in the chat already pointing out maybe one further difference between by a condo and condo forge. And they mentioned that condo forge also targets Windows Linux and Mac whereas by a condo only targets Linux and Mac so that could be an additional difference. Yes. That's true. Yeah. So my question actually my problem with the multi containers or the multi tool. The hash table is very nice to find something. What combinations already exists or to add a new one, but I always struggle to then find that long container hash that actually already provides this tool is there an easy way to find this. There's two ways to do it. The first one would be if you open your pull request against this multi tool containers, then as someone approves your PR and merges it, then an automated continuous integration service will pick this up and build it for you so you can go into the locks of that CI and find the URL because at some point that CI also pushes that image to buy containers. That's how I do it usually because that is kind of for me it always felt like the most convenient way to do that. However, if I'm not completely wrong here because I never used that before. There is also a service URL which can look for combinations of packages, which can use like a search engine basically and then just look for the combination that you want to have. If you're lucky, for example, there might also already be such a container. For example, BWA and SAM tools. I would envision this is a standard thing that a lot of people will have already would like to have. So there should be multiple versions with multiple combinations of the two tools existing so you don't necessarily have to build your own thing. So that's just the two ways I know. Thanks a lot. I think I also just know those two ways and we would be interested to know more if there's more. There's another question by Phil. He asked, could you reiterate when you would change the build number? Yes, so maybe I go back to the recipe that you know about what we're talking about here. In some cases, for example, a recipe is broken. So, for example, if some of the dependencies of that recipe was broken because one of the libraries that Bowtie used was broken on Bioconda. So, unfortunately, Bowtie didn't release a new version in the meantime because Bowtie itself was not broken, but the dependency was broken. And in such a case, it would make sense to not change anything here, but just increase the build number two here. So you could then tell the CI this continuous integration service to rebuild this entire the entire recipe automatically pulling in the latest dependency, which is hopefully fixed by then. And then rebuild the entire thing in a way that it's not broken without actually changing the version of the actual recipe because that was not changed obviously. And then you get like a sample is 1.15-2 then available as a condo recipe and also the containers would have that dash two in the build number, which will hopefully be fixed. So usually that is just used for like a patch of dependencies or things similar things. Yeah, thanks a lot. I actually also have a question now that we are here. I think when there's a new version of that package, then there are even automated PRs that will update the recipe for the new version, right? Can you tell us a bit more about this? So the Bioconda community has an automated bot that queries all the URLs that are mentioned here in the source YAML files and automatically tries to update them by taking the existing recipe just adjusting the app as a change and also decreasing the build number to one again. So I think it just does these three things that runs, I think all day or overnight or something like that and then automatically opens full request against the Bioconda repository. And then people can just go there, usually maintainers who already made that recipe available in the first place are really are teched on this PR and then people can just review. Okay, this looks good. See, I also run through in most cases because the dependencies are not usually changing that often. And then basically the update will go through quite quickly so that people don't have to do that manually on their own. Yep. So if Phil, for example, updates multi QC, usually the system picks that up within a couple of hours, and then you get the PR if Phil was not faster than the system to open that on him on himself. I think that that really facilitates work then by a condom. So, and we have one final question I would say for today. Regarding the pie test runner, how do we know which version of the pie test runner is required if you know about it, or is it seems like a very specific question though. It's a very specific question, which I cannot answer at the moment to be very honest, because I'm not experienced too much in the details of the Bioconda and Convaforge continuous integration services they have their own kind of customization in place there so I'm not really familiar how, how they test Python packages inside of the container and the package building process. I'd have to look that up actually, if that is something of concern. That could be something to ask on the Bioconda slack. Yes, that could be something you could ask there. Okay, so thank you very much everyone and thank you especially you Alex for this interesting talk. Definitely. I'm sure it will have lots of views.