 Okay, so let's start. So welcome everybody to this NFCore tutorial talk workshop. Let's call it like this. It's a great pleasure to have Phil Evelts. I both like it, did not pronounce that name perfectly, but I will try hard to give this talk and introduce us to the NFCore community and also different ways of their implementation ideas to make next flow implementation more robust. So this is very interesting for us because we have now also a few next flow workflows implemented but they are most likely not too robust and for that these ideas and also implementing it in the NFCore way might bring our work a big step forward. So yeah, I think that's it. So I will give you the stage and thanks again for being available and for giving us this nice talk. Thank you very much for the invitation. It's actually a really good timing invite to come and get the opportunity to talk to you guys because we've been thinking more about how to kind of branch out beyond just the genomics community with NFCore and try and get more workflows involved. And we're starting to see more involvement from other parts of the life sciences especially. So I think that the timing is great and I think we have hopefully a lot in common certainly in the things we're aiming for. So I hope we can work together. I put together a talk which is kind of a bit of an introduction to the next one NFCore. I haven't focused on next flow very much because it's my understanding that most people watching will be fairly familiar with next flow already. You'd like to say you already have some workflows you know kind of what next flow is and how it works. So I'll talk a little bit about what NFCore is and how we work and give a few examples of kind of nice actual things that you get if you develop as part of the NFCore community. And then now we can hopefully have a bit of a discussion at the end and if you have any questions this kind of shouting. Right, let me kick off. So the tagline that I use for NFCore pretty much everywhere is that it's a community efforts to collect a curated set of analysis pipelines built using next flow. And the key is that it's really as we try and focus on it's not just a website that lists next flow workflows we are a community that comes together to work together to build a single kind of gold standard of pipelines if you like. We're focused purely on next flow. We talk a lot with other communities but with everything within NFCore is just next flow. So next flow is this workflow manager it's fantastic of course as you all know it works across pretty much every confusing environment you can think of. So we use it on our HPC system with Slurm you can use it with Grid Engine, with PBS you can use it on more kind of futuristic things like Kubernetes and it works great on Clouds, AWS, Google, I think Azure is in the works. So pretty much wherever you want to run your workflow you can do that. It also has integration for Condor for Docker and for Singularity, which is fantastic because it means that as long as you have next flow installed and one of these tools installed then you can run basically any next flow pipeline without having to think about the software that it entails. It's also massively reproducible. If you use the Docker container so we recommend Docker and Singularity at all possible then you know that if you come back and run the pipeline that you did three years ago as long as you use the same release with the same to have Docker container you're gonna be using exactly the same software almost down to the bytes. So it's extremely reproducible which is a fantastic step forward I think. So that's next flow. What does NFCore bring to the table? What's different about the NFCore community? NFCore was started actually from one of the next flow group user meetings and there's kind of a few of us who came together and took the pipelines we had already and started to put together a set of what we thought were kind of best practice guidelines. So these are now kind of set of guidelines or rules that if you're part of the NFCore community we say your pipeline must do these things. Like simple stuff, it must be open source it must have an MIT license to through just kind of slightly more non-obvious stuff. For example, we try to have any one pipeline per data type. So there's one RNA pipeline and it can do lots of different things and it has lots of options, but there's just one and this deliberate means that if you're a user coming in and you want to do this kind of analysis you don't have to think about which pipeline is best you just find the RNA pipeline. So we have a set of guidelines which we adhere to and the best practices. And we also developed a set of tools kind of helper tools. This mostly comes down to a single package called NFCore which is actually written in Python that's on pipeline and conda. And this kind of mostly is, it has some functionality for end users. For example, you can list the NFCore pipelines and download them and automatically so you can transfer to an offline system, things like this. But it also has a lot of tools for developers who are writing an exploit pipelines. Of course you can use this to help us even if you're not developing within NFCore but it's primarily targeted at NFCore developers. And then finally the thing which is kind of most obvious is we have a resource of pipelines which are ready to get which are these best practice gold standard pipelines. And the hope is that basically we want as many people to come together and work on together on a single set of these pipelines rather than each group kind of developing their own flavor of what is essentially the same tool. So if you go to the NFCore website and click on pipelines you'll see this big long list of all the different pipelines we have. This is one slide that I have to update every single time I give a talk because the numbers change frequently but I think as of a couple of days ago we had 24 pipelines which had at least one stable release. 16 pipelines under development which means that they have no stable releases yet but and these vary from some pipelines which have very little in them which are still pretty much just the original template through to some which are basically fully functional fully fledged pipelines which are already being used in production workflows and they're kind of just getting towards their first release. And then we have, we never delete a workflow because that would be bad for reproducibility. So when pipeline is no longer actively developed we archive it and we have for archived. We try and kind of talk to a community as much as possible around us. And this is kind of one of our main kind of we really try to kind of adhere to those those bare buzzwords that findable accessible and so on. And so we have kind of ongoing initiatives with a few different groups around us. Probably the most developed is Dockstore where we're working together to build up full automation so that all NFCore releases are automatically picked up by Dockstore and listed there. Something similar with workflow have started more recently and we're hoping soon to also have similar automation on the Elixir via tools. So all of these NFCore pipelines if not already will soon be listed automatically on all these different sites to make them as findable as possible and to really kind of tie into other systems. The NFCore community has grown massively since it's not been around for that long. I think we first kicked it off. I think the idea came around at the end of 2017 and we actually started it, we started 2018. So we're not very old and it really has caught me by surprise the speed of growth within the community. It's been fantastic to watch. These graphs are a bit outdated from the publication but you can find live versions which are up to date as of today, you can go to the URL at the top and you can basically see this kind of pretty steady growth of people contributing within Slack, within NFCore, within the GitHub organization and these lovely graphs showing how fast we are to respond to issues and follow requests and things. So basically we're still very much in this kind of initial growth stage of the community, which is brilliant. I think these numbers are more up to date and you can see now that we have about over 800, nearly 900 people on our Slack organization. So many of these people come, are trying to run the pipeline and come and ask for help or maybe when they have help they might not use the Slack so much, but maybe 150 people who are using it on a regular basis. And then in terms of contributors, we have about 500 or so people who have at least made an issue or added some code to at least one pipeline and about just under 200 people have actually actively signed up to the GitHub organization. We have a lot of work now under those two years with this many contributors you quickly you have to commit and pull requests, which is, I don't know, I find this kind of staggering these numbers every time I look. So we're a fast growing community and we're very active. We're pretty much all over the world now. We'd still like to obviously have better representation in some of those empty parts of the map, but this is the web traffic on the NFCore website and you can see we've got a pretty active presence in most places where there's kind of a lot of research going on. Okay, so that was kind of an introduction to next one. And of course, hopefully it's clear the difference between NFCore and next blow. We are kind of side by side with some overlap but we are kind of distinct entities. So how does it work if you want to come in? You've got an idea for a pipeline and you want to come in and kind of work with NFCore. How does that happen? Kind of roughly, very roughly, there's three steps to this. And this is detailed and a lot more description on the NFCore website if you want to dig into the details. But the first step, which is really important is you should come and join the NFCore community as early as possible. You can go to our website and find all these different channels. The most important one is probably Slack, where we have all of our kind of messaging and communicates about all this stuff. And specifically in terms of this point, when you join Slack, you'll find there's a channel on Slack called New Pipelines. So if you have an idea of the pipeline, you need to come in and tell us as soon as possible. This is important because like I say, we try and make sure there's only one pipeline for the data type. So if someone else is already developing a similar pipeline, we want you guys to kind of collaborate and work together rather than turning up with a finished pipeline and having to resolve that. So first step, come and join the community, tell us what your idea is and you never know, you might recruit some help straight away. Once you're ready to go, we have a template which kicks you off with a blank next blow pipeline, but with lots and lots of boilerplate code built in already for you to basically kick you off with the best practice. We initially said when we started NFCore that this was kind of a useful thing for you, but now more and more we're saying this is pretty much a requirement if you want to have the pipeline with NFCore. The reason is that we add lots of functionality to this template and to our pipelines and there's an automatic synchronization mechanism. So that keeps your pipeline up to date as we add things in the main template. And if you're not using that template at all, you can fall behind which is a bit tricky. So anyway, this is a fantastic tool whether you're part of NFCore or not to have a template to kick you off with the best practice. And you do that by running this command NFCore and then create and it prompts you for a name and description and author and then generates this kind of template for you. The helper tools also have a command called lints which does kind of code linting and basically checks for lots of things that we require your NFCore pipeline to do. And this is where we get quite strict with the best practices. And basically the idea is this run you run as often as possible when you generate the new pipeline you shouldn't have any test failures offer templates and then as you add code you keep running it and keep checking and make sure that you haven't accidentally deleted something or whatever. The other thing is as we change the guidelines in NFCore then these tests will change. And so you might find that you have new failures which you didn't have before and that's because we want you to change something. What you can see here is a bunch of warnings. So saying certain software packages here are not the latest available. So you probably want to update them. But you can also see lots of warnings about the to do string. And what this comes down to is that the template to help you get started has lots of comments in the code which just say to do. And it says you should edit the code in some way here. You should add something, you should change this. This is just an example code. And those NFCore lints test just picked these up just so that it's obvious to you. So when you get started you can just work through all these find them all and it basically kind of helps hold your hand and gets you started with getting up to a fully fledged pipeline. These tests also run automatically on GitHub. So we have GitHub action set up. Should be on your forks and also on the main NFCore repository. So every time you push a connect, every time you open the core request all of these tests run and will clearly tell you if something's wrong. And one of the things we've added more recently is if this is on a branch it will even add a comment to your core request. If something's wrong saying that each one of these is a hyperlink to a documentation where you can find more information about that specific test. Right, so off you go, you write your pipeline, you test it, everything works, it's brilliant. You're ready for your first release. And so the final step really is to come back to the NFCore community for a review and release. This happens via GitHub, we do it as much as possible through GitHub. And so we have a kind of a community review, we call it where we try and get some other people who are not necessarily involved with the initial development to read through your code and make suggestions and ask questions. And once that's sorted, you make a release. That kicks off a whole load of automated testing and a whole load of automation, for example, with the website and other things which starts generating all of these identifiers and updates that is all over the place. When you do a release, we also generate a stable DOI. You get a DOI through the nodo, both the whole workflow, but also every single release which is great for reproducibility and for citations. You can refer to the specific pipeline that you use for this analysis. And something new that we're bringing in now is benchmarking using a grant from AWS. So now every time you run a release of your pipeline, it will kick off a workflow on AWS. And if you've set up the configuration for it, which is one of those to do, then it will run with whatever full scale dataset that you specify. So on GitHub where we commit, we do a very minimal test just with the tiniest test data bar you can find which we'll just run in a minute or two. But this benchmark should be with a full scale dataset. And the hope is that that stress test the pipeline a little bit. We also are gonna make those results available on the website so that people who are interested in the pipeline can browse through the outputs of the pipeline without having to run it themselves. And also means that because we have this for every release of your pipeline, we can kind of compare what's changed and benchmark any changes, especially for larger facilities that might have to do accreditation, things like this. It's going to be really, really helpful. Right. So hands up, I don't know anything about proteomics. I work in genomics and I've always have. And you can kind of see when you look down the list of pipelines at NFCore, that most of the pipelines are genomics and that's purely a function of the networking and the people who were started in it in the early days. But there are some proteomics pipelines and that part of the community is growing and this is where we really like to get more people involved. So I thought I'd put in a couple of slides about which pipelines we have already for proteomics. The first one, which I think is the most heavily used is the proteomics LFQ pipeline. Hopefully you guys will understand more of what these descriptions mean than me. But you can dig out more details if you follow that link. All of these pipelines are still under various stages of development. This one is in kind of pretty much production use now. You can see in that little stats graph that it's being downloaded by a lot of different people and it's got quite a lot of commits and different folks and by different people. So it's being actively worked on by especially in the honest. You might be in this meeting if you have any questions. Yeah, I'm here. And it's, I guess, kind of working towards its first release. Next up we have a pipeline called Dioproteomics by Leon Dixon who's another kind of major contributor for proteomics stuff, which is to do with mass spec measurements. This one is similar, kind of in a development state but pretty well used and pretty stable at this point, I think. One of the older ones we have is called MHCQuant which is specifically for identifying peptides from mass spec data. This one, I think, don't quote me on this but I think is slated for a release this week, first release. It's been around for a long time, actually, but it's kind of had different bursts of development work but I think now it should be heading towards its first day of release, which is really exciting. I saw some chat about setting up that AWS benchmark data sets on this pipeline. And finally, we have a pipeline called Ddamsproteomics written by Jorot, who's also at SyLiveLab in Sweden, the same as me. This one, I think, has fallen a bit behind his fork recently, so could probably benefit from a bit more kind of community input on the reviewing. I think that was what was problematic, was that this was contributed quite early when we had basically no one else doing proteomics within NFCore and struggled to find anyone to really help review this code. So it'd be great to try and get Jorot involved a bit more again. And I think those guys, Valetio Lab and Stockholm have got quite a few Nexplo pipelines which are all using the NFCore template. So it'd be great to get them involved a bit more. All right, yeah, live demo. Let's see how this goes. So one of the new features that we've had is just over the summer, which I thought is worth bringing kind of pointing out, because I think it's really cool, is to do with the schema. And this is a new method to basically describe the different inputs for the pipeline takes in a structured manner. So I'm gonna just kind of quickly demo one of the things that we can do with this. If I go to the NFCore website and type in proteomics at the queue, you can see the results of this popping up in a few places. Firstly, we can pre-hit usage to get the usage documentation. And then I scroll down through all the parameters. You can see these are really nicely formatted with like little icons and stuff and different help texts. And all of this is based from actually a structured JSON document which we can reuse all over the place. And so it's very easy to maintain and it's very easy to write and kind of can be used in a lot of different places. So it's used in the help documentation here, but also we've got this button up here that says launch. If I click that, it takes me to a structured form or web form with lots of different parameters for the pipeline which we know about. They're all described. You can click on help text. So this is really a big focus on user accessibility and user dependence. If I click launch, you can see that it automatically validates the input. So it tells me that there's a couple of problems here that the input is required to run the pipeline and then also the database is required. And once everything is green and happy, I can then click launch. This saves that set of parameters for me. I can either copy this to a file, just a JSON file and manually run next way with no other tools just exactly as you would normally run next way. Or I can copy this command and type that into my command font on the left. So if I paste that in, this is the NFCore helper tool on the left in my command font. And you can see it's talked to the website but it's found that ID. It's pulled all these parameters that I've just put into the website. And now it can run the pipeline for me directly and if I hit yes, then it will just kick it off and run next way for me. So this is super user friendly and it's especially important in these bigger pipelines which are starting to get a lot of different parameters. You can walk through this. You don't have to kind of go away and dig through the code or dig through the usage documentation. You can just kind of fill in this form and click your way through it and be assured as you're filling it in through that validation that we're doing things correctly. One of the very recent additions is this new button here as well. So the guys who write next way have got a startup company now called Secura and they've developed a new product called next flow tower. Now tower is a separate, totally separate system for next way but it can monitor running next way jobs and it can also launch next way jobs on the cloud. So AWS or Google and pretty soon I think the least enterprise version will be able to launch workflows on a local HPC system as well. So we've managed to put together an integration of these guys. So if I click this button it will kick off next row tower in the new tab. And again, it's pulled all of these parameters for my workflow. I can select the AWS that I have environment that I've already set up it does that most of that point they put in a test profile and then it will kick off this workflow forming on AWS in real time. Which is super cool. So you can see it's just spinning up AWS passion backgrounds and you can kind of follow the execution log in real time. So in terms of kind of accessibility and user friendliness, this is really, really good. Finally, of course not everyone has access to the internet or maybe they don't want to type in that specific details into a web form. So you can also do basically all of this by on the command line by doing nf or launch and it uses the same schema and builds takes you through on the command line. Different things says, okay, we need to specify the input it's got the documentation in place I can type it in and it just does the same thing. Last thing on this is that of course Jason documents are paying to maintain and write and don't really want to be writing Jason. So the first thing we actually built was a builder. And if I do nf or schema build it will find the schema in this pipeline and I can launch a web builder where I can kind of just drag and drop these options to reorganize the Jason file and build them. Changes the icons, write some health texts and mark down everything. And when I'm done, that saves it all as a Jason document and saves it back to mine. I worked on it. So none of this is next, but specifically these all fault on help tools which we've written to help developers work with next low pipelines and help end users launch and use next low pipelines in a standardized and structured way as possible using kind of community best practices where we can. This is using Jason schema, which is a standard and we're going to be tying this in producing hopefully to work with their efforts on research objects creates. I think it's really cool. Right, I should say. When I first did that live demo, I discovered a bug. So if you try and actually launch pipeline with that result now, right now, it might break but it will probably be fixed by lunchtime. Finally, I just wanted to do a quick note on DSL2. So next low is itself a domain specific language for DSL. And if you're involved, being involved with next low to all you might have heard about DSL2 or you might even be writing 5.2 to DSL2 ready. It's basically a new syntax for writing next 5.2. We have been fairly heavily involved with the development of DSL2 of some testing. We've been slightly hesitant to jump on to you early until I had a stable release because the syntax is changing a lot and we wanted our pipeline system to be reproducible and not specifically tied to like random versions next day. But that has now happened that just over the summer, DSL2 hit a stable release for next layer and it's now ready to go. And off the back of this we're developing kind of a new initiative which we've called NFCore Modules. And this is a GitHub repository where we're starting to collect basically little chunks of pipelines what we've called modules. So in this case, FastQC which corresponds to a specific tool called FastQC. And the idea is that now when you're writing your pipeline you do a mixture of writing your pipeline and also writing modules for the pipeline within NFCore, within NFCore modules and they can be either remote or not. So if it's never going to be shared you just do it as a part of your pipeline. But if it could be used by other people by the pipelines you put it in the central kind of repository. And then this helper command here part of the NFCore suite helps you as a developer to use these modules within your pipeline. So you just like NFCore Modules install and then it will pull those files and handle the version checking and fashioning stuff. So this is still very, very new. We've got our first few pipelines that kind of now established pipelines are being converted to DSL2. So we're still kind of writing these new guidelines because it's going to be a huge change for us. And the template as it stands is still DSL1 but towards hopefully the end of the year maybe start next year we'll see how quick we are. We're going to push out the version 2.0 with the NFCore pipeline template and that will be DSL2. And we're going to hopefully try and bring all of the NFCore pipelines into DSL2. It's extremely powerful. Right. So it's finding out more information about NFCore. We managed to get a publication out earlier this year. So you can read about the NFCore community on Nature Biotech, the paper itself is very, very short. The supplementary is basically the paper we all originally wrote, which is quite a lot longer and actually goes into a lot more detail about how we manage things and how we automate things if you're interested. But the main thing is if you want to get involved please do go to our website, click that little green button in the top right that says join and find all the details about how to get involved. The main one, like I say, is Slack. That's where we discuss everything. That's where you'll kind of meet people and ask for help. All the work is done through GitHub and then we have outreach through Twitter and more recently through YouTube. Right. I never actually introduced myself. This is not very important. I'm not here to talk about me and want to talk about NFCore, but my name was Viljul, still is Viljul's and I worked in Sweden at SILIFE Lab at the Genomics Instruction there. And then we can find out more information of those links if you're curious. And with that, I'm happy to take any questions but also I'm happy to kind of kick off a discussion more than just questions if possible. Thanks a lot. Very interesting stuff. There's a lot of development going on. It's not that long. I looked into the website and there's much change again. So it's very exciting to say. So are there any questions, Salvador? Yes. Thanks, Viljul, for really a nice presentation. I'm curious about the benchmarking part, the EWS because you're saying, well, I have a pipeline. I can run a full data set on that. Are you just capturing technical aspects of the running when you run the pipeline or are you also taking care of the sent to the part? Yeah, that's a great question. That's a fairly open question at the moment because this is still quite new. We got these credits from EWS to actually run the compute this year and it's taken a long time to set up the infrastructure because we have to set up the EWS workflows. We have to set up the GitHub automation and all this kind of stuff. But that's now in place and now we're really starting to push pipeline authors to actually add the full-scale data sets. And then we'll see what people are using for. My hope is exactly what you say that we'll start to be able to add in actual biological interpretation. A tax seek does differential expression between groups. So between replicates, do we see the same differential genes coming up? I'm sure there will be a prevalence of proteomics. The simple stuff is obviously to check the kind of biometadata. Like, does this file exist and how big is it? To be honest, for that kind of thing we're pushing harder on the modules because we wanna have unit testing on a kind of a sub-workflow level rather than a whole work or as well as a whole work-flow level. And so that's something we're scaling up in the big way there. If you run this tool, does it generate this file? Does it contain this string? Is it above this file size? And so that will really help that side of reproducibility. And then the pipeline tests will be more, like you say, hopefully they're kind of probably manual inspection and benchmarking. But if you have any ideas for how to do this and how to write up, this is right for the picking. We're just kind of, we're looking for input now. Actually, I'm quite interested on that because we are running the OpenEbench. OpenEbench is a lecture benchmarking platform where we support communities doing scientific benchmark. And we have three levels. And our level three is we get the workflow and then we run the specific data set and then we compare the technical and scientific performance. And basically you are doing some kind of what we are aiming to do. We have a small prototype. So it will be great if we can continue talking to see ways to collaborate and work together on this. No, absolutely, that sounds great. Okay, perfect, thank you. A little comment from my side also. Since like last week, I think the buckets of the results were made public, right, Phil? Like this is like one of the first steps that was needed because we are running a ground truth data set where there is a fixed concentrations of proteins. So basically what we are doing is like producing a PDF to see which proteins are deregulated. And we use this as kind of a check currently manual. So you have to look at the PDF to see if everything's right. But you could also check the deviation from the expected full change, for example. And this is what we are doing. Yeah, something kind of tying into this is probably obvious, I hope, by my lack of knowledge about proteomics. One of the things I love about it, of course, is that each pipeline is kind of handled by its own group of kind of domain-specific experts on that topic. We maintain it, we run it, we answer the questions, and then we have kind of a wider team who kind of collaborate on the technical stuff. And so I think this is a real strength for them, of course. It's kind of one big community spanning lots of things where collaborating on the more common things like these technical aspects, where we can all kind of have the same kind of setup, but also have in kind of smaller sub-communities we know about the biology. So you mentioned that, so ideally you will have only one pipeline for a data type. We are doing exactly the opposite in a scoop because we're trying to compare different pipelines on the same data type and using, for instance, this ground-truth dataset to see where the pipelines might have a problem or how they compare. Because in our field, it's one of the big problems that you get a lot of different results if you use different pipelines. And that's like 70, 80% of the coverage you have or less. So if we wanted to and we want to add our pipelines, then of course, how do we avoid trouble with you guys because we are doing exactly the same data type but have like six or I think five pipelines we will have in there or want to have in there? So a little bit comes down to the definition of pipeline, I guess. What we want to avoid is having a totally different set of documentation, different way of running the pipeline, different way of putting the pipeline and everything. We want to have one place to go to you if you want to run a data type. What we're fine with and what we encourage is having different ways to run that pipeline. So if you like different workflows within a single pipeline. So if you look at the RNA pipeline that we have, you can choose to use star to align. You can use Hisap2. You can use salmon. You can count genes with future counts. You can count genes with RSAM. The methylation pipeline also has a whole, but it's basically two pipelines within one. And then I actually did that so that I could benchmark two different analysis types within, but it's still within one pipeline. We still have one default, which will be the sensible for most people. And it's still kind of a central place to go to, which is easy to find. So hopefully that makes it clear. Of course, the downside of doing it this way is that the pipelines can get big and difficult to maintain, which has become a bit of an issue, for example, for the RNA pipeline, which is massive now. But what will come to the rescue there is DSLT and this modularity, because then we can actually literally have defined sub-workflows within a pipeline, which are kept completely physically separate. You can run them as a names. So you can say, and of course, next-flow run RNA-seq workflow Hisap2. And it will run that pipeline, which is like a subset of a larger pipeline, things like this. It makes it much more modular. It makes the code much more manageable. And then you can easily do benchmarking between different types of analysis within the same work. Would it be also make it more manageable to merge like different next-flow pipelines that are already set up and then put them together? I assume that's probably not that easy. It depends on the specifics. Yeah, I mean, this is something we hit occasionally as kind of, it's always a bit tricky when people come and say, I've developed this really functioning stable pipeline. I just want to add it to NFCore, because that's not really how we set up the community. We set up the community to kind of come together to write pipelines from scratch, you know what I mean? So there's always a little bit of pain involved if you have something to kind of bring it in and kind of bend it to follow all the same guidelines and stuff. But the specifics of exactly how to go about doing that is once a longer more technical discussion depending on exactly what you've got and where the end result will be. Yeah, it's a little bit harder, because I think all pipelines there are not in a way that you can just take one pool from one pipeline, put it in the other one. They get easily into very incompatible parts where file formats and everything does not fit anymore. That's a little bit of a problem we have. That's why we set up everything completely separately because they are not exchangeable in that sense. I mean, there's also the definition of what's the same and what's different is flexible. If they're different data types or different formats and stuff like this then it might be fine to have a set of pipelines. And like I say, with DSL2 we'll basically be able to write multiple pipelines with it once and then it's entirely possible to have one and only. Yeah, that's quite encouraging. DSL2, I have to take a look at that. Yeah, Jacek? Thanks for the talk, actually. I think my main question actually is related with the previous topic. I mean, there is something in proteomics is slightly different than RNA-seq and probably genomics when you have really stable tools that can be used in cloud environments. It's slightly different in proteomics. I mean, in cloud environments there is a massive amount of tools that they are not coupled very well together. They don't know what people use it but they are not really the most popular ones probably the Windows ones and the non-distributed ones are the more popular ones. Then what I was saying actually is it's difficult to come up with a community around a pipeline that says this is the most stable one this is the one that we want that everyone use that, okay? Then even if Julianus and the OpenMS and myself we have been working on one that would not be something that people can say, oh, that's for label-free this is the one that people should use because there is a lot of debate but also the changing one tool by another one can actually make the whole thing really more different. What proteomics is doing right now is moving into and we have a paper around that how we should move into more like a cloud-based Linux environments and by saying that we actually as a community we are working on testing and comparing tools for this type of movement, okay? When we said, okay, now instead of having this feature finder tool which is in OpenMS we should use something else or we should use this or for the downstream analysis actually we should use this package or for example for the PTM localization score we should use this or something else. And my question is it should be and of course the place to do this kind of of pylon research can we embrace that we can for example in the future have multiple label-free pylons like the one developed by us but also others that we can basically say what are the different between them? Because that's the place where the proteomics community is, yeah? And it's completely different from the RNA score and that's I guess the vice question and probably it will be also my question and of course the place for that. Yeah, it's a really good question it's kind of like you say this is the kind of thing that we need to figure out and it's one of the things that's a bit maybe unexpected is that coming from one field I kind of come with a whole set of assumptions and kind of not necessarily valid for everyone and stuff like yeah kind of I have run into this before with proteomics actually like you say kind of points and click into base tools which are difficult to put into workflows and stuff. So I don't have a solid answer for you yes or no I think it probably warrants more specific discussion so if you have like okay these are the things we want to do and like when we can kind of chat about it on Slack and like for me I don't see a big difference between having multiple separate pipelines and having one pipeline with some workflows within it like for me the code is basically the same the way you run them is basically the same but it kind of comes with the advantage of being simpler to maintain simpler to manage if you don't want to have a default because there is no consensus over which approach is best that's fine we can throw a warning message when you try and run it without any argument saying you must choose one. Let's refine a little bit more the question if the proteomics community is developing pylums to do benchmark it is not flow NFCOR the place for that or it should be the final decision after all this benchmark research part that's because I know also that I mean benchmarking is kind of something like is really active but from my experience with Juliano's and the pylums that we're putting NFCOR it's actually more a production setup basically you really have made your decision about what to do it's basically how do you play in the how do you put a workflow into a production kind of environment that can do tests for you have a benchmark data set that give you the exact results that you want and so on so forth it's actually doesn't look the plate for me for playing with data and tools and combining tools and then get the results out of it because the amount of work that we need to put to put something in NFCOR working is quite a lot that is my impression yeah absolutely so I would say the number one priority for us with NFCOR pipelines is that they should be stable and reproducible okay your pipelines are not stable then they're probably not ready for NFCOR okay but that's not that you can't use NFCOR tools and framework and stuff and when they are stable come along you know it doesn't mean that you can't ask for help and whatnot yeah sorry I just wanted to partially answer Jasset because Jasset you are making really nice comments and I can share the views that we have in opening bench as a platform for benchmarking we're not just interested in the technical performance but especially in the scientific performance because depending on the problem that people have at hand one pipeline or another might be the choice right so if it's one pipeline doing well in one data type or a specific or anything or whatever you will go for that but an imagine it's sort of a circuit where we have for instance workflows in workflow have you have for the benchmark don't know how in opening bench and then the most stable ones are the ones that are systematically performing well being in NFCOR so people can say okay like this one I want to use it where do I use it either I download it or I go to a place like NFCOR and then I can have access and start productions somehow and in any time you can go back and revisit is it still doing this workflow fine or is any new one that eventually have overtaken and then I mean as a community you say well there is a new one that has taken to the initial ones it's not yet in NFCOR so you can as a community say okay let's go for this one not a 20, 30 or 100 different workflow but just select them because of course there is a massive effort to maintain them and to document and then to keep up but it's a little bit maybe it's a silly view or super optimistic view but I think it might be something to work with I think it's nice there's definitely some middle grounds like the best kind of example I have is my experience with the I have these two separate side by side workflows within the same pipeline I benchmark them against each other I chose one and I use that but other people use the pipeline and they use the other because no benchmark is perfect for every use cause and so I think it's like if it's stable if it's works for people if it's reproducible then it's fine to put it in there even if it's not necessarily the top scoring on a wide benchmark that's fine if it's under development if it's kind of a bit tricky to get it to work and it's not really ready to use in a large scale environment then maybe it's not ready for I'd also like to comment on that because I think in our case the workflows at least most of them are widely used so they are as they are used they should be robust enough but that are not because we just set them up in next flow and their mistakes are don't really care about this and that and in order to do a good benchmarking I think you need robust workflows otherwise you won't be able to do the benchmarking because you don't set a parameter right there's something going wrong here you leave one workflow out because you forgot about this detail so I think you need a certain robustness in your workflows and of course could be a good place to go because you have certain checks and you're you're enforcing people to be more careful and producing a more stable robust environment and then that's why I see it also a little bit that NFC could help us maybe not getting a full production ready workflow but at least something getting nearer there using you guys expertise on setting it up because then we're talking better and decide which one we might want to put in the production or merge into a generalized one or not so I think it was one of the ideas that I had on that thing and I ask something I mean also related with let me see if I okay I think another strong point for NFC as far as I understand is actually all these benchmarking workflows tends to be done and then you do the benchmark you decide what is best and it remains there forever and no one or people move on because they just did that what actually I understand from NFC idea is that you actually are active well active maintainers of the workflow which means that I know the point of having guidelines produce a strong thing but it's more NFC idea is maintaining the thing and being both with the community to help do training on the workflow and so on so forth when the benchmark use case is more about proving that something works what is best and what is not the best thing to go forward but then it remains let's say more stable in the long term and the people let's compromise with the use of it because it's only for experts let's say in some way when the workflow is more for a production workflow is more for users but fixing some things like that what and I think this is the second thing to take care of it I mean and I think wherever people in proteomics doing benchmark moving to next NFC core the future for all of this because you know if I arrive there and then I see three workflows two that are benchmark around for example labor free and then the labor free one that has been developed by kind of a community what do I do do I go for the benchmark it's confusing for the user and I think from the what I get from your message is basically you want to have something that is stable from there absolutely yeah yes one of those guidelines I mentioned is that every pipeline should have the names maintaining it doesn't have to be the main person who does most of the coding but there should be someone who has a name there and you can go and ask that person about something you know and the idea exactly is that you add it and forget it it should be synchronization updates so we automate everything we possibly can so once you've released your pipeline you'll start getting full requests to it with updates to the boilerplate core code automatically but there's always ongoing maintenance because that unfortunately is the curse of open source as if you want your stuff to be used there is always ongoing maintenance but it is worth saying that the involvement of the maintainers within the community varies from pipeline and they change over time we have pipelines that we developed at Siloflab and then we maintain them for a while and then another group came in and used them a lot and developed them a lot and then they kind of took over that mantle so it doesn't need to be fixed and to some extent you can kind of choose how much involvement you have in the community but we do want there to be someone around who still kind of has some responsibility for that workflow and make sure it isn't just abandoned but that could be also something beneficial for you guys so if you say okay let's put like we'll start with a pipeline which is kind of starting up but it's an established one but somebody else implemented it in next flow which is not one of the main developers of that softwares like me here and then hope that it becomes sufficiently popular that the original developers hook up and actually take care of it which would be the best thing because they know their tools it's not us we are just using them I know a little bit about them but the actual developers might not always be interested in having their tools in NFCOR next flow or have other priorities or do not know about it for instance so that could be also maybe something yeah so I think we're saying okay I want only one pipeline for data type and so on that might be a little bit drastic but I don't want to criticize but so in that sense you're closing a few doors I think also yeah might pop up and I mean some of the guidelines I put into years ago we're starting to tweak and modify so if we decide that this is a major blocker that this is just this is causing more trouble when it's solving and of course we can always review all these things so my opinion and this is I think you could probably make it through like one pipeline but there should be something in this pipeline or some mechanism in NFCOR that kind of separates the different running styles more clearly because if you just if you start we can do like modules and sub-workflows for each of them and then just put a large if case in the beginning but I think what would happen is that like for example the JSON schema would be incredibly complex because you all the parameters are just mixed of the different workflows and so we probably need something to that uses some additional tabs on the website for example like the different running styles for a certain pipeline I don't know I'll split the schema workflow or something yeah that's a good point I hadn't really thought about that but you're completely right that will become a mess very quickly yeah yeah I'm very interested in continuing this discussion I think which is we should hop onto and of course because if we can find it's a thorny issue it's definitely a thorny issue it's come up before and if we can find something everyone's happy with yeah that would save me a headache as well there was Juan do you want to say something you have to put on the mic you're muted yeah hello yeah yeah I have a very much new question about next flow is it possible to exchange workflows between next flow and Galaxy or the other way around yes and no mostly no so Galaxy uses the CWL files or workflow language which is one I mean there's kind of a few key big players in the workflow field now and next flow is one and CWL is another and so next flow has a slightly different kind of ethos a slightly different way it works to CWL CWR you define the entire workflow in a static file when you run but next flow is kind of more deterministic where it's like a push system where you start off the top and then it's kind of auto-generated as it goes along so my mind is an advantage because it means it fails better and also means you can have kind of logic branches and stuff like this but this means that you can't write an exploit pipeline in CWL however this has been a bit of a recurring question and an issue with the next way developers and what has been talked about is the possibility to export a run workflow which is then of course when you're looking back retrospectively is then static as CWL describing a simple run that next day any good important for Galaxy as to the latest status of that I'm not sure yet So from next flow you can eventually export some workflows into Galaxy but not the other way around That is my understanding of it Others might disagree Also I don't know very much about Galaxy I don't know where their Galaxy has other ways of running I think the last time I did this is when the workflows are simple it's possible that is 1% only of the workflows and then of course when they become when they start using Robby really like all the channels possibilities and things like that then it's impossible then I would say only 1% it would be possible to do that My question is more is probably not for you because I know that you are involved in next flow but not really like an active maintainer of next floor or anything like that What are the plans because I know that at some point it was really important for interoperability between workflows and so on and so on but now I know that the things in next floor are becoming more complex and with NFCOR things are more becoming like I will not say proprietary but that the things are getting more strong more focused on the community of NFCOR and next floor do you think this is still a priority to be cross compatible with other stuff or and also I have seen that platforms rather than working in the cross compatible thing what for example DNA nexus I am working with them in another project and what they are doing is supporting the environments rather than cross compatible then DNA nexus now you can run the galaxy if you want CWL wheel and next floor also I mean the cloud the cloud environments are basically saying you or the bioinformatic cloud companies what they are saying is you can run whatever you want but not cross compatible but allowing to run the infrastructure because everyone is in containers conda by containers and so on what do you think is the future I mean it's it's like I said or yeah so so I I think it's really interesting I think that the main problem with having interoperability between all the different languages is that by definition if you want that you can run to the minimum feature set yep of all the languages you have to say okay this is available in every single language so we can support that and so if you want 100% compatibility with CWL whatever you can't have any of the more advanced stuff that maybe you could have just next way so I am not sure that the payout of looking for 100% compatibility is worth that however like you say all more dock store but I mentioned earlier that started off with CWL but they you can run next way where it's there and they index them and they store the metadata automatically in the same kind of system so I think it doesn't have to be a problem but of course other people disagree with me I know that for a fact if I if I can comment on that there is a consistent effort in EOS life specifically what we'll have to have the abstract specification of the work in CWL and then the specific implementation in NetFlow, Galaxy and NIME I think at the moment there's a few of them so you might have some automation of the transfer from one to another but as you were saying it is really difficult to have 100 because you are not people are not reinventing it will 100% it's like well this can do specifically this aspect better the other ones and so on and so forth so you can 100% of the stuff but eventually you will need some minor curation just to finalize and say okay I have transfer or convert one from another yeah absolutely and actually so I got some people from Workflow have gone in touch with me at BOSC this year this summer and I've been working on that I haven't had much time to do it yet but over the summer and I actually have some very stock codes to generate our own metadata and then we'll be able to generate this automatically and have our own codes for every and of course pipeline which I then shared with Workflow haven't listed and exactly that you kind of have that middle layer you have like the metadata which is standard between every language and that means that all these different systems can have a consistent kind of infrastructure which can run every language but the actual engine underneath is still bespoke I have another question so I think to do the most apparent incompatibility of just moving a workflow to a different platform like next flow to LXC because I think a big part of the difference are the shims like small file conversions small things here and there which are necessary to make them compatible you have to change some small thing here and there do you have in NFCOR a way or are you planning to do something with them so to have them like marked or having a specific way of dealing with these small changes so have you as a separate process or something like that or did you what are you thinking about this so not that I'm aware of is a short answer but I mean of course if someone wants this or needs this then no one's stopping them from going ahead with it all of that and of course my plans are open sourcing and we kind of have a hardcore open sourcing that we want people to take them and build them and do other stuff with them you know so if someone wants to convert NFCOR via whatever and maintain the set of shims to do this in an automated way then that's brilliant but I don't know of anyone who's doing it now and I'm not sure that it's going to be a high priority for the NFCOR community in the near future you're not maybe not so much about a conversion but more about having these shims as maybe a separate entry so you have more than the organizational part of setting up your workflow you might say okay yeah we're running the tool but all the file conversions all the small little things we have to do we do a separate process for instance so it's clearly marked as being something done to the data which is necessary and a separate process that's what I'm thinking so I'm not sure I'm totally understands quite what you're getting at but so for instance if you have gotten output from tool like data research in our case and you get a third file format which is PEP XML let's say whatever it is and then you need to change some parts of the file to be able to read it with the next tool because it may be a best example CSV file so you have to rearrange them to be read in the next tool in particular this conversion that could be quite important in the sense that that's a part which usually is hidden between the lines yeah absolutely okay so I'm with you now so that is something I think we probably have slightly less difficulty slightly fewer difficulties with file formats for example a couple of the pipelines we have genome annotation files which are typically GTF that some people use GFF and so some pipelines have the option to give either and if you give GFF then there's a process that converts it to GTF so then it reads into the rest of the pipeline so absolutely yes that should be done as a process in the workflow just like any other analysis step in the pipeline kind of pipeline step and of course the benefit then especially with DSL2 then if you add that to and of course modules then you can reuse that functionality across multiple pipelines if you want to maybe I'm not sure if that completely answers your question but I don't know whether next so you NFCOR is then also thinking about or trying to push people towards having these separate processes I think that would also help to make things compatible between different platforms definitely yeah it's not something that's really come up so I'm not sure it's explicitly stated anyway but that would be my view and I think it was fair to say it's probably shared with most yeah cool more questions okay thanks it was a very exciting discussion I loved it so I think it was fun and so I assumed that it was available for questions afterwards if you want to yeah and thank you go ahead