 This is really a very brief introduction to NextFlow and just to give you a sense of how we developed this tool, why we developed this tool and a kind of high level take of the tool, how it works and what makes it a little bit different from alternative tools. So because you signed up for this course, I guess all of you know that NextFlow is a pipeline language and a pipeline language means two things in this case which is not necessarily the case of any language. It means it's about writing your pipelines and it's about running your pipelines. These are the two functionalities that NextFlow is going to support for you. That's very important because in a way, running your pipelines is just as important as writing them. It's not like when you write a computer program and then you run it on your computer. When you run a pipeline, usually you have dilemmas about processing one million samples in parallel or in serial and about throwing some part of your computation on the cloud and then reproducing it on an HPC and all this kind of thing of performance computer center. And it's not trivial. There's nothing trivial about this because this requires interactions with increasingly complicated layers of software, queue managers and the queuing systems and all these type of things. NextFlow will take care of parallelization or what some computer scientists call embarrassingly parallel computational problems. What is this? It's a problem where the parallelization is implicit. If I run a multiple alignment software then if I want to do or if I design a multiple aligner and I want to do smart parallelization, I will have to go very deep and I will have to start editing the code, figuring out if I can do a matrix this way or that way. This will be non embarrassingly parallelization. What is embarrassingly parallel is a situation where you have thousands of samples that are all identical and you want to process them all in parallel because they don't have to talk to one another to be properly processed. So that's typically the kind of parallelization. NextFlow is going to take care of for you. And of course, it's a little bit down to your own design as to which level you want the parallelization to take care. I don't know. If you're managing samples from hospitals, you could treat every hospital as a unit of computation or you could go very low level and treat every patient as a unit of computation. That's whatever you want to do, but that's your decision. Another very important aspect is that NextFlow, only one E, supports containers which means it contributes to reproducibility. And to this concept that many of you have probably heard of which is a fair concept. Fair is a contribution towards computational reproducibility. It is the idea that any in-silical object has to be findable. Means you have to have something like Google or the equivalent to find it. It has to be accessible. If you click on the link you get, you should not get a four or four. It has to be interoperable. All of these things have to be pluggable into one another. And it has to be reproducible. And that's on the reproducibility side that NextFlow is making a contribution. Now, what makes NextFlow special? It's my baby. It's mostly Paolo de Tomasso's baby who was the person who developed it in our lab. But it's also my baby. We share NextFlow. And so of course you think your tools are always very special. And which means that my words about this are just as unbiased as I can make them unbiased. But I'll try my best. So in Unix and all of you are familiar with Unix, some more than others. But I guess all of you are familiar with this pipe symbol, you know? It's a very, very low level component of Unix. And that's something very important in Unix. You can pipe data, meaning that if you have a process A that is producing bits of data, and when I say bits of data, it's precise. In the case of pipe, it's going to be line by line. The line is going to be the unit of data. And on the other side, we have a process that is able to consume this data line by line. And the pipe is simply connecting these two processes. Now, behind the scenes, unknown to you, managed by Unix, something small is happening. Something small, meaning that this thing here is cutting chunks of one line and it's passed one line at a time. And that process here that you wrote that has been written by somebody else knows how to consume the data line by line, okay? We do this all the time. Computer scientists actually hate us for doing this because piping stuff in Unix is the opposite of high level computer science. All those very integrated things they do in computer science. And if you look at most of the papers, most of the things that are happening in bioinformatics, they are glued together by one liners. The number of one liners used over the years in bioinformatics is amazing. And so, of course, there is so much you can do in terms of pipeline. And at some point, it becomes very difficult to debug pipes that are piped into pipes, into pipes, into pipes, and so on. But what is very powerful in this approach is that the thing that is here, B, does not need to know how much data it is going to have to process. It does not need to have an overview. And the thing that is here does not need to care about what is going to happen with its data. Data is simply flowing. And of course, if you wanted, you could be smart here. And when you receive the data, you could send it to many, many, many different instances of B. And this is exactly what NextFlow is going to do for you. Now, the pipe is similar to something known as reactive programming by computer scientists. And the idea is that you have a program waiting and waiting and waiting. And then suddenly something arrives. If you, we have had quite a lot of big waves in Barcelona these last days, and there were people surfing. And when you look at the surfers, and I think Julia, you were in California for a while. And so if you look at the surfers, they do nothing really, they float in the water. And so you see a wave coming and they get excited or not. And they wait and wait. And that's exactly how it works with NextFlow. You wait the processes, wait for no cost. If you do it in a smart way, until something arrives to be processed, something arrives to be consumed. Now, this way of defining computation is exactly the opposite of make file. Most of you are probably coming to always make file. When you install a package Unix, you have a configuration file. And this configuration file, the make, tells about the dependencies between everything. For instance, if you have a .c file, it implies that in order to generate the executable, something has to be applied so that you turn the .c, the code file into a .or an object file, all these dependencies. This is a dependency graph. When you type make with a make file, the first thing that happened is actually the computation of all the dependency before the work even starts, make file has to compute this graph. When it comes to compiling program, this is amazing, this is unbeatable because you know that you have to know these dependencies. But when it comes to computation, it can be a pain because if you have a very large number of samples, if you have a very deep level of operation that are going to be applied on this, you may have a graph that is actually very dense. And you may have a graph that could take more space and more memory and more time to compute than the actual computation. And this is the reason why we decided not to go for a make file like way of processing data. We decided to go the other way around. We decided to go for a reactive programming, sorry, a reactive programming way of doing this. And that makes huge difference. We don't need to compute the graph, but it's harder for us to do a dry run for instance, which is one of the things, the alternative to next flow was snake make and snake make was really built initially on the top of make file. And the idea was that you do your computation as you will do make file. This was a very good idea. At the time, very honestly, we didn't know about snake make. If we had known about snake make, we may have been using it. But we didn't know about snake make and we knew that we did not want our tool to be built on make file for the very reason I've just explained, because we knew that we were going to have to do massive parallelized computation. Okay, so difference next flow, flows data make file prepares data. And that's really the main difference between these two approaches. In a small instance, it makes no difference at all. Then snake make is well implemented and every once in a while, they figure out small jurisdictions so that you can scale up. But as many people may have told you with snake make, every once in a while, you hit the wall because you've gone over what could be really what could be really pre-computed. In the case of next flow, you never hit this wall because by definition, the computation is going to scale with your capacity. The only snag with next flow is that it's a little bit harder to run a dry run that you know, as a science, computer science, there's no such thing as a freelance. If you get a benefit from one side, you lose it from the other side. And, but in practice, there are many simple alternatives to sub-sample your data, for instance, and to figure out if you can run some analysis on a smaller instance and all these type of things. Okay, so this thing here is probably the first pipeline we implemented in next flow that was in the lab. There was something called Unistrap. It is something called Unistrap. It is an extension of Bootstrap that takes into account multiple sequence alignments. And you can see here, each of these things here is a program and each of these things here is defined as a process and what do you need to do? You need to wrap, this will be your script. And it's very important, your script is written in any language you want. And we needed this because all the scripts already existed in the lab. We were not going to create new scripts from scratch for everything. We need it to be able to reuse everything. And if I were, Paulo de Tomaso hates it when I say this, but if I were to summarize next flow in a few words, I will say next flow is the equivalent of HTML to pipeline development. In HTML, you have something you want to see written on the screen and then you wrap it around tags and the tags are telling your screen or telling the computer how this message is going to have to be displayed. And that's exactly how next flow work except that rather than tagging this pipeline here, rather than tagging with formatting, I'm tagging it with a description of its input and I'm tagging it with a description of its output. I know what the input comes from and I know where the input goes to or what type of output is going to be produced. And of course it's for every process to know what they expect and what they receive. And when you have written this thing, implicitly you have a graph like this. This graph here is not produced by you. It is a consequence of the way your pipeline has been written in next flow language, okay? This is what next flow does, but there is actually much more to this. And that has to do with reproducibility. At the time next flow came out, this was not our intention was not to make a contribution on reproducibility to be honest. Our intention was to have a simple effective system to run all pipelines in the lab. But it turned out everybody was noticing that there was a major issue with reproducibility in modern research, the impossibility to reproduce exactly published results. And very few people knew it, but there was also a reproducible issue in computational biology. And actually we stumbled on this in next flow, which is one of the reason, not the only one, but one of the reason we made it in nature biotech action because that aspect is something that was even more important than the easiness with which one can compute pipelines. And it's actually a Evan Flotun, one of the co-author of the nature biotech paper who came out with this observation that if you take your pipeline, you run it on two different computers with exactly the same software version, exactly the same data, everything you can control for is identical. And you will still get more often than not slight differences. For instance here, this is the main illustration of in the next flow in the nature biotech paper. It's just finding differentially expressed genes using Kalisto. Well, everything being equal on Amazon Linux or on MatOS 6 native to Linux flavor, there should not be any difference. You find roughly the same thing, but you're gonna find 74 genes that appear to be differentially expressed on your Amazon box and 64 that only pop up on the Mac. Why is that? I have no idea. And trying to investigate it will be a waste of time because if you think of these machines, if you think of these computers as machines and if you think of every line of code as a moving part, you are talking about billions of moving parts. And there's no way to figure out which library is slightly ahead or slightly behind or slightly bug. And in fact, the 64 and the 74 here, they are neglectable from a biological point of view, but from an operational point of view, they are essential because you have to be able to reproduce exactly your computation if only because six months, one year after publishing or sending a paper for review, you have to redo exactly or to debug something or if only because these will be patient samples and you want to make sure that you get exactly the same readouts regardless of the hospital in which this is going to be produced, okay? So before we knew it, people started doing amazingly complicated things with this next flow. This is a companion pipeline done by the singer and just when we were writing the paper, we saw that this monster had come out of nowhere and I think it bundles something like 30 more than 30 different software components and that came out as a major strength of next flow. It's a great way to bundle all your software. And one thing I forgot to say about the last slide, how did we solve this problem? We solved it or we addressed it because of containerization, Docker and all these kinds of things where's the solution to this problem? If you containerize all of your software, you have no way to know if this is correct or this is correct, but what you can do with the dockerization is that everything is going to be the same regardless of the platform on which you run it and that's why it's so relevant for instance for medical computation. And as I was saying a second ago, it's great because you can bundle so many tools together and it becomes a no-brainer to move them from one platform to another platform on the cloud regardless of the flavor of Linux that are going to be used on your cloud regardless of anything you have to do. Now, the funny thing is that when we were doing this, we figured out that it's not only gene prediction, it's not only gene expression, we also found these fluctuations. These very clear fluctuations in phylogenetic reconstruction and it seems to be the part of the numbers you shouldn't care so much about because this is really, really, really in the low digits. But no, if you are dealing with a tree that features millions of leaves as we are going to do now, these things eventually induce topological variations as well. So that's something real that is happening here. And again, this instability is not the consequence of random seeds, stochastic, genuinely stochastic process. No, it is a result. Everything being deterministic. Why it's typically the culprit are typically slight differences in rounding, actually. This is really usually the main culprit of these things. And this rounding is an art in computer science. And as it happens, it varies a little bit from one flavor to the other. So what are the secrets? Docker, really, the fact that everything is registered was it should be heavy dependence on GitHub, heavy dependence on Xenotube as well, integrating the data sets in Xenotube, meaning that if you have a pipeline that has been published and if the pipeline has been properly registered at the right place, all you need to do is to run one liner on your machine and you're going to be able to run or to rerun exactly the same pipeline that has been used to generate a paper that has been used to generate results on a paper. Okay? Of course, it is very nicely integrated with all of these things and this is something that is going to be explored much deeper during the course so I don't spend much time explaining this to you. But the idea is that you can run any language. You know, your pipeline can be returning any language you have. Your pipeline can be containerized with whatever you like, Singularity, Docker. As you know, Docker system managers usually don't like it too much because of some security issue. Singularity is more popular and all this kind of things. You get all the main platform that are supported and that's very important as well. Oops, okay. So for a long time, NextFlow was, the problem of NextFlow is that it was non-modular. Each time you needed to bundle two pipeline, you would have to rip them open and to do a lot of cut and paste. And that's really not what you want to do. And so Paolo spent quite a lot of time shifting to DSL2, Domain Specific Language 2, which is the current version of NextFlow, which is modular. It means that whenever you write a pipeline, you can easily combine it with an existing pipeline without having to do any cut and paste or anything that will be invasive. So that is something that, it seems to be taken from granted that everything is going to be modular. But when it comes to this pipeline in, it is not something trivial and that took a bit of work. But I understand this is having a lot of success. Now, comparisons, never trust any benchmark done by the authors themselves. That's the rule, that's my rule in multiple sequence alignments and all this kind of things. This is why I'm always so happy when I see nice comparative analysis done by others and us. And this one just came out in Nature Method and it shows something very nice. You get all of these things here, ease of use, which is always a bit arbitrary, expressiveness. So I had to look it up on Google, on Wikipedia because that was not too sure. And expressiveness means basically you're going to be able to express any quantitative concept you want to express with your language, both ability, scalability, learning resource, pipeline initiatives. So here, near the top you have Galaxy. Galaxy is a big monster. But bear in mind that Galaxy was not built around containerization. It was not built around HPC. It was built to be useful for users. And what is nice about this list of popular tools is that they all occupy their own corner. You're not going to see any of these tools that would have three stars in all the columns. Which is telling you that you, as a biologist, your job is not so much to be able to use a tool. Your job is also to figure out which tool is going to best fit your need on your users. For instance, Galaxy is great because you just draw around boxes and before you know it, you have something that is working. But what you can do in Galaxy remains limited. And that's a natural trade-off. It's easy to use, but you can do what you can do. And on the other side of the spectrum, you have all these next-flow SnakeMake, which are kind of the same here. They are a little bit less easy to use. You have to learn some stuff, but their expressiveness is infinite. You can do absolutely anything you want with these things. And the portability. And so I'm arguing here, but I'm sure SnakeMake people will have a counter-argument that that next-flow is a little bit more portable. And these people actually agreed with this. Now, this being said, you know, to be honest, in consortiums where I am and you know, I'm telling people don't, if you've gone SnakeMake, it's fine and there won't be any massive difference. If you don't have the resources, do not try to go from SnakeMake to next-flow. But if you start from scratch, of course, I think next-flow is a better starting point. So it's always interesting to ponder what made a project successful because next-flow has been successful well beyond their expectations. And very often I see projects failing on the long run because people did not really understand what was a critical point that made them so successful. And so I've been trying to think about this. And behind the success of next-flow, we had the urgency. It was easy to use for urgent problems in a small lab, meaning we were working for ourselves and that makes a huge difference. That makes it a grassroots project. It was really made by users, for users. Here's where we're on. It was a simpler than complicated alternative. We first tried to do CWL like everybody, and this was undoable. And especially, and that's one of the things that worked very well with us, Paolo was amazingly effective at engaging the members of the lab. As I told you before, Evan, who figured out about non-reproducibility, was not working on next-flow. It is just something that came later. Very rapidly, we built a user community. And if you're new to next-flow, you're gonna figure out that the main pair of working with next-flow is that you get all these very active exchanges on forum. Whatever question you have, you will very rapidly have an answer. And that's really one of the perks of working with a very active project. Now, the big surprise, the nicest surprise we had amid something very, very sad. I mean, the COVID-19. The big surprise we had was to figure out that a very large fraction of COVID data actually went through next-flow powered pipeline. We never had this in mind when we started next-flow. And people did this because next-flow was an easy solution. That's the real reason that decides on these things to happen. The ease of implementing very large pipelines and developing them, implementing the pipeline, not knowing where it is going to be run is something very tricky. Next-flow makes it relatively simple. And that's one of the reasons behind this success. Another aspect that I think is essential is readability. Why do we need readability? Because all of this medical pipeline, they will soon decide about our lives. And we've seen it with COVID. Fake news are amazingly popular. And let's think of a very sad situation where you're sick, you have a cancer, maybe, and your data is passed through a pipeline. And the pipeline is going to come out with a magic number, say 19.5. And your medic is going to tell you, there is a new drug, this wonderful drug you've heard about. But for this, you need to have a genetic score of what it may be of 20. At 19.5, you're not eligible for this treatment. And that's very, not only you're sick, but you know you're going to get the treatment that makes people losing their hair, that make you sick. And then that feels like a double pain. And now a double punishment. And now, if you do not trust the pipeline because the pipeline is so complicated, then you are going to mistrust the whole system. And we are back into the fake news world where you're going to see headlines running, they are trying to control us with their pipeline or whatever. And against this, there is only one answer. And the answer is to make sure that the pipelines are readable, that the algorithms deciding on our lives are readable. And how do you achieve this? You're not going to make pipelines, pipelines that are going to be readable by everybody, by your grandmother. But if your pipelines are well readable, they will be readable by a large number of people. And among these people, there will be a large number of people you trust. And if the people you trust can look at the pipeline and say, yes, this is a fair pipeline. This is a pipeline that is using state-of-the-art knowledge to take the right decision. And you as a citizen, you're going to trust the output of these pipelines. And if the pipeline has the output that unfortunately you're not eligible for the very expensive treatment that works on the others, you will accept that this is part of life. And that's why I insist that we need readability. Think of fake news, fake news like this. If they were alleles, they will be the fitters by far. And that's a problem we have with fake news. They're amazingly fit. And our only protection against this fake news is full transparency. And I'm arguing here that the level of transparency in next floor or in this kind of languages is part of this thing. And it's very important I'm telling you this because you guys are all soon going to write your own pipelines. And biology, molecule biology, all of these things, it just not to matter so much because this was research. But now suddenly you're in a situation where this one-liner you have in your pipeline say, you know, superior to five or superior or equal to five. Well, maybe it's a hundred thousand people who stop being eligible for the treatment or who become eligible for the treatment. This just tiny one-liner. And it's your responsibility to make sure that all of these things are written in such a way that they are transparent and trustable, okay? So I'm sorry, I'm taking too much time. I apologize, Jose, I'm taking three minutes. More Jose, who is talking just after me is going to tell you about another development of around next floor, which is NFCOR. So if next floor was a drug, a new drug, NFCOR will be the pill in which this drug is being distributed. It's a very important development. NFCOR is both a collection of high quality pipelines and it is a standout under which next floor pipeline can be written. You don't have to use NFCOR to write your next floor pipeline. In fact, if it's a toy pipeline, maybe it's not worth the effort. But if it's a pipeline that you want at some point to see published and to see becoming public, then I really recommend you take a look at NFCOR. It is something that is going to have a lot of influence. And Jose is going to say a bit more about this. So I'm not spending too much time here. It's a very nice, powerful resource that comes along with a lot of powerful tools to write increasingly standardized pipelines. That's very important. And it has a fast growing community. It's quite amazing. We are encouraging all of our partners in Bovrak and Fang to use all of these pipelines and to use all of these languages. And I have to finish giving credit where it's due. These two guys here, Paolo. Paolo is really the mastermind behind the next floor. He wrote this language. Most 95% of the code has been written in his own hand in the original one. And he really had a very strong vision for this. Evan here was a former PhD student of mine. And he's the one who figured out this issue about reproducibility. And it is something that is important and that gave a lot of visibility to next floor and helped a lot reaching the success it has. And Maria Hatsu here was also a PhD student and was also an early adopter of next floor. And all these people thought this was time to go to the industry. I will not say I'm desperate, but it's something that people my generation have to accept that your brightest students now do not necessarily stay in academia. Some of the brightest also go into industry, which is something relatively new. It's a new phenomenon and it's interesting. It's an interesting development. And so all of these things are used by a lot of people. And what is next? And I'll finish here. So Jose is working actively on something we call a native benchmark, which will be a way, you know, all of these pipeline that bundled together all of these tools that have been published. And you know, they do well on some data sets. You have some benchmarks of some data sets, but will they do equally well on your data set? If you were designing your own benchmarks, will they work? We want to make sure that when you have a bundled pipeline, you can quantify automatically the accuracy of this pipeline at any given time, given any existing reference data sets. This will allow for instance, automated tuning, which is something that will be very desirable. As Janna mentioned it earlier, I run the NAR genomics and bathmatics journal. We started this journal three years ago now. It's a relatively young title. We've just started being indexed in PubMed. NARGAB is a sister title of nucleic acid research. It is something nucleic acid research realized that they have too much in silico and bathmatics. And so they had to do something. And so they created NARGAB. And I'm just about to start a section that will be dedicated to pipeline, a section where pipelines that are not necessarily scientifically innovative, but bundle together essential resources so as to become themselves essential resources, that these pipelines can be published and can be referred to as scientific resources. That's really what we are trying to do here. And of course, it has a long, it has a strong connection with God Ocean, which is a new way. It is something that is slowly being rolled in many journals where your paper will be published, not as a static paper as it used to be, but more as a lab workbook like Jupyter, Jupyter notebook and all these kinds of things so that the graphs, the data, the underlying computation are all bundled. And so I want to thank everybody who took part in NextFlow, especially Evan and Paolo and Laria, and Jose of course, who is now going to take the floor over and going to talk about some more specific aspects of NextFlow. Okay. So yes, I'm Jose Espinoza Carrasco and I work in the group of Sergi Nautedab, the developing pipelines, as I have been nicely introduced by Ayane. And that's to Sergi Carri, introduced very nicely NextFlow and NFCore. So I think I can jump directly to the presentation. So yes. So what makes NextFlow a stone? This is something that Sergi has a little bit discussed and I think that it's totally to that, probably one of the things that made NextFlow very strong from the beginning is that it had a very enthusiastic and active community behind it. And you might think that this is not so important, but if you think about it, it's not just about numbers having users and so on, but the most relevant point is that a stone community dives the innovation process. And yes, and this is one of the reasons that NextFlow probably is one of the work formalities that support more environments, container engines, et cetera. And it's also one of the most popular. So yes, I would say that this is one of the main reasons of its access. And of course, Paolo and yes, Evan and all the people that has been in NextFlow have worked very hard, but I think that without such a so dynamic community, NextFlow probably wouldn't have been so successful. Yes, and also Sergi mentioned during his presentation, so NFCOR, NFCOR it's not all the people that use NextFlow are involved in NFCOR and so on, but NFCOR was created in early 2018. And now I will introduce a little bit about this more in-depth NFCOR community. Okay, so these are, so as we are talking about community, here I just put some numbers for you to see. So these are for instance, the number of Slack users over time and it's quite striking. So when I was preparing the presentation, I was looking at a presentation that was like less than a year before and it has increased like 10-fold or something like this. So it's really impressive. Then if you look for instance, which are the number of GitHub, NFCOR repository members or yes, so this is how it has evolved a long time. So it's really that this thing that I think that now are a little bit spreading and there are like 300, almost 400 people that it's involved in the NFCOR organization. Then here I'm showing, I went back. So how many contributors or how many people contributed to NFCOR a long time? So this is, as I said, NFCOR started that early 2018 and you can see how many people have contributed as committers or also doing either issues or commits to the repositories. I'm finding here it's something that I think it's also very interesting. So when you, for instance, open a pull request in any of the repositories of NFCOR, you can see how normally people response very fast also is the same for the issues. Of course, there are some issues that are more stale in time but I would say that this actually reflects how dynamic this community is because there is always people, yes, in this case for issues and pull requests but also in Slack, if you have any questions you go there and there's always someone asking. So yes, just to finish this part about the NFCOR community, I guess what I'm showing here are some interesting links in case I guess that we will circulate the presentation. So in case you are interested, for instance, in joining Slack or go to the NFCOR website. Also, I would recommend you that take a look on the YouTube channel because now, for instance, there is each Tuesday at one o'clock, there is a 15 minutes talk about it could be next row, normally it's more NFCOR related stuff. So for instance, I took this one because I think it's quite interesting, the 24 because there have been already a lot of them. So how do I start writing my own DSL pipeline and they are 15 minutes, so you can just go there and see what's going on and if you're interested, they are also on YouTube. Okay, so Cedric was mentioned what it's NFCOR. NFCOR is a community as I have already introduced and this community what it wanted to do, it's to have a curated set of analysis pipeline, sorry, a build using next flow. So this lead to have a series of guidelines to implement these pipelines. So it's kind of becoming let's say a standard of how you can do pipelines following this practices in terms of computational reproducibility, interoperability and portability. And what it's also very interesting is that they have developed some helper tools that can be used either by users but also for developers that want to get involved in implemented pipelines or it shouldn't be only in the NFCOR ecosystem, you can also use them for implementing your own pipelines. So yes, here I'm showing, so if you go to the NFCOR website and you click in pipelines, you will get this list. Now, as you can see, it's just I filter them by popularity and you can see here that these pipelines can perform most of the common omics and analysis and you have here a pipeline, for instance, for NFC, which is let's say the flagship pipeline of NFCOR, the ones that gets normally the first, the new implementations or the new things that are for instance, this was one of the first, I think it was the first to be implemented in DSL2 and then you have this one for maybe some of you are familiar with them. So for calling germline or somatic variants, CPC, attack CIC and what it's more interesting is that more and more now there are also, because at the beginning it was mostly genomic stuff but now more and more there are proteomics pipelines that are being developed or image pipelines and so on. So the list of pipelines that they have in DSL2 there are already 33 release pipelines. A pipeline is released when it has already been validated and it follows the NFCOR standard and its production but there are also their 15 and their development. This doesn't mean that they can use but maybe they are going to change by one day to the other because they are being actively developed and there is no any official release yet. And then there are five archive maybe they are not maintained anymore but for testability and in case someone under-producibility they are archived and they are available as well. Okay, as I've said, so NFCOR establish a set of guidelines as here you can see which are requirements here in green at the top and which are recommendations. So the first requirement is that they should be built using X-Flow that's very normal, no? Then they should have a meat license. Then this is maybe more interesting, no? Is that they should, the software that these pipelines use should be bounded using docker or singularity. And this is because this will actually, so just this will enable that these pipelines the results of these pipelines analysis that these pipelines perform are reproducible. So we will discuss about containers during the course. So just for those that maybe are not familiar with them what containers allow to do is that you can sandbox your software inside this container and this way you know that you are running a given version of a tool with all the libraries and everything and this is immutable. So meaning that each time you run a given tool you will get exactly the same results and you don't depend on the operating system or the libraries that you have installed and so on. Then they have to be, so they have to have CI testing so this and they should include a minimal test dataset for this end. So what this means is that you, so it's a good practice always to have a small dataset along with your pipeline that allows to run you to run the pipeline, time to just to like run all the steps that are important in your pipeline so that in case you modify something and you spoil it you know that this is happening and you can solve it. Then they also have to pass an F-core link test. These are, yes, some standards on how to write the code, some files that are mandatory as for instance the JSON schema that it's that allows for instance to run the pipeline using a form as I will show you in my presentation below. They also have to have a stable release tags. So as you have seen, there are already 33 stable pipelines and all of them have been released in GitHub and they also should get a DOI using Zenodo so that you can use any of these versions and you know which version you are using and you can run it back in case you need to reproduce your results and so on. Then they also share a common structure and usage so that once you learn to use one pipeline it's easy to run another pipeline because they have the same structure and you can use similar commands to run it. They should, it should be possible to run the whole pipeline in a single command. Also they should have comprehensive documentation which is something that people sometimes forgot but I think it's a very important point because sometimes you'll find very nice piece of software but it's very difficult to run them because you don't know where to go to know how it works and that's our responsible point of contact. So a person like let's say that it's the main maintainer of the pipeline or yes, you can add this to you have questions and so on. And then you have here some recommendations so that the software if possible should be bundled with Bioconda and this is now with models quite important because and also it's quite stifled forward because most of the tools using bioinformatics have a container in, well, yes, I'm now mixing things. I was mixing biocontainers with Bioconda but yes, what I wanted to say is that in biocontainers you get when you can find a software, a container, a docker container, a single-eyed container and a condo environment for a given tool and that's what models are currently used. They also should especially support cloud environment benchmark from running on cloud environments and optimize a good file formats meaning that when possible they should run standard so they should, the results should be in a standard file formats like Clam or BAM or but not used like a strange file formats. Okay, well, everything is in this paper so if you want to read it further you can take a look on it. So they keep changing and maybe you are now wondering why having these thick guidelines. So one of the reasons it's to follow third principles which is that this was described for data but it's also true for pipelines so that the pipelines are findable, accessible, interoperable and reusable. And yes, and also because so NFCore it's built on top of Nexro and Nexro it's a very powerful, flexible tool that as Cedric says enables you to do any cool stuff but NFCore is like a little bit like you can do whatever you want but if you want to adhere to NFCore you should fulfill a set of guidelines and you should do it in a given way. And yes, and the reason for having these street guidelines is as I say to follow third principles also to adhere to the current best practices in terms of computational reproducibility and interoperability. So all NFCore pipelines allows you to perform reproducible analysis and in another interoperable manner. Also this warranties the portability between different computational instructors so you can run the pipeline in local for instance with the test data set which it's small but then you can go to the crowd and run it or to your server and also run it. And yes, and as they also so to have this common structure enable a set of common features between pipelines. So this makes that all the pipelines can be running a similar way and also that you know where you find where you should put the documentation or where you should find the documentation. As well, I will show some an example here or for instance what they will say that pipelines work in a comparable manner. So any of these commands will run an NFCore pipeline any of their NFCore pipelines. So here what I'm doing is just launching the pipeline and then this probably during the course will be well sure will be discussed. So I'm using a test profile and a singularity profile. So in this case, I will run it with the test data set and using the singularity containers. While in this case, I can do the same with Docker and with Konda and you know that all the pipelines allow the same command. Well, this is a simple but it's too. Also another thing that enables this common structure is that all of them, all of the NFCore pipelines have a JSON schema and this JSON schema describes the inputs and outputs and parameters of a pipeline. So using this JSON schema, if you go to the NFCore pipeline, you can launch any of the pipelines using the web interface. So as you can see here, I just put a simple example of a part. So here, this parameter will be to set the results folder. Here will be to set which referencing you want to use. Here will be for which email you want to get the results or if the pipeline is failing that tells you that it's failing and all the pipelines allow this and just you can go to the website and you launch it as you can see here. This is when you place the launch. So then you can use NFCore tools to launch it really or you can go to an extra tower which I will introduce later or simply take the command that it's located by the form and run it in local or in your cluster or whatever. Okay, so also as I said before NFCore provides with a set or a package of helpful tools. This is a Python package. It's not a Ruby package. It is something that people always surprise but this package, well, you can use both as a user of the pipelines or as a developer. And here in this slide, you can see how you can download it from different resources. So from the Python package index from Bioconda or how to obtain the Docker container with the tools. Yes, as I just said, you can use both as a user. So here I listed the commands that for users. So list, list the Bible and NFCore pipelines. Launch, it's to launch a pipeline through the terminal. Download it's in case you want to download the pipeline and all the containers that it use. For instance, imagine that you are in environment that you have no internet access. So you can download everything and then you can just put it there and run the pipeline and the license. This is just to see the license of a given pipeline. And this is more for developers. So you have create, which is to, before creating a pipeline to lean. As I say, this was a requirement. So you can lean your pipeline and see whether it that there's to this LinkedIn guidelines of NFCore. Then this one will be maybe not only for developers but also for users. Or maybe I would say more for developers but not only if you want to develop an NFCore pipeline. I think it could be quite interesting if you are developing DSL2 pipelines because there are already, as I will show a lot of models implemented in NFCore, many tools that are implemented in NFCore as DSL2 models. And then yes, these are other options for developers. So for the schema that I was mentioning before to BAM a new version or to synchronize with the template because just maybe I can discuss it when I discuss NFCore build. So here you have a NFCore list, one of these comments. And as you can see here, what you can do is that you can list all available NFCore pipelines that it also tells you whether it's in your system or not and how much time you downloaded last time. And if you have the latest list, which it's sometimes a good idea because normally they're implementing new stuff which is interesting. Then you have NFCore launch. So in this case, I'm showing, I was running this for the presentation and I'm just, normally I don't use NFCore to launch pipeline but I don't know maybe some people it's interesting. So I was doing it for the presentation and you can see it's similar to the form that I show you through the website but in this case it's in your terminal and you can then fulfill all the parameters and all the options and run the pipeline. So it's also maybe nice at the beginning when you are starting so that it guides you and then you get the command and then you can start playing on your own. And then this is more for developers because maybe you are not interested in developing or to contribute to NFCore but you are going to implement a quite big pipeline and you think that it will be interesting to use the same standard. So this command allows you to create a pipeline. So here is the whole tree that in this case I create NFCore toy for the presentation but this is the whole tree with all the files and I was afraid that you cannot see it. I just put it here only the parental directories and files but of course if you are not maybe contributing to NFCore you are not interested in some parts you can just delete it and take whatever it's useful for you. Okay and as Cedric was mentioning before next row it's turning modular or has turned modular actually so it has DSL2 has been released in July 2020 so it's just like not very long time ago and what DSL2 does is that it enables the definition of reusable modules and super flow. So because before DSL2 appeared what you had always is like you have a very big script that it was like you have to put everything sequentially one process after the other and now what you can do is that you can have like your small sub workflows that perform a given task with the modules and so on and you can then practice I don't know a given process that will be repetitive between pipelines like quality control of fast Q5s for instance. So you have your small sub workflow that you implemented and you can use it between different pipelines. This is what it said here and yes it has some I don't know if any of you has already used next four years or one but it has small changes. So for instance, channels can now be reused without need of great multiple copies of the same channel and there is other the way you declare inputs and output channels are little bit different but it doesn't change the core next four concepts and yes, just to be, to be aware that DSL1 it's going to be in principle it's going to be deprecated in the future. And yes, I wanted to introduce here some DSL concepts that I have already mentioned during the presentation. So maybe not everybody will be agree on how these concepts are defined but I think it's nice to have like this kind of separation. So a module could be a process that can be used within different pipelines and it's as atomic as possible. So it means that it cannot be a split in another module and an example will be a module file containing the process definition for a single tool such as fast QC. Then we'll have two workflows which will be a change of multiple modules that offer a higher level functionality within the context of a pipeline. And this is like what I was saying before or here in this case this will be it could be a sub workflow to sort index and run some statistics on a file that then you can use in several of your pipelines. And a workflow will be an end to end pipeline created by a combination of next four DSL2 individual modules and sub workflows. And this is a whole pipeline from one or more inputs to a series of finally outputs should be outputs or results. Okay, so yes and so Nexo has launched DSL2 and also NFCore it's becoming DSL2. Here I just wanted to show you that some of the NFCore pipelines that are already implemented as DSL pipelines. There are others that are being moving right now. So for instance, the chip seek or the attack seek pipeline. So all this modernistic pipeline are being ported. And the idea is that all the pipelines become soon all are implemented in DSL2. And this also, so what it makes is that there are a lot of modules, a lot of bioinformatics tools that are already implemented as NFCore modules because there was a need to have these modules implemented for the pipelines. There are already 300, 29 modules available. And I think that this also could be interesting as a template for instance, if you want to implement your own modules or even if you want to reuse the ones of NFCore so NFCore tools also allows you to list them, to install them also to create your own. There is like a template. So it's NFCore modules create and you can create your own. And the idea is that this repository, also in the future will host sub-workflows for the moment there are only four prototypes but this is work in progress. And I think that soon there will be more. And yes, this is as Cedric was saying we are involved in Eurofang and specifically in Boffrec which is a consumption to annotate the functional genome of the code. And there we are actually using NFCore pipelines for doing some of the analysis. Also other consortiums, the one that is dedicated to the genome of the pig and the chicken and the one that it's dedicated to fish are using NFCore pipelines. This is interesting because when we started using them NFCore was mainly DSL-1, but as pipelines are now becoming DSL-2, I think that and DSL-2, what more? It's interesting because DSL-2 is how you can now reuse code. So what we're planning to do is that in case we need any additional implementation, anything that it was not in NFCore, what we will do is that we will implement this part of the NFCore pipeline but this in DSL-2 was complicated. But now for instance, already with the RNA-6 pipeline, we want them to use a string tie to annotate the genome and we created a sub-work for this and it was quite easy to just create the sub-work flow and pack it in NFCore RNA-6 pipeline and make it work. So I think that DSL-2, it's very interesting and you will see how I didn't put any line of code and so on because Luca will show you how to do this during the course. So there's no need to do this. And another thing that I wanted to discuss to end up with it's a next-floor tower. So next-floor tower, it has been created by Sekira Labs. Sekira Labs, it's the company that Paolo and Evan have created and it's behind next-floor right now. And next-floor tower, it's a web user interface that allows you to interact with the next-floor. It also has an EPA that allows you to, say, talk to pipelines in plain terms. And also an artistic thing that you can do in next-floor tower. So in this web user interface, it's to configure crowd environments and it's much more easy. For instance, I have been playing with Amazon and it's much easier to do it using tower than using the native Amazon stuff. And also enables to run pipelines in the crowd or HPC and then monitor them. So this is just an example. Also this will be shown by Luca. So there is no need to show you many details. Yes, that you get a glance of how it works. So here you have all the runs that you have done using tower, sorry. The one that is running right now, which is this one and the ones that are having run in the past. So you can get some real-time statistics. So here you see that there have been five, some meeting process, eight have succeeded. So here you have some of the processes that have succeeded and there is more and more stuff. So this is just a small screenshot but there is much more information. And yes, I think that I will end up with this. Here I just put some links in case you want to take a look on any of the things that they have discussed during the presentation. And yes, and if you have any questions, just go ahead.