 Hello, everyone. Welcome to this Community Foundation on Explore Training. It's the September 23 edition, and as you're probably aware of, we have had similar training sessions throughout the year, March under this year, in previous years, so we are always trying to bring to you the most up-to-date information on how to learn Explore and NF Core tools. I'm going to share my screen here so we can discuss a bit about how the training session is going to be. So this is the NF Core website, the nf-core.re, and you have all the information about this Community Foundation Explore Training, right? So in this first session of today, we are going to focus on things that are related to how to use NextLow. So before anything else, I will start giving you about 30-minute talk about NextLow and the background and what is the tool, some information for people who doesn't know anything about NextLow, and then I will start going through the training material that you can access by access. You can check with training.nextlow.io, and we're going to use this basic Explore Training workshop here. You have many sections here, right? You're not going to cover all of them, but on the first day, we're going to check the introduction to NextLow, how to set up your training environment, the getting started, configuration, managing dependencies, and deployment scenarios. So all of these things, they are somewhat related to how to use NextLow Pipelines. So you found the pipeline on the Internet, the NextLow Pipeline, you want to use it, you want to change something, you want to understand how it works, you want to change the configuration. These are the sections that will focus on providing you the skills and the knowledge to be able to use these pipelines. The second section, which is going to be tomorrow, we're going to get deeper into the theory, like what are NextLow channels, what are NextLow processes, what are NextLow channel operators, channel factories, and all these things, right? We're also going to get into more detail about caching resume that will be briefly mentioned today, but we will get into more detail tomorrow. And then we're going to start to write from scratch a proof-of-concept RNA-seq pipeline in NextLow with everything that we learned, right? On the third day, Chris Huckart is going to join us and we will talk about NFCore, NFCore tools, NFCore both for users and for developers. This is going to be the third and last day of this foundational training, right? Probably by the time it's over, you will already have some knowledge about NextLow, and NFCore, and Pipelines, and using, and writing, and so on. And then we're going to have another training. We're going to go here. On September 20th, we're going to have the hands-on NextLow training. So this training, as the description says, it's like a fast way to get up and running with NextLow. If you already know NextLow, it's been a while. You don't write your Pipelines on the right, or write NextLow Pipelines. You want to practice, you want to remember a few things, or you just finished the foundational training and you want to have a challenge. This is the perfect training for you. And if you're really eager to learn more advanced features, then we have, at the end of the month, the Community Advanced Training. So this one is a very, very interesting training. It's the first time we are going to offer this one. And I think it's very, very nice. If you don't think you are ready for that, it's fine, but I think it's worth a look. There are many interesting things that you will see in this training, which will be at the end of the month, right? This one's going to be September 19th, as well as September 20th. So now that I've briefly talked about the training and everything else, even before we start using the training material, I would like to present a talk where we'll go through many concepts that I think are very interesting for you that will help you with the other sections, right? So if you don't know anything about NextLow and you think that maybe you don't want to do the training because you really have no idea why NextLow exists, that's a nice way to start. Start with this talk. So, here we are. I think that a very important keyword that is behind NextLow and many other technologies that are related to NextLow is open science. It's a very important, it's like a, you can call it a movement that we have today that a lot of people are trying to do open science or to make science open, more open, right? And I think that NextLow is very important to achieve this goal. And there are many other keywords that I think that come to mind when you think about open science, we can think about open source. So if you really want to have open science, it's important that you know what the software that you're using are doing. So in order to really be aware of your methodology and to know exactly what is being done. You know, the parameters, the options, the versions, when you call a function, what that function really does, it's very important to have access to the source code. At some point, for example, I was using my career software that the documentation said it did one thing and we all believed it. But at some point, I went to check out the source code and actually it was not quite what was being said. The methodology was different and this impacted the interpretation of the results in my study. So having access to the source code of the technology you're using in your study, it's very important. And NextLow is open source, just to make it clear here. But it doesn't stop there. Open data is also very important. If you are using this amazing methodology that you master, everything about it, and you don't know exactly what is the data that you're handling, that's very dangerous. You don't know how it was measured, what this variable means, what this other one means, what's the unit of the measure and all these things. So it's important to have open data also. And you know, if you have all these things, you have like open source software, open data, everything of very nice quality documented in everything. You use NextLow to orchestrate that. You also need an open community and a community of researchers and analysts and data scientists and so on where people interact and learn from each other. So maybe I like the pipeline you wrote and I want to apply this pipeline to my data and it will be nice to give you feedback on what happened when I used to different data to a different species, to a different methodology after the pipeline and all these things. So this open community is very interesting. In the NextLow community and the NFCRAW community in general, they are very, very open and very, very nice. If you join our slacks and see the discussions that happened there, you see that all the time we are learning from each other and sharing pipelines and running pipelines and publishing papers mentioning these pipelines and learning from each other, it's very, very nice. So I just wanted to highlight the importance of open science in having open source software, an open community and open data. So let's start talking about workflows. So what are workflows? So the idea of workflows or pipeline is basically to have an amount of data and you do a few things with this data. You have many steps where you do one thing to the data, then you do something else to the next, you do something to the data and then the result, you do something else with that and the result of that you do something else. So you have this workflow, many tasks that you want to do with some data. And genomics is no special area when it comes to having workflows. You have workflows in many, many different fields. And having said that, you can use the next flow and other pipeline orchestrators or workflow orchestrators in all these different fields. It doesn't matter. The thing is genomics is a bit particular in some characteristics that makes it very challenging to write workflows. So one example, one example that I can give to you is that in some fields in data science in general, in machine learning, you're going to have a tabular data, for example, a TSV, a table separated value with a lot of samples with some columns and that's what you have, one file. And this file usually compared to genomics, it's very small. You can have a few gigabytes maybe, but that's it. Millions and millions of rows and you have a few hundreds of megabytes or a few gigabytes. If you go to image processing, for example, machine learning with images, you're going to have lots of images, maybe a thousand millions of images, but these images are going to be short, like a few megabytes. In genomics, on the other hand, it's not rare that you will have a lot of samples that individually they are very, very large. So you could have thousands of samples that each sample has a few gigabytes, for example, or maybe you have different databases that you have to check. It's a very heterogeneous input for the workflows, let's say. You have data in different formats. You have binary formats. You have databases. You have reference files. It's a very heterogeneous one. And considering all these things, among many other things that I didn't mention here, it's very challenging to orchestrate genomics workflows. The next flow was born to tackle this issue. So a few of the characterizes that you also have in these genomics workflows is that usually you have also a very heterogeneous set of software's use. So in the same pipeline, you could have a Python script to do something to your sample and then you have an R script to do some analysis using some R library that you found, an R package, and then you have a MATLAB script to do something else. Maybe you have some compiled programs. Maybe you have some custom scripts. So it's not like you use Python for everything as it's relatively common in other fields. You actually have a very diverse and heterogeneous set of tools that you want to use. And if a pipeline orchestrator or a technology only supports compiled programs or Python or R, you're in trouble because usually that's not what you have. The next flow supports this, right? A lot of things that usually you will have to take decisions on the fly. So maybe at some point, depending on the state of the data, you want to use a different software for the task or a different one, or you want to skip a step. And if you have to stay by the computer for weeks or months, waiting this moment to say yes or no and then go to bed and then go back to say yes or no, this is going to be awful, right? So next flow has multiple features that allows you to not worry about that. So that when certain circumstances appear, next flow will do what you said. Of course, but if you have the autonomy to take the decisions based on what you instructed it to, right? So here we have one example of a pipeline. It's a bioinformatics pipeline, right? You can check the publication. I have the reference here at the bottom. Every... So we call this drawing a DAG, right? A directed cyclic graph. So we have nodes, which are the circles. And we have this arrow, this oriented arrows showing the flow of the information, right? We can do a zoom here. So I just wanted to emphasize that in this pipeline specifically, we have 70 steps and over 50 custom scripts among many softwares and libraries. So if you zoom, you're going to see something like this. So you have this step, for example, which we have two inputs. And this step, it's going to produce an output that is going to go to two other steps, this first and this one here. And this one is going to produce three outputs, which are going to go to three different steps. So this can get very complicated, as you can see. And at some point, people realize that even if you have a software to help you orchestrate, like a rudimentary software would say, but at some point they were trying with some basic technology, how can you try to make sure these steps are repeated? And actually what happens that you get different results. So in this case, they tried the same pipeline with Amazon Linux, W Linux and Mac OS. And as you can see, the results of the pipeline, they were different. So just by changing the operating system, you get different results. So this is the type of situation that Maxwell was trying to tackle, like scientific reproducibility, right? I wanted not only to be able to repeat the same steps everywhere, but in a way that I get the same results. Because if I have some samples, and if I run my machine, I get an answer. If I run in your machine, I get a different answer. If I run in the machine of the reviewer, I get a different answer. This doesn't work, right? So we need not only repeatability, but we need reproducibility. In this scenario where tools like Maxwell arise. But even though it can seem challenging to achieve reproducibility, even repeatability, even being able to repeat the same steps, it's very, very hard, much more difficult than most people believe to be. So in here in this paper by Dr. Mangool, for example, you can see that among the research papers that they analyze in this manuscript, 28% of all the omics software resources, they were not accessible. So the authors mentioned they used something and you could not even find that, because the resources were not available anymore. Among the ones that were available, even though you could find it, it was very difficult to install, or they could not be installed at all. So we could not even repeat what the authors did, or even an industry, we could not even repeat what some peer told us they did. Another paper here, tackling these things, first we tried to rebrand the analysis with the code and data provided by the authors, but then we had to reimplement the whole method in a Python package. So historically it's been very hard, it's been shown to be very hard to reproduce the steps of some analysis, and even when you manage to do that, it's very difficult to reproduce the same results. And then I love this figure here. It's very common for people who work with data science, and it's my case, I see myself here, when you read a new blog post or on Medium or something, or you see a new research paper doing this amazing thing with this amazing method, and one of the first things that come to minus, I want to take this method and apply it to my data. I mean, they made so many nice things with their data. I'm going to do amazing things with my data, but that's the tip of the iceberg. There are multiple things behind it, beneath it that you don't even imagine. So maybe that method is very good for that particular case, that specific data set they have with the specific parameters that they chose and everything else. And actually in my data, it's worse than the state of the art method that I have used before. So there are many, many different things that are difficult to control, and we are not aware usually, but they are very important to achieve the result that we were told to exist, right? And it would be very nice if we could have a tool to manage all these things, all the parameters, versions, isolated environments to make sure it's reproducible and install things for us and manage the updates and have documentation and everything isolated in a way that you can just give one command and everything is going to be downloaded, and it's going to run for you. That would be amazing, right? Well, that's what Nextflow does. So Nextflow is clearly a software, as you saw. It's a program, it's a pipeline orchestrator. You download the program and you use this program to run Nextflow pipelines, but these pipelines are written in a language, which is the Nextflow language. It's a DSL, a domain-specific language, and we call it the Nextflow DSL. So we have this language, and it's built on top of Groovy. So if you're familiar with Groovy, a few things are going to feel familiar to you, but it's not required to know Groovy to write Nextflow DSL. So there are three primitives when it comes to the Nextflow DSL. The first one is processes. So processes here are represented with these blocks. They are basically functions. They represent every step in your pipeline. So if your pipeline starts with some data, applies a function, then applies a function to the result, and then applies a function to the result, we have three processes, three steps, right? And as you saw, we keep applying things to the other one, right? We need a way to share these resources, these outputs and inputs between the processes. That's what we call channels. So Nextflow channels. They are variables, but they're not just regular variables. The same way processes are special functions, channels are special variables. We have a structure called a queue. So first in, first out, the queue of people, you can think of it like this. And you always use Nextflow channels to communicate your Nextflow processes. And whenever you have a set of processes communicating with Nextflow channels, you have a workflow. So here we have one example of a Nextflow pipeline. It's a very simple one. It's a single stack. So we are going to have a process that we call here FastQC, right? It's going to have an input that we call input. It's a placeholder variable. Whatever is getting inside this process that we don't know because we are right ahead of time. I'm going to call it input, right? And you use an input qualifier to tell Nextflow what is this input? Here we are saying it's a path. It's a path with folder. It's a path with fire. It doesn't matter. It's a path. And we say that this process is going to have an output, which is also a path. And the file name is going to be something underlying FastQC. And it will end with either .zip or .html. Then we get one of the most important blocks, which is the script block, which tells what the step is going to do. Here we are running a program called FastQC. It has an option minus Q and we're going to give the input. But what's the input file name? I don't know. It's whatever is getting inside this process. And that's why we use this variable here input. And you refer to a variable with the dollar sign and the name of it. But you know, these processes, they are just like recipes for food. You can have a book of recipes, but if you don't cook them, nothing happens. So we need the workflow block here in the bottom to tell what's going to happen, what processes are being called, when, and so on. So we use a special function called a channel factory to fabricate channels, right? And the special one, it creates channels with paths. It's a channel from path. And I say every file ending with FastQ.jz in my current folder. And after I create this channel, I'm going to forward it with the pipe. So people familiar with the command line, we will recognize the pipe here. I'm going to forward with the pipe, this channel to the FastQC process, which is here. So if my workflow channel was blank, nothing, even though the process is here, nothing will be called. So you have to explicitly say, I want to use the FastQC process with this input. Then we have it here. There are many different ways to provide the channels to the process. We're going to see this in time. But for now, just this example, so that you have a grasp of what an actual pipeline looks like. One very nice thing about Nexlo is that there are many, many features that make your life easier. One of them is implicit parallelism. So the thing is, if you have a program, which does something to a sample, and you have three samples, and you have a very powerful machine, it's not very smart to run this program in one sample. Wait it to end, then run on the second sample, wait it until it ends, and run it to the third sample, wait until it ends. If you have a very powerful machine, you can run them in parallel. At the same time, use all these amazing powerful computer I have. At the same time, I want to apply this program to these three samples. And there is a way to write this with programming languages. But you know, you shouldn't be really worrying about that because you are a data scientist, you are a scientist, you are a researcher. Your expertise is with the analysis you are doing with the science, right? You shouldn't have to be fighting with the computer and installing programs and all these things. Nexlo should do that for you and it does. So when you write your Nexlo pipeline with the Nexlo DSL, Nexlo will automatically identify the situations where it can parallelize your analysis. And this makes your pipelines run much faster. I have multiple occasions met people that said, they had their pipelines written in a different programming language for a different work for orchestrator, they converted Nexlo without changing anything, not even optimizing their code to Nexlo, just by converting and it was much faster than before. And whereas it's magic, the magic is that Nexlo is helping you in a lot of different ways. And the implicit parallelism here, you don't have to make it explicit. It's implicit. It's one of the reasons it's so faster. So you see here the files you have, they are in a channel, the queue. After the process, the tasks, they go to a queue again, and then in the end you have your files. Of course, in real life, you have a much more complex scenario. You have maybe thousands of tasks running at the same time. And here just make it clear that the concepts, you have files, you have channels, you have a process, which is the description of a task, but every time you instantiate a process, you have a task. So I can have one process. If I have 10 samples going through it, I will have 10 different tasks. So we have this parallelization. It's being represented here by multiple arrowheads. The same process, the same two processes, you have multiple tasks at the same time occurring, but Nexlo has other interesting features. One of them is what we call the re-entrance or resuming. So you see here the arrowheads, they have a different color. And what I'm trying to show here is that maybe you will run your pipeline. It's a pipeline that takes a month to run in a separate computer to a very complex pipeline. And by the end of it, there was an outage or some issue with the operating system or some error that you wrote in your pipeline, something bad happened and the pipeline was interrupted. So you went there, you check the source code or the supercomputer or something like this and you fix the issue. Then in theory, if you run your pipeline again, it's going to start from scratch, which means another month waiting. So Nexlo has a re-intensive feature that you can activate by using dash resume in the pipeline command line. And by doing that, it will start exactly from where it stopped, which is this situation here, right? And not only that, you can give a name to your workflow and create some workflows and modules and so on in the sense that most of the time, you don't even have to write your pipeline. You just have to use tools to say, I want this piece from this pipeline. I want this piece from this other pipeline. I want this module here. I want this sub-workflow here and so on. So you have this very high quality components that are already written by the community that you can easily plug to your pipeline. On the third day of this training, you will see with Chris Huckard about NFCore modules, NFCore sub-workflows and NFCore tools to help you handle these things. It will be an amazing session. I'm sure you will love it. So Nexlo is a software. Nexlo is a language to describe your pipeline. But still, we need more things, right? One of the things we need is software and also a computer environment to run. So what I mean by software here is that every step of our pipeline, we will be using computer software to do something to our data. So we need the software, not only the description, but the software to be installed and we need a place to run this pipeline. The computer environment could be my laptop, my local computer. It could be a computer in the cloud. It could be a cluster, a supercomputer. It could be many different things, right? And Nexlo has many built-in technology to help you with all these things. For language, for example, it supports many of the most famous Git providers like GitHub and Bitbucket and GitLab, also the AWS Code Comet and Azure Repos, which are the cloud equivalents for Amazon and Microsoft and many other things. What this means is that you can write your pipeline, you can host it on GitHub and just by sharing the GitHub URL with someone, they can just write Nexlo run your GitHub URL. And the next goal you clone or pull the repository, it's going to organize in your computer. It's going to do everything that's required. It's going to move to the right places, organize the files and run this pipeline for you. So there's a very nice integration with all these technologies here. When it comes for software, Nexlo supports natively most of the container technologies that we have today. Docker, Singularity, Podman, Charlie Cloud, Shifter and many others. And not only that, it also supports some very famous package managers like Konda and Spec, which means that you don't have even to know what Docker is or what Konda is or what the parameters are, what are the options, how to install or anything like this. You just say, Nexlo, I want to use Docker and Nexlo is going to do all the work for you. Or Nexlo, I want to use Konda and it's going to do all the work for you. This means that you don't even have to install in your machine or wherever is the place. You don't even have to manually install the software is required for the steps of your pipeline. You just use a Docker container or a Konda environment and everything is handled for you. By Nexlo for you. For compute environment is the same thing. It supports the main cloud providers like AWS, Azure, Google Cloud Platform and also Kubernetes, OpenStack. If you're running in a cluster or a supercomputer, it supports the main job schedulers like LSF and PBS, Torquesler, all these big guys, which means that it's very easy to write your pipeline once and it will automatically work in multiple different complete environments. Here we have one example. Let me move myself to the top here so it doesn't cover the script. We have a simple Nexlo pipeline like the one we saw before. The difference here is that we have a line now at the top saying Konda in a string here. What this says is that if I tell Nexlo to use Konda for this step, I want to install the fast QC software from the BioKonda channel in Konda and the version is 0.12.1. This is the version of the software found in this channel that I want to install. And then you will have a configuration file like mylocal.com, for example, where you're just going to say Konda enable true. So every time you run this pipeline, automatically it will activate Konda and install this package for you not in your machine for it to be accessible for you but just for this task specifically. And then when you run with Nexlo, run main.nf which is the name of this file, you provide the minus C in the location of your configuration file to be loaded. But maybe you don't want to use Konda, you want to use containers. So you just say container for this step and you say a namespace in a container image name here by containers fast QC. This is a ready container image for fast QC that the biocontainers project provide to everyone for free. And then in the configuration file I'm going to say Docker enable true. You can also have some parameters that you can provide your pipeline. So here, for example, I'm using dash dash input then passing some input file in my search code for my script, I would just say params.input and this params.input will always be replaced by whatever I have dash dash input. But what if I want to default value? I don't want to always type this option. Well, you can come to your configuration file and add this params input in the default value so that whenever you provide the dash dash input this is going to be overwritten but if you don't provide anything this is going to be used here. Here we are using is learn which is a job scheduler is a software that people install in supercomputers that is available for multiple different users at the same time. And if you know learn, you know that you have to write the script file with a lot of information for every job you want to run. And by setting the executor to learn here, what happens is that nextflow is going to write this file for you for every task. This is extremely useful. It's a great feature of nextflow. And here I'm using this executor with singularity for containers with the same instruction here for containers. Of course this configuration files they can get extremely complex. So here we are using let me put me down here. Here we are using is learn using singularity. We have many default parameters. We have very complex, let's say expressions here, you know I want to use a specific queue in my supercomputer. If the amount of time that I requested for the task is below three hours use the short queue. I want the value of this variable to be short. Otherwise say long. And you can do many different things. There's a section in the training where we are going to do some dynamic requests of resources and you will see how powerful this configuration is. So as you can see nextflow is reproducible between brands with all the package management automated the use of containers, the support of so many different computer environments made it reproducible between brands or between systems. You write it in your laptop for example and it runs supposedly everywhere depending on how you configured it and it's pretty scalable. It works for 10 samples, a thousand samples, a million samples. It's very, very scalable and the implicit parallelism is a key feature here. So that's nextflow you have your code, your custom scripts you have software being managed you have the environment being managed but sometimes sometimes you want to manage compute infrastructure you want to manage configuration you want to share your pipeline and the work you are doing with other people. So in the end sometimes in some environments you need more features and thinking of that we created nextflow tower which is a technology that helps you with these extra tests that are required when you have a large team or a very complex pipeline of regulations so nextflow tower is very, very useful for that and because it allows organizations and large teams to work together you have reproducibility not only for you but for everyone people can reproduce what you did and you can reproduce what other people did. So tower is a very nice technology it's a web system you can access it with the address tower.nf and once you log in with GitHub or something like this you're going to have like a launchpad with some pipelines in the session today we're going to have a look at the tower but what you have to know now is that there are three different versions of nextflow tower you have the community version which is open source it's a bit outdated but it's available you have tower cloud that you can access with the tower.nf address there are free and paid tiers the free tier is enough for most people and the first paid tier which is the professional one there's a voucher for academics you can buy for a waiver so you don't have to pay for the professional tier and then we have the enterprise tier the enterprise version which is commercial on prem and paid so here's one example of the management of your computing environment you could have multiple clusters here and cloud computing resources and when you add your pipe and you just say this pipeline is going to be running this machine and everything is managed by tower so people in your organization they can access the computing environment they can access and so on you can manage data sets secrets, computing environments pipelines, teams, organization nextflow tower is an amazing tool if you think about this level of management here's one example of the launchpad you have a few pipelines here this is my private launchpad we're going to see a real one today and it doesn't stop with pipelines so tower aims to provide different features like how helping you develop pipelines and data explorer for working with your data making it easier to integrate data with the pipeline and tower all these tools about data pipelines a data studio to do a post analysis a post processing analysis data analytics so every day we have new amazing features in nextflow tower which is built by sequera which is also the company behind nextflow so I would like to highlight that not only this training but also the previous one that we had like last March and last year and so on they can all be viewed in the YouTube channel the NFCore YouTube channel just go there youtube.com slash nf-core and you see all these training sessions recorded including sessions in other languages so we have trainings in Portuguese and Spanish, in French and Hindi and Portuguese so you can have the content there not only that but apart from this foundational training as I mentioned at the beginning we're going to have this hands-on training which is for people that already know some nextflow including people who are finishing this foundational training and then later this month we're going to have the advanced training which has very nice interesting features to write nextflow pipelines but there will be advance for most people but everyone is welcome to join of course and not only that but we're going to have the nextflow summit in two places this year last year we had in Barcelona alone this year we're going to have in Barcelona and Boston and in both places we're going to have hackathons so I hope to see you guys there both in person or online maybe so feel free to join there's no restriction in terms of level of nextflow knowledge everyone is welcome to join so with this we end the first part of this training session with the slides so coming back to the website of NFCOR if we go to the events you see at the bottom some instructions on how to get insert for your questions if you go to the button here it says there's a channel called September 23rd 2020 foundational in the NFCORS LAC by clicking this link here you can go to this LAC workspace and there you will find this channel so not only during the streaming of this training but also afterwards you are very welcome to go there and ask your questions and we will be more than happy to answer your questions related to the nextflow NFCORS and related topics so let's go now to the training material you will find it with address training.nextflow.io you will be able to access in different languages here it's partially translated in some languages fully translated in others but we will do the training in English today so let's click and start the nextflow training workshop and let's do it second so by the end of this training material we expect you to be proficient in writing nextflow workflows of course we don't expect you to be a master of it but we expect you to be proficient in writing nextflow workflows without thinking twice very complex nextflow pipelines that's not the thing but you will be able to start writing your own nextflow pipelines for some reasonably simple or complex questions depends on how much knowledge you had before about bioinformatic tools and nextflow and command line and Linux and so on you will be able to see channels and processes and operators and factories and so on understanding about containerized workflows which are a key concept to scientific reproducibility we will understand the difference between executing nextflow on the cloud clusters and so on and you will be introduced to the nextflow community and ecosystem mostly on the third day and of course in the fourth session you will see previous recordings here but that's what we want to do today we want to start running or training material so here we are you can definitely install everything in your machine Docker and Java and nextflow and Git and bash and so on but depending on your operating system and the things things can go wrong so what we prepare for you is a Gitpod workspace which is a virtual machine available on the internet that you can access using your browser and when you enter this workspace you will have a code editor, a terminal and everything will be installed for you so you don't have to worry about it so that's what we are going to do today here but just to make sure that you don't use nextflow you just need nextflow and Java which is usually installed in any operating system but in order to follow this training material with all the examples and everything there are other things you have to install just to make it clear it's all here in case you want to do it installing nextflow is a single command it's very easy and as I said this is the website for the training material and here is the URL okay so we are going to come back to the beginning of this section, this environment setup and we are going to click on Gitpod so open on Gitpod by clicking here it's going to take us to... it's going to take us to Gitpod I'm already authenticated so I see this window you probably will have to sign up using your github account for example to answer some questions and in the end you will have access to Gitpod workspace we are going to link below in this video a bite size talk where I'm teaching you what's the goal behind Gitpod how you can set up your Gitpod workspace how you can run Gitpod and so on so I won't get into too much detail here but the important thing to say is that you are going to have a browser version of VS Code which is a code editor you will have access to a terminal and you can choose between a standard machine and a large machine if you provide your LinkedIn account to synchronize with Gitpod you will be given 50 hours for the standard machine for free, right? if you choose the large one the credits are going to burn quicker which means you will have less than 50 hours to use same thing if you don't provide a way to synchronize your LinkedIn with Gitpod I think you will be given 10 hours they burn faster if you use the large large machine most of the time I just use the standard one and it's pretty enough for what I do that's what I'm going to do here today with you so I will continue this is a new workspace so it's going to take a while to be ready because it has to pull the container image do some configuration do a few things but once it's done I can come back later and open the same workspace and it will be much faster because the workspace is ready right? if you are I think 30 minutes idle it will interrupt your workspace, your session but as soon as you're back it will show you the way it was before you left so if you had an open file if you change a few files when you're back they will be changed the way you left them so it's a very nice environment to work a bit and go out for lunch or go to sleep and the next day you come back and it will be there ready for you the way you left it so for people who are already experienced with container technologies what's happening here is that it's building an image so I have a docker file we can go in the meantime to next row next row no training and you will see there's a gitpod.gitpod.yaml file where we have the instructions for our gitpod workspace so we say we have this container image I want it to be used it's a docker file there's some configurations here we use github actions to automatically build or container image and push to the container registry so here's the docker file for the container image as you see we install next row and four among other things but you don't have to worry about this in the end you're just going to type gitpod.io slash hashtag in the github URL that you want to do or just click on open on gitpod the way I did on the training material webpage in many different places you have this button open and gitpod so we are finally having our workspace opened it still takes a few seconds after it's ready because it's loading a preview of the training material that we were looking at before it's going to open the terminal organize the files and everything so now it's ready so I'm going to click here to close this debug console because I won't do this and we just use the terminal here in the content here to go again to where we were before open a basic training I'm going to so here the objective that we saw let's go to the environment setup and in this section you have not only information about any local environment for you but some information is about gitpod so as soon as we got here for example it asked us to do a next load info so we have some information about the next load installation in this virtual machine and then we have the next load version just the 23, 24, 1 we have the build number we have when it was created the operating system where we are at the groovy runtime and so on there are a very nice explanation about everything in gitpod I will make it quicker we have the sidebar here we have the file explorer we don't need this hands-on because it's for the hands-on training we just need this NF training tree here these are the files we will use among other things if you want to download something to your machine right button you can click and go to download if you want to hide the sidebar to get more space for your terminal in the preview you can just click here in the stew files icon and the data explorer will be the file explorer it will be the blue shrink here we have what we saw before training content and here we have the terminal that we are going to use if you know vscode if you have vscode in your machine you don't have to have it installed in your machine you don't have to have nothing installed in your machine to use gitpod just need a browser like I did to see this website and everything is there the thing is if you know vscode it's easier for you so if you want to create a new file example.nf and just do code example.nf it will appear here right and then I can do whatever you want pin, split up it's the same thing that we did that we will do with vscode in our local machine so I have this explanation here that I told you a few minutes ago about the crash burning quicker depending on the machine type that you choose you can reopen a gitpod session at any time just by entering the same git hub repository you can also have new work spaces but you can use again the same one and it will preserve all the modifications that you did just training material specifically it's built for one version which is the 23.04.1 which is the one that we have right now if it was different if by the time you are watching this training maybe in the future in the version is different I would advise you to type this command so you click here to paste go here to copy and here your paste and it's going to stick this version for you nothing's going to change here because we already this is the version we are using but in the future this may be useful then after doing that we will do next flow dash version and you will see the new version that you sticked here nothing changed it was the same but just to be aware that you can use this environment variable to set the version you want next flow to run and now we go really to the introduction of this training material so I will shrink a bit the terminal here it's not that useful for now and let's focus on the context right second hmm okay let's go for it getting started so the basic concepts right in the first part of this session today I showed you a few concepts about next flow having its own DSL the features that we get as portability the features that next flow provides and contribute to portability and portability scalability all those things integration all the existing tools like technologies package managers and all these things I also talked about next flow processes and next flow channels right every step in the pipeline is written in the next language as a process and every step communicates through next flow channels which are a synchronous first thing first out data structure right you can have a channel with three elements and I have a process and I want this process to apply to each sample so what we have actually here is a parallelization of these three tasks happening at the same time providing three outputs again which are going to be queued again the next channel the output channel and provided to the next step we have this execution abstraction that I briefly mentioned and we have some images here to show some supported platforms that allows you to write your pipeline once and change the compute environment without having to change a pipeline you could think that in your company you have a deal with Amazon and two years later you get a better deal with Google platform the pipeline will be mostly the same you just have to change the configuration which is very simple but the pipeline itself won't change so if you found this very famous pipeline be by someone else and it was done for WS you don't have to worry because you just have to make some very small adjustments to make it work in the compute environment of your choice like a cluster or the azure and so on as I said the next flow DSL is built on top of groovy but let's start to do some coding let's start to see next flow in practice so as I said you can click here to show the files we're going to click on this hello.nf or we could just type I'm going to put my face to this side I could just type code and space and hello.nf and here we are going to describe what this script does so in the training material whenever you see this plus symbols you can click and you will see a description of the line or part of the code the plus is closer to so we have this in the scripting bank if you know scripting languages like python and shell script or r in many scripts we have this which basically means that if we try to run the script without saying without saying what software we want to use to interpret it it will pick this one so here if you don't say next flow it will use next flow to run this one usually this is not required because we want to type next flow run with options and everything pointing to my next flow script you will see sometimes this and I think it's important to explain to you what it is so it's telling the operating system to use this binary this program to interpret this file if it's executable we are going to create a variable that we call params.gritting and the content of this variable is going to be the string hello world this variable is starting with params. there are special variables that you can overwrite or set with the command line you will see this in the bit for now just keep in mind that variables starting with params. can be can be set and modified overwritten based on command line arguments and then I'm using a function here which is called channel of is a channel factory I'm creating a channel which contains an element which is this variable params.gritting I have a process here the first process in my pipeline which I call split letters and basically what it does is to split letters it has an input block I'm going to call whatever gets inside this process x and I'm saying that the qualifier is a value so it's going to be a string but the output of this process is a path that starts with chunk underline something in my script block I have some I already come in your operating system so the script block is not really next flow here I'm using print which is a command that we have in the command line we have here hello and I'm using split which is also a command that you have in the command line so you can just do man split to see the man page or split dash h so you see this is nothing I'm installing something specific to next flow it's just some command line programs that are already there and I just want to play with them to show you a simple next flow pipeline so what this line does is to print to the screen the content of the x variable which I don't know about this it's just something that's entering the process rate and I'm going to pipe this to the split program which expects a string it is going to cut the string in pieces of six characters and store the content in files named chunk underline something it will automatically fill something for you so we could do an example here you can list what is in my current folder and I'm going to use the print to do Marcel is teaching next flow today then I'm going to use the pipe to pipe this to split I'm going to choose then I'm going to use the word pieces underline so after doing that if I list the files in my current directory you will see a bunch of new ones piecesAA, piecesAB, piecesAC, piecesAD 4 if I do cat to get the content of piecesAA for example I will find the first 10 characters of that string Marcel is space if I do cat piecesAB teaching, spaceN and so on if I get the AD which is the last one I will get the last 10 characters in here there were 10 there were 3, there were the rest so that's what the split command and what the print command and the split command do so it's just there, it's nothing specific to next flow it's just something using here to teach you next flow pipelines the process is called convert to upper and that's what it does, it converts to upper again I'm using cat and TR which are command line programs they come with Linux, it's nothing specific to next flow I can just do here I call Marcel and I'm going to use TR to convert every of a static ladder from lowercase to uppercase then I have Marcel I could do the opposite I'm going to translate all the uppercase to lowercase so here that's only the M but it will be everything lowercase so again, no next flow that's just some programs that I chose could be any program that you have and it would be fine, here's share script but it could be Python or MATLAB anything but as I said earlier today there are just descriptions of processes nothing is being done you need the workflow block to say I want to provide to the splitleaders process this channel which is a channel that contains the greeting hello world here or something else that I provided and then I'm going to save this to the leathers and the lrnch variable so this variable contains now the output channel from this process reading this input after that I want to use the convert to upper process and provide this input I'm going to use one but using this function which is a channel operator every function that applies to a channel is a channel operator but what does this flatten does so let's comment the bottom of our pipeline and let's just do a leathers h view the view is a channel operator that consumes all the elements and shows them to you in this screen we will see here the hello world cut in pieces of six characters and each piece is going to be one item of our channel so we have here one element in my channel which contains two files the AA and the AV which is hello or something in the world if I use the flatten I'm going to use here the flatten channel operator the view is going to flatten my channel instead of having one channel with one element in this element is a multi item element the two items and we have two channel sorry we have one channel but with two elements two single item elements only one thing twice here we have two things one element so that's what the flatten does the collect channel operator does the opposite the flatten and the collect we're going to see what we saw first which is one channel element containing two items okay that's what we see here so when we provide to resolve ch this converter operator using flatten I want to call this converter operator multiple times for every piece instead of receiving a list with all these files and then I'm going to call view to see the content in the end so now I'm going to run this pipeline without having commented anything everything being run so it's hello world I will get hello world in uppercase I can run again and maybe you're going to see something different let's see it's your hello world okay so you're going to see something different and I'm going to go back to what is the explanation for that so it's what we saw hello world okay one interesting thing when you run an extra pipeline is that first you see the next low version that was used for this run specifically you have the name of this script file that was launched with next low you get an automatic minimonic run name which is a random objective in the random surname of the random objective and the last thing for a famous scientist you have the version of the language which is DSL2 here you have a revision a revision hash which is like an ID of your pipeline if you change the pipeline code the revision changes you have the executor which is here is local it could be a cluster here it's local and it tries to guess the number of tasks that will occur in your pipeline sometimes things are decided on the fly so it's not so straightforward to guess the number of tasks that will run right then you have the first step here which is split letters it happens once we have a string and we convert it, get a string once and convert it to many files once and we have convert to upper done twice because we use flatten to flat our channel that had two items one element now it has two elements right okay we have hello world here you also probably notice that we have this hash here so every step every process, every task actually has a hash like this which is a task directory so we have a place in the computer where everything related to a task is stored and we can go there the default work directory is work actually like this tree which is a program to see the file structure right in the command line so work 44, 75 we do tap to autocomplete and if we do that we see junk AA and junk AB which is the hello world split in two pieces so this is the hash for this first task for the second process we have two tasks and only one hash so this is the default way for next flow which is one process per line, if we want to see the test directory of every task we have to set this option NC log to false and then we will have one test line and then we will have access to all the types of directories okay now we have this directory for each convert to upper one interesting thing I will try to run it again oh okay now it's great now this time you see it said world hello instead of hello world like last time so why it's happening like that so if you remember when I showed those pictures those diagrams showing the implicit parallelization you know if you have three samples in your channel and you call the process three times to work with each of them if they are very similar you expect the first one that was ran with the program to finish first but sometimes let's say the third element of your channel it was extremely light and the first one was very heavy so it's going to take one hour for the first one to finish in a few seconds for the third so because of that the third will finish first and then you will have something like this here world hello instead of hello world sometimes it doesn't matter because you have when you start writing your real pipeline you see that usually you have some meta information about every sample together so that regardless of the order you know which is which there are some process directives you can use to keep the order there are many things you can do but for now just be aware that this happens most of the time it doesn't matter but it's important to know this happens the next part is to start to learn a bit the resume and cash I said at the beginning of tomorrow we're going to get into more detail about the cash but we will already start playing with that so this part the goal is to change the pipeline script and in the convert to upper process we're going to replace all this with just RAV again nothing to do with next logic is a program it's used to reverse a strain like we do like Marcel is Brazilian and when we do the RAV it's going to say Marcel is Brazilian reversed right so we're going to keep the convert to upper name but be aware that now it's not converting to upper it's reversing the strain okay so we are going to run this again but now I'm going to use that resume because I want to use the cash for the computation that doesn't require to be computed again the first thing I want to show is that the revision ID is going to change so here it was 197 a zero and so on it changed because we changed the script code so when you're trying to see if something changed or broke you can always check the revision code to see that if you are running the same pipeline that you ran at a specific point in time so here the first step which is split letters it was cached because it's still hello world we didn't change that right so it's splitting in files it's the same thing but the second part we are doing the reverse now so there is no cash it had to be computed if we type the same command again now everything will be cached because we already ran with the RAV right if we provide dash dash greeting which is this params dot greeting I told you that when we have the variables starting with params dot we can override it with the command line and instead of hello world I say hola mundo now everything will have to be recomputed because the split letters has only cash for hello world because I'm changing it now and we have no cash so everything will have to be computed from scratch and that's what happens here and in the end we have hola mundo backwards which is what this have does um we played with this already one interesting way to see that is with this figure here so we have this variable which is hello world it's added to a channel so here we have the Q right the channel with the element it's provided to a process split letters with expecting a value the task once everything is replaced we have this format here it is one interesting thing to show so let's get this hash here right so let's do code work e7 and inside that there's a file call dot command dot sh so every test directory that was run with success we'll have a dot command dot sh file and this file inside has what was contains what was indeed run in the end because in the hello world in our next script we have this command line but you know it's with variables we don't I mean we expect to know what this x will be but we can't be sure so sometimes when something is going wrong you can open your dot command sh to see what was really ran in the end which was all the right and here that's what it's showing in the world split and so on the output will be two elements right because more than six characters here and then we're going to queue that the output channel we use the flatten to separate them into two elements and I'm going to provide this to convert to upper and again we have this final batch string here is expecting a path indeed it's a path for these files that contain these characters and in the end is going to print to the standard output that we can view with the view operator so with this we end the first section which was introduction so again if you have any question you can go to the channel on Slack and we will be more than happy to help feel free to stop the training when you think that I'm going to answer a question that I asked and you want to think about it a bit about it before or try yourself if you're free to take a break whenever you want to try to do something or here again to an explanation I provided so it's really under your control here to go the rhythm you want sometimes it's a bit slow for some people sometimes it's a bit fast for some people and we do our best to provide a good balance for that but it's impossible to please everyone so feel free to increase the speed on the youtube viewer or decrease the speed or stop as many times as you want and again we will be more than happy to answer your questions in the Slack channel the next section of a training will be on configuration so as I said probably you can find some amazing next-loop pipelines on the internet you download it and then at some point you want to run it in a different environment or you want to change a few things so this section for configuration is specifically to showcase a bit the power of the next-loop configuration and show you how you can change some settings among other things so the first thing is that the first thing that next-loop looks for when you run it is a next-loop.com file in terms of configuration and if you take into consideration whatever is there actually if we go to the official next-loop documentation which is docs.next-loop.io we can go to the configuration chapter here and you see a list of different ways you can provide configuration settings to next-loop in the priority so the highest priority is the dash-dash in the command line the second one is using the dash-params-dash file option to provide a JSON file with parameters you have the minus-c and so on there are many different ways you can provide configuration to your next-loop pipelines and here's the priority which means that whenever you have a conflicting setting the priority will be resolved like this way so let's continue so the configuration syntax is basically a keyword the equal sign and a value it's important to emphasize that because some settings you can use for both in the pipeline script and in the configuration file and the difference is that when it's in the pipeline script, you won't have this equal so it could open here, like script 7 open this file here and you see here for example a published dir, you have a few processor rectives, right and you see there's only a space here there's a keyword and a value you could have CPUs, you could have containers, you could have many different settings but when you use the configuration files for that, you need the equal sign okay so here's just one example I could open a next-loop .conf, there's already one here with already some things inside right, so a keyword we have the scope here, the key and a value I could just add these three there are variables that can access in the configuration file here hello and I want to replace that with property 1, it will be hello world it's important to mention that when you have single quotes you won't expand variables, so it would still be dollar sign property 1 when you have double quotes it's resolved it's expanded for the content of the variable same here for path which is an environment variable if you want to add comments and it works both for config files and to the next post script single line comments they are slash slash but if you want to do a multi-line comment you use slash star star slash the scope is also interesting because you can have like we saw here for the .conf we have process in many different values for this process scope, same thing for docker so you can either just type multiple times like alpha x for the value alpha y or you could just do alpha curly braces and then all the values in closing the curly braces there are two different syntaxes for assigning multiple values for different keys in the same scope we also saw the params.varibals we played a bit with it before here we have one example which is with two parameters foo and bar we have bonjour le monde here we have the work first script so as an exercise let's save this first file here as let's do myconfig.config let's copy click in here here and paste okay let's save that and this one let's save oops we click here as params.nf and it's being suggested by the training material so let's save that too and then let's run that I'm going to run but I'm going to add the minus c with myconfig file oops myconfig file so what do you think is going to happen so we have here hello world but because I'm providing the configuration file it has higher precedence and it's replaced by bonjour le monde which is here the config file right if I don't provide my configuration file then the only value is the one in the workflow script which is hello world and that's what we're going to see here right we could also provide the configuration file but then also say hello so it won't be hello it will be hola le monde right this precedence here but we replace the full we saw here that the highest precedence the highest priority is this dash dash right so that's what we are showing with this exercise one of what the other scope that we have is the m1 and actually if you go to the official documentation of next low you can go to the config scopes here and you see there are a lot of different scopes so if you go for docker for example you have all these variables here that you can set a value so for the environment variable that's very interesting because you probably know that you can type and to see all the environment variables and I can try to grab or alpha oops and as you can see there's nothing but it could just write this full.nf next loop script which is going to type and and filter for alpha and beta but we know there's no environment variable called alpha or beta we just write here and we didn't see so let's run this next flow pipeline and see what's going to happen nothing, nothing is returned but then I could get this and add to my next flow config I'm saying that it either do this or it's the same thing as doing that and now in my process when I run it will see these environment variables beta and alpha but again if I run the same command here I won't see anything and that's the same thing for not having to install softwares in your machine because it doesn't matter in your machine it matters in the task environment specifically so here we see that in the machine we don't have these environment variables but in the task environment we do have so just what it shows if you do work we can also look at the process scope and we have what we call process directives which are keys that you describe how some things work the way you want your process to work the containers the tasks so here I'm saying that there are 10 CPUs 8 gigabytes of memory the container image I want to run and because here I'm just saying process it means that I want for all processes in my pipeline to request 10 CPUs 8 gigabytes of RAM and to use this container image depending on the executor these things will behave differently so if you're not using containers you're just using your machine with no containers or one or nothing and you say CPUs 10, memory 8 gigabytes and so on it won't really matter unless the program that you want to run in the stack has a parameter to set the number of CPUs or threads and memory time and so on but if you're using Docker or if you're using cloud computing or clusters and so on then these things will be taken into consideration and for job schedulers and clusters you must provide these things so it's very important to know but again you don't want to set the same requests in terms of resources for every step in your pipeline that's what process selectors are for so you can say let's open here because it's very nice the official documentation on that you can say you know process with the name hello I want for CPUs 8 gigabytes of RAM and I want the queue short to be used or you could say I want 10 CPUs like it's here for every process but the process called full or something else you can also use the label process directive which adds a label to your process and then you could say no I don't care about the name of the process but if it has a label big mem I want this big memory right I want this process directive otherwise nothing you can also use some expressions to evaluate like if the label is full or bar I want two CPUs and four gigabytes of RAM or you could have more expressions like this one any process but the one with full with the label full with the full label support CPUs it has the full label to CPUs and so you can do a lot of this stuff there is a select priority here is a long reading and lots to learn when you practice with that but the idea is to show how powerful this process selectors are along with the configuration in general to make your pipeline very powerful in terms of portability into the specific compute environment that you want to focus its execution here you have the syntax for memory and time and so on there's one example here that you have I want to request for this specific process I'm not using configuration files here I'm writing these things in the script file I want for this process full for CPUs two gigabytes of RAM one hour in timer in time which means that if the task takes more than one hour to finish it will be killed you don't want it to take more than one hour and then maximum number of retries is three so here we are not setting the error strategy the example but the idea is that if we say retry for example when something bad happens you can say I want the maximum number of retries to be three even though I said the error strategy is retry if it gets to the third attempt terminate the pipeline I don't want to keep retrying and here's one example of a program that you provide the number of CPUs and memory to the program itself even though we are running locally here and this wouldn't be taken into consideration you can also have some lazy expressions like closures here so I want four gigabytes of RAM times the number of CPUs so if someone says this process full you have two CPUs it will be eight gigabytes of RAM and many other things you can do here is some talking a bit about the docker scope so you say docker enabled equal true we saw that at the beginning of the session today when it was showing conda in docker in singularity we had this docker enabled true conda enabled true singularity enabled true and so on and the process container which provides the container image for all the processes being done this way or for specific processes when we use process selectors we won't really get into singularity into details about singularity during the session today so in a few sessions you see singularity if you like singularity or you want to get into more detail about it you are free to go but I won't do that and singularity is not installed in our git part environment so you won't be able to play with it here but if you are a fan of singularity the content is here for you let's click the singularity let's go to the next one the conda execution as we saw in the slides you can provide instructions to install the FASQC package in a specific version from the bio conda channel with conda or maybe you already have conda environment in your machine that you built with a specific recipe you can just provide the next flow the path to the conda environment and this path will be activated for that task specifically so with that we finish this configuration section let me see if there is something else I wanted to mention here there are many things you can do actually so if we come here to the configuration we can verify we have the scope and that we saw in many other ones so the conda let me see if there is something interesting here the executor we saw already for is learn so we have many different ones it could be SGE it could be the local is the default one which is in the local machine there are many different ones you have kubernetes, you have email you can send an email at the end the execution to say everything was fine you can use the manifest scope to put information about the author of the pipeline you can specify the minimal required version of next flow to run your pipeline as you can see the list is very long there are many many things you can do the process scope is probably the most important one so if you want to pick one to really read and understand I would suggest you to choose the process scope it's what we call the process directives and they are very very powerful and useful if you want to modify an extra pipeline or write your own so to get into more details about the process directives you can come here to the process section and go to directives and as you can see it's very very long here we're going to have the executor one and you see we have support to AWS batch, Azure batch, Condor Google life sciences Ignite the number is very very large and we have more actually right now we don't have to update this this fair one is very interesting so at the beginning we saw when sometimes we have hello world or world hello and one way to ask next flow to force the ordering is to use this fair directive the fair word comes from fair threading so it won't try to make it the more efficient way but it will guarantee you that even though they are in parallel the resources will be provided to every task in a democratic way so that the first one will in the end and first so with that you have the certainty that if you provided a channel with ABCD the output channel will in the same order have the transform version of A, B, C and D this is the label one that we saw you can add a label big memory for example for a process this is very useful because sometimes you have 30 processes and you don't want to use a process selector with the 30 names so you just say now I'm going to add a label big memory to all these guys and then you just say I just use the process selector with label you have machine type for Google life sciences it's very, very long but this is what makes it very powerful in terms of portability it's very, very nice to pay attention to that with that we are over the configuration settings again if you have any questions feel free to ask in the select channel the next part of our training session will be on the deployment scenarios section here so most of the time you'll probably write your pipeline on your laptop you're going to do some tests there and it's completely fine but at some point when you really want to run your pipeline with real data you will need maybe some cloud infrastructure, some HPC or maybe even as a small server in your laboratory or company or something like this at this point you have to do some configuration to make sure next flow you run accordingly so here one example is that you can just use the default local executor so some configuration, your next flow script the next flow program is going to use the local executor to use the local operating system or maybe it will use a great executor using Slurm or PBS, Torque and so on which is going to communicate with a batch scheduler which is this guy talking about Slurm for example maybe some network file system, some share drive or something in a cluster so here we already saw most of the things we are going to see in this section we saw already in the configuration so in the process scope we have the executor process directive and we provide as the value for example so that's we were looking at the hello.nf with the wrap let's run it again hello.nf I'm going to use the dash presume so I don't have to recompute things that have been already run okay in the end it was a different phrase, no cache was here cool so let's see that let's say that I'm going to open the next flow.config and I'm going to say that I don't want to use this guy I'm going to erase erase everything and I'm just okay I'm going to erase everything I'm not going to put anything but actually I'm going to put just like it's here I'm going to put the process executor to be Slurm the issue here is that I don't have Slurm installed so one of the programs that Slurm uses is as patch so this won't work but there's a reason I want to do that even if it won't work so let's just run the white line it won't work because as patch is a command that's found okay I'm going to get the hash of the task which is ac3d so code work ac3d and I'm going to open it .command.sh which we know is just the print hello world with the split and so on but I'm going to look now at a different file which is .command.run and you see at the beginning a few instructions here which is the some script that as patch and Slurm expect in order to know how to prepare the job submission the resources that are required it's very simple here because we didn't say much but we could have said CPUs 10 memory 1 GB time 1 hour we could do something like this so when we run it again it's going to fail again because we still don't have as patch that's not the purpose here I just want to show you what next load does for you on the background so bd18 the hash here so code work command.run and now you see here the CPUs the 1 hour time the 1024 MB which is 1 GB memory so all these weird syntax here with "-c-dash-man which is different from what we did before it's specific to Slurm and next we'll prepare this file for you if we open the next load.config instead of Slurm we say sge for example which is another batch scheduler and then run it again this is going to fail because we don't have the software installed here's QSub we don't have QSub installed so the same reason it didn't work but let's look at the task directory so work command.run you know you see it's the same thing we're still providing information to the job scheduler but the syntax is completely different we don't have the dash-dash-man anymore we have mem-free age underline rsss we have "-l here it's completely different and you would have to do for every task a file like this if you didn't have something like nextflow so even though we can't show it working here you see that nextflow does a lot of work on the background for you right I'm going to refresh this because it accidentally closed the training preview then I open a new workspace because remember the command that opens the training page I think it's GP preview but I don't want to lose time looking for it so we just open a new one so let's just open it here in the meantime so we can have a look at the other stuff so deployment scenarios we're looking at slurm and everything else as I said there are some process directives that you can use so the amount of time you want the process to you know so the amount of time at the maximum that you want your task to use the amount of disk storage memory CPUs here and so on you can do that either in the workflow script as we saw or here in the configuration file there's many blog posts like there are some referred here on how to use nextflow on HPC but one of the most common tips is that you should instead of getting to the login or head node when you get inside your cluster instead of just typing nextflow run you should ideally write like here is this learn for example as learn script to launch nextflow as a job and as being a job run other jobs because in some clusters are some very strict limitations I had login node it depends it's something like in a cluster base you have to talk to the administration of your cluster to understand what are the limitations the number of jobs a job can submit and so on but usually it's advised that you run nextflow as a job and not just like for the command line we also saw here the process selector here we have another example so I want to run on learn make you short for CPUs and so on but if the process is named full then I want something different and if it's named par I want something different too so here we have one exercise to alter to change the script 7.nf we haven't seen yet but it's basically to go to this file and change it let me see from inside here it's easier so basically go to this file but specify in the configuration file in nextflow.conf that for the quantification process you want 2 CPUs and 5GB of memory so if you want you can pause the video now and try to do that and to make sure it worked you can have a look at the test.cpu variable or the .command.run or .sh files inside the test directory to check if it was used I'm gonna show the solution now so pause to see it and basically that's what you have to do it's very simple you can actually use label as you saw before you can have a container per process a container image per process which is the preferred way in the second we're going to see more about managing dependencies and containers and you will understand the power of that one very interesting thing when it comes to configuration and deployment scenarios is the profiles so in the configuration file you can have this scope called profiles there's always one called standard which is the default and then you have the other ones that you can create you can give any name and give any configuration inside here is just an example with cluster and then choosing the executor to be sge and the queue to be long and so on and now we also created one called cloud that's using a ws batch with this container image and so on so what this means is that you can create your pipeline with some profiles so that if someone wants to test in the local machine they just use no profile which is standard but if they want to test in the cluster they say dash profile cluster or dash profile cloud and you can also use multiple profiles at the same time so for the NFR pipelines which are some curated pipelines that we have we can run with the profiles test, docker which means it's going to make a simpler run for test purposes we're using docker because you don't have the software installed in your machine so it's going to pull the container images and run the containers with all the dependencies that your pipeline requires and inside the container the pipeline is going to be run the tasks will be run right so here we have one example setting the profile cluster but in here we see standard cloud so you can have multiple profiles at the same time for cloud here we have one example so you write this in your machine for example but then you want to deploy it to the cloud to AWS batch so you set the process executor to be a AWS batch you choose a queue that you configured a container image to be used a work directory in here SNS3 buckets the AWS region and a path for the AWS CLI right and by doing this you can call this profile Amazon and you can run it like that you can also mount some volumes like with elastic block storage it's an Amazon service you can just do at the AWS in the scope batch volumes and you just provide the path here you can have multiple volumes for the AWS batch you can have custom job definitions I don't want to spend much time here in these details it's very specific to AWS but the idea is to show to you that there's a lot you can do you can have a launch team play you can have hybrid deployments so I want to run everything on Learn except when the process has a label called big task in this case I want to use a AWS batch so you have this hybrid deployment where some tasks will be run in your local computer or in the cluster and some specific processes will be run in the cloud if we go to the next flow the next flow website or the Secura website both have blog so you can go to company blog and here at the top we have the blog in both cases you have lots of stuff on AWS for example so you're clicking on AWS you have some tutorials using AWS with tower and so on, announcing some interesting features it's a lot of stuff you can do with AWS and AWS batch and next flow in here we have a very nice blog post on Google cloud it's one I wrote so get started with next flow on Google cloud batch so these are blog posts that get into more detail on how to run your next flow pipelines on the cloud in the language that's more informal you can always go to the official documentation and all the information is there but sometimes it's too technical and these blog posts are very nice for beginners that want to run so they just want to run next flow pipelines as soon as possible to see the effect of having it working so this one has like a step-by-step guide on how to use next flow pipeline the RNA-seq one to run on Google cloud it's short, not very long it has some print screens so if you want to play with Google cloud that's a nice way to start yeah that's it for the deployment scenarios the next section we are going to talk about next flow tower next session will be on next flow tower so as I introduced you briefly before the software is the next flow tower it allows you to do many many interesting things I'm going to sign in here with my github I have another account but I'm going to use the github one which is one I used to play as soon as you get here you see you have this organization which is community here and you have this workspace call showcase everyone who creates an account on tower will get the same thing here you can access that and you have 100 free CPU hours to run all these pipelines and play with them so this is the launchpad, you have the list of pipelines that you have added by your organization here Sikara added a few for your supply so there are some tags here let's play with NFCore and ASIC for example it has it's already assigned AWS complete environment for it to run it has the github URL for the NFCore and ASIC data pipeline you can do a few things here you can optimize, let's view it and see what happens there's a label the complete environment is this one there's some resource labels some config profile by default assigned which is tests no pipeline secrets so pipeline secrets which are here you can manage them it's like some passwords or tokens that you don't want to have written anywhere so tower will on the fly replace some variable placeholders it's a very interesting feature you can manage there are a few secrets here you can see the participants in this workspace in this organization you can have some credentials we indeed have some credentials here complete environments there are two data sets we have a few data sets here actions which are some triggers that can happen we are not going to play with them today and runs not only we see my runs which is the execution of an extra pipeline but we see the runs of everyone in the community so I know Adam for example but I don't know this person so some have succeeded some have failed so I just go back to my launchpad I'm going to choose one which is an ASIC one and I'm going to click on launch so here with a few things already failed for me like an objective in the last time of a scientist as I said before there is some input files by default default output directory I would just go down there and I'm going to believe everything and just click on launch so by clicking on launch it's going to launch my next low pipeline is here for now it's submitted the status I'm going to click on it so I can see more things about it that's the command line that was used so run the github url I give a name give a params file with the input and so on I said I want to run with tower this is the revision of the github url you can have the master and other branches and I chose a profile test so far nothing happened it's still submitting I would say so this takes a while because it's creating virtual machine on Amazon and so on it's going to the ps batch it's managing the resources, checking for available infrastructure we have here the wall time a few seconds have gone already and for now everything is empty if we go to runs let's look at this one for example which is okay this one which is an RNA-6 same pipeline but Adam ran it much earlier today not much earlier okay a bit earlier so let's click on it it's over so we have more information so there were almost 200 tasks that were executed and they all succeeded we can see here information about every task so here for example I can click on it and I have the name of the task the command that was actually ran and it's here the exit status was 0 which is good it's a complete success this is the work directory specifically for that task we have some environment variables here we have time when it was submitted when it started when it completed in the duration of real time we have what container image was used we have the queue the number of CPUs requested memory, time, disk we have lots of different things as you can see here there is nothing specific to the pipeline I don't know the data I don't know the result so when you use tower it doesn't have access to your data specifically everything is done on your computer environment it doesn't have access here so tower doesn't have access to that here we have some information about how long it took took 16 minutes there's a lot of CPU time there's an estimated cost because it's a cloud so we can estimate that some statistics here one thing that's very interesting is this chart here so we have in this tab here the amount of resources that we used is a percent of what was allocated so you said you would need this amount of resources but actually used very little compared to what you requested so based on that we can fine tune the amount of resources we are requesting which means cheaper machines and faster computation and then less money being spent so same thing here for memory, job duration IO and so on so sometimes if you look at the launch pads you're going to see this icons which means you can optimize this pipeline this is optimized already which means that tower can help you spend less money so our pipeline is running now so let's see how many tasks have succeeded so far so we have four submitted tasks no tasks succeeded or failed it's still at the beginning we don't have much so we have these four on orange here being executed so we can see the command but it hasn't started yet okay so let's do something slightly different let's follow the tutorial here so if you don't want to use the tower launch pad and everything you don't have to you can use the command line here so let's do the hello.f it's going to reverse hello world for us world hello let's do greeting hello my beautiful world world cool but you know I don't have much information here I want to run this with towers so that I can get more information about the pipeline and to have the run register and database so I can look and show my friends and other people can watch with me what's happening even though it's running in my machine I want it to be monitored in a way that other people can monitor what's going on so I'm going to use dash with dash tower however it won't work because it doesn't know what's my tower account or anything so it's missing the next full tower access token so I'm going to go back to tower I'm logged in with my account I'm going to come here your tokens I'm going to create a new one so I'm going to say NFCORE training that's the name it's going to give to me this token here I'm going to copy that I'm going to go back to here and I'm going to export tower access token equal that hash and by doing that now when I run dash with dash tower if you identify by this token who am I it's going to submit to my run page on tower so it even tells you the URL if you want to go straight to it but instead I'm just going to come to my I'm going to go here to the run page and it will be there runs okay so it's a different organization it's my mhb.dontus runs and it's here hello if we click here it was very quick because it's not running on the cloud so there's no virtual machine being created resources being analyzed nothing like that it's just the same speed so it was very very fast as you saw here in the end right so here we see six tasks five over to upper and one split ladders and indeed when we go to next floor tower we see six here which are cached so let's do something different I'm going to do it again but I'm going to provide a different string I'm going to say I hope you guys are having a lot of fun so everything is new so nothing will be cached so let's open the URL that's going to be provided here let's go there oops copy this now we have eight tasks which are indeed correct we have seven convert to upper here and one split ladders they all succeeded there was no cache this is the line that we ran we can click and get details about this task but there was not much to do but the container by default was this one why? because by default when you open the workload config in this training material we have the process container here in the configuration file so it's used for everyone the CPUs one is by default the executor is local and then some statistics about how much resources the task used so nextvlogtourer is very very useful to help you monitor your pipelines and then you can share with your friends that are in the same organization or same workspace all the runs you can follow things together you can discuss stuff you can have a launch pad that is shared so here in compete environments I don't have any but if I had something configured here I could I mean you can add of course you can get if you have access to Amazon you would say I want to add your spatch you would share the credentials you got from them and you would have access to a compete environment basically you would come here to launch pad add new pipeline you would put the compete environment you want to use and here the github URL so you can run your pipeline would appear here in the launch pad so going back to the training material here you saw me showing you nextvlogtourer with web interface right and everything the launch pad but you don't have to necessarily use it the graphical user interface you can also use the command line interface right so we don't have the tower CLI installed in github but I have it in my machine so I can show you what I can do so it just exported the tower access to open the same way we did on github then I'm going to do to twruns list for example and this will provide me the list of runs the project name if it failed or not the submission date and everything in here I could do workspace help w could do workspaces let's see what options I have workspaces list and it would list the workspaces from my account and here is just the community showcase right I didn't create another one it's a very powerful command line I think it's a bit advanced maybe to go in the foundational training to more detail but I just wanted to show to you that nextvlogtourer adds a new layer of features for you to monitor your pipelines and to work together in a team in an organization in a lab and be it in the command line with the TW CLI or in the web interface you have plenty of power to monitor your pipelines and do a lot of cool stuff so you can also use the API to interact with tower maybe you have an application that you want to use the API the recommendation is very rich so you can come here and read in more detail if you are interested in you can really set everything in tower according to organization and permissions and workspaces and so on so it's very very nice if you want to organize access to some computer environments or to some pipelines and credentials and data sets and so on so what account do I have here so these actions are also very interesting you have this Outlook workflow executions you can use github hooks it's up to your creativity on what you want to trigger the launch of some pipelines so let's go to oops like this let's go to Marcel he better not does here and you can add a new organization and then create the next flow training as the name let's see what else I don't have to do so we have next flow training now I can get inside it and there is no workspace so let's add a workspace workspace one this is the first workspace of my amazing organization we make it private and now we have here new organization with a new workspace and I can add people to participate see we have access to the settings now participants lots of stuff that you can do there are some demos on youtube in next flow channel and demos that secure does so next flow is next flow tar is also a product so if you want more details there are better channels for you to get information about that it gets a bit out of scope of this train but in the end we had prepared for showing to you in this tower section so now let's go to the next section of our train today so this is the last part of our training session today the first date where we will go through managing dependencies and containers this is a very important section and some of the concepts here may look complicated if it's the first time you're ever hearing about containers so again I mean if you're free to pause and read again and to practice with the examples that we are going to see it's worth it because once you master handling containers and condo environments with next flow things are much easier to make sure your pipeline will be reproducible so Docker is the most famous container technology it's not the only one but it's the most famous one and you have a command line program called Docker that's already installed here and you can use so a very common a very famous and basic right container image is the hello world one so you can just type docker let me try to increase bit here the font docker run hello world as if it's the first time I'm running it it says that it cannot find locally the image so it's pulling for the internet but where is it pulling from the same way we have github for source code you also have container registries for container images so the default one for docker is docker hub and it's pulling this hello world from docker hub now once it's done it will run as I requested it and you just have some hello from docker this message shows that your installation appears to be working correctly it's just a one container image will be used to check if docker is working okay we're going to use a different command now called docker pull to pull the image of debian using this tag which is a label you have to refer to a specific version of a container image for example so here let's do that docker pull debian so it's pulling it and it's here so we can just docker images now and it's going to list all the container images that we have pulled in this machine in this git pod workspace it's basically the next one we have played with it before the hello world that we just pulled and the debian that we just pulled also one interesting thing you can do is that you can use docker run to run a container right to to get the container image which is just like you can think of it as a recipe but it's ready it's like a cake in your fridge and when you want to eat it when you want to interact with the cake you have to open the fridge and put it on a table so you can think of it as the container so you have the container images stored in your machine and when you want to run these containers you use docker run and the container name but then you can use this minus ti which is to get an interactive terminal and you say bash to run bash inside the container and you can use this to play with it so by doing this you will get a shell inside the container I can do ls for example it will show me lots of folders here I can do who am I I'm brute but if I exit the container and do who am I it's git pod if I do ls I see the files that we already know that are here so you can really see here that we are inside the container so I can go inside again and I will try to find the program that is not installed in the container okay tree is not installed but if I leave the container we know that tree is installed so this is the type of isolation that you achieve with containers you can have a space a place in your computer where it's very specific in control and has installed what you need that your machine in general won't have so that's what the containers will do we pulled a few containers here but what you usually want to do is to create your container image so I'm going to create a folder here called playing with docker I'm going to get inside it and I'm going to create a file called docker file that's by default the file that we use when we want to build a docker file I'm going to put my name here it's Marcel my email here and then basically it's going to run the apt-get command to update and install curl and calsay and by adding this path I will be able to find this calsay binary here so I'm going to save and follow the instructions which are to build this image I'm going to give a name a label to my image which is my image you could do any name and dot to say that the docker file is in the current directory so it's going to build my image if we appear when we do the docker images in the end docker images it's here as you can see my image it's created three seconds ago it's very small and then I'm going to use this command line to run my container and run the calsay program with the argument hello docker and if you know calsay you probably can guess what's going to happen with this calsay is not installed in github pod but inside the container it is then we can use it I could also get inside like I did before like this give me an interactive terminal and a brand bash so next flow and there was the ass and then we have this that's the big container so let's open our docker file and do a few modifications so I will add the following line to add more stuff to my container I'm going to use curl to download this file which is compressed and move the files to specific directories what this does is that I can just call someone in the operating system we will know where to find so it's just a docker file I have to build again my new container image let's do it there's some cache as I didn't change the things before it was cached and only the third one which is the new run will be it on and then I can just run this command line which says run the my image container and I want to run this someone with dash dash version when we do that we will get the version of someone again if I do this on gitpod it's going to give me command not found because someone is not installed but it is installed inside my container we can do but I just did a few times already which is to get interactively inside the container image and inside we can do someone this version yeah this but then you can say ok Marcel I have this container image it has someone installed but when I get into the container I cannot see the files in gitpod so I want to use someone which is a bioinformatics software to do something to my samples but if I cannot see my samples how can I do that this is to mount volumes you say I want this part of my operating system of my disk to be accessible from within the container image but for now let's try to run this command which is going to fail but if you know someone what you would do use the someone index command to create a reference index based on a reference which is here it is stored but then it is going to tell me that it cannot find this file so the file does not appear to exist and indeed it doesn't but if I set the volume like this so you say that volume that's what I have in my computer and that's what I want to see inside the the container image so what happens I'm going to change it a bit I'm not going to run Salmon instead I will get inside with bash and the minus ti that we saw already a few times so by getting inside the container just like I was doing before but now with the volume you see that the transcriptom.fa it's here because I said I want this guy which is in this path to be inside the container in this path which is here so leaving the container and running again this long line that we have here which is to run I can see this file and I want to run Salmon with everything let's see what happens the thing is this guy it's not a directory okay so I'm not going to write folder I believe let's go back to the right directory not the plain with docker it's run from here so it ran and if I go into the transcript index folder we will see the result the command before though it would work because we didn't mount the right way for the container to be able to write and while it's writing we see outside the container image the command fixes that by saying you know this whole working directory that I have that I am right now I want the same thing to be outside and then when we save it this way inside the container we can see that from outside the container that's what we just used the tree here but you see it's starting to get very complicated so choosing the image and running choosing the work directory and the volume like it's a pain and you don't have to do that so actually what I'm showing to you is just like the raw way of doing that but Nexo does everything for you you would have to do this for every task there you have thousands of tasks you don't have to do that because Nexo will do it for you so there is this bonus section here to show how you can push the container image you just created to the docker hub container registry you can follow the instructions here you can do it by yourself but we will skip it here it's going to take a while and so on but okay so we can see that we have these files here in the script view it is a very simple pipeline with a single process which is what we were doing right now creating an index with Salmon so what this type is telling you oops is to provide your docker image instead of the regular one that we have which is this nextflow slash rdc-nf so how do you do that you can just do like this okay let me copy everything here and by doing this with docker image you are actually overwriting that process.container and if you look at the work directory this task the dot command that run and you look for docker run you will see that the image name is my image so we can confirm that the container image that was used for this task is my image the one we just created if we run this again without this my with docker and again the reason this is working because our nextflow.conf it has docker okay it needs to have docker that's why we got the Salmon command not found we have to put oops wrong folder wrong nextflow.conf we have to put docker enable through here and by doing that now when we don't say anything about with docker if we use docker and if we use that container image that we have here which is nextflow so let's get this test directory dot command run let's look for docker run and we see that this is the container image that we are using it's not my image anymore so as you can see by inspecting the dot command run in the dot command.sh as we have done a few times to investigate how your pipeline is working so now that we played a bit and again you saw that we did like all the mounting volumes and everything but now we didn't do anything the only thing we did was to either with docker or here but my image that's all we did so nextflow was in charge of running everything mounting the volumes we didn't have to do anything we just ran the pipeline as before providing that container image and nextflow pulled the image found it ran mounted the volumes moving the files to one place moved it back and did everything so maybe it was a bit confused in the beginning because indeed you have to mount volumes you have to make the container see what's outside and then be able to write outside so that when you leave the container you can read the files is a bit fuzzy definitely complicated if it's the first time you're hearing about containers but we can see now that nextflow does everything for you without you having to worry or even having to understand that you just say nextflow I want this process to use this container image take care of everything and nextflow will take care of everything for you the Singularity is another container technology it's very famous in cluster settings in HPC and several computers people very commonly use Singularity but you're not gonna get into more detail about it here there's a whole section about it it's not installed in the in Gitpod but there's content there's not content here for you to understand you can run docker containers with Singularity so most of the time even the competitors they support docker containers but again docker is the most famous one so I think it's worth learning docker even if you want to use another one afterwards so we got the basics about containers that's good that's very important if you want to work with nextflow now let's do something different let's work with Conda one second so Conda is a popular package manager mostly in bioinformatics you have some you have a package repository with some channels with many many different software you have thousands of software in the Conda package repository and you use the Conda command line program to install this in your machine you can create an environment in the sense that when you activate this environment the programs work when you deactivate them it doesn't work anymore because the environment paths the environment variables that are used to find where the binaries are they are changed so that you can only use a program when you have the environment activated maybe you want to have a few environments with the same software in different versions in each environment so it's important to do the management of these environment variables one thing we're going to do here is to start use Conda in it okay we're gonna update our terminal by just typing bash it's gonna like a reset open a new one and then you have the base Conda environment loaded and what we're going to do is to write the file called n.yaml with this content I think it's already created here actually n.yaml and this has the name of a Conda environment the channels you want Conda to look for these packages that you want so I want someone in the bioconda channel in this version first you see and so on so by having this file you can use this command Conda and create in the file to create a Conda environment with the name that you have inside with the packages that you have inside this takes a while because Conda you check what softwares it has to install so that someone is going to work so even though we asked it to install for softwares it may have to install many more softwares in there many more packages because each of these packages they may have dependencies and these dependencies may have dependencies and so on so it's it can take a while to calculate what it has to install so at this time I will advance a bit when this is done we will be able to type Conda and lists and it will list all the environments we have the base loaded and we see the star here showing that but we will also have the nef-tutorial and this is going to be the path for this environment which means that we're going to have some programs inside that are not automatically found when you type a command but when you activate this environment the programs that are here will be easily accessible just by calling them by the name we activate a Conda environment with the Conda Activate command and when you do that this base here is going to change to any af-tutorial for example so with that one thing you can do is to run the next or run command just like before but instead of with Docker you're going to do with Conda and here we are passing an environment, a Conda environment path which means that when I'm typing this next will run this environment doesn't have to be loaded, doesn't have to be activated in the sense that I can type Salmon for example and I will get command log found but by providing this with Conda for the task folders, for the task environments when the tasks are running Salmon will work because the environment will be activated specifically for every task instead of the environment path I could just provide as we saw in the slide deck at the very beginning of today where I said BioConda and the name of the package in the version will build an environment for us and activate this environment whenever it's needed another way to play with Conda is to use MicroMumba so MicroMumba is a very fast and robust package manager that uses Conda repositories so it has some some extra features let's say and some people in the end they use MicroMumba to interact with Conda repositories instead of using Conda to interact with them and some people even merge that with the idea of using containers because the idea is that this environment solving that's happening right now it's not guaranteed to be the to give you the same dependency graph let's say forever and people have had experiences when they tried to reproduce Conda environment a while after they first tested it and they got it in a different way so you don't really have this guarantee that you will have the same environment in the future or things will work the same way so what people do is use Conda, MicroMumba, whatever to do the management of this package but you do that inside a container image so that when you have the container image it's beautiful for you forever you can run it today or in 10 years and it's supposed to work the same way so that's the the golden practice the best practice let's say the gold protocol we have again here what we had before the MVGEMO and what you're going to do is to write a docker file like we did before but we're not going to use to install we're going to get the environment which is the recipe for the Conda environment we're going to move it inside the container image we're going to create an environment we're going to install the package that we want based on this MVML and this is our container image so it's a container image which is the best for reproducibility but all that has to do with installing software and finding dependencies and everything is done with MicroMamba which is very fast and nice and uses the Conda repository which is a great repository for bioinformatics tools but this is the gold standard right so as you see it's still installing but here there are some instructions on how to do exactly what I was describing so far and even going to the point to push your image to a container repository here's Dr. Helbin but again this is kind of the hard way it's really a pain to write every container image that you need there's a project called BioContainers and BioContainers is very interesting because they have a container ready for you for every package that you have in BioConda so for fast QC for example that we're going to use when we are writing our simple NAC workflow the proof of concept we're going to use a few tools like fast QC and you know so many people use fast QC if everyone had to write their Docker file to create their container image to pay to host it somewhere it would be awful so BioContainers already have container images for a lot of different tools and actually when you go to NFCore to the modules and again on the third day you're going to see this part but just to show you when you go to the modules regardless of the module that you choose anyone you will see that there's a BioContainer being used here so if you want to use this module the container is already there I could pick any other one that's random Aliba Run, Main and you have a BioContainer here so these BioContainers are a great very very very nice feature and you can use it to find moldy package images like an image that has many tools sometimes you want to run more than one tool in the same process sometimes you want a specific tool so you have these container images which contain only one tool which is fast QC here and sometimes you have these mold containers with 2, 3, 4, 5 different tools inside this and there is this Galaxy 2.0 that is very nice it brings this Village Search program that allows you to find this Village container that you are looking for it's very very nice so at the bottom here we have some exercises that allow us to play with the other scripts that are already written here for us the script2.nf for example with Docker using this they are created by BioContainers it's a very nice exercise for you to play with the other bonus exercise I really recommend you after the training read everything again in your pace and try to do every exercise even the obvious one try to do it because when you were doing it you ended up running into something that you thought was trivial and actually it requires some thought so I really recommend you to go through the exercise and solve them with that we finish our session today I hope you enjoyed it again feel free to watch it again pause and go to the channel on the NFCore Slack to ask your questions to see what other questions we were asking to discuss whatever you think is interesting so if you go to the NFCore website nf-co.re you can go to join NFCore on the top here and there's a link for Slack so by clicking here it will take you to Slack and once you are there you just join that channel which is September 23 let's go here it's SAPT 23-training-foundational so we will be there to answer all your questions so the overview of the session today was basically how to use NextLow how to modify a NextLow pipeline that you want so we went to the setting up of this Gitpod experience we understood the basic concepts like processes and channels we saw some images that helped us to understand when you have a channel with a queue of elements and these elements they are provided to tasks which are processed instances and processed and then put in the channel again at the end and deliver to another process which is the next step in your pipeline we saw the idea of execution obstruction which means that it's very easy to use different complete environments cloud, cluster, your local laptop you just tell NextLow which executor you want to run and NextLow will do all the work for you same thing if you want to use Konda or if you want to use Docker or Singularity or Podman then it will take care of mounting the volumes doing the configuration pulling the container image running the container setting some parameters for you setting the right user permissions and all these things in the configuration part we saw lots of things we can do about requesting resources for a specific process or attaching labels to processes many different things we saw where we can deploy so we saw that we can even have something I think is very very nice which is this idea of hybrid deployments I can run a pipeline in my machine and for most time things are really light I don't want to spend money with the cloud but there is this one underlying process that takes like forever I really need a lot of resources so I want this specific one that I attached a label called DeepTask I want this specific one to run to run on a WS batch so I can use it as a code node not as a job in a cluster so it's very very powerful unfortunately we don't have a lot of time for this foundational training so I can show many examples and most people are still very in the beginning of their path in the next low learning so I cannot show things that are very challenging because it can not be very comfortable at the beginning you may get lost and then in the end the hands-on training for example which is the next one we have a closer to a real life example of a pipeline with a lot of different container images and mullet containers and so on so after that I really recommend you to take the hands-on training because it's much more interesting it looks more real for tomorrow which is the second session we are going to get into more detail about our channels processes, channel factories, channel operators the cache and resume and at the end we will write a simple RNA-seq workflow from scratch so it will still be a simple one so not really close to a real life RNA-seq pipeline but still it will be a pipeline that's real enough to be useful for you it will index a transcriptome file, it will perform some quality controls it will perform gene expression quantification it will create a mobility key-c report in the end so it's a real pipeline it will work and it's very useful but usually we need in some cases we need things that are much more complex so that's the difference that I'm trying to point out here so I don't want to risk to show something that's very complex even though it's interesting to me and to lose you because it's like a big jump so I apologize if for some people we are going too slow or too fast I'm trying to find the balance here but I think that being able to answer your questions in a channel it's an amazing opportunity to try to even tune more this balance so if someone thought something was very easy here or obvious you can ask difficult questions there and it would be more than happy to answer or if you thought it was too fast here I would like to emphasize that there are no stupid questions every question is welcome from the very basic one to the very complex one every question is welcome and we would be more than happy to answer all your questions there so that leads to the end of the first day of the training material of the training session and I look forward to meet you tomorrow bye bye