 Hello everyone, my name is Marcel Hebedo Dantas. I'm a developer advocate at Sekira for Latin America and the Caribbean for a next-flow in NFCore communities Welcome to the hands-on training. This is the second training. We are we are providing this month of September 2023 You probably saw the foundational training that we had at the beginning of the month In case you are new to next-flow and you haven't watched it I really recommend you to go to the NFCore YouTube channel and check this three-day Three days training that we provided so the foundational training is is focused on Helping people on their first experience on next-flow. It's very basic. We have a very slow pace we get in a lot of details about concepts and Channel operators channel factories and next-flow processes next-flow workflows and all these things So it's a very important first step for you and if you haven't watched it yet I recommend you to do it. If you did or if you already did in the past something like this You already feel like you're your intermediate level in next-flow. This hands-on training is perfect for you It's much shorter than the other trains we have done and it's very hands-on oriented So we have this pipeline that we're gonna play with a bit with it a bit And you have you try to fill some gaps that I've left so that you can try to remember more about what You knew about next-flow or just get more experience on on how to do that So the first thing I want to show you is the page of the hands-on training So it's gonna be a single day just today about two hours maybe a bit less If you have any question during or after the training you can go to this channel September 23 Hands-on training on the selection on the NF core in next-flow slack is a shared channel between the choose like workspaces So there will be people to answer your questions during this training, but also afterwards So it's a perfect place to ask. There are many different channels in this slack workspaces both the next flow and the NF core one Places this channel is very specific to a few quick do some type of questions and things like this But don't worry for this first day for your first experience and touch with next flow and ask any of your questions in this channel So regardless of maybe your question may be more fair to a different channel. Don't worry This is a channel for you to answer all your questions. Okay, so It's important to mention that just like we had before The foundational training we're gonna also have the advanced training in a few days So after doing the foundational training and now the hands-on training if you feel confident enough Go ahead and take this advanced training The hands-on training that we are doing now. It's the first time we are we are doing that In the current version it is so maybe we're gonna run in some issues I don't know some bugs, but we let's hope not right and the advanced training is the same thing is the first time it's been It's been shared in this current format. So both are new trainings and you'll be like the first cohort taking them The next step is to go to The train the official community training portal right is training dot next law. IO When you join this website, that's what you're going to see and you if you've done the foundational training You're going to remember that this is the link to go there, but now you're going to do to go to the second one here So launch the hands-on training, right? so a Few warnings like the first warning is that this hands-on training is a next-low course, right the next-low train We are not Interested in teaching you bioinformatics now The best way of doing variant calling or any type of bioinformatic analysis We're not trying to teach you basic concepts of next-flow. This is a different course Okay, if you are still not sure about some next-low concepts like channel factors channel operators next-low processes and all these things Next-low modules, maybe you should go back to the foundation train like click here on basic training and Check the sections that you don't feel confident enough, right? So here's gonna be a bit in a faster pace I will still try to explain as much as I can But I won't be able to get in the required detail if it's your first time working with next-low fine So if you're looking for a real variant calling next-low pipeline There are plenty out there one example here a few examples here at the end of course a red pipeline the viral recon and the RNA var So these three are production ready very amazing up to date, you know Best you can get for variant calling So you're gonna find the best practices there the best tools the best versions the best way everything Data to tests, you know these things Here we're gonna play with a part with a variant calling pipeline far from ideal Okay, so some versions of the software is gonna be updated the practice we are using maybe it's not the best practice Currently, so be aware of that the go here not to write this amazing actual pipe pipeline from the scientific point of view So we have this pipeline here for you to to inspire yourself to learn more about the volume calling And I mean after I've said a few times that's the go here, right? So we're gonna Not really implement because it's almost already Implemented, but we are kind of implement a variant calling pipeline for RNA seek data on next-low The first section that I would like to get into details with you is the data description section So here I'm going to talk about every input data of this pipeline Then I'm going to talk about the steps how to set up your environment and then we're gonna go to the pipeline implementation Why I'm doing that There are a lot of input files in a lot of steps with a lot of different tools If I try at the same time to explain everything it can get very confusing So the first time here, I'm going to talk about data the steps the setup of the environment and when all these kids are clear to you again, we're going to go to the next-low code So the first input file here is the genome assembly, right? So this is the reference genome For homo sapiens, right? So it was obtained from gene bank, but it's specific to the chromosome 22 So you have the whole genome, of course here. It's not the full reference It's a small piece of the genome and you may ask why would I be interested only in the chromosome 22? There could be a few reasons but here the reason is just because we want something small that you can play and see the results Quickly a real variant calling pipeline with a lot of samples and all these things could take hours A lot of time even more than that It's not go here We want something that you can in a few minutes run the full pipeline, right? So we're gonna have a few samples and everything's gonna be a piece of a real sample So it's starting with the reference genome here is only the chromosome 22 The RNA-seq reads same thing they come from a human cell line but even though The RNA-seq is full We only have here reads that were mapped to specific locus in the chromosome 22 So again, there is more reading of the genome because we want it to be quick Here you have a table with some information about this the samples, right? I won't get into detail here They are parent RNA-seq data The third input is going to be known variants. So it's a VCF already compressed So these known variants they come from high confidence variant calls for this specific cell line Okay, for this Illumina Platinum genomes project The idea is that we use many different tools to do this variant calling to Use pedigree information. We did the best to have this gold standard Set of known variants Sometimes you want to find something novel, right? And then by comparing to a set of known variants We know what we are expected to find and what's new, right? So this is the the goal of having the known variants here variants that are expected to be found The other input file are the black listed regions. So these are regions in a genome with anomalous coverage We know that it's usually difficult to map to these regions or maybe they map in a very weird way They have a high ratio of multi-mapping to unique mapping reads is a high variance area in terms of Mapability so the name is clear is a black listed region. So I usually want to be careful with these regions So as you can see here, we have only four inputs the reference genome the reads The known variants and black listed regions, of course, we're gonna have a lot of intermediate versions of these Files and outputs of other programs. So this thing get very complex in the end Actually, I have open here Let me see if I can find Yeah, here. So Here is a is a is a dad that like that Directed a secret graph, right that shows you this pipeline that we're gonna build in the end Here you have the input in this box and every node is a step in the pipeline And this directed are Edges they show you what's going to what's what's happening, right? So the the prepared genome pick up for example has one input which is a reference genome the same thing for the prepared genome some tools This prepare VCF file. It gets the variance file The black listed regions is going to provide the input for two other processes Which is the RNA seek GTK recalibrate and also the post-process VCF. So as you can see, it's a bit complicated, but Not even close to a real pipeline. So real pipeline is a much more complex than that, right? Okay, so let's go back to Work for you, right? So these are The four inputs even though there are a lot of new intermediate file that we're gonna find at some point The next section is the workflow description This one it can get confusing and as I said the goal is not to teach my informatics So I cannot get an extremely detailed here But I will try to do my best and again this this material is available to training.net flow.io You can read it with more calm afterwards read more in one time It's up to you and of course you can go to the channels like and ask all your questions Even if they are by mathematics related here I won't get into much detail just because it's a matter of making sure we're gonna be able to cover the Whole material in the time that we have So for the workflow description in this section We're gonna see what we're gonna do in every step as if we didn't have next rule Right, just see what tools are gonna be used the parameters the options and all these things so that when we get to the pipeline Implementation it's somewhat clear in your mind. What are the input files? What are the steps of our pipeline? So as I said as the goal of this pipeline is variant calling, right? We're gonna process some raw erasic data. They are in fast queue format, which is a format. They are expected to have And you're gonna obtain the least of small variants single nucleotide polymorphisms in dels So small changes in the reads Compared to some reference or some insertions and deletions of some bases, right? The pipeline is somewhat based on the JTK best practice for variant calling But again, the goal here was not to be up-to-date and use the best practice and everything else is to learn next flow as in the As in having some hands-on experience on the real pipeline like we had with the RNA-seq pipeline in the foundational training So the steps are gonna be processed independently of it for each replicate just the way we usually do in X flow We're gonna map the reads we're gonna split at the cigar So the cigar is the way to specify the quality of some part of the mappings, right? at some point you can have an N which represents some time some type of Splice injection and ensuring these things so we're gonna split some point lots of things gonna happen And the variant calling will be done. There's some recalibration. So let's go slower in every part So the first part is for the software manuals every program that we're going to use is in this list here So we're gonna use some tools. We're gonna use Picard. We're gonna use star We're gonna use VCF tools and we're gonna use a lot of commands from the JTK tools, right? So you can click here and see all the manuals and information about all these commands and all these programs If you are if you are feeling a bit lost about what these programs do I recommend you pause the video now and Go to the website and have a read in each of these commands or programs Just to make sure you have a somewhat understanding of what what what they do Each of them do multiple things actually so Take your time Now let's go to the pipeline steps. The first step is to prepare the data, right? So you we have the reference genome Which is a faster file the genome dot FA here. We're gonna use some tools to create an index for that genome index we're gonna use Picard to also do that and You may ask Why would I do that, right? You're gonna use star at some point to do the same As I said, even though we have four input files We're going to have a lot of intermediate files that are outputs of some steps of some processes and their input to the next So these three indexes that we are building with star with Picard and with some tools They are required for next steps in the pipeline So in the it maybe doesn't make a lot of sense to start from from the bottom But that's how usually do we prepare what we need in the next step Then what we need the next one we bring this one and so we go, right? Then we're gonna use The VCF tools to handle this known variants file that we have in the black listed one And the idea is to filter out the black list of regions from the from the known variants to reduce false positives, right? This is optional, but we are doing it here The second step after preparing these input files, right? We're gonna map the early secrets the risk from our samples from the human cell line to the reference genome, right? We're gonna use a two-pass Approach that the idea of the first one is to create a table with a splice junctions and at some point We're gonna do the realignment. So first pass This is the command. I won't get into detail about all the parameters That's not to go here, but we're gonna use this command with all these parameters here. We have the Name of the samples, right? The first and the second parent read of the sample We're gonna create a new genome index using splice junction table what we created in the previous step And then we're gonna define alignment here, right? This would give us a BAM file, which is an alignment file get the reads In terms of the reference genome, how well where they map, okay? We're gonna use some tools to create an index out of this Alignment file and I've said index a few times already and maybe you're asking what's an index, right? So these files, they're massive files and they're not always so easy to handle So you have this huge file You want to do some some search of some string or something like this and sometimes they can this can be very Computationally intensive by creating these indexes you make it easier for other tools to access information about that file So these are indexes are created for The next step that is to split these reads whenever you have this end cigar, right? That's what we're gonna use a GTK command to do that Which is split and cigar reads and the idea is to create more reads because you are splitting reads whenever you have right After doing this split again have lots of parameters here the type is the command of the GTK that we're going to use the reference genome The final alignment state, you know, the name of the output We're gonna fix a few things here. We're gonna allow and cigar reads the goal is not to explain these options here There are plenty of options like a lot of different programs, right? This is what going to be done As I said a while ago this is as if we were not using Maxwell So everything is hard-coded here. You can see I'm using the name of the files. You're everything's hard-coded It would break if I change a single Leather in the name of the files and everything else with next low a lot of these things will be handled by next But before we get there, we need to understand every step of this Relatively simple pipeline. Okay The next step is to do a base recalibration So the thing is this workflow does not include an indel realignment step, right? So we exclude that this is very time-intensive and the goal here is not to waste like the whole day waiting for something to finish Just like we use only the chromosome 22 and everything related to that the same way We exclude this step because it's quite time-intensive Instead of that, we're gonna do a base recalibration step. We're gonna use the GTK Tool again with these two commands and you have here lots of the parameters the non-variance filter, right and everything else When we get this point, we can finally do the variant calling So the variant calling is going to be done with this command here with the ATK the haplotype color But before we're gonna create an index and do some filtering, right? All these parameters again, basically they are following some best practices So keep clusters of at least three snips there within a window of 35 bases between them and you have this Options here, right? So we could take hours and hours getting to detail of every command and option, but it's not a bioinformatics course Okay at some point we're gonna use grab and at the UK in pearl we there are some Interpret as in command line options in Linux mostly so that you can use to filter these files and get what's interesting to you So basically here we are considering only sites that pass all filters and are covered with at least eight reads So in the end, that's right. What this weird command does After that in this post-processing, we're gonna basically have known Single nucleotide variations, which will be used for the analysis of a little specific expression And we're going to have novel variants, which will be used to detect RNA editing engines Then we compare these two variant files the tech common in different sites BCF tools We're gonna use a WK again Yes, if choose and at some point we're gonna use our script to do a plot, right? We're gonna calculate reads With JTK command So this is somewhat in a quick way because the goal is not really this one here What is going to be done in this pipeline? So prepare data map the RNA seek data to the reference genome is split The reads do a base recalibration Perform the variant calling, which is the main thing we want to do here do some variant filtering and then do some variant post-processing Sounds simple when we look here and very complex when we see here But mostly because all these commands they have plenty of options plenty of parameters And maybe you're feeling like we should get in detail here, but I'm going to show you one thing If you go to the NF core website and you check the modules for example a Module is a is a component as like a wrapper for a tool in the next row pipeline We have over a thousand modules, right? So maybe if you get you know, JTK See this one here And we see Let's open the github repository for that We open the main And you're gonna see here Lots of different options. You see it's not hard-coded So have variables here to be replaced by some configuration some settings some process directive some input variables right, but you're not really going to Like fight with that or use your time to to set these things You're going to import this module into your pipeline You're going to provide the inputs the settings that matter to you and the rest is already done So even in practice when you're building your pipeline if you have an NF core module You won't really be fighting with all these options and everything like that So have you understood this part let's go now to the environment setup and here What I want to show you is that we're going to use get pot to have This hands-on experience in this in this training, right? So get pot as a spot from the service that provides to you Virtual environments in your browser So if you use VS code to to program to write code it provides a version of VS code that that exists in your browser and The way it's handled with a virtual machine on their side is that if there's a power outage in your car in your house And your computer turns off and stuff when it turns on and you go back to a good point environment Everything will be there the way you left even like selection of taxing everything like this So it's not on your computer. It lives in the cloud And this is very useful because instead of having to install Docker or do some Configuration, maybe you use windows. Maybe use mech OS. Maybe use Linux Maybe use a different version of anything or you cannot you don't have permissions to install software So these things can be a hindrance if you're trying to to learn right So the idea is that you're gonna use a good point environment So that you can have an environment that we're ready built for you with everything that you will need Already installed including the files that we will be needing for this analysis like the input files When you click here to open on get pod What's going to happen is that you have to create an account, right? There's a free account and it's fine. You are fine to use it I would probably recommend you to use a large machine up to 8 core 16 the rack bites of RAM and stuff this If you won't have as many free hours as you would have with the standard But the large one gonna help you make things run quicker And then you're gonna use your time or to learn than to wait the pipeline to finish right the steps After you created a comment, you can choose this one and click continue This is going to open a VS code on your browser as I said is gonna open it Training I already have here an instance open with some cash that we don't waste too much time And what's going to happen is that you're gonna see the Simple browser here with the training material the one I was showing you before like you can see here And then you have lots of files here and everything else So the first thing you have to do let's go here Lynch wants his own training environment setup There's a by size talk on and on the NFC or YouTube channel where I explain lots of things related to get part and I won't get into much detail here But basically know that you're gonna have a terminal on the bottom here on the left You have a file explorer and you can open this file in the terminal type in code and the name of the file And it's gonna be open here's a tab so you can have a look at it So let me move this here Okay, so the first thing you're gonna do it's going to when you open this you're gonna be in the NF training folder This is not what you want. This is for the foundational training as it's saying here The first thing you're going to do is type this command So whenever you have these boxes on the training material you can click on this icon here to copy the content Then you go here right button paste And you're gonna be in the right folder, right? So you can type tree and dot here and you're gonna see all the files here I played already a bit with this good point of environment So there's a lot of extra files, but that's what you're going to see at the beginning So there's a read me file. There's a beam folder which has a script file inside There's a data folder which has the four input the three input files We mentioned in the folder with the reads which are all these other input files that we saw, right? In the end we have a final underline main NF pipeline, which is the pipeline that in the end we're going to build I was playing with it here in the end it took about 20 minutes to run. Let's see if we can find it here in the output Okay, this pipeline in the end it takes about 20 minutes So I lost it here, but I can run again with next flow run resume I already ran this pipeline So all the steps are going to be cached because I use that resume I'm gonna ask next flow to only rerun the steps that I never did and cache everything else So as you can see there are many steps here This six of six because we have six samples, right? And everything is cached so there's no worry here Okay, so we're gonna use Docker. Of course, it's it's very important to use Docker for reproducibility reasons and in this training we're going to use the best practice when it comes to Container technologies and next low which is to use a single container image to every process So basically what this means is that I'm gonna have in the first step One container image in the second step a second container image and unless another step has the same Requirements of tools of a previous one. I want to use containers Here we're gonna just choose one of these containers Which is a container that has some tools and we're gonna use Docker to pull this container and enter this container and play a bit with it So let's come and be here. Let's space. I'm gonna bring the terminal a bit to the top It says I already have this image in this container because indeed I was playing with that before and If you have never used it or pulled it, you would see it would select download complete and so on I'm gonna use this other command now, which is a Docker command. I'm gonna run this container image I want to see a container. I Wanted to be removed after it stopped And I want to get a terminal to play with that in the entry point It's gonna be bash bash is a shell right that you can type commands So I want this container to be run with bash and I want to be able to interact with that So press enter At some point, I'll be giving a shell I'm not a good pod anymore I'm rude because inside this container I'm rude and I'm going to do some tools dash dash version and have Some twos one three one if I do this outside the container pick exit it to leave Some twos are not installed. So you see you have this isolated environment with the tool that you need to use One thing is important to do is to give execution permission to this GG his file this script file, okay with that we have our setup Your environment set up we have this good point apartment with everything we need installed we have docker Yeah We have docker we have Next flow you're using the edge version. I think 23rd. Yeah, so Okay, and now let's go to the next part which is the pipeline implementation, right? I'm going to hide the file explorers. We have more More screen Okay, and here we start the real part of this of this training now that it's somewhat clear for you about the data That we're going to use the software we're going to use what we are trying to do in a chief Let's go to the next one part, which is the real goal here So we separate it and in all the steps like we have one two three points one But like the prepare like the input we have one ABC and D then start mapping Split the recalibrate the variant calling the post-processing and then some overview about the results. Okay Let's try to to take the best out of this pipeline example here, so Regarding data preparation. We saw that we have this data folder here with all the reads with the black listed dot Bad the gene of that file in the known variants the VCF file, which is compressed Okay, you already know that this genome this reference genome. It's only for the chromosome 22 We have this reads here. I say it's 76 base pairs paired and paired and beads. Sorry We have the variance file, which is the known variance and then we have the blacklist files everything is already The first thing we're going to do is to now that we are in this folder Slash workspace slash git pause less hands-on. We're gonna type code main dot NF and We're going to paste This code there so we can click on this icon here Come here comment the comment has to save And we're gonna type next flow. Oops Next flow run main dot enough Nothing is gonna be printed because apart from the basics the always thing because we didn't print anything here We just send some variables. We are saying that the genome is this one the variance is this one and so on and As we know because we have the promise dot here. It means that we can change Overwrite this variable definition by just saying genome something else, right? And then it won't be this base dear data genome anymore So it saves one thing we'll start doing now is every time we run a pipeline this main dot NF I want you to also add the dash resume This is gonna save us from having to recompute everything every time we do a change, right a New thing that you didn't see here because we saw this for inputs is the results parameter So at some point in the end we want to store our results for folder and I'm saying here this folder is going to be called results So here there are some instructions about copying and pasting and so on you run this And let's go to the first problem that we have to solve So the way I'm gonna do that is that I'm gonna show you the problem Give you some hint regarding the solution and I'm gonna ask you to pause the video and try to solve this on your git pod instance After you spend I don't know five minutes if you haven't made it work Then I would ask you to come back here and pause the video and see the solution and then later You can go back and try to do it again alone because Depending on your knowledge of next flow Maybe you're gonna be stuck and it won't be very productive to just spending hours trying to solve that So my recommendation is that if in five minutes you can't solve that Stop come back to the video and keep watching and later more calm. You can come back and try to do it again So our first problem is that we have all this farm parameter definitions here But we don't have input. So the first thing is create a channel To get the reads the information about the reads, right? And it has a tip here saying use the from file pair channel factory as we saw in the foundation of training The channels factor the channel channel factories are next flow special processes to create next flow channels out of regular variables So you may have a string that points to a path where your files is and you use Where your files are and you use from file pairs to read the files in this folder that we're saying they are pairs and collect them then as pair of files, right? again, don't forget that resume so Pause the video now because I'm about to show the solution So the solution that here you saw the blank. That's what we're gonna have every time So wherever you have blank you have to replace with the solution So here it already shows it's a single line and the solution here is to use a channel dot from file pairs and Pass the param dot reads that we saw here is a path to this folder saying every file that starts with this Underline one or two and because there's only two files That have this beginning. We're gonna have one parent read two files, right? For the first time we build this pipeline, we're gonna do that Because by doing that we're going to have a very quick response in that the end we can run with all the samples It's gonna take a bit longer, but then it would be more real. Let's say So now you can copy this go to your main.nf and Edit at the bottom If you run this again, you won't see anything because you're not there's no process There's no printing, but it will create this channel believe me for now Then let's go to the next step which is to create a faster genome index We have the faster genome reference. I want to create an index using some tools So again, it tells you the name of the process What it does which is a script block is to create a genome index for the genome with some tools The input is the genome faster file the reference genome and the output is the index that some tools is going to create So I even give you here the whole process We're gonna use this container which is in biocontainer So biocontainers is a project that they create where they create a docker container for every Recipe they have in bioconda. So if you have a software on bioconda, you have a biocontainer with the software Sometimes you're gonna need containers with more than one tool and then you can create a mud container If you go to the foundational training on the container Section I teach you how to create a mud container there or how to request one We're gonna have the input a path which we're gonna call genome and the output is gonna be the name of this genome dot FAI And what this step does is to run the some tools command Software with the FAI DX command and the name of the genome, which is a path here The workflow is what we did before but now it's blank. So What do I what's missing here? I call This process so what you need to do in this task is to write a line here that will call this process with the input it requires So I'm gonna show you a solution. So if you want to try it pause the video and three two one So the solution is everything we saw already with the previous line in the workflow block that they already had But the new one which is prepare underline genome underline some tools and then you can even click this plus here Every time you see a line in the block in the code block with the plus you can click there And you're gonna have some description about this very specific line So here we have to provide the programs of genome as input to this Process you may ask why don't I put here the path of the gene the reference genome? I know it's here. It's in data is like genome that FA and indeed it's here but Maybe someday you would like to do Genome that another reference genome And by doing that it won't break your pipeline because every time you are using the reference genome you're actually Using this params dot genome and this is overwritten This is overwritten when you provide the dash dash genome here or dash dash variance or dash dash blacklist So this is the right way to create an extra pipeline that makes it very resistant to changes right won't break easily One interesting thing here that I mentioned in the foundational training is like you are calling a process but you're not providing a channel you're providing a string of value and We know that processes. They only receive channels The interesting thing here is that when you pass a regular variable that has a single value, which is the case here one string Next row we will implicitly create a value channel and Put as the single value there the string or the value that you're providing So actually when we pass a params dot something to a process This process is receiving a value channel And if you don't remember the difference between a value channel in the queue channel I recommend you to go back to the foundational training But in short words a channel a queue channel is consumed every element is consumed and it's gone You cannot consume the same element for the same process twice But for a value channel you can the same process can consume it Infinite times as long as other inputs require another pair right The next step is to create a sequence dictionary with Picard Again, we have some definitions here the name of the process is going to be Prepare underlying genome underlying Picard the command will create a genome dictionary for the genome faster The reference genome that we have with the Picard tools the input again is going to be the reference genome that FA and The the faster file and the output is going to be a genome dictionary file created by Picard So this is a third problem. We are so here again We're gonna have this blank somewhere and we have to replace this blank with a solution here. We have Again the definition we are using a different container image because now it has Picard right The input we're gonna call it genome just like before The output now we have a different format here. We want it to be dot dict and We're gonna get the base name of this genome file It's going to be a path I want the base name as the name of this output file I'm going to use the Picard with sequence dictionary this file the output I say here, which is what it is and if you are a bit new to next glow and it's not clear to you What are these outputs so in a script lock in your pipeline? It can create many different files It can create log files depending the program that you're using it can create hundreds of files In that the end you don't really care You just want the output of the program the result of your analysis. So when you say Output is this it could be any things also, but here's one thing when you say this you mean, you know Probably lots of things were created in this dictionary But the only output that I want to put in the output channel for next processes is this one And what we're saying here is that only the dot dict file That is created with this command is going to be added to the output channel Which is going to be the input channel of the next process So we have here everything we've been seen so far, but now there's a blank So he is explaining what the base name does it return the file name without the file suffix So if it's like human FAA is gonna be just human So no path no for no no file format just the base name And then you can add whatever you want which in here is the dot dict So I'm going to show the solution again what we have to do here just like before We have to call this pre pair genome picard But we have to pass something and this answer is kind of kind of obvious because we are creating the index index again With the reference genome again, and I will ask you to pause if you want to try but I'm gonna open a solution in three two one Very simple, right? We're gonna create an index from a reference file same thing Just pass the reference the genome reference file for this Process and you know why we are doing that every time you could simply come here and keep adding these things So I didn't add the process a But I could come here And get process a That's one one a and if I try to run this With the dash resume always so that we use the cache And it worked So prepare the genome With some tools here with picard That's what we see here is one of one because we have a single graph genome reference file. So it's one task We have here the hash of the task So we could do three work with the default which is the default work directory ad c94 tab and We have here The index file that was created the the file, right? We have the input and that's the output Same thing I could do for the other so three work FB 62 And we have the genome dict And this way we keep going, right? Let's go now to the one C and We're gonna create now or third index Again, this time we're gonna use the star mapping software the name of the process is going to be prepared star genome index What the command does is to create a genome index with star for this genome reference file The input as always is going to be the genome first a file and the output will be a directory Containing the star genome index. So that's the way star works. It creates a folder with a lot of different files a Container process directive again And if it's not clear for you what this does when you use a process directive Which is these directives that happen that take place at the beginning of a process block It's telling if I'm using docker This is the container image you have to use for this desk And if you were asking if you're using docker, we can see the next flow dot config file here And it has docker dot enable equal true, which means by default. We want to always use docker here for all These processes whenever there's a container directive, right? They won't see here against very simple very similar as before we have an input the output is the folder now So no file extension. We just say genome on the land here We create the folder and we run star and we say where it's going to be the output here, right? And in the workflow, we have to add a new line. Can you guess what's the new line is going to be? Prepare on the line star on the line genome on the line index Which is the name of this process like before and what are we going to pass as input? It's the same thing as before Just now. So let's copy this So what we can do is to copy everything and just keep removing the workflow block So we're gonna always add the next process and the workflow block is going to be up to date We can run this with next low run And in the meantime while while it's running we can go to the next one to start understanding what's going on So now we want to future To future the variant file, right? So we have the known variance, which again is Variance in the in the human genome that have been obtained in this case with multiple different software so that we have a high confidence of what we obtained using pedigree information in all these things and Now that we have that you want to filter the blacklist with the blacklist file these regions that have that usually have are prone to artifacts, right? you want to decrease the false positive rate in your in your final analysis So we're gonna have now two input files is the first process in this Training where we have more than in one input file. Can you remember how it's written when we have this on next floor? That's how you do it. We're gonna have just in a different line So I'm gonna have one input that I'm calling variance file Whatever it is when it's inside the process I will be able to refer to it with the variance file handle the variable and Another one, which is the blacklisted here. I have a mud container because I'm gonna need VCF tools in TabEx So I need a container with these two softwares So I had to request this mud container to be created in the bio containers project So we don't in the end create a file containing the filtered and recoded set of variants This is the command the commands we're gonna use and the output is gonna be a channel element Which is one, but it has two items the top over the first one Is the the filtered variants the VCF file, but the second one is the TBI and And then again, we have now to fit to resolve this gap here this blank line It's obvious that we have to call this prepared on the line VCF on the line file process But what do we provide as input? We know it's variance file and blacklisted What would we do it? Let me provide maybe these two guys? I Don't know. I'm gonna start. I mean if you want to try pause this video because I'm gonna show the solution Three two and one And that's exactly what we have to do to provide the variants and the blacklist in next flow at least for now We don't have named arguments, which means that Whatever you put as the first argument here is going to be the first one here in whatever second is going to be The second one here. So whenever you are unsure about Which argument which channel which value should be first or second you look at the process block And you see what's the first and what's the second and that's a positional arguments That's the way by default that next flow works So take as input this guy here as you see here in the in the script block We have this placeholders right replace this with the base name of variance file Exclude the bad and we do the blacklist file and so on in the end use topics and Let's copy this replacing our workflow rock here in the bottom Why I'm doing that because the workflow block is always Redone for us with the new line in the solution and the new process is added. So let's run this again now with resume And we're gonna have no fourth step here Good clean terminal a bit This ends the first part process one ABC and D. We just did that we're gonna now that we prepare this input files We're gonna start mapping the reads of the samples to the reference genome using the indexes that we created, right? So that's what the process to we're gonna create a new process, which is called our nasik mapping star The input now it's three inputs. We have the genome faster file Which is the genome reference file the dot FA. We're going to have this star genome index Which is the index we created with star and we are going to have a tuple containing the ID in The pin as a first item in the second item is a list of the reads So as you see here, we have this guy for example The C O Q 1 and we have the read one every two So this first part before the underline is the ID you can see it this way and these two files are the reads When you do the channel from file pairs, and you provide a path the way we did here at the very beginning Actually at the beginning of the workflow block What this is going to do I can do a view here So you see the contents. Let's do this The view is a channel operator in X flow to consume every element of a channel and print it to the screen So we have here the ID as the first this everything here is a channel element a single channel element The first item is the ID and the second item is a list to the brackets Here's a list with all the elements the other items which here is the read one and reach you the path to each of them so in this Mapping here we have to map every read to the reference genome using the index And in the end we're gonna have something similar to that a replicate ID But here we're gonna have the line band file and the line band index file So this is the file showing how this every read each read mapped to the reference genome So we have the whole process here is Mild again because we use star and we're going to use also some tools So we need two softwares in one container image. So we had to create a mud container Three inputs the genome Which is the path so here the names kind of help you find the answer But remember you can put any name here Whatever is the first one when you want to refer to that within the process You're gonna use the name you chose to be here We chose genome for the first one genome dear for the second one and a tuple for the third one The first part I want to refer to it as replicate ID and the second one reads as we know that the second part is a list with Elements reads is gonna be this list with all these elements these items which are the reads for the output We already saw we wanted to have the replicate ID the band file and the band bay file the index of the line files, right We have here this star command with all these options Using reads here is going to have the two to show the two reads the read one and read two for the alignment We have here we are you we are passing the test of CPUs This is a process directive in X flow by default this one But if we had here something like CPUs for it will try to use for CPUs, right? And we have all the options here again If you have questions about that you can go to the beginning of the section of the pipeline implementation And you can check the manual of every tool This is the first pattern we saw then we have the second pass the final read alignments that we saw in their workflow description and in the end We're gonna create the index of the band file to help us to work in the next steps with these aligned beats And now we have the blank here again The blank is going to be a call to this process, but we have to be to provide these three inputs If you want to try that pause the video again I think five minutes is enough to try to solve these questions If it takes if it's taking longer than five minutes I would recommend you to stop come back and watch the training and try the next ones and Later you can come back and try again or redo it or something like this and every question you have Either during the training or after what you can go to the next to the select work spaces and ask in the sept 23 hands-on training, okay So I'm gonna open the solution three two one The solution is basically to call the terms of genome that we use before here is the reference genome and then the output of The prepared star genome index So this guy here this prepare Star genome index is the one That's going to create index with star for the reference genome and because we're using star here to line We also need this index by using dot out You are getting the output channel from this process. So when this is over when it puts the results in the channel I want this channel to be the input of the next one Which is this one here, right and Then we are going to pass the reads that we just saw here It's a channel created with the from file pairs channel factory. So let's copy all of this Replace our workflow block again And as you can see the pipeline is getting already kind of long right the one day the one a one B one C one D Then we're gonna go now to the alignment In everything is here. So let's run this again with resume the first four steps are cached already So all the fifth will be computed That's what we are we are seeing here. We only have Two reads of the same sample. So we're going to be one of one, right? We're gonna align each read to this reference genome today using the index that we created before While it's running. Let's go to the next step The next step is a filtering step using gtk. So for example, we're gonna have this reads with the cigar information On on the quality and how the map occurred We're gonna look for the end and Then we're gonna split this and at this end these reads which are gonna create more reads for us So here it shows you that this and corresponds to the segments of the original read Beside between the splicing events. So they're represented by this and So here we have four input files So the process is gonna be named a nice act the line gtk and the line split and cigar What we're going to do is to split these reads on this ends I using gtk and the input is gonna be the genome faster file again the index made with some tools the genome dictionary made with picard in a tuple Containing the id the align band and the index from this line then which is the output of the previous step And the output is going to be a tuple contain the replicate ad the split band file and the split band index file so This is the Process block using this pretend cigar reads the command from the ATK here We have the indexes here the inputs. Sorry. Here. We have the outputs. We have Disponder image from broad Institute with gtk and we have here a new thing which is a process directive called tag This tag is very interesting because when you have multiple Samples and you have one of ten two of ten sometimes you want to know which sample is being Processed at this specific moment and when it fails You want to know which one fail or which one did it what which one didn't fail So by using this tag and providing here the replicate ad ID which is here We're gonna make it appear in here beside the name of the process Which sample is being processed for that task at that single moment? It's just something yet we add here. It doesn't really change anything in the analysis But it's better for you to debug and monitor your pipeline execution. So again, we have here the blank so I'm gonna give you a few minutes to try to do that and Pause if you want to try I'm gonna open the solution in one two three Go and that's what we have you see we have the output of The process that create that created an index with some tools. We have the output of the process that created a Sequence dictionary with Picard and we have the output of the process that created an index Not an index, but here's the line files with Star let's copy this In override or workflow block and run again So this tag it's also useful in the previous Step whenever you have multiple samples being Done in a process. It's useful to use this tag so that you can see what sample is being processed at the time Right and maybe while it's running would be interesting to to to remind you that the process is a definition So what we write here is a process, right? This is a process But every time you instantiate this process you have a task So if I have three samples, I'm gonna have one process, which is one definition But I'm gonna have three tasks three instances of this process for each sample the inputs going to be different Right for each of these instances So here you can see the end of the sample right the replicate ID it appears here between parentheses Good one thing we could do Just because we are having all this dot out Let's see how it looks like Let's just add here dot out to get the output channel and view to view that and I'm gonna rerun This pipeline everything's gonna be cached because we just ran it But it will print to the screen now the output of the latest process that we ran And here we have the replicate ID. We don't have no list of files They're just the second item of this channel is the split that then in the third one This is split by which is the index of the aligned file, right? Cool, let me move that And we are going to go to the next step Which is to do the base quality score recalibration using GATK Here as the input we're gonna again have the genome faster file the index made with some tools the sequence Dictionary made with Picard now we need a tuple containing replicate ID a line band and line band file index from process 3 Right, so it's the split this thing that we just saw that's what we it needs as input And we also need the tuple containing the filter recalibration VCA file in the tab index So remember which one we use we did this we can Come here where we have the tab X Okay, it's the process 1d So we need the prepare underline VCA for the line file output as input of this one here In an as output we're gonna have a tuple Whenever you see tuple you can think of is one thing with very thing many things inside, right? So here we have a tuple containing the sample ID the replicate ID in the unique band file in the unique band index file After this recalibration, right? So here we have the function another model container because we're going to have the ATK and some tools, right? As input we already saw what is the input here replicate ID banner by here We have the prepare variance file in the prepare variance file index as Output a tuple which she has the sample ID the replicate ID the same thing We just use a different name so that we can refer to both IDs But this one is from this tuple and this one from this other tuple. So we use different names and And here in this quick block, we're gonna just change What is the sample ID? I'm gonna get the replicate ID and replace with this regular expression here And we're going to run the base recalibrator in the print reads for the final band file Then we're gonna use some tools to create an index with the final unique band files from the alignments And now you have to fix this blank, right? And here we go if you want to try pause this video, I'm going to show the solution in three two one Here we have The Parms genome is the reference genome the output of the sum to index the output of the car sequence dictionary the output of the reads split on the MC gar and the output of when we prepare the VCF file Futuring and we in recording the set of variants. So let's copy this And replace our workflow block It's not clear we are always replacing the workflow block because new things have been added and then we add also the new Process and we just created in the previous steps of four three and so on Let's run this again with cache and Because we would resume and because most things are cached only the final process and we just wrote will be computed So one sample We align one sample we split one sample we recalibrate one sample It's very quick because we are only using one sample with two parents with two reads, right? Now we're gonna finally do the variant calling So we're gonna use the haplotype color command of ghtk You can find information about the parameters here Just like at the beginning you had the manual of ghtk with all these commands that you can investigate more And again feel more than welcome if you want to pause the training go read a bit Come back so that you have a better understanding. It's completely fine This training video will be made available on YouTube afterwards So you can check it whenever you want it's gonna always be there Just like we have the foundational training in previous training session that we provided both next row and NFCOR So this next one this RNA seek underlying call on the line variants We're gonna have us input the genome faster file the index made with some tools that the sequence dictionary made with picar A tuple contain the replica ID the aligned band file and the aligned band file index from the previous process It's already recalibrated, right and the output is gonna be A tuple contain the sample ID in the resulting variant calling file Which is what we are after even though we still need to do some post-processing So here note that in process 4 we use the sample ID not replicate ID as the first element of the tuple in the output That we're going to use now Now we combine the replicates by grouping them on the sample ID So it follows from this that the process 4 is run one time per replicates and process 5 is run one time per sample so here we have the cause of GATK to do the haplotype color in the variant filtration and You have to fill this blank I'm going to Ask you to pause if you want to try to solve that and I'm going to show the solution in three two One and that's what we have here params genome like before But now we're gonna get the out from some to speak art and now the out from we calibrate We're not gonna get the VCF here or the split cigar because we recalibrated these guys, right? So the output for recalibrate Sometimes you can have multiple Output channels from the same process everything here even though we have multiple input channels We always have one output channel. That's why we're using dot out and we get it If you have multiple output channels things can be more complicated And then you have to specify the position of the output the first a second or you can use the emit keyword that is going to Is a named output you can think of it like this But here at simple we have one output channel. So we don't have to worry about that for now Again, we're gonna use the tag Here with a simple ID so we can easily see here what sample is being processed Let's copy this like before let's override the workflow block and let's paste it And let's write down Oops no container here, so let's get here the container line And now it will be able to find the program to run this knowledge and now we are calling your variants Good the next step We're gonna start or post-processing So we're gonna create two processes one for only specific expression and the other one for irony editing analysis So we're gonna process this VCF result This is variant calling we just did to prepare these variants for this Beginning for this a lady specific expression, right? We get to both process together. So I think by now you already felt Used to all these things we are doing. So let's try to make it more difficult to process at a time The first process. It's called post underline processes underline VCF It's going to the process processing of the VCF file on each sample the input is the topo Containing the sample ID and VCF file, which is the output. We just did here, right? And go here the output is simple simply the sample ID and the final VCF file You may ask Marcel you are saying that the output file is named final dot VCF But if I have many symbols won't I have a conflict of one file trying to override the other? so You should probably be aware that in next flow when you let these three work The reason we have so many different folders here is that next law has an isolated folder for every task So if the same process has ten tasks, you're gonna have ten folders So the same file names won't be an issue here Of course that if we have this the channel having a single element, which is the file It could be some issue even though there's the path which is different and it won't be But here we have the sample ID in this topo. So we don't have to worry about this usually the name of the file In the output gonna be a topo containing the sample ID the VCF file in a file containing common snips Second process, so maybe we should go to the first process here and see more detail. We have a mud container Because we need VCF in a specific version grab was there by default But we need a VCF to send a specific version that we couldn't find so we did a mud container It has a tag and now we have a published here process directive The published here process directive It tells you that you want to store some output files of this process Specifically in a very specific folder. You don't want it to be in the work directory All this crazy names path that you saw you want a specific one We show that the very beginning at params.results is a folder called Results, so I want the output of this process to be in a folder called results slash And then there's something Slash sample ID right where does the sample ID comes from here the input The input we saw already. It's a sample ID and the second item of the topo is the final VCF file But we're also going to get the filtered and recoded VCF files in the TBI Which is the output of this 1d process here as Output we're gonna have a sample ID the final VCF file But we also going to have a new file, which is the common single nucleotide polymorphisms They have you a grab to filter do some filtering here, and then we're gonna run the VCF tools command For the second process we're gonna have a prepare underline VCF for ASE, which is the only Specific expression The command is going to that prepare the VCF for a new specific expression and generate a figure in R about that The input is a topo containing the sample ID the VCF file and The common thing so it's obvious here that the input for the second process the second process is the output of the first one The one we discussed previously And the output is gonna be a topo containing the sample ID in known snips in the sample phrase ASE and The path for a figure of the snips generating R as a PDF file So here we have the two processes And now in the workflow block we have this blank which obviously is gonna be more than one line because we're gonna be calling two processes so in this one Maybe even more than five minutes maybe eight ten minutes and you could also try to do the other one that maybe got stuck And I'm gonna open the solution in three two and one and here you are We're gonna do the post-process VCF calling the output of these two in the next one The output of the previous one. So let's copy the solution overwrite or workflow block Because it has the last two processes in the new workflow block Let's run this with the resume. So we don't lose the cash Be very careful because if you run this without the last resume It's gonna override the test that you already ran and it's gonna mess your whole analysis in the sense that you have to run Everything again. Nothing bad is going to happen. It's just that you're gonna waste time First one was very quick the second one to good Now we can get go to the final step the process seven So here is a bit more complicated So we've seen the beta up using processing next flow yet one of the features of next flow is the operation that can be Be formed on channels outside of processes Sometimes you have this channel and you want to run a process But the input of the process is different from the current format of your channel So you can use channel operators to operate on this channel to transform it for you, right? You can go to docs not next flow.io and you have a list of channel operators there So here as you can see you have we're gonna use channel operators We're gonna get the output of the recalibrate step That's why the dot out we're gonna use do something here So in the plus we're gonna use an operator that groups tuples that contain a common first element, which is the replicate ID, right? The second one we're gonna use an operator that joins two channels taking a key into consideration So you can click here for more details gonna take you to an operator called join the next one documentation Then you're gonna use map, which is a channel operator that applies a function to every element of a channel again Here it explains what I just said and the set is just channel operator to save this channel after all these Transformations save this channel with a new name, right? And here is telling you to use the grouped underline VCF underline banned by channel as the output name This one maybe even more than 10 minutes. I think you should really try so if my by now you haven't really stopped It's right. I think this exercise It's difficult enough and important enough for you to understand the basics of this use of channel operating next flow I would really recommend you to stop for at least 10 minutes and Try to do that and run a pilot and see if it works because this one is very important for you to Solidify let's say your next world knowledge. So you can read a lot about next. Well, you have this training materials You have so much content on YouTube. You have so many pipelines to look at You have the whole not next to documentation, but if you don't Make your hands dirty if you don't try to write your pipelines or to modify pipelines It will be very hard for you to master next flow. So I really insist that you should pause the video and try To write some code based on these instructions here And of course if you open this this this training material, you can click on solution and see the solution But I recommend you not to do that to at least try 10 minutes to do this before you see the solution So pause now if you want to try it because I'm gonna open a solution in three two one So that's a solution here and what I'm going to do with you Is to run this each step at a time so you see what's really happening So let's override the workflow block here and I'm going to Comment everything the first thing I want to show you is how this this channel look like So that's resume of course, so we don't have to recompute everything everything would be already computed I'm just gonna print to the screen. So I have what one channel element, which is in an ID the ban and the buy cool Then now I want to group tuple Here nothing is really going to do to change because I only have one key, right? So actually it would be nice to use more samples. So what we can do actually is To run this again But I'm gonna do reads oops Dash dash reads. I'm gonna do data This guy here. Oops. It reads This guy here. I'm gonna use the cock Coq and Star so it's gonna get both q1 and q2 and the line Fast q2z So this will run the whole pipeline again, even though in the first sample we already have cash We don't have for the second one So all this guy a cash because creating a indexes and everything at the VCF file It doesn't take the reads into consideration, right? The reference genome is the same. It's the only one The known variants and blacklist. There are only once or nothing changes But when it goes to the mapping now, it's a different thing because we have new samples, right? And it's going to run everything again because we path different path To get I think the cash won't work here. So it will run all this for this For reads which are two samples. So this is going to take a while in the meantime I think we could have a look That's good NF core and I want to show you some pipelines So we can have a look at for example viral recon So very kind of it does assembly and inter hose low frequency variant calling for vario for viral samples Maybe it's not what what you want to do for variant calling in your work But let's have a look here where it does. So every time you are looking for And of course pipeline you can come to this page the pipeline and click the one you want You're going to have this first introduction tab with a lot of information about the pipeline You have the subway plot where you have explaining all the steps all the different paths You can you can skip some steps. You can do some different types of analysis So when you have these different colors here, they're actually different analysis, right? So we're gonna use IVAR and consensus for the blue line The pink one it use BCF tools the other one, you know, so you have these different paths and your subway plot Here what it shows is what it's done So you have the merge of re-sequence fast queue files You do read you do fast QC you use fast speed to do adaptive trimming use Kraken to do remove of host reads You do the read the variant calling using like gonna align the reads You're gonna sort and index these alignments like the Bay that we just did we're gonna do primary sink as removal You're gonna do duplicate read make marking. We're gonna do alignment level QC the quality and control We do lots of things for this variant calling part with consensus and everything else in the end I'm gonna do the novel assembly and by the end of it. You should have The files enough to present the quality and control and visualization for raw read an element using multi qc You're gonna have a report about everything it did for you This first one is for Illumina and here you have for Nanopore So for both types of data this pipeline works You have a quick start here saying the minimum version you need to use this pipeline You have to install docker or singularity or conda or any content technology, right? So that you don't have to install all these softwares You just have to pull the Pipeline have your data and run it and because you have conda or docker or something like this installed If you create the containers pull the container images with all the softwares install libraries configuration If you run the pipeline for you, you can provide a dash profile test To use input test data from the pipeline So you don't have data, but you want to see how it works how it looks in the end You can do that and you you have to specify or not put directory with dash dash out here There are some other instructions here For different formats and so on different data and some documentation is some credits, right? People who have contributed to this pipeline Cool. If you go to usage, you're gonna have some extra information about the usage How are the parameters? Options So extra information is required to use this pipeline Here you have a very long and detailed description of every Parameter every option the type of the data help text what's expected if it's mandatory or not It's required here as you can see you have hidden Parameters and very very extensive in detail list of all the parameters of this pipeline specifically Here you have information about the output Which is very nice And here you have even some examples of output for Illumina for example, you're going to have here Some example of the output of every step that we put in published here And here you have some information about the releases, right? The latest one was this 2.6.0 and In many pipelines most of the most famous ones you already have a bite-sized talk Which one of the authors usually presenting the pipeline in detail for you So the pipeline the bite-sized stock is going to be here on the right of Every pipeline you can go to the Sarek one for example to another one very famous that also does somewhere in also detect variants you have here the video, right? Let's see if it's over It's over Okay, so now Not quite expected So let's try to do this Let's go to workflow block It's like a live debugging. I made a mistake. Let's try to find out what happens I'm gonna do a view here. Let's see what's inside of the channel the reads. I Wanted to get this for files, but I think yeah, it's not working It's only getting the first I'm gonna use the return here. It's a nice thing that you want the pipeline to stop here I don't do anything else. So we don't waste much time I have to use double quote. I think so I can resolve this Star here as everything that has this beginning Let's see if that will be a mistake. Yeah, that will be mistake. Okay. Now you see we have the two samples I want so you remove this return I'm gonna run this again. Now the cache work for everything The cache work for the first sample With two reads now it's doing the second part for everything. So the mapping of the first one is Q1 here. It's already occurred. Now. We are working on the second one. It's key to one key to two It's gonna take a while to do this. Let's go back to the NF core comments while this is running so There are many different pipelines here that do very calling if you do variants you're gonna see a few of them So here you can call and score variants from whole genome sequencing You have the variant Some of them are on the development. Some of them are archived. Some of them are being up-to-date And they have recent release and everything else, but all of them you go there. Let's go for example viral integration We're gonna have here introduction usage parameters Everything you can see is a very early version But there's a lot of diversity here and what you can find You can see here in squares You can sort by different things So as I said at the very beginning if you really want to have a look at a variant calling pipeline a real one This is the place to go. So let's go at RNA var for example Here you can click at the github repository rate So one thing I do a lot and because we are in an intermediate level Next we'll train and let's say I'm going to give you this tip here Wendy in a lot of times I want to try to do something and I'm not sure how it's done Because the NFCOR project has so many pipelines so many modules so many projects with the best practices You can just go to the organization NFCOR and you can type here. For example, let's say I want to do publish tier Type this and I'm going to find lots of different Uh, okay, not a lot It's not too common. I'm gonna find here a thousand occurrences in codes You know in an RNA fusion Sarek RNA seek all these different RNA NFCOR pipelines in the using published during many different ways So like something is likely more complicated people use group key Not very easy to understand what it does You can just type group key here and you see all these examples of NFCOR pipelines using that Uh, I don't know if someone use flat map. It's a channel operator. It's not clear to me what it does You can come here and see how it's done like this example here on NFCOR let's see how it's Working here Don't tell me the browser killed my tab I'm going to refresh this one And you know, it's on my machine. It's in a cloud. So it should still be running the way I left it Where's my terminal give it back to me? Yep, it concluded. So you see It's starting from the mapping part. We have now two of two And then we have two or two first split cigar recalibrate variants and everything else So now let's do what I was trying to to show you I'm going to Everything's commented. I want to group table, right? Here the view is showing what's in the recalibrate output channel But I want anymore. I want to group tuple and I want to view. Okay, let's do it this way See what's inside. So the group the recalibrate we have two of two Let's see the output channel Okay, the first channel we have this as the Key here we have the ban here with the buy Here for the first one here for the second one So as you can see, they had the same replica ID because they are the same replica, right q1 and q2 So I'm going to use group tuple I'm going to uncomment this remove this view here and I'll view afterwards I want to see this new channel after the group tuple And I can you probably already know what's going to happen It's going to be now one channel element because they group them based on this key So we have this key and we have now a list of all these files. So bonin bye and bonin bye First item, which is the replica ID second item, which is a list of the Bans of the first one and third item, which is a list of the bonin bye of the other one now I want to Join This With the output of this prepare vcf for ac But as you saw in the process here This time we have multiple outputs And we use the emit that I told you to give a name to that So we have two outputs and what I'm saying is that I want the output of this process dot out dot vcf on the line for ac That's what I'm saying here So I'm going to get this output channel and I want to join with the previous one And because they have the same key the same first element This is going to be one channel element with everything So let's put the dot view here. Oops Let's run this again and see what the output will be like We have one channel element with the replica ID the first item the second item in the third item now we have the known snips good This fourth line here uses map which I already told you it applies a function to every element Channel element every element in the channel And what he is saying is like, you know, I'm telling you before like the dash Bigger than you're saying what it is How you how you want to to represent what's inside the channel? And it's saying there is a meta which is the id the replicates There's a list with bonds a list with buys and a list with the vcf And I want to change the order. I want the vcf to be the second one The bond to be the third one and the buys to be the fourth one So the only thing we are doing here is to reorganize the order of the items inside our channel elements And then we save that In a grouped underline vcf underline bond underline buy underline ch channel So in the end we can See that with the view And you see that basically what's going to change is the order the vcf here is is the last one Now it's going to be the second the second one Good, let's go back To or let's close this fire explorer. Let's go back to our hints on training pipeline implementation We were here Yeah, and the solution we saw already is this one Nice, let's remove this view here Just one more time to make sure I removed all the views and we have a cleaner screen Everything's cash good and now we go To the final process Which is going to do the elite specific expression analysis with the g80k command a s e r a se rip counter So it's going to calculate a little counts at a set of positions with gta g80k tools The input is the non-photophile the reference file we've used so many times The index file from some tools the sequence dictionary from from picard In this channel that we just created So as I told before sometimes we have a channel and you have a process that wants to receive something like that But not exactly that Then you're going to use channel operators to transform the channel in a way that it can be an input to the next process that you want And you could say yeah, but why don't I change the process to accept the channel the way it is The thing is sometimes you need to really transform that because the programs you want to use Expect files in a certain format. So you have to do this organization Here the output is going to be a tsv file called ac.tsv Is what we have So now it's a different thing We I always gave you the process ready and you wanted to write the call Now i'm giving to you what the pipeline had what the step does And I want you to create the process So of course you can go to your main.nf and look at the previous processes The structure how they were done to write your want your own here I would probably say 10 15 minutes to solve this one Make sure you don't rush you do it with calm you see here What's the name of the process you have to choose what it does which is given here What's the input? What's the output? So I really recommend you again to pause the video and try to solve this one And I'm going to show you the solution in three two one The name of the process was provided You use the g80k container that we already used in the past use tag which is optional but Interesting like we'll just here to organize your your files The input you gave a name gnom index dict as I said is positional So it doesn't matter the fourth one you have a stable id a vcf a band in a bay The output as it was said here at the top it has to be ac.tsv That's what you did in this quick block. Just copy paste what I gave before above For the workflow block you have to also write the function call Which is pretty simple compared to writing the process which is more complicated By doing that, we can Save this overwrite or workflow block We can run it again And we're going to finally finish the pipeline because now it's done everything We need for this brand calling pipeline Congratulations Let's see if you can do here the Yeah baloons And some confetti maybe I know Anyway, congratulations for getting to the end. It's one of one. We have a single output file also One thing we can do here is to go this results folder that we chose And everything is going to be here We have this Let me download this pdf file Yeah, we have the ac.tsv You have the conti position the variant it was a t The reference was an a so on so here you have 30 as you saw only for the chromes on 22, right? Not sure if you will be able to see the pdf If the way i'm recording my screen, but if you can see the pdf, it's just a plot showing This analysis for the early expression And we're closing the pdf now. So we have the output here the final vcf Here for example, sometimes fst was a c and so on the non-snips And all these things So here in the results overview Just if you're curious that i won't get into much detail here But if you're curious about how to interpret the output you can read this more commonly So you have the final vcf file which contains all the somatic variations called from the zarnacc data You see variants that pass all filters So with the past keyword in the seventh field or the vcf file and also those that did not pass one or more filters You have the commands in this file You have the explanation here for all this the columns The a c a s e t s v you have also the description here About the the columns because they are tsv for right there type separate values You have this pdf called is the histogram plot for the little frequency for as names come on to rnc and non-variant from dna As bonus tab you can burn everything right? and if If somehow you think you masked something when you copy pasted you broke something And you are lost You don't know what to do to make it work the main dot an app that we built step by step But you still want to see the full pipeline working There's a file called final underline main here on on the on the on the folder and it has the complete version of this pipeline So you can just use this if you want to see the pipeline working So here now i run this command the bonus tab with everything that we have which are Six samples right? We have the replicates here, but they are six and that's why we have here now six mappings To our cache this is me four It's going to take a while because this mapping indeed takes a while. We saw it already But in the end it's gonna add some steps to every process up to us And some processes that we have to be recomputed no cache like this a c non snips With that we end this hands-on training. There's nothing else after that I would like to really remind you about the next courses that we're going to have we had The foundational training if you didn't do it i really recommend you to do it The hands-on training today and soon we're going to have the advanced next training In a way you could think of these trains as introduction intermediate in advanced I would also like to remind you about the summits the next most summits that we're going to have In barcelona in october and in boston in november This is the annual event of next loads Where all the community gathers the talks are amazing. You can also join virtually of course for the barcelona one The talks are amazing. There's the hackathon before the whole community meet and discuss and Announcements are amazing every year new technologies new features new amazing next flow pipelines So I really recommend you to attend at least virtually. It's a lot of learning that way With that I would like to thank you for your attention remind you that you can still ask questions even before this training on the earlier mentioned select channel And that's it. I hope you enjoyed it and have fun in the exercises and your next flow path Bye. Bye