 Hey folks, welcome to this NFCore training. Just give another 30 seconds or so for people to join and for it to go live. And then we can keep things off. So I'm just gonna share my screen onto the training event here. And this is available, all the information is available over on the NFCore website. There you'll see that this is the first of the sessions that we're gonna be kicking off today. And we're running three training sessions across sort of three time zones around the world. And this is like the first of those three. And this is also this will be across three different days. So you'll see there's nine sessions in total. Here is the schedule, which can be found on the NFCore website right at the very top there. You'll see some information on the different events. And we're gonna be kicking off here with a two and a half hour session. And we're gonna have a break halfway through that. Hopefully, if you're watching this, if you've managed to find the live stream, which is available on YouTube there. And I guess the key information is that if you've got any questions or anything you need to ask during this or after the event as well, we're gonna be doing all of that over Slack. And that can be found in the NFCore Slack channel. So if you scroll down here, you'll see some information on that. And there is the one for this session, which is this training October 22 payback. So if you select onto that Slack channel there, that will open something up for you. And you can sign into Slack and kick things off there as well. I know there's a couple of events around the world. So welcome to people who are joining there. And we can get things kicked off by starting off with the session on the introduction to next floor. This is gonna be sort of 25 minutes session or so. And then we're gonna jump straight in to developing our first and next floor pipelines. To point out all of the material is available online. The only thing that you need is essentially a browser and a GitHub account for what we're gonna be using, get pod for our interactive environment, which is available. So without that, there's a lot of material to cover. So I'll kick things off and jump straight into discussing things on next floor. And then we'll jump into some practical sessions as well. So kicking things off with a little bit of background on next floor and sort of how it came about, the kind of problems that we're trying to solve here. Most of the time when people think about next floor, they think about the concept of trying to solve essentially pipelines or workflows. And you can think about these as essentially, you have multiple different pieces of software, which pieces of software maybe try to solve some problem and sort of put it all of that together ends up being quite a difficult task, particularly if you want to do that in a reproducible way. Here's an example of an ELEF core pipeline for EGAR, which is ancient DNA analysis. And you can see that on the left-hand side, you've got some input files, in this case, some fast queues, some BAM, fast file. And then what you want to do there is send all that information through each of those individual steps and get some kind of results out. So when we talk about next floor and kind of workflows that we're thinking about, you can think that there's typically many different tools. Each one of those tools maybe has very different requirements. And when you're running this through, we're not just running this through once. We might run through many of these steps for every sample that we have. So we end up with this kind of parallelization across the whole workflow, coordinating that, orchestrating that can become quite complex. And that's really the kind of key problem that next floor is trying to solve here with in terms of writing those pipelines and then running them as well. So stepping back a little bit and thinking about the broader problem there, we know in particular in life sciences, we all now have much larger datasets. Whether this is coming from sequencing datasets or imaging datasets, this kind of kind of large influx of data has really required different approaches through what you could do. You typically don't run this analysis on your laptop or if you do, maybe you need to scale up to a cluster or to the cloud. And you've got this concept of embarrassing parallelization, the idea that you spawn up thousands of jobs or tens of thousands of jobs where basically the pipeline is run in parallel for every sample as well. And in bioinformatics, we have a whole sort of collection of different tools that we rely on. Could be scripts from your own lab, maybe it's some code that you take from GitHub, maybe some BioConda containers, maybe even some proprietary software. And you need to bring all of that together into kind of working in a single environment and then be able to run that in a sort of scalable way as well. And obviously all the dependencies of that software is very difficult to manage. So that's kind of the base problem that we were sort of starting with. And that's to this development began, so it's coming up to eight, nine years ago now. And so since then a lot of things have changed, but a lot of things actually do the same. And many of these things are still very relevant today. We think about it as kind of a concept of fair data, which you may have heard of a lot. And this is something which you kind of see often with regards to the data itself, but I think it's just as applicable to the data analysis or through even the pipelines themselves. The idea that we can make the data analysis findable, accessible, interoperable, and reusable, obviously next flow is kind of very key to many of those points, but also that could be equitable and scalable. The idea that you would take that analysis, run it on your laptop and scale it up into the cloud as needed, and also test it that we can use modern testing practices for software engineering, essentially components, make a pipeline into components and run from there also becomes super important as well. Here's an example of a pipeline that was developed in the next flow pretty early on. And you can see here that there's really a lot going on in terms of a modern pipeline. This has got 70 different processes to it. So it's got about 55 different external scripts, et cetera. And when you were thinking about how you were gonna manage that, how you were gonna install that, I guess the old way to do that was to go find each one of those pieces of software, maybe go into your cluster and install it, maybe use modules to install each one, but then you end up in situations where there becomes sort of dependencies, maybe two versions of the same software don't work together, and it becomes a real sort of difficult thing, situation to try and sort of manage that and install it in this way as well. Here's a sort of expanding a little bit on that on how you would take that. So very often people were starting, you look at a paper, maybe something you wanna try and reproduce, something you wanna try and replicate. And if you look say through the method section, you can go through and you can see, okay, if you use this version of the software, this has kind of happened, then you get a rough idea of it. But without any kind of reproducibility, sort of even in sort of the silica way, it still ends up with a lot of wasted time. Here's an example where it says that on a typical paper, you're looking around two months just to be able to reproduce exactly what the authors did. This seems like an excessive waste of time, given the fact that it is computational. You think that there is code, there's scripts, et cetera, you can essentially publish and make available and replicates. It kind of, the problem is a little bit deeper than that. So what we were able to show in one of the exploit papers early on was the idea that the application itself, when it's deployed in different environments, essentially when you take a workflow and you install in different environments, you end up with different results. And this is what was shown here in a couple of examples, one of them was with that same pipeline I showed you before, which was for gene annotation. And here you can see that the genes, which were essentially defined where the start location, the stop location is different depending on the different, depending on the different environments which you installed that pipeline in. And then the same thing happens as well, we have to show here with gene quantification. So there's this transcript quantification using Calista and Sluf here and between the same versions of the software, everything essentially identical on just different operating systems, you ended up with the genes being called significantly different changing. This is probably due to underlying just differences, maybe rounding differences in those libraries that are underneath there, but it ends up in a situation very difficult for reproducing ability. So I'll look at the next one and sort of next close the sort of attempt to solve this problem, but also some adjacent problems as well. We think about what next law is itself, it can be thought of as a language. So it's a way to write pipelines. It's essentially a situation where you see at the top here where you can write code in any language. So you can take your Python, you can take your R scripts, you can take your command line tools, et cetera, and you wrap them in these process blocks which are linked together with data flow programming. And this is a very kind of scalable way that you're taking components of your code and linking them together where the data is gonna flow through them. You typically then will define the dependencies that you need with containers. So that's like a way where you can take all the individual pieces of the software which is required to run the underlying sort of processes. And then you wrap all of that into a Git repository. And by sort of wrapping that into a Git repository, we get this version control and we get sort of the reproducibility and sort of modern software practices there as well. Excel is also a runtime. So it's also as well as the place that you kind of write the code, write their language. It's also the runtime that executes that code. And once you run an Excel pipeline, you've got some multiple choices for what we call the executor or the execution platform. You can run in AWS and Google Cloud and Azure. And you've also got options for all the schedulers. So the major schedulers, I believe there's 10 or 12 or so of those, as well as Kubernetes, as well as local. And even in, for example, AWS, you've got sort of four or five different ways that you can run the Excel. So it provides a little flexibility there. And I think one of the key things that sort of led to, it's much the success of an Excel has been this separation between the definition of the script, essentially definition of a pipeline and then where an Excel runs the pipeline. And that's really meant that people can come together, can really work on the pipeline itself. They can work on the workflows and GitHub, for example, and then they're able to essentially, independent of where they're running the pipeline, maybe they're working in the university cluster, maybe there's someone who's working in a company, can work on the cloud and they can all work on the same code base. And this really led to the development and the sort of the growing of the next flow of community as well. Considering about like some, maybe some differences now about how next flow works versus some other workforce systems. So you've got this concept of it being a DSL. This means a domain specific language. DSL is written on top of another programming language. So it means that for the most part, you can use exactly what you see in the docs what you kind of see there as part of the DSL. And then in situations where something is not covered or something to say, you wanna do something outside of that, you've got this underlying programming language you can access. And with regards to next flow, that's groovy. I should point out, you don't need to know any groovy to write next flow, but having an appreciation of underneath can help you in those corner cases as well. This is kind of different from say a specification, something like CWL where you have a definition of exactly what can be done. And then sort of outside of that becomes a little bit more difficult. Next flow has got this concept of reactive programming. This is where the kind of flow part comes in of next flow. So it's the idea that essentially the processes are sitting there alive, they're waiting for the day to come and the data is pushed through the pipeline. And we'll see a lot more of this as we go through and that's kind of one of the core concepts that we're gonna be learning over the next couple of sessions as well. They have an idea of being self-contained. So each next flow task that it runs, it runs in its own working directory and it kind of runs essentially almost isolated by itself. This was kind of one of the key decisions that was made early on. And it turned out to be quite fortuitous given the rise of some of the services in the cloud, things like the batch computing services, where the concept is that each task is essentially a containerized job and each one of those containerized jobs is going to run on some environment. And this kind of this idea that you isolate these things ends up very useful, things like our resume abilities, the idea that you can sort of stop the pipeline, resume at any time as well. And then also the ability to obviously run these things distributed across many ways. And finally, the thing I mentioned before, this idea of the separation between the definition of the workflow and then where you run the workflow as well is a key point which we'll stress a few times as well. Here's an example and this is an example of DSL-1. So this is like the first iteration of the next low language. And it's useful just to hear points out here that this is the way that help basically tasks are linked together. I'm going to show you this in DSL-1. You can see that the concepts behind are exactly the same even though the syntax is, we're going to learn a slightly different syntax for this. So here's the definition of a task. So you would make typically start with this, maybe something you've got from the command line where you put this in your script where you're trying to run BWA mainly here against with a reference against a sample, you're piping that to SAM tools and then you've got some output in this case which is our sample then. Now, and next to the way you would write this, you wrap this in a process block and you can define here the inputs, the outputs and then you'll notice there's a script section and that script section is exactly the same as you would have it. So it basically could be the exact same thing that you would write in the command line or that you had previously. And the key point here, I guess the key difference there from something is this concept of these channels here. So you can see that the process takes two inputs and one of those inputs is from a genome channel and one of those inputs is from a reads channel and that's essentially the channels that are sort of driving the data and they're like a linking back to here. Now, what happens if I wish to use this downstream? What if I wish to use sample ban in the next process? I wish to kind of do that. Well, this is where the channels again come in here and if we see that we define another process here you can see we've got this process index sample. You'll notice that the output of the aligned sample process becomes the input of the index sample process and that is what's inherently creating this link between those two things. And then you'll see that this channels are kind of one of the key defining factors of next flow and how you can kind of drive this parallelization forward. So we'll look at it a little in practice. This is the DSL2 syntax. This is what we'll be looking at for now on and you can see here it's very similar. We have the definition of a task on the left-hand side. In this case, we've got quantification task which takes two inputs here and index and some reads and the script section here, now we're running salmon inside of this and you'll notice here that we've essentially looks again very similar to what you put in the command line here. And then on the right-hand side we have a workflow definition and the workflow definition essentially allows us to define the full workflow where in this case we can see that the quantification task takes two inputs again, the output of the index and the read peers channel. This is just a very kind of high level. Don't worry about and we're gonna go through this a lot more in a lot more detail and practical exercises that we go through. In terms of the kind of core concepts of what a channel is. So at a technical level it's called asynchronous first in, first out queue. And that's kind of what defines these channels themselves. They are typically used to link processes together. So a channel sort of sits between two processes and it's also a concept of operators which we'll see which is a way you can manipulate those channels in between those things as well. If you consider sort of more visually you can imagine that we've got a channel which would contain some data maybe three data sets or three data files or three values in this case. And you can see that here a process is essentially the definition of what's gonna be run. And then the task is a iteration of that process. So it's essentially you can see that each because we had three values in this channel we have all three elements in that channel. You'll see that we end up getting three tasks being run from there. And then kind of the parallelization continues from there and then the output channel kind of goes on. In practical terms you can see here we've got the fast QC channel and you can see here it's got simply a single file in there. So you've got data sample fast dot fast Q and that would therefore, because we've got a single sample in there when we run this process we'll just end up having essentially one task in general so fast QC will run once. However, if we change the input we said actually I want all my fast Q files there then the fast QC process will run for as many files as there is in there. And those processes will run purely in parallel as well. There's just kind of a different view on that where you end up with every file itself every fast Q file in that channel and creating a fast QC task as well. Okay, there's a little kind of differences on CWL and Nixflow. So here you can see that with grass to CWL you've got this kind of concept of Nixflow containing both the language which is the way you write it as well as the runtime where CWL is kind of more of a specification there. I've got this concept as well of being the DSL which is kind of specifically written for the pipelines as well versus more of a declarative. So yeah, more like the declarations there. Nixflow is quite concise in terms of the way it's written. It's usually relatively clear what's going on and that's kind of written in a I guess in a very easy way to kind of see and to visualize exactly how that works. And it kind of got this concept idea of a single implementation which means things are quite concise in terms of the way it's developed as well. If we look at Nixflow, I guess it's a lot more similar to the snake make here. I guess some of the main differences there is around the difference between a pool model where Nixflow is defining the output there and kind of pulling the data through whereas Nixflow itself is using this kind of push model. There's support there in Nixflow for really a many different container runtimes as well as for cloud as well. And that's something that's kind of been driving it a lot of the success of Nixflow recently. Okay. Let's now jump a little bit into deployment scenarios. So far it's been a little bit on syntax. We've seen just a little flavor of what kind of Nixflow pipelines look like and some differences. What about coming to actually run the pipelines themselves? So we've got a couple of different ways of running this. We're gonna mostly do is for the workshop local execution although towards the end we're gonna see in the third day we're gonna see options for running in cluster and cloud and just to highlight what these kind of look like. So when you're running Nixflow locally you've got a usually a single virtual machine maybe you've got like your laptop or a workstation here and when you run Nixflow each single task can either be containerized or not but it runs essentially on the operating system and the files, the intermediate files and the input files, et cetera to put it on the local storage there. So it's just a kind of basic execution, Nixflow run it's kind of running locally in this way. When you wanna scale things up and then you want to really use some more distributed compute here Nixflow can connect in with the clusters. So this is typically what we call centralized cluster orchestration it's where you're submitting to for example, a slurm cluster. Nixflow will wrap each task into the slurm job and then from there it's gonna send each of those to their cluster, which is gonna need to go spin up the job, et cetera and sort of take care of the scheduling from that. Here you're typically using a shared file system something that both the cluster node can access as well as Nixflow the head job in this case for running that and it ends up to be sort of a popular way for running sitting at your team here where people have got sort of large set ups there as well. What's become really popular in the last couple of years is these managed services in the cloud. Here I'm showing your AWS batch but there is exactly this sort of very similar set ups with Google batch and Azure batch and both of these all these cases Nixflow is submitting the task to the batch API and from there that batch API is actually spinning up resources in response to the task being submitted. What's great about this is it means that you can spin up resources on demand for what you need and you really only pay for them when they're being launched. This means that for example if you have data coming off the sequencer or maybe you're in a situation where you are basically just need specific resources for a particular analysis at one moment in time GPUs or something particular high memory machines that allows you to use those resources just in that moment and then when you stop using them essentially you're not having to have any ongoing costs with regards to that. The data here is typically stored in some sort of object storage and in this case here it's like an S3 bucket but there's a whole bunch of other choices as well for doing that as well. What's also another option which is available which is kind of going beyond the managed services in the cloud is this is Kubernetes like execution. Here Nixflow submits each task as a pod and those pods then in turn spin up virtual machines to put it on the setup that you have. There's a little bit more work in terms of you have to manage that Kubernetes cluster yourself but we see this as something which will become sort of have some more problems over the next couple of years as Kubernetes becomes a little bit more mainstream and there's more sort of management services for that as well. Okay, switching gear a little bit now I was kind of thinking about how that works. So here you've got a little bit of a description of how it would work with Nixflow. You've got your scripts Nixflow run local and then that would run on a local machine here and Nixflow essentially got no, you can think of this as not having any configuration here that kind of is the default way of running. If we were to add some configuration here and you'll notice that the script stays exactly the same but we added some configuration here for running on Slurm we could define our executor, we could define a queue some memory CPUs and maybe a container image here and when we read Nixflow run with that we would see that Nixflow would submit each of those tasks then to the Slurm executor and this would be something we would put in our Nixflow config file. We could simply then, if we wanted to take that and then run that same thing in the cloud we could simply switch out the executor here say I want to run that in AWS batch and Nixflow would submit each of those tasks to the batch executor here. And then the key thing, the finding there is that we're really kind of got these abstractions for queue for memory for CPU in this case which really takes care of many of the details and you'll see this in the last section where on the third day we're gonna go over some of this and we'll see how sort of easy it is to define those things and really keep these pipelines kind of portable as well. Okay, how is that, like how is that portability possible? Because how can we just simply switch up and executor and do that? Well, key thing is to really, obviously write up pipelines in a portable way and there's a few things that we can do to that but a really key driver of that has been containerization and containerization was something that was sort of pulled up quite early. Nixflow was created actually just prior to the release of Docker and as Docker came out it was really sort of what you were just timing with that it sort of embodied a lot of the ideas that we sort of needed for Nixflow in terms of its portability there. You can think about containers were typically used and still are for the most part used for long running web services or for development of applications on the web and then running those in environments like the cloud. Containers weren't really necessarily at that moment sort of thought about for being used for data analysis or data pipeline applications. And we sort of took the idea in sort of ways that we could sort of use those in the industry. People previously were using virtual machines and in some cases still are using virtual machines or VMs for those kind of applications but there's a lot of downsides to running the virtual machine. You have this kind of they're very large typically often many gigabytes for launching them. That means the transport of these things around is quite long. They're also got a very long startup time. For us we wanna spin up tasks very quickly you know, hundreds and hundreds or sometimes tens of thousands of tasks and we can't be passing around these very large sort of virtual machines for doing that. The tooling around containers has also been very useful. So the tooling around Docker for being able to build them for being able to store them in registries, et cetera can all be used except as a hobby how can be used for writing the pipelines and for the kind of pipeline software that we want. And we've got this concept of, you know a build on top of these containers as well. And all of that has sort of helped in terms of the gaining popularity for them. So Nextflow has support for Docker very early on. There's also support for three or four other run times now as well. So if you run in a cluster you'll often be using singularity there or something similar which is really doesn't have the kind of problems with running it with elevated privileges, et cetera. The concept is all very the same the idea that each Nextflow job or each Nextflow task becomes a containerized job there. And from there the containerized job is run on your HPC system or on run on your cloud in a parallel way as well. Okay, when to use containers I would say I would always and this is maybe something you can get out of this workshop out of the containerization section. There's really, we'll be taking on later today there's really a whole whole bunch of really positive things to use containers and once you get into it this is very, very useful for your work as well. Okay, a little bit on Nextflow So Nextflow has been growing something on the community side of things. I know that there's gonna be more sessions later on in the other days or on NF Core but just a touch on the community aspect. So Nextflow has been growing pretty rapidly the last couple of years now. And a lot of that growth isn't not so much like contributors to the runtime or the Nextflow thing but also to sort of look to NF Core as well where people can really come together build up those pipelines, et cetera as well. If you're looking to do Nextflow development though so it's an extra tool in which it's available here there is point out that there's some editors we're actually gonna use some of these in a moment but we're gonna have to be developing our pipelines with the syntax highlighting here and just going back to the point on NF Core here. So this has been a really, you know a success developed from the community. We have support from Chen's up initiative for both Nextflow and NF Core so I should just point out that now. I don't wanna go too much into there but you can think of NF Core as a collection of not only fantastic pipelines which you can pull from content which you can use straight away but also a collection of tooling which provides you with best practices for how to structure and think about building your pipelines as well as a whole, you know a fantastic community of people who can help you with developments and running of those pipelines as well. Okay, I'm gonna stop sharing my stop sharing this screen here now and we're gonna move over to the section on the actual practical training and for that maybe we can place in the chat the training itself. So I'm gonna find this over at training.secara.io and share my screen there. We can all see. Okay, so this material point out is always available. So you can find this anytime. So don't worry if you miss something or you wanna come back to this this will always be here. Here you'll find all of the information for what we're basically following for the majority of the rest of these workshops. And if you go down here there's a little bit of background hopefully we've kind of covered a little bit of the overview so far. On the left-hand side you can see we've got this environment setup and this describes a little bit if you wanna run this on your own laptop if you don't wanna use the environment that we're providing here you can do a local installation here. Now, we're gonna be doing 1.2 we're gonna be using the Gitpod environment but I'll run you through this just in case when you wanna say wanna go run this. So outside of this the main requirements for running NextFlow is a POSIX compatible system so that means typically Linux or Mac or Windows subsystem for Linux and the requirements there are having Dash. So basically if you've got a command line you're good to go Java 11 which is pretty much available everywhere and then some optional ones which are here for running sort of some of the work that we do with Docker and Git, et cetera. There's a couple of optional ones for this workshop for running this. We don't really need to see them we've got Grafiz which is just more for visualization of the figures and the AWS CLI if we wanna run an AWS that are on for some of the workshop. To download NextFlow you've got a couple of commands you can use this W get here which will place it basically download a single file which you can then either move onto your path or make executable here and then you've got a couple of options here. Let's jump straight though to Gitpod itself. So Gitpod is a essentially I can containerize virtual environment that we can run here and if you select that where it says 1.2.1 Gitpod quick start and you select that there. This is just going to bring up essentially a visual studio code environment as a container with everything that we need for the training here. So it's got all of the data that we need. It's got all of the, it's got NextFlow installed it's got several things and it might just take a couple of minutes to do that through. You may be asked as well to log in there to Git Hub to do that. So I'll just give you a few seconds here. One of the nice things about this is you've got on the left-hand side you've kind of got a file browser. So if you're used to using this you can access that information here. You've also got a terminal down here and then we've also got like a text editor up top here which is very useful for us but we're editing this text here. I'm going to try and zoom in a little bit and just going to shrink this one so hopefully you can see my screen and the text nice and clearly. Okay there. I'll just give you 30 seconds or so to bring that up. Hopefully everyone's got access to that. These environments if you come back to this here it should stay available and if you want to, can actually log in to Gitpod and see your different workspaces that you have up here. You'll notice at the beginning if you're sort of interested in this this has actually taken, all this is doing is taking from the actual public training itself. So we have a Git repository which is under the exploit public training. I'll just paste that, show you what this looks like in here. And then what we're simply doing here is we have a public registry, a repository here. So this is like the public repository and inside of that you've got this Gitpod YAML file and by putting this Gitpod YAML file in here we're defining exactly what this should look like. So basically setting it how to install Nextflow. We're putting for example some VS Code extensions on here. So we have an Nextflow syntax highlighting, et cetera. A couple of other things as part of that as well so that when you can simply put Gitpod at the start of this address it brings up this for us as well. Okay, I'm going to close that there and I'm going to get started. So we're going to be following through the material which we have in the public training there. So hopefully everyone's good to go with that. I'm just going to paste this one here and we're going to start from a very kind of very simple pipeline itself just kind of getting started with Nextflow. So we're going to jump to section, it's going to be I think by the way section 2.2. So your first script here. And then this first script is, it's a very basic pipeline. It's got two processes to it. So essentially two steps. And the way that it's written it's written in a way where we're trying to show a little bit of everything that sort of happens in a basic sort of Nextflow syntax. So we've got definition of processes. We've got some channels which we're going to be using defining those. We've got some parameters. We're also using software which everyone has got. There's no, it's just using basic command line tools here. And we're also doing something that often happens in sort of bioinformatics or data pipelines. We've actually got some files that we're going to create. So we're going to create some files. Within those files we want to run in parallel. So we want to run each of them through there as well. And then from there we're going to, we're going to sort of process the result of that and get to the end of that there as well. Okay. So I'm just going to walk you through this and I'm going to show you first what's happening. Sort of what's sort of happening before we even get into the Nextflow part of this. So you'll notice that there's a definition here of the parameters, which is just in our case, a single parameter which is called hello world. And we're basically taking that hello world string and we're pasting it into this section here which is simply going to take the string and split it into two files. And then we're going to take those two files and convert them to uppercase. So before we get to, before we're going to go into the Nextflow syntax, I just wanted to make sure it's kind of clear that it gets the non, what the non Nextflow pieces are doing here. So I'm just going to close that there and run you through that. Just a copy and pasting the actual pieces without Nextflow itself. So the first script we can see is simply taking, is like you would do from the command line. It's actually just saying here, it's going to say print F and then we've got this piece which is the actual script itself. So we're taking hello world. This is our parameter that we have there. And then we're taking that and we are piping that to a command called split. Split is just a command line tool. So you'll find that you've got it in your terminal there. And then we're just splitting it on six which basically means based on every six characters, we're going to split it. And then we can give this a name and this just means like chunk. So this basically is going to create for us two files. You'll see both prefix by chunk and it's going to split hello world into two files because you can split this into two here. So if I do this and I say LS on chunk, you'll notice that there is two chunk files that have been created here, chunk AA and chunk AP. And if I was to do a cat on them, you'll see that chunk AA contains hello and chunk AP contains world. So the pipeline is simply taking a string splitting it into two files. And then the next process of that, the convert to upper process, all it's doing is it's taking on each file, it's essentially saying cat on the file. So cat on chunk AA. And then it's piping it here to another command line tool in this case, which is just converting the lower case letters to uppercase and really just printing it to the screen. So I'm just gonna allow text space there and you can see it's taking world, making a load, making the uppercase a load and world uppercase world as well. That's all our pipeline is doing. So don't get too worried about the specifics kind of beyond that. It's really just taking those two things and doing it. Now let's go on to the next floor part of that. So what are we doing when we're seeing this in itself? Well, let's have a look here. The first thing you can see is we've got a she-bang definition. This is something you'll see at the top of your pipelines often it's not necessary to place it in there, but it's obviously good practice for when you wanna call those things. You can see we've got a first thing which is a definition of a parameter. And you can see this because it's got this params dot in this case greeting. And this means that greeting is the name of our parameter there. Parameters are a bit special in next floor. They can be defined in many places. So you can define them in the script like this. You can define them in a config. You can define them in the command line. You can have a parameters file where you just find them and they're often used in this way where you can essentially have a default that works well and it's constantly always in the script. And then you can obviously overwrite the parameters to run here. The greetings channel that the line for here is our first experience here with the channel. And here you can see that we're using a, what's called the channel factory. And this is the most basic channel factory, channel of. And what this is doing is just taking the parameter dot greetings, the params dot greeting, this string and placing it inside this channel, greetings channel. And you can see this is like an assignment. So we're giving this name, greetings channel to the contents here. Next, we've got the definition of our two processes. And our process is the first one is called split letters. It has a definition. This is the name of the process. Here you've got this curly bracket. So give everything between here and here is the definition of that process. And this process is fairly simple. We've just got a single input. You can read this as it takes a single input which is a value X. And it has a single, it has an output here which is a path chunk underscore star. What this means is gonna capture all of the files which match this. You can tell these are files from this path here. So capture all the files that matched that and place that into this channel. We've gone over this section here before jumping to the convert to upper. Very similar. We've got convert to upper, the name of the process. We've got the path Y. So now instead of taking a value, it's taking a path. And you can think most of the time in the next law you're even dealing with values or paths and paths are particular because of what has to happen with staging of files, et cetera. And in this case here, now convert to upper the output is simply just capturing the standard output. So you don't see too often, but you'll often see if you just see, I wanna basically wanna capture that output and place it into a channel. I should point out here that in all these cases the inputs and the output of all of them are channels in themselves. So we're essentially taking inputs channels and we're creating new channels which are output. How do we actually call these things then? How do we take this split-latism and convert to upper? Well, we've got this workflow definition here which actually calls sort of references the sections themselves. We're defining our letters channel as taking the split letters and we're passing it a single input here. So we've got split letters which has been receiving a single input which is our greetings channel. So our greetings channel here is essentially the input that goes into there. Now the letters channel in this case letters channel is the output channel. This is again, it's like an assignment of a channel name. So we're saying the letters channel equals this here. And then likewise we can say the convert to upper takes the letters channel flattened and this is our first example of an operator. We'll see how this works in a moment. And finally you can see you're just taking the output of that which we are viewing here. This is again another example of an operator with this syntax here as well. So don't worry if you don't sort of get this first time we're going to go through all of these things in a lot more detail but let's start running this pipeline now and see what we can get up. There's a full definition of what each one of the lines is doing but I want you to first try this and hopefully just to confirm everything is working you could say next flow info here and this is going to make sure that next flow is installed. You can see I've got the version of next flow I have and et cetera all that information there. If I say next flow run typically now you can point to a single file that you want. So this is in our case we've got this file which is hello.nf. I say hello here.nf. You can see that the pipeline will launch and I give it a second there to launch through. Also, and you can see that it's run through there now we've got split letters which has run once. So you can see this from this one of one. So split we could say the process split letters had one task and then the second process converts up. I had two tasks and you can see this from this two of two which ran and on the bottom here you can see that the output which was printed which is in our case the view of that channel has created hello world. I want to run through this again see if we can see any difference. So I'm launching again everything's everything's exactly the same there and you can see I get hello world. When you run this you might see some differences though maybe you'll see in some cases like here where I've got world hello. And the reason for this, the reason why we're not being hello world or world hello now is that the second process this convert to upper the tasks are running through purely in parallel. So they're really at parallel at the CPU level. When things are in parallel it means that they're essentially running exactly at the same time there's no sort of difference between the two and therefore we don't know exactly which one is going to come first. And in this case here you can see that in the first case world hello ran first on the second case hello world the first case hello world came first and that kind of parallelization it's only over two files here but you can imagine when you're trying to run this over the hundreds of files across each of your samples and maybe it's not on the local CPU but it's running out to in a distributed manner and we create a pretty powerful it's kind of flows through the rest of the pipeline as well. What about some other things that you can see here that are interesting? Well, one thing that to point out is you can see that converts to upper here has run two of two. If you wish to see each of those itself and this is in the thing I'm just going to run the same thing. I'm just going to place it at this thing called anti log false submit and this is going to split out each of those processes. So now instead of having showing you only the processes this is going to split it out and you're going to see each individual task being shown here. So you can see that now we have split letters one converts up or one converts up to two. This is kind of can be useful mostly for teaching when you go to run and practice obviously you may have hundreds of tasks or thousands of tasks you don't want to them it's much more useful to be able to see exactly what's taking place here for being able to individually see those as well here. And the other thing that you might notice what is this kind of letter inside here? Well this is what we call the task directory hash and this is essentially a unique hash which is generated by looking up all of the inputs of that task. And this will come through to what we'll see in the resume function in a moment but essentially is a way of both working directory of where that task runs which is inside the working directory as well as a unique identifier for that task and it becomes very useful for what we're going to do. I'm going to open up that file now this hello.nf and find it here somewhere. And you can just double click on the left hand side if you wish and you can bring that up here. So let's turn it off, okay. Now let's imagine that we are we're happy with a pipeline it's kind of working okay but we want to make some changes or maybe we made a mistake and we want to fix something up here. In the second process convert to upper now instead of changing it from instead of it being convert to upper I want to change this and make it reverse. So imagine I made a mistake in my pipeline I wish to do that. I'm going to save that. So when I'm saving I'm actually just pressing command s here if you wish you can also always save file save and all of these things here but I'm just pressing save behind the scenes and I'm going to do everything exactly the same now except I'm going to just press dash resume and notice that say small sorry it's a single height in there. So I'm saying explore run as resume and I want you to follow through with what happens here. You'll notice that when it launches that the first task itself here split letters is cached. What that means is that the first task didn't actually have to run. It was using a cached version of that task. The second task, the second process here convert to upper actually ran. You can see two of two and out now instead of converting to upper it's reversed those for us. So that can be kind of useful if you had imagined that you're running through the pipeline maybe there's an error maybe the pipeline stops for some other reason. With Nextful you've got this ability to both modify and resume but also just resume itself. If I run Nextful run this resume on this you'll notice that it will run through but all of the tasks will be cached. This example is a bit trivial it only takes a few seconds to run through the whole pipeline but if your pipeline takes minutes or hours or even days to run then this feature becomes really invaluable. I'm going to see some more on the debugging of that in itself. The way that it works though is I also want to point out notice how that in this first case let's say you can notice here that A B dash D A so the split letters task has this hash and because we use resume since the next flow looks up all of the information or the script or the inputs et cetera and it determined that the hash was the same and that the task had been run successful previously and therefore you end up with the exact same hash ID there. So the task directory stays identical based on this information and that's what produces that as well. What about if we want to change this? I'm going to go back and remove my changes there I'm just going to keep that the same as we had previously. Often when we're running these pipelines though we don't just want to run the default data through we want to run through something a little more interesting and here you've got the hello world that we're running through as agreed what if we want to change that? So I'm going to say everything else the same next door but now if I want to change the parameter from the command line I can define that by using not a single dash but two dashes for the name so dash dash or hyphen greeting itself then I can put in whatever I want here I could essentially change and override what I want to place inside there. I'm just going to put in here and copy from the material there and run through this and see what happens. You can see that this pipeline is launched now and notice that a couple of things the first thing is around convert to upper instead of running twice convert to upper has now run three times and the reason for that is that split letters has essentially split our new string and Bonjour Le Mon is not split twice but it splits three times because it's obviously longer and you can see this is the three pieces that it gets up split into it and then converts to upper. It's an example of the parallelization if I was to sit this here and I said actually not a little long and I just looked whatever I wanted some long sort of string in the image of my file was longer or whatever there was inside of there then we're going to get that basically being processed however many times there is and now you can see that we get this parallelization of whatever we have there. Now I'm showing you here strings but this could be the exact same thing with files the concept is exactly the same and we can see that we've used this structure to find these processes and how we can manipulate the channels to see that out there as well. Another thing that I wanna maybe point out here and something that we haven't seen is this concept of the operator here and this operator here is actually doing a lot of the work so I'm just gonna place I'm just gonna cut this for a second here let's just take away the operator first so I'm gonna take away this flatten and then I'm just going to do a view on this exactly itself and the first thing I wanted to do is actually just to view that. Okay, just go back to the basic one here we're just gonna remove that and run as default. So the flatten this is something important here what the latest channel in this case is generating is actually a list of files so here you can see hello world if I don't use the flatten you'll notice that there's no parallelization and the reason for that is if I do here latest channel and I'm just gonna do a dot view on this and dot view is very useful you can see the syntax is actually very similar this is essentially the equivalent here but what this is doing is gonna allow us to view what this is it's a little bit like print but in this case it's gonna allow us to view the contents of that channel I'm just gonna remove this for a moment and save this so just what it basically when we get to this stage here what is the actually just containing we run through this app and just to show you here this is this latest channel that we're viewing you'll notice that it contains two files it contains the chunk AA and the chunk AB as we expected but it contains them in their full path first of all it's because these are a file object the path object in this case and that means it's going like the actual full path of it although as far as we're concerned we can just simply use this chunk AA chunk AB cause we're next close to taking care of that behind the scenes and the other thing I want you to notice is these square brackets what this means the fact that this is on the same line essentially within these square brackets means that the output the greetings, sorry the letters channel here is a single element it's made up of a single element let's change this before and what we saw with Flatten what is Flatten actually doing here save this by using Flatten it's taking that it's actually single object which was made up of two files and it's flattening it out so that it's now is two individual elements inside that channel and it's the fact that we split those into individual elements what drives the parallelization that's what really splits that up now you can say okay that channel letters channel has those two elements or four elements or 20 elements depending on how we set that up for running that through and that's how that's how it works I'll show you one more thing and then before we take a short break and jump into the RNA seed pipeline one thing else that can be interesting is to look at notice how this location of this file and how it's inside of this work directory so I'm just going to remove this and go back to the default here if we look inside here I'm going to firstly remove my working directory so remove rf-work and this is just to show you so it's nice and clear and I'm going to run through that again every next load task remember it's self-contained and it's self-contained within this working directory so if I go to CD work here you'll see that if I do LS you'll notice that there's been three working directories which are created here so inside of work and these are from these three tasks that have been generated now if you don't delete your working directory essentially it will stay there you can see the 1760 73 let's go into the first one so you can see that our first task that there says was 73 so I can go CD 73 and then do an LS and you'll notice that inside that directory there's another directory which has got this longer sort of hex number here and this one is 8a f6 etc all the way through and I'm going to go and society that one and then if I look in here and I'm just going to show you so firstly the first case if I just do LS on this you'll notice that there's the chunk AA and the chunk AD so the tasks split letters ran inside of this working directory essentially was run from that location and the files which got generated chunk AA and chunk AV were placed there now let's consider the second process and the tasks related to that so 60 and 70 I'm going to clear this I'm going to go back so this was 73 can go back one more and let's go now into 60 and have a look into there again is another task directory inside that and now inside this one if we do you'll notice that there's chunk AA but notice how it's like bold and slightly different color compared to that for the reason for that if I just look at the the full sort of list of that here you'll notice that chunk AA is actually pointing to so which is like it's symbolically linked to the original location to the actual to the previous processes working directory and it's that in this case when you're working in a shared file system or when you're working on your local machine for example it explores simply the symbolically links to the task about them and uses that now this approach is really good really useful but if you're running in say for example a cluster environment or if you're running say say in the cloud or clusters actually find typically but if you run say in cloud you have to like take care of those files and stage them or treat them differently and this is the whole reason why we treat basic objects strings integers maps etc different which we usually define them as value versus what we do with path here as well okay that's the basic part of running our first pipeline there I'm just trying to see if there's basically on time maybe Christy can let me know if there's anything we're all good to go there yeah I think that's fantastic I think now's a good time for a break and then after that we can come back and start talking about this sort of pre-concept RNA-seq pipeline as well as a little bit of docker awesome awesome but thanks thanks a lot folks I'll pass you over to Christy I think we're gonna have a short break and then yeah as we come back we'll jump on to jump on to that see you all shortly cool wonderful perfect thanks Evan so let's take about a 10 minute break and we'll come back like I said we'll start talking about the RNA-seq proof of concept pipeline so yeah go away stretch your legs get a drink and I'll see you in 10 minutes yeah folks okay just another 10 seconds and we will start the next part of the training right screen one second okay welcome back everyone before we go any further I just wanted to remind everyone that if you do have questions the best place to ask those is in our Slack so if you're not really a part of our Slack community you can join using the Slack here and any questions you can just put in this training October 22 APEC channel here and we have James who's part of the NFCORE team sort of monitoring that channel and if there are any questions that you can help with you'll jump in pretty quickly and give you a hand so we will continue with our workshop using Gitpod so what I'm doing here is just creating a new channel a new instance rather using Gitpod so just pulling the net container again if you hadn't done this already it's just going to take a few seconds to spin up it just takes me a while to pull in that container and what we're going to be doing is kind of taking it a step further beyond what even it's just been showing us with Ashley developing kind of a proof of concept RNA-C pipeline so everyone should see this again again remember that we have this training material up here if you want to follow along using that but other than that I'll mostly be following along with this just in the training website as well okay so off to the right here we're going to be talking about this RNA-C pipeline like I said and we have scripts one through seven all of these are sitting there you can execute these from your command line but first things first let's just start off the script one and we will talk about what's kind of in this script so we can do this with a normal next flow run execution we type in there script one nf and as we can see the same as when Evan ran his hello world example just takes a second to spin up it's going to give us our next flow version number it's going to give us a sort of run name revision and here we're just seeing a printout of reads works place in equal training public the training data as well as what's in here so going back up to the script and actually looking at what is in that script we can see that we have these three lines here with this param stock reads so this is kind of like parameters parameters are quite special here we're really just bringing in strings so here we've got the project directory which is basically we were sitting at the moment so this is actually quite neat in that when you're using the project directory it's using this as a relative path which is a lot better than hard coding in where you're actually trying to like store this data it makes the pipeline much more portable so here we have some data which is this fast queue files from the chicken from the gut we also have this transcript don't file as well as this multi qc folder and here all we're doing is just printing print line reads parameter dot reads so what we can do is we can just change this this is just a string it doesn't really matter what we actually put in here we can just change this very quickly to like for example control save so you can make what i'm doing and we can just run it again cool so we see that was a new run name revision and we just sort of see the output here but in reality we could just kind of change anything into there we don't actually need to have anything in particular because next flow is just treating this as a string again next flow run and see this big long string so at the moment we're not really doing anything special we're just kind of importing these strings which is the frontier which is a project directory as well as this kind of just text out the back we can also kind of override this in the command line so if we want to just change this here we could do something like data the gg gal or we'll change that back to guts let's just do one and two dot if q and we'll close that off but again what we can see here happening is just that it's running through this little script up here it's still saying loan but down here because we've overwritten it in the command line we can actually say gut what we can also do here we'll come back to this a little bit later but because this is just a string it's not actually doing anything we can add like a glob and again it's just going to print through and show us just that string of text nothing more at this point so what we're going to do is let's just jump down to the script pipeline this is where the actual rubber hits the road and that we're going to ask you guys to actually do something I'm going to actually try and participate in the training we will ask you to do these exercises here so you should see this in the training material and what you can do is modify the script one the same as what I've been doing over here and you can add another parameter called outdoor so this is the output directory or what we are going to use is the output directory so you can add that in as another parameter if you get through that really quickly you can also try and add in this kind of log info log into is a really good way of printing out some material to your terminal and here we're just asking for a multi-line string if you're familiar with some sort of basic coding concepts this shouldn't be too hard but if you don't know that's okay as well because of the answers just down here so I'll give you a couple of minutes just to quickly have a go at that and yeah and also come back again in a couple of minutes okay let's just jump straight back in as you can see I've already sort of copied and pasted some of the answers then just to keep things moving nice and quickly so what we can do up here you can keep defining these parameters at the top the pipeline or the workflow so here we're just going to define prams start outdoor again we could sort of call this anything we want today we're just going to call it results here we're still just going to keep our print reads as you know let's remove that because we don't really need that any more and we can just use this log info so here we could just put in any text we want here we just call it RNA-C pipeline and we're just going to print out what we have here is the parameters with the script file the reads and the output directory which we've already just defined up here and here we've just got the three quotes so we're containing all this text and down the bottom we have the strip and then and that's just going to remove some of the white space to make things nice and pretty so again we can quickly control save and we can just run this again we don't actually need that text for now what we should see is just printing nice and quickly might as well make that a little bit bigger for everyone to see okay so now again we can see next close run we've got this launching a script script one giving it another name another revision number and this is the log text that we've just added in so again this isn't anything that we've really changed but we're just printing out the reads transcriptor and file in this case the output directory which we've just defined great so with that we're actually going to keep building on this pipeline we're going to jump a script two let's close that one for sake of reading so all we've got in here is just the exact same text that we've just added in so you can either choose to continue with script one kind of keep adding these things and as we go we'll just jump through the different script numbers as we sort of progress building this pipeline so here we're going to add another sort of two parts to this pipeline the first is a process so again this is what you've been touched on before is that with next float you kind of have these different ideas of channels and processes so the processes is really where the actual script is happening and what you typically find is three main parts to a process the first being the input the second being the output and the third being the script we're actually going to run the code that you would normally expect to see in a pipeline so with next float sort of define the inputs and outputs there might be paths there might be values you can sort of add different parameters kind of tune these if you need to but here we're just going to input this path which is transcriptome this is kind of like an arbitrary name but what you can see here is that we're called the input transcriptome and then down here we reference it again in the actual script and this is where next flow really kind of starts to integrate different scripting languages and different different excuse me inputs from different places so because we have transcriptome up here we can just reference it with a dollar sign down here for the output we're just going to specify salmon index is kind of like a a specific hardcoder name in a way we also see that here and then of course as I said before we've got the script which we're going to run salmon index threads tasks what CPU that's just another part of next flow and then we can actually specify how many CPUs we want we'll talk about this a little bit later about how you can use sort of different directives and different processes to kind of tune this depending on what you're actually trying to run and as I said transcriptome and salmon index down here we just have the actual workflow instance this is actually going to be executing that the process so the index which is going to be taking the prams dot transcriptome file so this is where we're actually referencing the input that we've defined up here which is this path to the transcriptome file so let's just run this it's probably the easiest way to see what's going on and what we'll see is when we actually run this is that it's going to fail so the reason this is failing is because we don't actually have salmon installed on this computer or on this instance so it's pipeline 7 salmon isn't found to get around this we've actually included it's an image of salmon in a docker file and we will talk about docker files more at the end of the session but for now all we can do is just quickly add in with docker this script is going to know to go away and look for a docker file and include it in this pipeline it's going to act as the source of this this tool and we see it runs nicely so again what we see is we've got this pipeline's been executed you can see this log information and here we can see this index process one of one being executed so you don't really want to have to write with docker and every execution of this file so what you can do is actually include this information in a config file so over here in our in our working directory we have this file here called nixflow config and here we have some information already about the container and where it is so this is just the nixflow RNA-seq container we have some run options again i'll come back to this later so don't worry about it too much now but what we can do is just docker enabled equals true and we can save this we can get rid of this from our command line and we can just click run again cool and now we have this index run successfully nixflow has known to use this config file and it's enabled docker for us automatically which is really fantastic so now we don't have to type that any of your time you want to execute this piece of code so what we can do let's clear that is jumping back to the training material there's a couple of exercises here that we could have a wee go at the first we've already done by adding docker enabled equals true into that pipeline but we could also try and print out this index channel using the view operator if I remember back to even talking earlier he's actually used this already but I'll give you a couple of moments just to try that for yourself and as an extension of that we could also try and modify the number of CPUs that's being taken up by that process using a directive now we haven't talked about directors yet but there's a little information about it here as well as the answer if you get stuck so I'll just leave about 30 seconds for you to have a we go at that and I'll just check if there's any questions that are coming through that I can address on the fly okay so for the first exercise if we want to actually view what's happening in that channel we can use index dot view the same as what Evan showed before we can save that very quickly and we can just run the script again let's take your second run I've caught a wrong thing my apologies we type over there index dot channel click view again to run so you can see here quite often that when you do make a silly mistake like I did it actually tells you where the mistake is which is really nice so you can see here at line 37 which is in the script here it couldn't find index but I've now created that index underscore ch and it says come back and run successfully and here what we can see is it's just printed out the contents of that channel which is this this path to the work directory which is here so what salmon has done is actually index that transcript don't file and just to actually see what's happened in that file you can look at the work directory you can go into that specific work directory so you can see here again with this hash which is 15 5e even when I complete that you can see that it's a seven index which is what it's created as well as it's actually staged this transcript don't file so what it's done what next slide is done is it's gone away and it said I want this transcript don't file has brought it in and staged it in this work directory so everything's localized in this one localized so it can be accessed by this process and it's executed this command and produced seven index what we could also do is look at the tree of this this is just the breakdown and this is everything inside that folder which is being created by salmon and as you can see here this is actually a symbolic link to the file to the transcript don't file which is sitting in the workspace over in here and in the data folder which is somewhere up to just up in here data gg go there it is in there okay great so we're just going to jump straight to script three try and keep things moving again what we're going to do is just quickly run the script and see what happens and then we're going to talk about what's probably happening behind the scenes to close these for clarity so as you can see here this is actually kind of a step back on that script that we're just created and we'll come back to what we've achieved in script two already but here what we've worked I'm trying to introduce is this operator from file pairs so we have these things called channel factories which when you try and bring in this information this data these strings from into next float you can use these these channel factories up until now we've been using this channel off when we needed to bring in a piece of piece of data but sometimes it's much more complicated than that you might have combinations of file pairs you might have different patterns of data that you're trying to bring in so what we're doing here is we're bringing in from file pairs and this is I think it's probably one of the most useful channel factories especially if you're dealing with sort of paired end I'll leave the data for example so what we can do is we are just going to read where she'll look at the contents of that channel so what you can see here is we've got this little bit of information at the front which is gut and then we have this this paired piece of data which is a two different transcript don files for gut one and gut two so actually breaking this down what's happened is that from file pairs has gone in and it's looked at this this string this this this this read data and it's the project directory data data gg gal gut one and two so that's just kind of acting like it could either be one or it could be two with a prefix of gut and what it's done is put out that prefix value gut as a value and then giving us the path to these two files so this is a really nice way of doing it and what we can actually do is add in some some glob patterns to actually identify or include additional data in that folder so what we can do again I'm just going to do this relative to the path but you could sort of do the the whole project do I think but I'm not going to do that just to make my tape a little bit quicker data gg gg gal we're going to go for a glob pattern so it's just going to include every possible combination at the start dot fq let's try that oh it doesn't like that what is it like that that's probably because I've done this still in here and there all right let's just change it up here for now then just to keep things moving don't get rid of that load run script three so what's happened here we can't see now let's close that for a second there's a game we've got gut liver and lung so these are the three different types of data we've actually got in this data repository here so again it's gone through let's look for that prefix all that kind of common information at the start of these reads so gut liver and lung it's worked out that that is a very a value rather we have these paths here which is really really nice one thing we can do here as well and this is some people like it one way some people like it another but we can actually change this so that we don't have to sort of specify this using the equals so we can use something else called set and what we can do is just channel so again this isn't this isn't for everyone some people like it it actually should be currently brackets but we can call this read repairs channel so my apologies we kind of jump a little bit here but we can use this other operator called set which is just going to define this this way and of course like when we do add our operators kind of flatten and other things like that you kind of just chain them in the middle and this is probably how I do most of my coding just because I find a bit more intuitive that you kind of start at the top of working way down so you can follow what's happy with the data so again here we're just going to run this again I'll just save this we should get the same output everything set read peers channel you're in Paris thanks james thanks for your peers yeah that's a silly mistake of mine no sorry line 18 has got a new typo but line 19 also has a typo and then line 18 should be AI flipped there we go read peers read peers read peers thanks james there we go success so yeah obviously the debugging there you can see online 18 that I was making a mistake and james quickly picked that up for me but I got all flustered but here you can see channel from file peers we'll just use set and again we're just viewing that but sometimes you don't want to sort of just proceed with your entire pipeline if you're something like me and made a mistake you can add and kind of checks and this is especially important when you are dealing with big data when you see things up to the cloud and you don't want to just sort of you know potentially spin things up and leave them running for a long time so you can sort of add and use extra little bits to kind of check if your data exists and if it's if it's there so whenever we use an operator like this there are quite often extra functions that we can use to check so the next documentation is very very good as I said earlier like there's all this information on channel factories so there's all of the a lot of different ways that you can bring in data but sometimes you need to tune that to make sure that it's working properly and specify if it's a type how many how deep you want to go when looking for a file if you should follow links what size you're expecting if it's going to be flat or not here I'm just using check if exists which is when true it throws an exception that the file path doesn't exist so I'm just adding this into my operator from here from this channel factory rather check if it exists true and we can just run that again cool and that works but for example say we change this to something a little bit more complex let's call it leg data cool so this is going to spit back this area message here saying this doesn't exist and this is because we've used this check exists equals true if we didn't have that it's still going to spit back an error but I don't think it's going to be quite as elegant as that which doesn't spit an error it just it just runs so this because we weren't expecting we weren't checking for the error it didn't look for it meaning that we didn't get that nice error message telling us what was wrong and this is the the check error the check if exists is a really nice addition to add and especially when you're bringing in data like this so just because we are moving quite quickly we do need to keep powering through this this this content we're going to jump straight to script script for which again is just sitting down here this is going to be much much closer to what we've seen already in script too but we're going to be adding this quantification step so again we've got this quantification which is another process we've got the input here we have both the same index as well as this tuple so a tuple is just the the kind of combination of different pieces of data that's been stored together in the one sort of channel process here we have this value which we have specified as sample ID as well as the paths so what you might remember from earlier is that we had these channels or we created this channel as an output from the from the file pairs so when we looked at that it was the same as yes we can just quickly jump back to script three what was changed is back to gut so we see this here that is this gut and then these two file paths so with the gut this is the value and these are the two paths that we're expecting and then we're seeing this here it's this tuple so this becomes like in this example gut in here we're storing the two different file paths as reads for the output we're just going to specify sample ID and this is an example we'll be using the dynamic using dynamic naming to actually name the output from this process so here we're actually going to use the sample ID which you've used as an input to name the output which is really nice and this is a really good way especially when you have lots of samples you can kind of carry this information around its values you can even map this and use kind of different things for potentially emitter it's an emitter data to name this data and include information about it as you sort of carry it through this this workflow down here for the script this is nothing this isn't anything special but again we're using salmon we're going to use the quant we've got the threads here so this is similar to what we've done up here with the tasks sort of CPUs we've kind of got this this lib type the salmon index which is specified as the path so this is the output from the index process so here was the specified as salmon index we could rename this this doesn't have to be anything in particular we're just going to keep it the same now but we could just quickly rename it to something like that and that would still run it's really just you're giving it this name if it's an input and you carry that through into the script when you're actually running it here we're using reads zero and one so again up here is we've got these paths we've just indexed them as zero one because we've got the two as a part of as a part of the file paths and this is these excuse me as a simple and here again we've got like this specified value as the sample ID so what we will do is we're just going to quickly run this again and see what we get as the output well 34 that's going to fail so here we've just kept it to the one sample this the singular gut sample and specified up here and as you can see it's kind of just taken the one sample sorry it has taken index and then it's just as indexed the transcriptome file and then it's quantified it using this quantification process we're actually going to make this a little bit more complicated and a little bit cooler by adding in some extra data and we can also add some other things called tags and we consider published directory which is also very very useful when you're sort of dealing with lots of data or once now aside of a simple example like this so what we're going to do is go reads we're going to go data gg gal here we're going to have lots of different options because of all of these different files and this folder we're going to add a globpand at the start to include all of those we're going to go through the peer data dot fq boom don't leave that one there so here we're actually going to address all of this data and we're going to use the resume function which we'll talk about a little bit afterwards as well so again we're going to go through the same process next float run script four in this example we've changed the the read so this is an override what we've actually specified up here and in the main workflow script are going to reuse resume just because it's going to add a cool functionality that will be able to demonstrate as well so this is all happened quite quickly but what would actually happen to here is we've got this cache this word cached at the end of this line so what that means is that we've actually pulled this again from the cache so next flow knows is that you knows that you've already run this so it doesn't need to run it again because it's hashed saying we haven't changed anything about that process and it's used as cache data what we can't see we're just going to move this work directory and run it again is that Confication also use the caching information for just that first sample which in this case would have been gut and then it's repeated again twice for the other two samples so I've just removed the work directory so now we've removed that cache and we'll run it again the resume function won't work but it doesn't actually hurt having it there it's not going to spin the error anything like that it's quite a cool name to again take quick cool so here it's a little bit slower and as you can see these numbers there's a number here but I'll talk about that in just a second but we can see three of three meaning that it's run it three times so once for each different set of data being the lung the lung the what type of data it is the lung, the lima and the gut so this is nice here it's just given us a number telling us which set of data is running through but we can also use these things called directives which I mentioned earlier to kind of tag this information and actually give us something a little bit more informative when we're looking at it so tags are a really nice way of sort of just like I said labeling that data I think in the example here we use we use a different tag so this is one of the exercises so you're welcome to follow along and do this one as well so we can just call it seven on and then we can just call it sample ID so it's one of the cool things about Nexla as well is that even though we've kind of specified sample ID beneath this in the process Nexla still knows that it can use this in the directive which here is tag as another example you can also sort of change the number of CPUs which is nice nice autocomplete there to show us what's going on if you're using Gitpod I think there is actually a reasonable amount of resources here but we can specify here for example CPUs too but up here for example we can add a completely different directive which we can say four so again let's quickly save that and run it again and we'll see if we can see any differences when we run this cool so it'll handle quite quickly so I'm just going to clear this and show it again resume so if you watch quickly so it's the first process nice and quick and then one, two, three you can see here that we've actually tagged this with Salmon on Lug which is what we've actually specified up here you probably don't need to specify on Salmon you could just have a sample ID you could make it more descriptive today they are processing like this again you can just run it again it'll be nice and quickly run and what we'll see so I'm processing one so again let's just process that really quickly but it looks really nice the other exercise we have for this this module on the training material or for this process what script four is actually adding in something called a published directory and this is a really nice way of kind of organizing your data especially if you have multiple steps and there's quite big files you might say I want to keep this I don't want to keep this I want to store this here I want to make this a sim link I want to copy this there's lots of really cool ways that you can sort of organize your data so what I'm going to do is just try and bring this up quickly and just share it with you so again this is this is one of the directors there are a lot of different directors here and we don't have time to talk about all of them today we'll talk about some of the beginning of tomorrow but when you sort of use this this published directory you can add it in at the top and you can specify exactly where you want that data taking it up and also where you want to store it so today this is part of the exercise and I'm just to keep things moving I'll give you sort of 30 seconds to try this again this is a part of the training material I think it's in section 3.4 if you get stuck so it's about 30 seconds try and add in the published directory yourself then we'll quickly go through the answer sorry just a mute so what can happen here is again just add another directors and you can just keep filing these directors in depending on what you're trying to do here we're using the published directory which is the the name of the directive excuse me here we're specifying the prams.outdoor and this is the same as what we've specified up here but we don't actually need to do that you could call it something completely different you could call it results too just just anything you want really or even something probably more descriptive like you know you could just you know call it the quant or something like that probably put them records but today we're just going to call it the prams.outdoor and we're going to copy it in this case so depending on the data type the size of the file you might sort of say I do want to keep this I don't want to keep this I want to copy it I want to sim link again lots and lots of lots of options so just to sort of show my directory at the moment we don't actually have a results folder but what we can do I'm just going to clear that okay let's run this again again this is just sort of the the basic next flows right next flow here's as long information you specify right at the start index quantification to our processing gut and now when we look at it we can actually see this results folder when we look inside that results folder we can see all this information that is is all the quantification information probably not a particularly good example there but a little bit easy to see I'm going to sort of dig down into this and see all the information there as well so that is really an example of you know it's really a taste of what you can achieve with next flow in terms of first of all sort of really tuning each process to potentially like allocate different number of CPUs and some nice tags so you can see what's going on as well as storing that data afterwards as I showed there are like a lot of different directors that you can use and don't be afraid to sort of have a play around with those in your own time so um script six is again a little bit oh excuse me we're going to script five first apologies oh okay so this is just kind of jump again a wee bit there sort of just adding all this along to script four but again this is just the information that we've done here we've just added in this extra strip here for fast QC so fast QC I guess this is this part of the the training script five in particular this is more relevant we're using DSL one where you actually had to kind of split a channel so that you could use it twice with DSL two it does this automatically so you don't really need to worry about this anymore but just because it's included we'll just talk about it very briefly again we've got like a tag which is fast QC on sample ID again we've used a sample ID from the input which is that value that we derive from using from file peers we've still got the read sitting there we've got this output which in this case is fast QC underscore sample ID again we've used that sample ID from the value from the input underscore logs so this is kind of like standardized sort of putting together some string values with variables in a script or in as an input and then we're sort of running a basic fast QC execution which some of you are probably familiar with but we're just creating this directory which is this log which is this folder here and then we're using fast QC and it's just taking in the reads and the sample ID and executing just the fast QC analysis on that and this workflow channel things again are pretty much the same so like I said earlier we've got this from file peers which we're taking in these print reads which are from right from the top we're checking if it's true and we're seeing that as the read peers and we've really talked about how we could have done something like read underscore peers spelling mistakes this time so you could have just done something like this so they completely removed this bit here but it's just fine doing this way as well this is this is how I like to do it personally so again here for example you could do something like set index stops so we could go through and change all of these if we wanted to it doesn't really matter too much for these purposes but it's just to kind of demonstrate that you can kind of like there are different ways to do the same thing so X flow X flow run script five again this is the way second to run so again here at the top of the script we've just got gut in the past we have overwritten this in the command using the reads just for ease we're just going to use the glob there and we're just going to run it again just to get to demonstrate that even with extra data more files in this this sort of input parameter this is going to split those off and run them in parallel which is a really, really nice way of doing it fast QC so it's going to take a little bit longer okay so I don't want to dwell on script five too much because we're going to jump straight to script six especially because we still have a lot of talk a lot to get through we're talking about the containers and docker so we're just going to go straight to script six here again not a lot has changed but we're just going to keep adding on another process so everything up until here fast QC is the same but now we're also going to be using multi QC and this is this is another really fantastic feature of next flow that we're going to we can use operators to really kind of kind of you know split channels apart and sort of bring them back together when we need to because sometimes you don't want to run one process on sort of one set of data you might want to run it on the data collectively and that's the same for multi QC where we don't really want to run multi QC but use a different report for every different every different sort of file combination or every sample but you kind of want to use that report for everything together so we're just going to play around with this work for a little bit we'll talk about mix and collect just to kind of show you sort of how these things work and how you know really demonstrate how this might benefit your own coding or your own pipeline to the future as well so again next flow run script six we still have just gut up here so it's only going to be for one sample and what we can see is that you're going to sort of use this index quantification seven on gut fast QC on gut it's going to take a wee second and once all of these processes are finished next flow knows that all of this is now available and it can actually start pumping it through into the the next the multi QC so this is something we haven't really talked about but when the way next flow works is that when these channels are full it'll drive the process forward next flow can't start multi QC because the outputs of fast QC channel this hasn't been created yet so until it's used the repairs from here which you don't really get until you've done this this you've created them here you can't start fast QC it's also nearly the quant channel and that hasn't been produced yet because you need these other files first so when all the criteria for that all the needs of that channel have been met then it will drive their process forward so here multi QC quant channel dot mix we've got it for one sample because it's only got the one sample available but when we actually changes back to to the glob and we can see we have the multiple samples included I'm just going to add resume here to keep things nice and quick so again we've slowly done it it can't do these these processes it's paralyzed them but this multi QC process has to wait until everything has been created so just to kind of break this down what we will do is just have a look at some of these channels quant dot channel dot view okay so here what we can see is that there's a different channel for each of the different tissue types and this is purely just just the just the the folder the path to the folder we can do the same for the fast QC channel again we're just going to quickly look to see what it produces so we have an idea of what we're working with I've left off a there's my mistake so you can see here this is actually output of the fast QC in which we can check the outputs so actually what's in there I'll ask you to paste yes so as you can see here we've actually got these fast QC files which are the HTMLs as well as the files there so this is just the output of fast QC in this work directory we haven't stored there anywhere else at the moment so sorry excuse me so when we want to combine these channels we can use operators and operators something I've sort of mentioned a couple of times so it probably doesn't make a lot of sense again there's really really good documentation on different operators and how they work that's a lot of next-flight website I'll just bring that up quickly just so you can have a quick look at it operators last big section here there's different ways to filter transform splitting combining forking different maths operators as well as all these other operators here there are probably like maybe a handful that you use quite frequently and some of these others are probably more of like edge cases where you try to do something a little bit a little bit cooler a little bit on that side of the norm but the documentation is really good so don't fret about and not knowing these some of them are quite similar to what you'd expect from groovy but it's it's it's not a big deal to kind of sort of become aware of what these do so jumping back to this if we just were to look at we're going to remove the multi qc process for now and we're going to look at this as an output so here what's happening is we're taking this quant channel we are mixing it with the fast qc channel and we're not going to collect actually let's move that off for now so we're just going to see what happens when you mix the quant channel with the fast qc channel what does that output look like so let's move it up make it a bit more visible for everyone so here what's happened is it's just mixed these two channels so up above when we just looked at these in isolation so when we looked at the quant channel what we viewed there in the fast qc channel when we viewed that separately we can sort of see each of these as separate channels and we noticed because they're printed out as separate lines like this up there there's the rest of them here by mixing them we just added those together so now all of this is mixed but they're still separate channels what we want to do is kind of collect all of these into one single channel and that is what we're going to do just here I think that should be okay so what you can see I've done here as well if we kind of just chain these on top of each other and you can kind of mix up this a little bit as well you might say I want to do it like this because I find that easier to to look at so again it's just a mix that we're going to collect and then we're going to view it view kind of being that bread and butter way just to print this stuff out to the terminal so we can see it nice and easily just taking a second cool so what you can see here now is that we've actually got all of this included as as one channel and we noticed because it's got these square brackets around it so we're going to kind of see this first one here comma second comma third comma blah blah blah all the way down so so having these separate channels they are now all included in one so we can sort of double down and make this call this a set and we can call this anything if you want you know you could call it single channel and we're just going to set that as a name then hypothetically we could do your multi-QC on the single channel this is a little bit different what you'll see in the training material but it's just kind of illustrating that you can do this stuff like lots of different ways got the resume there and again it's just going to go back and it's going to say a lot of this stuff is cached so we don't need to run it again all those hashes haven't changed and we've just we've run multi-QC what I might do is just quickly clear the work directory see this all the way through and we haven't actually set this as anything so we could sort of set this as another channel as we want it doesn't really matter because we're not going to do anything else with it after this I want to run that again these steps just take a wee second still got this few in here so we're getting that print out to the terminal and we've got this multi-QC which has run we just don't have we haven't stored the outputs anywhere I don't think we're going to run this again anywhere I don't think we're going oh we do have it so we can actually check that and the results we've got that nice multi-QC report in there that we could open up in view if you wanted to but we probably don't need to right now okay so jumping on to script 7 we just see if there's any all my notes are for this just to make sure I know what to talk about okay cool so up until this point we really just sort of strapped these four different processes together the index quantification the fast QC and the multi-QC we've talked about how we can sort of control and fine tune these channels with different sort of directives from here things can get a little bit more complicated which is going to quickly demonstrate some of the more sort of dynamic things you can do with next flow I'm going to have to move through some of this quite quickly just so that we keep time so the first thing I want to talk about very quickly is everything else we're talking about so far it's kind of very like channel specific and everything's kind of isolated into those processes but you might want to execute something at the start or at the end of your workflow and you can definitely do this next flow so here for example you can do this workflow.oncomplete here we basically have like an if house statement this is a little more groovy don't worry about it too much if you're not familiar with groovy but you can just use this login file again the same thing that we used right at the top of the script just to print out if this is completed successfully do this if it hasn't do that so here we're just saying open the following report in your browser which is the prems.outdoor the results folder we specified at the top the multi-qc report if something fails and it hasn't completed successfully oops something went wrong which is quite a nice way of finishing off the script cool done open the following report in your browser and we could definitely do that if we had more time what I also want to talk about here is that you can also do these other kind of like quite a lot to call it this is sections cool quality control but I haven't got this set up on this computer but if you have your own sort of SMTP sort of set up with emails you can send me an email once this this workflow is finished and that's quite a nice way of doing it if you are doing a lot of sort of sequencing off a call facility type platform there's lots of documentation on this I'm just going to copy and paste this very quickly as an example but so sometimes we have these processes which are defined or using other tools but say you might have your own script for what we're doing here and obviously isn't it is important but you know say you had like a bit long Python script or an R script something like that something else like that that you wanted to use in your workflow you can actually just include these as a part of as a part of your workflow process so next flow knows to look for a bit process so in here we don't actually have it set up probably you can just quickly do MacDur bin CD bin slice and empty sort of quickly use nano so what I'm putting in here is just to just like a shelf script as you might expect this is everything that's of that fast QC process today I'm just going to place that that fast QC tool this script just to kind of demonstrate how this could be done control X yes I want to save that and I'm going to call it fast QC dot SH now we have the fast QC USH in there what we do need to do is to make sure that this is executable so we're just going to do that oh I'm already put a bit inside a bin oops so easy let's just we're just going to go back and do this again just because it's going to be a little quicker be trying to type all that out and check that it's working XS fast QC dot SH I've got a little bit excited to get used so most probably some of the code in the tutorial rather than typing it out myself so now we have this bin which is going to have we're going to have the script in here as well as this other bin I'm just going to remove that in case there's going to be any weird conflicts so what next flow knows to do is that instead of looking for this next load knows to look for the bin directory within this sort of working project directory so we can just specify fast QC dot SH and we know that in that script that we just created I actually set the two arguments which I can show very quickly apps bin fast QC dot SH starts with sample ID and the reasons to arguments so what we can do is replace it in here as a script we've got script seven there as we move into it let's type this out here we go and this is an excuse like any other script so this is a really good way like for example like I said the R script or Python script or something like that you kind of install it and have it as a part of your pipeline or they have you to sort of go away and pull it in from elsewhere again this is running really nicely one thing I haven't talked about I don't think we're going to have time today I'm sorry but like if you jump into the work folder C2 48 you can see there's a bit of stuff there but I don't know yes there's actually these hidden folders here and you can go in here and actually check out what was explicitly run dot so this is actually like the whole script and what next slide is done but probably what's more interesting is this command what you say we can actually see what it's what fast QC is taken in and how it's actually run this script and you can do this to any of the work directories or any of the processes where you can go and actually look at exactly what's happening on the hood as part of next flow one last thing and I apologize I can't develop this for too long but when you spare example so one thing is that you can actually run next flow not just from local scripts but you can run it from from like an online you can run it straight from get I guess is what I'm trying to say and I think this will come up again a little bit later so I don't think we all store too long on it now yeah let's just we'll cut or skip this for now but basically what you can do scripts seven dots an F type date you can get all these really nice like reports and traces and timelines and DAGs and PNGs and stuff like that so this is just like a nice way of reporting some of this has kind of been succeeded by stuff like Tower and you can probably like mine a lot of this stuff from a fast QC and multi QC report anyway but it does it does produce these really nice reports and we're going to get a lot of these turn up and our folder here like for example trace where is that trace dot ticks so you've got to see like all this information about where is one where it was excuse me where it was run how it was run and how much resources it took because day PNG which is this kind of fly this nice representation lines being channels circles being processes these dots kind of being operators a really nice way of visualizing what you're doing so we're going to jump to something else now which is kind of an aside we could talk about next file a lot more but we will talk about kind of like some of these like these ideas these concepts a little bit more tomorrow but we've got when we talk about channels and operators and processes a bit more in detail there but today I want to talk about Docker very quickly Docker can be quite scary when you first start with it but it's really worthwhile once you understand it a little bit more so what we're going to do is we're going to clear that clear that basically to start kind of like this this new idea this fresh idea Docker is a little bit similar to to next flow sorry excuse me and that you have this this Docker run much much similar to next flow run with a container name so so Docker's Docker's kind of like a container management software I really encourage you to go away and check out Docker some information on it if you haven't really heard about it or used it already but the main idea is that it's kind of it's a container platform that is building these these tools or it's a story these tools with all the kind of like underlying dependencies and everything else and it really contained in an organized way and it just basically enables you to either use your your local version or a version on Docker Hub which is like a repository they can pull those into your pipeline and run them 100% reproducibly which is really fantastic especially to make everything really portable but also really reproducible so just an example today we're going to use this this example of Hello World and what's happening here is Docker is effectively printing out what's what's going on so just sort of working this really quickly you couldn't find Hello World locally and why am I local sort of like Docker repository here so I was pulling it in from from Docker Hub it's pulled it in it's telling you what's going on and it's got this Damon working in the background basically just kind of like monitoring what's going on and then the sinner says don't have it locally go to the hub call in the hub pull it in stream it out here so it's kind of getting escalated and then coming back down to actually just print what's happening here which is really really nice but this is really just an example to show you that sort of like this idea that's going up and down we can always use then called Docker pull which is actually going to pull it or pull in a Docker image from from in this case Docker Hub so what we've done here is I've just pulled in Docker Debian which is stretch slim basically the version number is a very very light container and here I've just quickly typed in Docker images which is just a way of saying what images have I have I got what's happened here as we can see that this image has been technically like brought in that has been stored locally for me now so what you can also see here is it's it's Debian which is already sort of relatively small hello world which I've just run again I didn't have that locally it's gone up the Docker Hub and pulled that in and I haven't sitting there locally now as well as this next flow excuse me RNA seek NF which is which is substantially bigger because we've we've much more in there salmon and other things for example so this is all fine this is kind of like gone up then come back down and says let's run this it spins up says cool here it is spin that down I don't actually have that running anymore but you can run it in an interactive way which is what I've done here but it's typing Docker run basically interactive this image in bash and what you can see here is actually this is a completely different file structure to what we're expecting so again you can exit out and just show that I'm sitting here with all of my nice like all NF call files but when you sort of run this again it's different you can also look at who and I and you'll see here that you have root meaning you have sort of root privileges and your user ID is root but again when you exit out you can go who and I you see we get pod so we've got a lot of different profile as effectively what's going on so what I'm just going to demonstrate very quickly just because we are very quickly running out of time is basically how you can build a Docker file I'm just quickly making a Docker file or a Docker folder and moving into it because if you execute this going to follow with other stuff in it it'll bring it into the image which you don't really need it's just going to create it a lot bigger and heavier than you potentially want it to so here I've just I've just used nano again very quickly and if it's copied this from the training material I'll be able to go back and have a look at this in more detail but yes you can see I've got this Docker file so basically what's in that Docker file is it's kind of just instructions for building for building this this image this this this docker so saying go back use this this DB and stretch slim here I could put a my name if I wanted to here I'm just saying at Git which is where public stuff from the web and I'm basically bringing in a single Cal say which is a just Cal which can talk through your command line and yeah so that's basically what's happening there and then what I'm going to do is quickly build this image so again this is just working through what's happening or it's working through those instructions so I had in that docker file and it's just saying go away bring this in bring this it bring this in and it's going to build it and call it my image and as you can see you can call it different tags and things but we might go into that right now if we can go back and look at docket images you now see that we built this image which is my image and then we can kind of run docket again so I caught that in so I'll show it again docket run white image which is just this image that we've created just up here we can use cal say hello docket and we run that and we get that get a nice cal say miss you could yeah hello hey back they'll they'll run nicely the thing here is that we can show that cal say wasn't found locally but only when we run this as an as an excuse me as a as a docket image so what we can do here as well we could just quickly show that my image so we can go in there do cal say hello there we are so again this is just this is happening in the docket it's not happening on it's a different environment what's happening locally for you in the in the terminal um we are probably going to run out of time to do this in great detail but we'll start anyway so again all I've got in here is my docket file nano docket file so here we can add in basically information to bring in anything we want and this will just like iteratively work through to sort of build this this docket image or this this docket file rather so here we're just adding a salmon just to actually demonstrate that you can use this with real world examples not just just the sort of artificial kind of like toy examples you can see here this burst happened really really quickly because we've really done that we're just building on top of that image already here I'm just adding in salmon I'm just curling it from from github and then just kind of installing it using just some sort of basic batch commands um so just as a pretty concept docket run by image salmon so this is just again we've got this my image just a little bit bigger now and we've got salmon inside of it so docket is wonderful but it's also a little bit more complicated in that it's not always as easy to run run some of these um potentially scripts I might just click all that so again I'm just going to run docket with my image but sorry excuse me talking a bit quickly well we can run this in the container and it's going to work it's not actually mounted on your system so there are going to be some it's going to be some disconnect between the files you can see what's happening in docket and what's happening locally um purely for time sake I do encourage you to go back and actually look at some of this material here so we sort of get up to here we've just added this into our docket file we've just built it again we can see that salmon is running but we're going to hit eras because basically what's happening in docket isn't necessarily happening on your system and you need to mount to make sure that everything is connected so here this is probably the example we want to get to this mail may not work because we've jumped a couple of steps okay so it's going to be an error because this is important an older version of salmon but this I don't know why that's not working I'm in the wrong folder yeah there we go so basically what's happened here I was in the wrong folder which didn't help but we need to mount the docket volume onto the local volume and that's what's happening here we're also specifying that this is your work directory which is the local directory basically where this where this command is being run use this image as well as the actual command you can also do cool things where you can actually push this just straight up to the cloud and also or to your docker hub again this is showing the training material here so we could use like docker login we won't have time to show this today but you can log in you could retank this image because what we've done is just kind of called it latest which isn't always the best idea you can push your image up again you can use your username plus my image and then we stored basically externally on docker hub and then you or anyone else can pull this straight from docker hub to use probably the main thing I'm going to finish on here is that well this is just using the next flow RNA so you can pull like actually pulling it from docker using for this docker next flow image but the cool thing here is that you could I don't think this is going to work or it might work so I'm going to crash that clear so here we're just going to run next flow with docker this is already going to have the docker everything set up from our config earlier but I'm going to use my image which we've just stored installed seminar it'll work so it's pulled in basically this docker image what they were just created and it works really nicely just because we kind of hit time touch on this a little bit tomorrow but just very quickly Singularity works much the same it's probably better for like local HPC environments where you don't need to kind of have this day when working in the background but docker has its benefits for cloud computing again like you can basically just use Singularity to pull the docker image or pull a docker image and then it'll create the Singularity image which is to be stored locally on that HPC system which is really nice I'd encourage you to go away and have a quick look at this material as well we can actually use condor to create an environment which you would have ever used like which you kind of basically install the dependencies basically using a standard condor environment sort of create and then you can use docker a micro I think it's called yes micro-mumber that's got that wrong to take this environment YAML and turn it into a docker image for you but that's basically what's happening here so you kind of get the best of using condor which is really good but isn't always as reproducible you can basically use it to create this environment push it up to docker hub other people can use it it's got your dependencies nicely sort of interwoven excuse me but it's yeah it's a really nice way of doing it and it has a lot of benefits to this as well especially with especially with biocontainers as well which is kind of a place you can store this so that's probably where we're going to end it today I'll try and pick up on this a little bit tomorrow so I'll just refresh some of the stuff that I've skimmed over here but apart from that thank you for attending if you have any questions please keep putting them into the APAC channel James and I will sort of continue to model that for the rest of the day and tomorrow and yeah we will pick up again tomorrow thanks so much and we'll see you then good