 from Children's Hospital of Philadelphia and the KISS first project, and you see, and so I will go to a little more details for the RSWL and RSWL pipelines usage on some of these platforms. So with all these advantages of using workflow languages, there are some intrinsic challenges. First, the workflow language is a command line tool itself. So there's a steep learning curve for researchers even for skilled data analysts. And also the workflow language, mainly deal with the command line tools that are involved in the pre-processing steps. So it has a poor integration with the downstream statistical analysis tools such as RM Bioconductor. So this can cause some reproducibility issues from the modeling step. So these two packages are designed to connect the pre-processing steps and modeling steps and provide reproducible and standardized data analysis tools and pipelines in RM Bioconductor. First, we choose one of the most popularly used workflow language, CWL, and write a robust and scalable RBIO conductor interface for it. And based on that, we have developed RSWL pipelines to use as a catalog, a set of commonly used and emerging mathematics tools into the reproducible CWL pipelines that are available in R. So currently we have 214 tools and pipelines together. So this figure shows the basic usage of these two packages and the color blocks are the function names. Basically, a user starts from the RSWL pipelines first, they need to search a specific tool or pipeline that of interest and then load them into our programming environments to be used. And one can also write their own set of tools using RSWL package, build a tool or build a pipeline and all this and contribute to the RSWL pipelines and to benefit the other researchers and these existing pipelines and tools are customizable with some utility functions which I will show you in the lab demo later. So they're easy to customize by changing like a Docker version for a specific software for changing the arguments for a specific tool. And then after that, we are ready to run the tool or pipeline within our environment. Which combines both the preprocessing tools and also the data analysis tools using our functions. So here I will show you some comparisons of using different ways of doing the data analysis. First, the traditional way using writing batch scripts using the command line tools in command line interface writing our batch scripts. So the first step, you need to search and download the software and compile it based on your computing environment. So there are kind of issues coming from the software dependency and conflicts and you need to check frequently for the software updates. So second step, you will write on-premise batch scripts where you define, where you use the specific tools and define inputs and outputs between different steps. So there can be reproducibility issues. For example, you need to track carefully for the software tool versions because different versions can lead to different results. And also the inputs and outputs between tools are hard coded into your scripts. So which can also cause some issues, right? And also it's relatively time consuming because you need to wait for one step to finish and then manually start the next step. If you work on high performance computing, you also need to write some scripts in your data analysis for the workload management configuration such as different job-scheduled systems. And this has to be written repeatedly, which is quite inefficient. So the common workflow language, so as one of the workflow languages, address these challenges. First, as I said, they use the Docker to track the specific version of the software and they connects all the tools together so that your data analysis pipeline is more automated and streamlined. If you use the CWL, as I said, it is a command line language itself. So first you need to write a script. Here I'm using the star index as an example. You need to write the star scripts following all the configurations. First, you're using star, the software tool is called star. And in the requirements section, you need to specify the Docker image of the version that you're using in your pipeline here. And also you need to specify all the inputs that are involved in this tool or pipeline. Here we have one, two, three, four. We have four arguments that you need to wrap into this tool. Here, the genome director specify the prefix. If you are using it in the command line, you need to specify the other grammars, like how to separate the arguments and the values and the default values for some of the arguments. Some don't have the default value, so you need to define each of them. Like following, so you need to specify output using a similar way. And the second step, you need to write an input file. Here the star index.yaml, it can be yaml, it can be JSON, where you assign values like file path or a string for our directory name for the input parameters like defined over here. So the third step, you will need a CWL runner to submit a job. So this involves the software downloading and installation, like one of the CWL runner that you're more familiar with. You can CWL tools for other runners. And then you run the tool, CWL tool here and by in the command line to, by specifying the two of the files, the CWL file and the yaml file, which will produce results for the whole pipeline. So this is a very good tool, but it's, as we can see, requires a lot of work, efforts and skills. There's a steep learning curve. And here I will show you how to use the RCWL to do the same data analysis. Here we, the example is to use the star software to index the single cell RAC data. So the first step, the first step, you just install the Belconduct package as you use other packages. You just, you're specific manager to install the package and elaborate, load it into the R session. And you, if the first time you use, where it has been sometimes since your last use, you need to update, use the star web, they function to update, to sync all the existing tools since they have been constantly adding into our R-STWL pipelines repository. So basically, you just, people will start from a star web search using multiple keywords to search a specific tool if it is included. If you ask me now, you may ask me now that if you have this tool or that tool, you can actually use this function to search using any keywords. So it returns a list. So it's a star web hub, but including the title of the tools or pipelines that have been included in our repository. So the PL represents pipeline and TL prefixes represent that tool. So here we use the CWL load function to load the tool star index. And once it is loaded into R, it's ready to be used. And we need to check out input arguments. Like what values do we need to assign to this tool to run the data analysis? So we can see there are four arguments and two have default values, the run thread and has default value of four and genome directory name is default star index. So we only need to assign values for this tool, the middle two parameters. Here I have, show some demo code here. Once you assign the values, so the star index is now like a combination of the CWL file and the YAML file. So you can use the run CWL function to submit this job and the specified output directory where all the output will be saved. So it is internally submitted as a CWL script as shown in the previous page. So it also supports parallel computing by using the run CWL batch function. And it supports different Docker systems in your high performance computing. Okay, so next I will show you how to use as a developer. If you don't find any tools included in our WL pipelines, you can actually write your own set of tools. So here I will show you how to write your own set of tools as a developer. So it's quite simple. Here I'm using the star solo as an example. So this is a tool you want to write. First you need to specify the inputs. So there are two inputs, you need to specify them as input parameters and there are two outputs you want this tool to output. You need to define them as output parameters. And then you need to define a command line tool or our package. So this is a command line tool but it can also wrap our package or function. And then you can have a CWL tool in R. So this is a tool. You can also define a pipeline. Here I'm using the star solo, which is a command line tool for doing the alignment for the single-cell RNA sequencing data and droplet U2s, which can be used to read the output of the star solo into R as a single-cell experiment. And it also has functionality to do some further filtering. So once these tools are defined as the first step, you need to connect them into workflow. So first you need to define the inputs for a whole workflow, not for a specific tool, but for a whole workflow and also the outputs for the whole workflow. So the outputs can be from different steps. And then you define the tool as, define each of the tool as a workflow step and then you have a star there workflow in R. So we have some basic functions where we can define the input output and the tool and the workflow and the workflow step. We have some advanced functions to change for the customization of your tools next time when you use them with different settings. Here I show some demo code, like pseudo code for how to wrap the tools. I will show you in more details in the lab demo in the end of the slides. So first you define your input parameters to define the two inputs, use output parameters to define outputs and specify the Docker image here and then use the step-by-step process to wrap the star tool and by specifying all those elements. To build a workflow, you specify input parameters for the workflow, output parameters for the workflow can be from different steps and then you step-by-step workflow to wrap it and specify each tool as the step step and then use the plus sign to connect them as a workflow. And then you can execute the pipelines within R and by assigning values to the input parameters and you can use run CWL if you are using your local computer and you can use run CWL batch to run it in parallel. So the take-home message, the R-CWL and R-CWL pipelines provide a CWL workflow with user-friendly R programming environment and it uses Docker, which is trackable and reusable and they are compatible with common job schedulers such as SGE Slur so we have more than 200 pre-built tools and pipelines which are open-source and open development and we hope you can contribute and benefit the research community by wrapping your own tool and contribute to the R-CWL pipelines repository. So we have successfully used and applied R-CWL on the cloud. We have this package called R-CWL cloud which builds a bridge between R-CWL and R-CWL pipelines and cloud computing platforms such as CGC and Kvartica. So we have done some high throughput in RA data analysis on the NCI CGC platform and we have also made this CWL workflow for RMAX Turbo on the Kvartica platform, collaborate team with Dr. Xin and we have also, for Anvil, it doesn't have native support for the CWL but only for Vito but we can use R-CWL to take advantage of this platform using the Jupyter Notebook. So we can submit the CWL scripts in the format of R-CWL tools. So I will go to the lab demo and here is my contact information. I think I will share my slides later. So for the lab demo, you can find it on GitHub so under R workflow 2022 underscore R-CWL underscore demo. So it's under the vignette and the demo R-CWL.RMD. So I will show you my terminal. So if the font, can you see it clear? Should I zoom in or? I hope it's okay to see. So I will start showing you how to use them. I think I have about 15 minutes, right? Okay. So in this workshop, if you first demonstrate how to use the previous R-CWL tools and pipeline scene are using case study for singles, they are sequencing data, pre-processing and then I will use some example code to demonstrate how to build R-CWL tools or pipelines. And I have to be more faster. So in this case study, we use the 10X genomics. In this study, we will use the star solo for the data alignment quantification and filtering which produce a high quality count matrix from FASTQC, FASTQ file. And before the alignment, we need to do a one-time indexing step using the star index. So it's also included here. So first we need to install the packages using Belsing manager. And we need to load several, use several of the other packages, the gate to R to download the example data and the joblet to use for doing the loading, for doing the conversion of the data to single cell experiment objects in R and Belsing parallel if needed for running them parallelly. So first I will load these three packages here. And I will show you what data are we using. So here we are using the real single cell sequencing data from the 1K PBMC from 10X genomics. So this set are sampled from the source file to contain only 15 cells instead of 1000 so that each step can be done within one or two minutes. And also the data further created to only include reads on the Cromson 21. And I will show you, so you can just evaluate this code or gate to R to clone the data set and I will show you what data we have. Here we have the four FASTQ files over here and we have one GTAP file for the alignment and we have the Cromson 21 and the barcode file over here. So, and we need to create an output directory. So I use it as order two and create the file to contain all the outputs. Three, then we can proceed with the data preprocessing. So there are three core functions in our CWL pipelines, the CWL update, CWL search and CWL load. They will be needed for updating, searching and loading the needed tools or pipelines into R. First, we use the CWL update to sync all the tools. I don't think we need this. And we can see a list of the tools over here. So currently we have about like 214 of them all started with the PL or the pipelines and started with the TR or the tools. And we can see there are 45 pipelines and 170 tools. I think I've just added one of each before this talk. And then we use the step for a specific tool and pipeline here, we search the star and index and we can see TLS and we can then use the CWL load function to load it into our R session for the data analysis. So now we have the star index, which looks like this. So this is looks quite complicated. It's very similar to the CWL.CWL file, right? And it contains a lot of information. Actually, we can use some basic utility functions to return a specific. So we can use the specific function called inputs to see what are the arguments that we need to assign values. And we can use the base command and see, okay, we are using the command line tool called star and we can see the Docker pool. We can see the version of the star is 2.7.10. And this can be changed like how we customize our existing pre-built tools. You can actually change the version of the star and take advantage of all the other pre-built tools here. And for the data-purposizing, let me run this step first because it will take about one minute and then come back for the explanation. Okay. So before the read alignment, and then we can equivalently index the genome doing the same thing using our CWL tool of star index within R, which was internally passed as the CWL scripts by only assigning values to the input parameters and execute the CWL scripts. So here I have assigned values for two of the parameters like I showed in the slides. So we need to assign like the file path for the genome faster files and SGDB GTL files for the indexing. And then we use the run CWL function to submit the job once and specify the output directory here. I'm using the star index and the score output and docker equals to two. So the docker arguments in the run CWL function takes four values. By default, it is true which automatically pulls the docker images from the required command line tool as we specified in the requirements. And also it can be false if you already have pre-installed like the star command line tool and in your computer. So you don't really need to pull the docker image and it can also take singularity and the U docker based on your system configuration. Okay, now the job is done and we can see the final process status exercise because I have used the show log to choose. So that's why we see all the logs over here which is can be easier for debugging and we can take a look at the output files. Okay, here are all the output files in the star index underscore output. So these files are ready to be passed to the next R-SWL tool called star solo. So then we use star solo to map and de-matter likes and quantify the index, singles, RN, SIG data. Again, I view load the tool, assign values for the input and run this first because it takes about, I would say two minutes. So the steps are very similar. You just search a tool and use multiple keywords. Here I'm using star and solo and we can use the step to load it into our environment. And then here I'm just specifying some fast queue file path and then pass them into the parameters for the star solo tool here. Once assigned values for the star solo, we can see. Once assigned values, so the star solo is now a combination of the dot CWL file and the dot YAML file. So it is ready to be submitted as a job as you do in the command line, just submit it. And it will be submitted internally as a CWL script. So different, if you're, this is a tool, if you're using pipelines, so the tools will be run one by one automatically. So as it's running, let me take a look at, let me take a look at the chat. And then there seems to be quite, there's a lot of activity in the chat. I think it'll be a lot. So one of the threads that's going on the chat is about different CWL and workflow languages and resources for those. So for example, Dockstore, NFCore, where you have snakebait with various different repos. And so the question is, is there a common bioinformatics CWL and CWL search so that you can actually search multiple repositories? Yes, we do have a function called containers where we can search all the available Docker images from the query and the back containers, right? So you can then use that to use any to choose from the list. I actually have some code in the demo later. So there you can see how you can search out available dockers and choose one that you like most. So it will be a little in the later part, but I will show you the results of our solo. So just show the directory and we have the like the order called solo.out and we can read one of the summary.csv file from output. So which is very similar to the cell ranger summary and is useful for quality control. And next step is to import count data into R. So as we have finished the data preprocessing, maybe do some further data exploration in R environment interactively. So we can use this tool called counts to SCE to convert it into the single cell experiment object and we search and load and let's take a look at the input. Okay, we only need the directory name for your output files from the star solo and then specify it and we can submit this tool so that we can get the RDS file in the single cell experiment format. So we can load it from the results output. We see the results, we can see the results output looks like this. So this is a file path and we can just read it into R and we can have the single cell experiment. So this is by, this function is by wrapping a back conductor package called drop it to utils which I use the function 10x, let's see. It uses the function called read 10x counts. So please note that the integration of our packages are specific only supporting the RCWL but not in the original format of R-CWL. So this is unique feature of R-CWL. And then instead of running the tool separately because we have run the star solo, we have to run the counts to SCE separately step by step. Instead of that, we can alternatively use the pipeline called PL underscore star solo to SCE to run the star solo and the counts to SCE altogether. So this is a pipeline, you don't need to run them step by step. So we do the search and load and see the inputs for whole pipeline. Okay, the inputs argument names are little different than the specific tool and then we assign values for each of the arguments and then we can run the whole pipeline which combines the star solo and the drop it and the counts to SCE from writing the back conductor package drop it to utils. So this will run the previous steps again to generate all the results as we define. And I will nominate, I'll just post this and show you the calls. So as you just asked, we exit tools for pipelines. So that have the star solo, which is a star here and the version and we can check the base command. We can check the requirements here. It shows the Docker image from quit, right? And we can also see requirements of a specific step from a pipeline. So this is the pipeline that one step is the star solo. We can see the requirements here and you can see the arguments which are the hidden arguments that are not included in our tool which are the default values. You can also make changes to these arguments. We can see the inputs as I just showed you. So for example, we can change the Docker containers. Yeah, the RSTBL have this function called search container. So we can see there's a whole list of different versions, different versions of the star and from different resources. You can see arcs, search containers. So there are two sources. One is the quit and the Docker by default. So each of you return a full list of all the available dockers for a star. As we can see there are different versions of star that we can use. So I'm using by default all the preview tools are using the most updated version. So here we're using 2.7.10 or 9. Sure, but yeah, so we can make changes over here. I hope I have answered your question. And then I think I have a little bit more time or I'll show you how to use the RSTBL as a developer to build RSTBL tools. A very simple examples. We need to load the library and use the echo function. Like let's wrap the echo function as the RSTBL tool. Here we need to define the inputs for the echo. So after echo, you need to put the value there so that it can print out something, right? So the idea we defined as something, the type is a string and you wrap the echo tool where you use the star by process specified the base command as echo and the inputs equals to input one. So now the echo is the R tool that you can use it within R. And let's assign some value for the echoes input parameter and then submit a job and then we can see R1, R1 output and we can read out the output. So it's a standard output. So we can use read lines to read output. And then we can also wrap a batch script. Here we define very simple batch script. Also doing echo, hello something. And then for the script, we can define the input parameter same as the previous one example. And then we need to require shell scripts for this script, we have just defined and we define another tool called echo underscore B as a stubby process where you see the base command as shell script and requirements. We have defined and the input parameter. So this isn't, and then the echo B looks like this. So we can see the base command is batch script.sh. It's a command line tool and entries script sh. And here's the content of the script where you define something as the input function as echo. And define the input parameter, wrap a stubby process and assign some value for the input parameter and run the job and R3. You can see the R3 output. So after wrapping the R function we can also do the same thing as wrap, we can wrap all of this into our tool to incorporate into your whole pipeline as one step to your pipeline. So there are some extension materials that where you can submit parallel job using the based on bouncy parallel, which you need to define some batch parameters and use the rest of the batch. I think this will be more commonly, more usually used if you're doing a big data than the rest of the R and you can check the output. Yeah, I think that's all the information. I'm sorry if I went too fast and I'll go back to my slides to show my contact information. So please feel free to contact me using my email or other ways and you're welcome to contribute your tools and pipelines. Thank you, thank you so much.