 Hello everyone and welcome to the next edition of the BioExcel webinar series. My name is Rossin Apostolov and I will be today's host. Today we will have a presentation of a new open source tool, CWXAC, which allows you to run common workflow language workflows on LSF and it will be presented by Ching Da Wang. Today we have as a co-host Michael Kruso, which many of you know as the co-founder and the leader of the CWL project. First to let you know that this webinar is being recorded and we will upload the recording of this webinar to the BioExcel YouTube video channel and you can share it later with your colleagues or friends if they have not been able to attend today's event. Just a few words about BioExcel. For those of you who are not familiar with our center, BioExcel is a center of excellence for computational biomolecular research. We are over two years old now and we work in three main directions. One is to improve the performance, efficiency and scalability of several important applications for molecular dynamic simulations such as GROMACS for docking an integrative modeling HADOC and QMMM simulations, CPMD. Another main aspect of our work and this is also the topic of today's webinar is our work on efficient workflow environments with associated data integration. And there we work with a number of platforms and CWL is one of the projects that we have been working with and supporting. We also work a lot on training and providing consultancy to academia and also industry examples such as pharma companies. We have a number of activities centered around several interest groups. One of the interest groups that might be of interest to you is the workflows IG. You can find more information about them on our website and you can get in touch with us, various platforms in the forums that we have. We have a chat channel and a video channel. You have recordings of our previous webinars and events. At the end of today's webinar, we will have questions and answer session where each of you can ask whatever questions you have in mind to change them. While the presentation is going, you can use the questions tab on the GoToWebinar control panel. And after the presentation, I will let each of you speak directly to CINGDA and ask your question. If you have problems with the audio, if you don't have a working microphone, I will read the question on your behalf. If you have any other follow-up questions or points that you would like to discuss, you can always go to our forums at ask.bioxcel.io where we will further follow-up. And with that is the small introduction that I gave. Now I would let Michael Kruzo take over with hosting and presenting CINGDA to the audience. Hi, Michael, can you hear us? Yeah, I can hear you. Hi, everybody. I'm really excited to be introducing a new vendor into the CWL ecosystem today. CINGDA Long, as we see here, has a lot of experience in software development and grid cluster in cloud. And so we saw their blood post from the LSF group about their experiences running CWL using toil last summer. I'm quite intrigued to read at the end that they were developing their own implementation directly on LSF. So in this sort of deeper integration between a workflow language and a specific scheduler, hopefully yielding special optimizations or just a better experience, has always been something we've been anticipating and hopeful for as a project. Looking at the number of attendees, it looks like the Rails is quite interested. So I'll stop talking and let CINGDA continue with this presentation. Okay. Thank you, Rosen and Michael, for your introduction. I'm very happy to have this opportunity to present to BIOXCEL users today. My topic is, as Michael mentioned, CWL exec, a new open source tool to run CWL workflows in LSF. Can everyone hear me properly? Yeah, you sound great. Okay, thanks. So before I start, maybe I can say a few more words about myself. So I'm the principal architect at IBM working on IBM Spectrum LSF family of products. I have worked on a product called LSF Process Manager for many years. It is an enterprise workflow software running on top of LSF. And we have a lot of customers in the life science vertical. So we know a lot of use cases and requirements from life science. Hopefully this helps us to develop CWL except better for life science users. Okay, so because I'm going to make many forward looking statements on our project and loader map. So I'm required to show this disclaimer. With that out of the way, here's a gender for today's presentation. First, I'm going to introduce IBM Spectrum LSF and LSF Suite. Next, I will briefly go over common workflow language and its implementation. And then I will cover CWL exact in detail and then talk a little bit about what is ahead of our loader map. So I know there are LSF users in today's audience, but many of you may not know or not that familiar with LSF. IBM Spectrum LSF is a batch scheduler for the HPC environment. So it is a workload management software that helps you to optimize your computing resources. Whether you have a small or large cluster. And it provides efficient scheduling and also sharing policies for maximum job throughput. LSF also has a few add-on products that helps you to improve the user experience and provide additional functionalities. For example, IBM Spectrum LSF application center is a web-based user portal for users to submit and monitor their workload. And the process manager automates workflows that runs on top of LSF. We also have other products to monitor your cluster usage and generate reports. So LSF together with these add-on products, we package them into IBM Spectrum LSF Suite. I should mention the LSF Suite for HPC is available for free under the IBM Academic Initiative. So next I will briefly go over common workflow language. As Rosen has said, Michael Kruso is the founder and the leader for the CWL project. And he has done a bio Excel webinar before covering CWL. Hopefully you have attended his webinar or have watched the video. Briefly, as you know, there are many open source workflow tools available today. And many of them are very widely used in life science. But one issue with these tools is each tool has its own specification or format for the workflow definition. So it becomes difficult for scientists and users to share or collaborate on designing the workflows. CWL is an open standard developed by an informal and multi-vendor working group. It's led by the community and there are many participating organizations in academic research and industry. And it basically designs the CWL specification in such a way that makes the workflow definition portable and scalable across various software and hardware environments. So as long as the open source software implements the support for the CWL and conforms, passes the conformance test, then the CWL workflows can run in these different software. And if you have the same definition and the same input parameters and data, then the result is repeatable. And also CWL workflows can run in different computing environments such as running locally or in a cluster or on the cloud. And CWL also has prominence in mind, so you can always get the same result. So it becomes a documentation of your workflow. There are already many implementations of CWL on the right of this screen. You can see a list of the software. I copied this from the CWL website. And if you check periodically, you can actually see more and more software implementing CWL. On the platform support, you can see software supporting local execution or in HPC environments such as Slurm, Grid Engine, and LSS. And they can also run on the cloud including AWS, Azure, and Google computing platform and so on. So as CWL grows in popularity, we have LSEF users asking us to support running CWL workflows on LSEF. And even though there are already many open source implementations of running CWL workflows on the HPC environments for batch schedulers, these integration are often generally relatively basic. So this is not just for LSEF, it's pretty much most batch schedulers. So they are usually limited to the simplest or the most common functionality. So you cannot take advantage of some of the rich features from these batch schedulers. And the implementation can be inefficient. I will talk about this later in more detail. And they are often limited or no ongoing testing or enhancement. Even though these batch schedulers continue to release new versions. And there's pretty much just the best effort community support. So in order to overcome these limitations, we started working on this new project we call CWLExec. And it will be an open source tool to run for now with LSEF. And we will have tight integration with LSEF and it will be fully supported by IBM as long as you have LSEF support. And we will leverage many LSEF features such as native container support to take advantage of the LSEF features. The versions we are going to support is CWL draft 3 and the version 1.0. There will be a few exceptions. Some of the features we may not support in the first release. This includes software requirement, expression tool. They include the directive and the remote location in the file and the directory specifications. This primary is due to the effort required. We can certainly add them in CWLExec in a later phase. We actually looked at all the CWL workflow definitions in the public repository. And we kind of find that these features are not that widely used. So maybe this will not be a big limitation. CWLExec will require LSEF version 10.103 or above. If you use LSEF suite 10.2 community edition, then the LSEF version inside will be sufficient. And I need to point out the community edition for LSEF suite is free and downloadable. You can download it from IBM website. I should also point out CWLExec will be a standalone package. So aside from the fact that you need LSEF, it will not rely on any other IBM product. So this makes it easier for people to use it. We plan to release it in the second quarter as part of LSEF suite. We will put the source code on GitHub under Apache license. The source code will be written in Java. So here's a look at the CWLExec command line. This is similar to CWLRunner. And we don't have a lot of options like CWLRunner. It's basically you provide the CWL workflow definition file and input setting file. Then this command will execute the workflow. The command line options are consistent with the CWL conformance test. So we can use this command to run the conformance test. The parameters include the output directory, the working directory. And we have a new parameter called exact config that I will talk about in more detail. And when you run the CWLExec, you will get a unique workflow ID, which you can use later on to query the workflow status or to rerun the workflow. So now I'm going to talk about more details of the features in CWLExec. I will talk about how we check job completion efficiently with maximum parallelism, how we support LSF submission options, self-healing of workflows, Docker integration, cloud bursting, and I will also talk about how you rerun and interrupt CWL workflow. So first, efficient checking of job completion with maximum parallelism. Like I mentioned in the open source implementation, often to check whether a job has completed the existing tool, often just pull the status of the job periodically. For example, in LSF, there's a command called bejobs. You can use to query the job detail and status. And often it is used to, you use run bejobs say every 10 seconds to pull for the job status until the job finishes. This is not efficient because first of all, you can have up to 10 seconds delay. You don't get the job status in real time. And secondly, if you have many workflows in the system and many jobs, then running bejobs on these many jobs over and over can generate a lot of network traffic. Even though LSF can handle high volume of queries without problems, these commands can still generate a lot of network traffic, which can have an impact on your environment. So in CWL exec, we use a command called bewait instead. This command was introduced in LSF 10102. Basically it waits for the job completion through a callback or notification mechanism. So you can get the job status in real time without delay. And at the same time, you don't generate those kind of network traffic. And we run bewait in a separate thread so that we can ensure we have the maximum parallelism. So on the screen, there's an example. In the beginning, we have four jobs, J1, J2, J3, J4, which don't have any dependencies. And potentially the first job, let's say it's a scattered job, you can have thousands of jobs that can run immediately. So we will submit these four jobs right away to LSF so that they can run concurrently in the cluster, depending on the capacity of the cluster. And then there's a job, J5, that depends on the output from J1 and J2. So we will run bewait to wait for J1 and J2 to finish in a separate thread. As soon as J1 and J2 are finished, the bewait command will return and the main thread will gather the notification and then start the job J5. So same thing with job J6. So in this way, we will be able to check job completion efficiently and ensures the maximum parallelism. Next, I will talk about how we support the LSF submission options. In LSF, you use the command bsub to submit a batch job to LSF cluster. And bsub has a lot of options, dozens of options. And our users pretty much consider it mandatory to be able to specify these submission options when they run LSF, run CWL workflows on LSF. Some of the examples for these submission options say resource requirement. In CWL specification, the resource requirement is limited to CPU memory and the disk only. This is understandable because CWL specification is supposed to keep the flow definitions portable. They don't necessarily run on LSF or any workload manager. They can run locally or on the cloud. So these kind of options, workflow manager specific options may not apply in other environments. So back to the resource requirement. In LSF, the resource requirement can be much more than just CPU memory and disk. For example, a user can specify the preference on what types of hosts. Do you want hosts with least load or more free slots? And for parallel jobs, you may want to specify how these parallel jobs should span across multiple hosts. So a user can specify very complex resource requirements to match their jobs to the optimal compute hosts. And another fundamental thing users want to specify is which queue they want to submit their jobs to. The queue can reflect policies like priorities and how groups of users should share the cluster and so on. And other options can be what project, application profile or whether a job should be runnable. And as I mentioned, the CWL workflow definition is supposed to be portable. So in order to support these submission options while keeping the workflow definition portable, we will introduce a separate configuration file that you can specify for the submission option when you run the CWL workflow. So as shown here, we use the parameter exact config to specify this file which will be in JSON format. And these LSF options can be specified at step level or workflow level. Options at the workflow level applies to every step in the workflow. And if an option, a same option is specified at a step level, then that the one at the step level will overwrite what's specified at the workflow level. And on the right, there's an example. The user has specified a queue called high, and this will apply to every job in this workflow. And he has also specified that all the jobs should be runnable. And then in the steps, for example, main step one, the user again specifies the runnable to be false which overwrites the one at the workflow level. And he also specified a resource requirement for this step which says what type of host he wants and he prefer hosts with least CPU utilization. And also it will require 500 meg of memory and there's a requirement for swap memory and temp space and so on. So he can just use this file when he run the execute the CWL workflow. And currently we will support a queue project, a runnable application profile and the resource requirement, these options for when you execute the CWL workflow and we can easily add more later on. So next I will talk about self-healing of workflows. What do I mean by self-healing? It really means when a step fails in a workflow, we try to let the job recover itself without user intervention. So we will do this in two parts. One is the runnable job that is a feature from LSF as long as you enable runnable then when the execution host for the job goes down, LSF will rerun the job automatically on different hosts so the user doesn't have to worry about this part. And the second thing is we will provide a custom post-fader script. So when a job fails in a workflow, often it is recoverable after you take some action. Ideally you may want to have a script to check and fix the problem automatically whenever possible then the script can recue the job and let the workflow continue. For example, a job may fail due to insufficient memory and you can have a script that detects the job failure reason then modify the job submission parameter to increase the memory requirement. You can do this through the demod command in LSF and then recue the job. Here's a command called beRetue and then the job will run again with more memory requirement. As you know, it's always very difficult to estimate a job's memory requirement accurately. If you specify too high a requirement then you may overuse your resources. So it's best you make the best effort, best estimate and then if occasionally the job fails then this kind of script can recue the job and the job can still succeed. And the script may potentially also just check a job log and if it can fix certain problems then it can fix the problem and then recue the job. We will support this kind of custom post failure script and through the exact config file we talked about before. And on the right there's an example you can specify the path to your script and you can specify a time out value in case the script is not written properly and it can run for a long time or even half. And you can also specify how many times you want to retry. So if a user has configured the post failure script and then when the job fails CWLExec will run the script. When it runs the script it will pass the job ID, the sub-command and job command all the necessary information to the script so that the script can do the check and take actions properly. If the script runs successfully then CWLExec will consider the job has been recued and it will go back to wait for the job to finish. If the script fails then CWLExec will consider the job has really failed and then it will stop the workflow. So this makes it possible for the workflow to be self-heating as much as possible and hopefully this is a useful feature. And next I will talk about the Docker integration. So CWL specification supports dockerized jobs and there are some security issues when you use dockerized applications. This is not specific to CWL, it's a generic issue. For you to run Docker jobs in order to avoid a seal-do often the users need to be placed in the Docker user group and the Docker job will run as root so these Docker group members get root equivalent privileges. This is not really an issue when Docker is used to run services started by root but it can be a potentially serious issue when the users are allowed to run arbitrary jobs like in this kind of HPC environment. Second issue is many businesses often have concerns about users being able to arbitrarily using a container from external registries such as the public Docker Hub. There could be concerns on security and provenance or auditability of these Docker images. And LSF has integration with containers including Docker. Actually LSF also integrates with other container technologies like shifter singularity and so on. And LSF addresses these security issues for dockerized applications. First we let administrator to use application profiles to configure dockerized applications. Application profiles are some configuration in LSF they are used to define the common requirements and parameters for the same types of jobs. For example for a certain application they want to specify some common requirements such as memory CPU limits pre-post execution scripts and so on. And for dockerized applications LSF administrator can have the ability to approve and configure what registries and what images can be used. And he also has control on what docker options that can be used through the application profiles. And LSF will start those docker containers as LSF administrator removing the need for all the HPC users to be in the docker user group. So that means those end users never gain elevated privileges. The only user ID you need to put in the docker user group is the LSF administrator. Of course there's a configuration that you can specify a different user ID to start the docker jobs. So this diagram basically illustrates the points I've just mentioned. The administrator will be able to manage what registries and images can be used through the application profile which is defined in the lst.applications file. And the user just needs to specify a dash app which means what application profile I want to use. And LSF will then pull the image and run the docker job. So regarding the CWL exact integration we allow the administrator we expect the administrator to configure the application profile for the docker jobs. Typically for now we expect one profile per registry. So here in the configuration there's a line container which specifies which images can be used. So in this example the administrator has specified the key.io as the registry and the image name is passed in through an environment variable called the lst container image. So this will be passed in when user actually CWL exact runs the bsub command. And this pretty much limits the docker images have to come from the registry key.io. And in the docker options the user as the administrator has specified a script. He can specify hardcoded docker options but LSF provides the flexibility to use a script to generate the docker options. So we have a sample script here basically you pass seeing multiple docker options through the lst container options in one variable. And the last parameter is the starter which is the user ID who will start the docker job. After the administrator configures these application profiles the user just need to specify this application profile in the exact config for the docker job in his CWL workflow definition. He just need to make sure the registry matches the application profile. And CWL will then submit the docker job and automatically pass in the image name and docker options such as volumes because the CWL exact knows what work directory or output directory and other temporary directory potentially to be mounted for the docker. So this way it should be pretty simple the administrator to configure and the end user doesn't need to know all these details he just need to specify the correct application profile and this makes it more secure to use docker jobs. And the next feature I'm going to talk about is the cloud bursting capability from LSF. So LSF has a component called the resource connector that adds the cloud bursting capability to LSF. This will enable LSF to automatically borrow and launch hosts from a cloud provider to join the cluster when the workload demand is high in the LSF cluster. So the cloud provider we support include AWS, IBM cloud, Microsoft Azure Google computing cloud pretty much all the main cloud providers and the resource connector is able to automatically flex resources up and down based on the workload in your cluster. So when the workload is high it will borrow resources from the cloud provider to join the cluster and when the workload goes down these borrowed hosts become idle for a certain amount of time the resource connector will return these hosts to the cloud provider. So this pretty much you will be able to leverage the on-demand capability from the cloud infrastructure to borrow resources as many as you want as you need and you will be able to only pay for what you use. And the resource connector also has policies for you to configure when you want to start bursting and also for example the maximum number of hosts you can borrow and so on. So this is mainly the execution effort by the administrator the end user just need to specify the queue or resource requirement to be able to borrow resources from the cloud and in this way CWR workflows can run in the cloud when your on-prem cluster does not have enough resources. So finally I will talk about the rerun and interruption so if a workflow fails it exists with non-zero code because some jobs have failed you can rerun the flow through the CWL exact command with the rerun option and the rerun will start from the failed steps not from the beginning because the user's workflow can be long running and has hundreds of steps the user don't want to rerun the flow from the beginning. Usually you only want to rerun from failed steps and also you can interrupt running CWL workflow by pressing Ctrl C the CWL exact will stop it will not submit new jobs and although it will let the submitted jobs to continue to run. So this is all the features I want to cover and next I want to talk about our load map looking ahead. So the previous slide basically talked about how you use CWL exact to execute CWL workflow. It is pretty much in the CRI mode we will add a server mode to CWL exact and the server will run as a web server that provides RESTful API and points. In the server mode it pretty much acts as a central server to execute and manage flows from many users and it will be a central place to manage all the flows and you can achieve scalability and so on and the RESTful API will include the APIs to execute a CWL workflow or get a list of the finished and running workflows optionally with the filter you can also use API to query details of a specific workflow like the details of each step and so on we will also provide the APIs to control CWL workflows including kill, suspend, resume and rerun a workflow and we intend to include the server mode in the open source project and use the CWL exact server mode as back end we will also implement a GUI interface for CWL workflows and the GUI will be done in IBM Spectrum LSF application center to integrate the environment for users to visualize, manage and monitor all the workflows so the abilities include managers of CWL workflow definitions visualize the definitions and the instances, the running instances so visualization will be similar to the CWL viewer measurement by University of Manchester and you can also monitor the progress of the workflow instance visually and additionally you will be able to select input and output data and view its content as well as the work directory of the workflow instances and jobs in this single integrated environment you can also control the workflow instances you can select the workflow instance to kill, suspend, resume or rerun many of these capabilities are implemented in the LSF process manager we are just going to port or migrate many functionalities to support the CWL workflows so these are illustration of these abilities this is the LSF application center so you will be able to see a list of the CWL workflow definitions all the definitions in your system and you can pick one of them and say I want to view its graph then we will show a chart similar to CWL viewer and you will see all the steps and their relationships and if there are subflows you can select it and then expand or drill into its details and you can also select a workflow definition and then execute it by providing input the settings file then you can have a list of all the running workflows and then you can select one of them and say I want to see its chart and then we will show the progress of this workflow instance and the status of the steps will be shown in the graph through color coded states the color gray indicates that step has finished green means the step is running and yellow means the step has not started yet and you can also click on the input or output file and then say I want to tail or view the content and you can actually, when you tail, you can gather the content at a real time so this is pretty much our roadmap and I have I think coming to the end of the presentation I want to thank everyone for joining the webinar and you are encouraged to download and try CWL EXEC when it's available in second quarter and we are always welcome collaboration and contribution from you if you want to be informed of further developments feel free to just drop me an email and you can also just give me feedback so I think that's end of my presentation and I will be open for questions thank you I would like to see what you've accomplished so far and your vision for the future so I just want to invite everybody to send in any questions you may have we've got some time set aside for this we have one question already it's a kind of straightforward one asking if there's going to be a recording of this talk available and yes, the recording is being made and hopefully in a couple of days it will be public in fact there will be another opportunity to ask questions in a sort of virtual face to face format next week at the CWL community call which will be on Tuesday at let me get the time zones right because we're also going to do a little bit a half hour earlier than we normally do today at 1500 UTC and the email about that will go out to the CWL community mailing list so great time for questions let's see here if there's any others available yet I have a question for you have you tested the CWL exec with any of the publicly available CWL workflows that's our plan actually we looked at the CWL definitions available on the public website and we have picked a few to want to make sure it can run with our tool great all right Maxine Sherman writes will the slides be shared yeah this slide will be shared I think it will be downloadable from the file excel website great and we've got a question here from new to this interface looks like somebody unmuted me am I speaking now I guess my question is if we have not yet upgraded to 10.1.0.3 how can we use CWL exec okay yeah so as I mentioned we need the B-weight B-weight is actually on 10102 so if you have 10102 then majority of the functionality should work we need a 10103 mainly for the Docker options and of course for LSF if you have the LSF the upgrade is always available okay so if we if we are not looking at using the containerization right off the bat and if we are able to implement our own B-weight command will that work or is it using B-weight through if you implement B-weight it should work but make sure it actually we basically do B-weight W and the list of job IDs and the relationship for now so if you satisfy that then it can work this isn't something we're hoping to do long term but just until we get that upgrade if we want to try things out theoretically if we write our own B-weight command it should be able to find it in the path and then work correct yeah okay cool thank you so Michael actually I can read the questions sorry my DM rebooted so thanks for doing that no problem so I will just go on to the next question yeah from Igor Kosin yeah so the question is is PAC needed for Docker integration or is it integrated directly into LSEF so the answer is it integrated directly into LSEF so you don't need to have PAC Igor you'll need to speak up a bit we can hear you Igor I can't yeah Igor doesn't have a mic Igor if you'd like to respond in the chat or ask another question to reply okay he's good thank you Igor for your question so we've got the next questions from Manabu Ishii are features like post-failure script are they valid would they be seen as valid by the schema-solid tool I think I can answer that one so the post-failure script as shown in the example Manabu was in that external configuration file that's pasted separately so it's not part of the CWL description if it was implemented as an extension to the spec which is totally allowable and that was supported by CWLExact then schema-solid tool would be fine with it as long as it had a namespace and a schema-specified just like other extensions Manabu does that answer your question thank you I understand any other questions is if anybody else has anything you said Shindo that your LifeScience customers had asked for CWL support did you have interest from customers in other segments for now I'm only aware of customers from LifeScience I think a CWL is I know it probably has already expanded beyond the LifeScience but from our customers from LifeScience great so in your Docker integration with the dash app support so that means that Docker containers can only be specified by this sort of parallel configuration is that correct or will you be able to translate CWL's Docker requirements to see if they're supported actually we have a fallback so if you don't specify the application profile we will just assume the user is in the Docker user group and we will run the Docker run directly okay and then there's a technology called UDocker that kind of gives a fake Docker experience and we've had success using it at the European Informatics Institute has your team played with UDocker yet? we haven't looked at it yet it's in the reference runner if you want to take a look at that and that flows into your toil yep we will have a look we've got a few more minutes one more minute for questions if anybody else has any just want to take another opportunity to thank you again Chinda I am really excited to play with this myself I know a lot of customers and users out there are as well so Q2 can't come soon enough thank you Michael as well well that's all for now so look forward to on the biocell mailing list and website for the recording of this webinar and again we'll be emailing shortly to the community group about another Q&A session as part of the CBL Community Call next Tuesday take care all cheers thank you bye