 Connecting Galaxy to a compute cluster, let's find out how we will connect Galaxy to compute cluster and how we can configure job dependent resources like cores memory and so on for our distributed resource manager. This is an acronym we'll use a lot today distributed resource manager just means any different kind of clustering or job scheduling software like slurm sg condor all those sorts of things. So by the end of this you should be able to understand how the galaxy job stack works a bit understand a little bit about the galaxy job configuration file know how we map tools to do job destinations and the various ways in which they can be mapped. Let's start a little bit talking about the galaxy job configuration. So we set up a template in the Ansible Galaxy tutorial to handle the job configuration it was very simple. And then we came back in the singularity tutorial and we added on to that the parameters necessary to tell the job to run in singularity containers. Now we're going to do a bit more with that. And so as a result you need to know a little bit more about the galaxy job configuration. I guess this was also covered in the Pulsar tutorial where you added a different plug in a different handler in different destinations as well. We're going to do more with it now then so plugins these are the different distributed resource managers that we have access to and how they how we talk to them. Then we have the handlers these are responsible for handling the entire lifecycle of jobs from creation of the shell script that gets sent to the DRM to manage submitting the shell script to the DRM to monitoring the process to cleaning it up. Then we have destinations these are groups of places where jobs can go that have common parameters. Additionally we have tools these are just the individual galaxy tools and how we want to map them to individual destinations. Resources these are a section that let us provide in the tool form some additional information the user can select. We can provide for instance a select box saying hey I know this job is going to be very quick I have a very small file. So give it less resources something like that, as well as limits. These are limits to help preserve the quality of service of your cluster by limiting the runtime of jobs, limiting the maximum number of concurrent jobs that sort of thing. So why use a cluster a cluster is a fantastic idea just to preserve performance of the server. Even just one other host will already help a little bit by moving some of the galaxy jobs off onto another server that when that server fills up with jobs it doesn't affect the main galaxy server. Additionally by opting into using proper distributed resource manager we can ensure that galaxy can be restarted. And the jobs won't be killed with the local runner any job that is running while galaxies restarted gets automatically terminated. These galaxies managing that job and the entire process tree of galaxies killed. So let's talk a little bit about plugins these are all the different DRMs the galaxy can talk to we have plugins for things like a local runner, which as we've said should not be used in production. Slurm and DR drama. So drama is a library interface that a lot of the different clustering engines speak. S G PPS LSF torque I think even HTE condor and slurm all speak drama protocol and you can use the drama interface to talk to them in a very generic way. Torque is available as well pulsar so well if we cover this in the pulsar tutorial sending jobs to the heterogeneous resource destinations. There is additional a command line interface so if you have some some reason to need this where you want to send jobs to a specific location that you can do so via SSH but you don't have a scheduler running across those locations you can do that with the clean interface. Galaxy conditionally scheduled jobs in Kubernetes go or chronos. So this is a little bit about how drama works you don't necessarily need to know how this works. Just know the drama is generic interface and a lot of the different DRMs will speak this interface and galaxy can use the drama Python library to talk to see drama library which talks to your scheduler. So handlers handlers are the actual processes which process jobs, each of them should have an ID attribute in the server name when you set them up these you can configure different handlers for different destination so for instance if you have people who submit a lot of high memory jobs and for some reason these take a lot of time to process and manage. You can put all of the high memory jobs onto a single handler and then those can be processed separately from a lot of the small high throughput jobs. Or if you have a lot of individual small jobs that you want to process you can also sequester these on their own handler to make sure that make sure that everything gets processed efficiently and quickly. Destinations. You can define different destinations these are generally a collection of different memory requirements and CPU requirements these are sort of like a high mem destination or a low memory destination these are common things. And they define how a job should be run which plugin should run them. Should they be running a container which one. What resources to the DRM need to be aware of as well as what the environment should be so if you have some special environment variables that need to be set. So let's look at a very basic job configuration file. Here we've got the single default plugin. This is just saying run in local we've got a handler section which is now no longer included by default and then we've got a destination section which just says everything should go to the local destination. So in this job configuration file you can set tags these can let you select from different destinations at random select from different handlers at random so if you have a couple of different clusters. And they should all be given equal preference you can just say oh please submit to any of these different destinations in your, any of these different handlers in your destination or any of these different destinations in your tool. They can help enforce concurrency limits and they can help you distribute the load across your different resources. Next up is the job environment we so we set some of these already in the singularity tutorial we set the environment variables to set LC all the correlation environment variable and language settings environment variable. As well as a couple for the singularity. You, but this is not just limited to environment you can also load environment settings from a file. You can also execute a command which generates different environment settings. So all of these are options depending on your needs you can do different things. Next up limits. This is a very, very important section. You can set well time in your galaxy job to make sure that they don't just run forever if someone has specified some operation which is misbehaving. For instance when I was the administrator of use galaxy dot you. We had a couple of times where someone would launch a join on the same file, and it would take an absolutely crazy amount of time and produce really really massive output files which would destroy the entire university's file server. And so limits help you prevent this they let you put limits on the output sizes so that if the tool produces outputs which are too large it's killed. They let you put limits on the wall time so if the job is running for a suspiciously long amount of time they're killed, and they also let you put limits on the concurrency. This one will talk about in a little bit more detail. So concurrency limits let you ensure quality of service for all of your users. Personally, I think they're incredibly an important thing to set and set low. Otherwise, if you have a server that's supposed to be used by a large number of people, a single user can launch 100 jobs in the queue and then no one else's jobs can be processed until those are. And this is really unfair at all of your users so by setting concurrency limits you prevent multiple jobs from running at once. When they shouldn't be when one user is starting up a million jobs that other users can still run their uploads, they can still run their tiny text computations. If you don't set concurrency limits you're at risk of individual users abusing the server. Next up shared file system so most of these distributed resource managers require a shared file system between Galaxy and compute nodes. There is the exception of Pulsar this will be discussed more in detail next day on using heterogenous heterogeneous compute resources. So for the shared file system there are files and services and data sets which need to be available both between the Galaxy head node and computation nodes right. So there are, for instance, the Galaxy application itself some of the tools that you will want to run require access to the Galaxy code base things like the upload tool. Needs to know about all of Galaxy's internal file type detection algorithms things like this, where the tools on the compute cluster need access to Galaxy. So you'll need to have these things shared between both node and Galaxy. Additionally, there are some things that need to be actually the same, not just in the same location. These are things like the job working directory and the input and output data sets both Galaxy and the compute cluster need to operate on the same data sets themselves for things like the tool dependencies and the Galaxy server we've just said they need to be at the same path. So they can be sim linked in and shared they can be completely separate duplicate copies of files, whatever works for your environment. That's something you'll need to work with your cluster administrator to figure out. So multi processing. Galaxy automatically sets this variable Galaxy slots to CPU or core County specify when submitting and a couple of tools within the Galaxy ecosystem know how to take advantage of this. There are a couple of different ways to set them multi processing threads are course based on different schedulers you'll need to look up the one that's appropriate for your scheduler. Additionally, Galaxy can set memory requirements for different jobs, you'll need to be aware of how to do this for your own scheduler. And some tools will additionally need additional environment variables to actually function properly they don't natively know how to take as much memory as possible. So these are things you'll need to think about when you're setting up your cluster. In my experience we mostly adjusted memory requirements rather than adjusting the CPU core requirements. There are very few tools that need or can take advantage of multiple cores and you'll need to look through the tool code itself to figure out if that tool supports it. For most of the informative bioinformatics tools we ran though when I was admin abuse galaxy dot you. Most of it was just setting up memory requirements and increasing those. So next, running jobs is the real user if you have an existing cluster, perhaps an existing cluster which your users have been ssaging into and running jobs at the command line. Those users probably have a real user identity on the system. If you can match up those user identities on the cluster with the user identity in galaxy you can also run these jobs as them. This is especially nice if you have save for instance an HPC where the counting of who is running jobs is very important by running jobs as the real user you ensure that everyone's using the cluster fairly and galaxy can use as much of the cluster as possible. It does require some additional rights you can read about that in our cluster documentation if it's relevant to you. So job config mapping tools to destinations. For instance, if we have a tools or tools that behave differently we might want to have different destinations that have those parameters. For instance, here we have two different destinations have single destination which is a single core and then we have a multi destination which has multiple thread multiple course assigned to it. And here you can see we've done that in the slur runner by saying minus minus and tasks equal for number of tasks equal for. And then we can pass it individual tools that we can say okay hi sat tool should go to the multi destination because it can take advantage of that. This is however unfortunately very static mapping you can see we've just said hi sat goes to multi. We haven't done anything intelligent with that so far. So, when you have needs for that you can use dynamic job runner. This lets you dynamically choose which tools go to which destinations. And this gives you a lot more freedom and flexibility. There are a couple of different ways to do this there is the dynamic tool destination, which is a built in system that given a YAML file, YAML configuration file that says, based on this tool ID plus input data set sizes and put number of records and what the username is you can map these different destinations. So this is very convenient way to get started with mapping tools to destinations you can just say oh this tool, if it's using this many input files or the input files or at least this size. Then maybe I want to send it to the high memory destination or the long run time destination something like this. And if the. You may want to give priority to users and you can do that here again as well. But if that is not enough flexibility, you can have arbitrary Python functions as dynamic tool destinations and these let you say given a tool ID username inputs to the tool tool parameters and literally any other resources you want to access, you can configure where tools go. Maybe you want to do something fancy like saying okay I'm going to send all of my tools to this dynamic destination and then based on which cluster I have if I have multiple that has the lowest queue, I will send those jobs there. You can do really fancy things like that you can also use this to implement authorization to run tools to make sure that specific users or specific groups have permissions to run tools. You really have complete freedom, but you're you have to maintain it yourself. So, once you've got, once you figured out where the tool is going. You'll want to care about the tool dependency resolution so we talked about this a little bit in singularity. There are other systems like conda Docker and modules that you can use if you want in this training this week will only be using singularity though. And you can read about this more on the tool management with ephemeris that you should have covered yesterday. So key points from this galaxy sports a large variety of DRMs and a large number of interfaces are available through drama. If your cluster is not available it's possible to add your cluster you can just read the existing runners and add your own or you can use the clear runner. The command line interface runner to run arbitrary commands to launch and manage jobs. Dynamic tool destinations are incredibly important to map jobs different locations and lastly job resource parameters allow you to control these resources available to job. So thank you for listening and let's get started with connecting your galaxy to a cluster. So now that you've heard a little bit about the background of distributed resource managers and how galaxy interacts with those. Let's get started connecting galaxy to a compute cluster. So your results may be a little bit different just because of random processes so some of the numbers used in this tutorial like job number four. This may be different for you and you'll need to check with your actual system in your system logs to make sure of what the number should be. Just know that in advance it may not be completely copy paste from the tutorial. So do you need to DRM. Yes, unequivocally yes you need a distributed resource manager. So we'll start by editing our requirements dot yaml to add a couple of new requirements for our DRM one will be the galaxy projects repositories. These are needed for a slurm drama library. And then we'll need the galaxy project slurm role which is used for setting up the slurm distributed resource manager. So why slurm slurms pretty easy to get started with. And it's very easy to cover in a training it's also very popular. It's one of the more popular ones. So let's edit add our two new roles to our galaxy dot yaml. We've noted that we should add them to the beginning of the roles section because there are. They don't depend on anything being already there so we can add them to the start and anything that needs slurm to be up and running later will have it already running when it's needed. And we'll also need to add some group variables to control how the slurm server will work. We don't have galaxy group far as galaxy servers and down here at the bottom I'm going to paste in all of the slurm content. So you'll see that we have slurm roles controller and exec, which roles the machine should play if you are using ensemble to set the larger cluster you probably have one controller node and multiple execution nodes where all the actual jobs are run. In this case we have a single virtual machine and the controller and execution node are identical. Here we list the slurm nodes that are part of the cluster. We have the name of our host and here we've defined two CPUs. This does not match the number of CPUs our virtual machines may have but we're going to use that just for this training to be a little bit consistent across all of the trainings. Then we have slurm config. There are a couple of parameters we've set here. First off we've set config override slurm D parameters to config overrides. This just tells slurm, please don't throw an error because we've lied about how many CPU cores are there. And then we also allocate individual cores memory instead of the entire node. It's just how we want our slurm configured. And when that's done we will run the playbook. And this will set up all of slurm. So I think slurm is used by usegalaxy.org usegalaxy.eu uses primarily HT condor. And usegalaxy.org.au the Australian Galaxy server uses primarily Pulsar but usegalaxy. Europe and the American one also use Pulsar just as a lesser role. Okay, that's looking good. And when this is set up slurm will get installed with munch. And this is just a search UID and GID imporium is a recursive algorithm or acronym. It is just a service which authenticates users between different cluster hosts. So normally if we had multiple hosts we would start to worry about is the user ID the same and that we distribute the munch key across all of the hosts, which is a very suitable task. Grantsful and managed for you if you use this slurm D or the galaxy projects slurm roll. You don't need to do this since we're already we're only installing slurm on a single host. Okay, this is going to run in the background, but slurm D should be set up. And if we run system CTL slash status munch slurm D and slurm D CTL, we can see that all of our services are running and active, which is exactly what we want to see. So let's check out how the service is configured with the S info command, or the S info now for a little bit more information about the nodes we can see that's configured with two CPUs there. And all the node list is localist everything's up. It looks good. I'm going to send letting the playbook run in the background. This is again, it comes back to do you want to set individual labels on your tasks in Ansible and potentially miss some changes that affect other things or if you just want to run the entire playbook every time and be absolutely certain it's running completely. But, and as you can see here I'm just letting it run in the background it should be done shortly. So we are going to play around a little bit with slurm to figure out how it works there are two main commands for running things in slurm interactively. We can use the s run command or the s batch command. Here we have s run, you name a, if we copy and paste that we should see something like this where slurm runs this command in the background. It's created a job for it. It's written out a script it's run you name minus a and then it's reported the result to us. However, usually what you want to do is more complicated than running a single command right you want to run an entire script that does some tasks and reports back at the end. This is how galaxy works itself right it writes out a script. This is our batch script which runs you name a runs the uptime command and sleeps for 30 seconds you name a runs the uptime command and then sleep 30 and our batch script should look approximately like this. So we'll go ahead and make that executable. And then we can use the other command, the companion to s run as batch to run this command. And run some stuff in the background and sleep for 30 seconds. Okay, you can see that submitted batch job three, and we can use sq to check on it. And you'll see that our job is running in the background. And you'll notice that we've only assigned a single node with two CPUs but there's only one slot as a result. So if we tried to run you name a, we'd see that it's queued waiting on these resources. So this is yeah nice part about distributed resource managers is you can go into galaxy can queue up all of your jobs and then they'll run when they have been allocated resources and works a lot more efficiently than the local runner. So the job number three finished and job number four ran. You can see a little bit of output here like this that our job was running in the background it's been running for seven then 12 seconds state running that's running under the Ubuntu user, Ubuntu user. So if you've seen all of this is all of this has worked, then your slurm installations working, and you're good to go. Next up we actually need to get the slurm drama library installed. We'll set this up as a post task that at the end of the playbook. We install slurm drama, and then we will run our playbook again. And that'll install the very last at the very end the slurm drama library. I believe that needs to be at the end because slurm needs to be available. I'm not 100% sure that you can submit questions in Slack and hopefully Nate will answer them when he wakes up. Okay, so when that's done playing, then we're ready to set up galaxy and slurm. So on top of the slurm drama library that we just installed the libdrama.so library. There is the galaxy interface to this galaxy needs a Python interface to that library and it also needs to be able to find that library. The slurm drama is one of the examples of those conditional document conditional dependencies the galaxy has. So where we've made some special configuration to galaxy that has changed its behavior and this also requires new dependencies to be installed. So whenever we run the galaxy playbook this time after we make these changes we'll see the galaxy also changes the installed packages it installs an extra additional slurm drama pack Python package. So we'll start off by changing the environment, we have to do this because galaxies started by system D. So the environment for the galaxy process is controlled by system D, and we need to specify the path to the drama library. So we'll add that after the system D portion. Right there we'll add in galaxy system these are going environment. We'll put the drama library path to the location of our library file. If you like you can check that that exists, and you'll see that it does. This is provided by the drama package that we installed, or rather by slurm drama itself. Okay. Next up we need to configure our job configuration file to actually take advantage of this new slurm that's available. So we'll start off by adding a new runner and up here up here at the top, we're going to set the slurm runner ID equals slurm type equals runner just like the other and load this new module, the slurm job runner. And out here we're going to update our local destination to be a slurm destination, and we're going to add this new destination called slurm, and we're going to make that the default destination. ID equals slurm. It'll have all of the same environment variables that the singularity local destination has, except instead it will go to the slurm runner. So, when galaxy works with the slurm destination. It'll do all the same things it did with local runner, but this time it'll write out this job script that's sent to espatch. So just like we ran an espatch command earlier in the tutorial. It'll do exactly the same thing with galaxy jobs just running espatch and then the path to the shell script it wrote out. Okay, and with that we're ready to run our playbook. All of the tasks will run. And when it's done, we'll check that everything loads correctly by following all of the galaxy logs to see when it restarts. Oh, and look at that. It's finished. Fantastic. So over in our galaxy logs, we should be able to see that it started up and it started to use slurm instead. And when we run a job, we'll actually see it show up in the slurm logs, which will be very exciting for us. Okay, so let's run a job. If you're not following the log files already, we recommend you do that journal CTL minus F for follow and minus you for the unit named galaxy. And then we're going to click the upload button at the top and paste in some test data. And we'll see that job get started. And if we look in our journal CTL logs, we'll see a look job runner drama is running three slash five a state change job finished normally. If we go back to the tutorial, it explains a little bit about what all the different log messages mean. If you're interested, you can read a little bit more about there. But the interesting thing is I read off that three five. This means galaxy job ID number three and slurm job ID number five. And we can actually find out about slurm job number five by s control show job. I'm going to show job five. Okay, and here it is upload one. So that's the tool ID and the username is attached there so you can see who ran this it ran under the galaxy account. That's the upload tool. You can see the start time and the end time it only took three seconds to run. You can find out the working directory where any of the error messages might have gone if the job failed something like that. So that's fantastic our jobs work they're running in slurm now. If we look at the job information. You might be able to see that it was run under the runner slurm. Also again with singularity. Okay. That is how you connect galaxy and slurm together so all of our jobs now by default are running in both slurm and galaxy and singularity. And this starts to be a really good production server right we've got our jobs running in a reproducible way in a very reproducible environment and they're also running in a distributed resource manager, which enables us as administrators to restart galaxy whenever we want and not have to worry about if we're going to accidentally kill some jobs. There's a lot of further reading if you're interested in all of the different options the different features of slurm or other clusters you can read the cluster documentation for information on how that works. There's a lot of accounting information. It's very big topic for clusters and oftentimes at your university or a research institute you'll have an existing cluster that you can use. And you can work with where you won't have to learn all of this and you can just submit jobs to it which is really nice. The job configuration XML file is a very long file we recommend you read through that to find out any additional information about these job runners that you might want to set there are a lot of different features of the job configuration file it's a very long file. And there's a lot that you can configure that will make your jobs more performant or easier to account for set different parameters etc to work with multiple clusters. So key points of this galaxy sports variety of different distributed resource managers you should definitely set one up. And as always please fill out the feedback form so we know that if this was okay too complicated too easy. If you have comments on this individual video please leave those in our slack. Let us know. Next we'll get started with mapping jobs to destinations.