 Hi, everyone. My name is Simon Glavin and I'm a bioinformatician at the University of Melbourne in Australia and I'm also one of the administrators of Galaxy Australia, a large public Galaxy server. Today I'm going to be talking to you about running jobs on remote resources with Pulsar or a way of getting Galaxy to use remote compute resources, not local to Galaxy. To some requirements before we start, we would kind of recommend that you would understand a little bit about Galaxy installation and how to do it using Ansible. And also, if you've never used Ansible before, there's a good tutorial on a slide deck on that as well. And if you are going to do this tutorial later on, you'll also need a server or a virtual machine on which to deploy Pulsar. And hopefully that server or virtual machine is separate to your Galaxy server. Right, some of the things we hope that you'll be able to answer the end of this slide deck and hands on tutorial is how does Pulsar work and how can I deploy it. We'd like you to have an understanding of what Pulsar is and how it works, how to install and configure a Pulsar server or a remote Linux machine and be able to get Galaxy to send jobs to a remote Pulsar server. All right, so what are heterogeneous compute resources? And what do I mean by heterogeneous? Well, basically, I mean that we can have a large Galaxy system, we can have a number of different compute resources available to it, and they can have different operating systems or versions. They can have different sets of users or different groups and different users and groups could have access to different compute resources. For example, if you're running a Galaxy server at a university, some of your users may be able to have access to HPC and how do you make that available for them to use considering it's probably going to be running a different operating system to the one that you're running your Galaxy server on. So data accessibility is another issue. So if you have compute resource in a different data hall in your university somewhere and doesn't have access to the Galaxy file system or the temporary job working directories a Galaxy would normally use, then how do you get to use that compute resource? Do you have full administrative control? If you're using your university's HPC system, then the answer is probably no, you don't have full administrative control. How do you get Galaxy to use those resources if you don't have administrative control? And sometimes they may be completely separate physical location. For example, in Galaxy Australia, most of our stuff is located in Perth, but if I want to run a job using some compute resources in Melbourne, which is about 3,000 kilometers away, how would I go about doing that? So these are the kind of things we're talking about when we say heterogeneous compute resources. It means we've got different groups of compute resources that may have different accesses and different administrations, different locations. But to be able to use it, Galaxy expects that everything is running on a single operating system. So the versions of all the dependencies and everything can be the same. And then we have a shared file system with fixed paths. They didn't have access to the Galaxy files as well as the temporary job working directories. And maybe this is not always the case. For example, in Australia, we used to be, had known to be located in Brisbane, but now it's located in Perth, but we want to be able to utilize compute resources spread out all over the country. And so how do we go about doing this? Well, there's a partial solution in Galaxy, which is the command line interface job runner. So if you've worked through the Galaxy tutorials up till now, you'll know that there's in the job conf, you can add different job runners and one of those is the command line interface one where you can say SSH into a remote machine and then submit jobs with the command line interface. They're using S batch or Q sub. If you're using slurm or PBS talk, etc. However, this still depends on having all the file systems being shared between the remote computer resource and the Galaxy server. And this may not be the case. So this is where Polestar comes in. And Polestar is Galaxy's remote job management system. It was written by John Chilton as part of the Galaxy project itself and is a system whereby Galaxy can send job metadata and job input data to a remote computer system somewhere. Polestar can pick them up, run the jobs, collect the results and resulting metadata and send it all back to Galaxy again. And it doesn't require shared file systems. It just requires that the Galaxy server be able to talk to the Polestar server over the network. Polestar can run on any OS, including Windows, which is kind of handy if you've got a Windows only command line tool that you want to be able to run on your Galaxy server. There's multiple modes of operation for every environment in terms of different firewalls, etc. And Polestar can be configured to run in multiple different ways to suit your needs. So the architecture of Polestar is that Polestar server runs on a remote resource, i.e. the head node of a cluster or just a remote machine somewhere. And then the Galaxy Polestar job runner in the job control file is the Polestar client. And communication between Polestar client i.e. the Galaxy server and the Polestar server is either via HTTP via REST API or via advanced message queue protocol. And file transport from one machine to another is dependent on the communication method. It can be used things like Percurl or HTTP file transfers or the SCP or RC Inc etc. So the architecture of all this kind of looks like this diagram here. So this little green box here is the Galaxy server. We have Galaxy running and it's talking to a SLAM queue to send jobs out to a bunch of compute nodes. There's a shared file system between the compute nodes and the Galaxy server. So this is the local Galaxy cluster, I guess you'd call it. Inside the Galaxy server we have a job cont file. And in the job cont file we've defined the various things like job runners. So the SLAM job runner, the local job runner, and this thing called the Polestar job runner, which is actually the Polestar client. And in the destinations we've set up a bunch of destinations like the local destination, the Polestar destination and the SLAM destination. And by default we send everything to SLAM except for the case of if a user wants to run Trinity or a user wants to run Spades, we've decided that actually these tools can be sent to the Polestar destination. So if a user comes along and decides to run Spades, for example, according to the job cont file, Spades will be sent to the Polestar destination, which then uses the Polestar job runner. And that then contacts the Polestar server. Well, if it's in REST mode, the Galaxy server will then just contact the Polestar server and say, hey, I've got a job for you. Here's the job metadata. And here is your input data. Please take care of this for me. And the Polestar server can do a numerous things. It can run it this locally on the same machine that the Polestar server is running on. Or it can submit it to a queue in the remote location be that HPC or even at your own cluster in the cloud. And then once those jobs are finished, the Polestar server collects all the results and sends them back to Galaxy. So Polestar can operate in a few different modes. The first one being the RESTL mode. And basically that means that the Polestar server on the remote computer runs a REST API on a web server. And the Polestar server listens to the API and the Galaxy server can initiate a connection to the Polestar server by just sending it a REST API request. So HTTP encoded URL. This is really good for where a firewall or open ports are not concerned. So you have complete control over the Polestar server and you don't mind having a web server with some open ports. So port 443, for example, open doesn't require any external dependencies to be running this mode. So you don't need any other intermediary servers between them. Galaxy can talk directly to Polestar over the REST API. The other way of running Polestar in the main is by advanced message queuing protocol. And in this case, we have an intermediate server program called an AMQP server such as something like a rabbit MQ or something like that will sit in between the Galaxy server and the Polestar server. And then both Polestar and Galaxy talk to the AMQP server. In this case, Galaxy would say, hey, AMQP server, I have a job. This job is for Polestar server. Can you please hold all the details of this job? And then that will be put into the queue for that particular server. And then the Polestar server periodically checks with the AMQP server to say, hey, is there a job for me? And the AMQP server in this case will say, yep, Galaxy has just given me a job for you. Here's all the details. And then the Polestar server will grab the job metadata, grab the locations for the files, and we'll pull them in. And this is really good for where the remote compute has a strong firewall and there's no way of getting around it. And so in this case, because Polestar will be initiating connections out to the Rabbit server or the Q server, we don't need to poke any holes in the firewall. It's also kind of useful for networks with bad connectivity as we can set retries and things like that to and we can use things like curl for file transfers, which are probably a bit more residual than HTTP file transfers. And occasionally you may want to run Polestar in embedded mode, which means that you're running Polestar on the same node that you're running Galaxy on. And this is sometimes good for copying input data sets from non shared file systems to shared file systems or a manipulating paths around etc. There's a lot of documentation on this if you wish to look into it more. Polestar needs to stage the files that it needs to work on. And so when Polestar gets a job, it actually needs the job metadata. So what tool we're going to be using, what version of the tool we're going to be using, all the parameters that are set for that particular tool, and then which input data sets we're going to run on. And then it needs some way of getting those input data sets off the file system on the Galaxy server onto its own local file system. And it can do that in a few different ways. When Polestar is running in REST API mode, Galaxy can push the data out to the Polestar server, or the Polestar server can pull it from the Galaxy server. Polestar can use libcurl for doing more robust transfers with resume capability. But when running in MQP mode, the only way of job file staging is by pull. So it means that Polestar needs to be able to curl the data out of the Galaxy server. Okay, so dependency management for Polestar. And this is where Polestar gets a job and say it's a BWAMM and it's BWAMM version 0.7.18. And Polestar says, Oh, I don't have that installed. How do I go about getting it? And this is where Polestar can automatically go and get them. And so it has a similar dependency resolved conflict to Galaxy. We can set up Polestar to auto install condo dependencies, or pulling singularity containers, or our token containers to run that particular tool or that particular version of that particular tool. In the tutorial, we'll be setting up Polestar to use condo dependencies. That was fairly simple to get it to use singularity. And job management. So Polestar can run various different job managers to facilitate the actual running of the jobs. So if we're going to run it locally on the Polestar server, we can use the Kube Python job manager, which will allow us to send the Polestar server numerous jobs at once. But that Kube system will only run one at a time. Or we could send it to the Kube drama manager, which will then forward on or submit jobs to a local queue for the Polestar server, say like it's a local Slern queue, or a PBS talk queue, or some other kind of HPC DRM system. Or we can set it to the Kube CLI so we can get Polestar to submit jobs using Kube server s batch. Or we can now send it to run on a condo cluster, which is pretty cool. And so after taking Galaxy Australia, well, this is the old picture now it's all located in Perth, but up until recently it was all located in Brisbane. So our main queue and our storage main storage and our main compute nodes were all in Brisbane. But now by using Polestar, we've able to utilize compute resources located all around the country. And you can see here we've got our head node in Brisbane. We had a Polestar node in Perth, a couple in Melbourne, and we were in the process of setting up some in Adelaide and camera and Sydney. And also we can if we want to set them up in Darwin and Tasmania. In actual fact, this can be run internationally as well as what happens in Europe. Europe has a rather extensive Polestar network where their head node is based in Freiburg in Germany. However, they have Polestar nodes dotted all around the EU. And they also have a Polestar node located in Melbourne in Australia. And that was pretty much just approved that we could run it all the way across the globe. There's a lot of resources for Polestar. The dock site, the actual source code for Polestar is part of the Galaxy Project. There is an answerable role for installing Polestar available as well. And in the tutorial, we'll be working through that to install Polestar on a remote server and then get Galaxy to talk to it. Some of the key points we'd like you to think about are the Polestar allows you to easily add geographically distributed compute resources into your Galaxy instance. It also works well in situations where the compute resources cannot share storage pools. Thank you.