 Hello everybody and welcome back to the Galaxy Administrators course. In this session we're going to be talking about reference data in Galaxy using CVMFS. And my name is Simon Gladman and I'm a bioinformatician at the University of Melbourne in Australia. And I'm also one of the administrators of Galaxy Australia. Okay to get started we now need to be on the Galaxy Training website at training.galaxyproject.org and then we need to go to Galaxy Server Administration. If we scroll down you can see here we have a tutorial called reference data with CVMFS. There's a slideshow available, a video of the slides and also a hands-on component. Hopefully by now you've all seen the video of the slides. If not I suggest you go and do look at those first as that explains how reference data works in Galaxy and also what CVMFS does to help us with that reference data. And the hands-on tutorial we will be installing it onto our Galaxy servers that we've already created. So if you're ready to go click on this link here and you'll open up the tutorial reference data with CVMFS. Some of the objectives of this tutorial are to by the end of it have an understanding of what CVMFS is and how it works and hopefully you got some of that out of the slideshow that you've already watched. You will be able to understand how to install and configure the CVMFS client on a Linux machine and mount the Galaxy reference data repositories. You'll be able to configure your Galaxy to use these reference genomes and indices and we're going to be using an Ansible playbook for all of the above. Some of the requirements that we hope you've already completed are Galaxy server installation with Ansible and hopefully you understand what Ansible is and how it's used. Just a quick recap. CVMFS or CERN VMFS is a distributed file system perfectly designed for sharing read-only data across the globe and we use it extensively in the Galaxy project for sharing things that a lot of Galaxy servers need. Namely all the reference data. So the genome sequences for all the different genomes that we need to think about in Galaxy and bioinformatics, things like a human genome, mouse genome, et cetera, et cetera. And we need a lot of indices for the genome sequences and lots of tool indices for all those genomes. And we also have a repository that contains tool containers. So singularity containers for all of the bioinformatics tools that we might want to use. And there's a tutorial there already done this week or coming up soon this week that explains how to use those singularity containers within Galaxy as well. So we're just going to get on with things. The agenda we're going to follow today is we're going to install, configure Galaxy CVMFS reference data using Ansible. We're going to explore the CVMFS installation and then we're going to configure Galaxy to use it. There are a few different repositories that Galaxy project has created and shared with everybody. The first one is the reference data indices and this is in data.Galaxyproject.org. And then we have another one called singularity.Galaxyproject.org. If you want to have a look at what's in there, you can click on the data cache. And this link here will show you all the things that are inside this data.Galaxyproject.org. Okay, so now we're going to move on to installing and configuring Galaxy CVMFS reference data with Ansible. We are going to do some Ansible here and we're going to install CVMFS onto our Galaxy server. Hopefully you all have access to a Galaxy server. Here's mine. Mine is our GAT15.OS.training.Galaxyproject.eu and on it I am an admin user and I have access to the admin page. And this was done as part of the installation of Galaxy. And hopefully you've installed a tool. I have BWA and BWA mem installed under mapping. And as you can see here, I have BWA mem. So if I click on that, you can see here that I have BWA mem. But there are no options available for reference genomes. So we want to fix that. And we want to connect to all of Galaxy's pre-built references. And so we're going to use Galaxy's CVMFS system to let our own Galaxy's connect and get access to all of the pre-built cases and everything we already have. Okay, so let's get started. If we go back to our tutorial here, it says that we need to install a CVMFS role into our requirements.yml and then add it to our Ansible. And what I need to do is log into my Galaxy machine in the terminal. So I'll open a terminal sshypuntu at, and the name of my machine is cap 15. Copy that and copy the password. Right, so now I'm logged in. Have a look at the contents of this directory and I'll go into Galaxy. And here we have all of the Ansible scripts that hopefully everybody already has. Okay, so the first thing I'm going to do is I'm going to add the CVMFS role to the requirements.yml. And we need to add this to the bottom of that file. Copy, paste, and save it. And now install the role into our local Ansible scripts using the AnsibleGalaxy command. And as you can see, it's downloading the CVMFS role. And if we go into roles now, you can see that we have galaxyproject.cvmfs on the screen. Okay, now what we need to do is we need to run this role and that will install the CVMFS client onto our Galaxy server. So the first thing we need to do is edit our group files Galaxy servers file. There are a bunch of different variables that we can set. We can set the CVMFS role to be client or stratum 0, stratum 1, server URL, so where we're getting all this data from. And if you remember in the slide show, the data can come from Europe or America or Australia, which repositories we want to have installed. And this is an important one, the quota limit. This is basically saying that CVMFS will cache some data on your local machine. And the quota limit is the maximum size of that cache. Okay. But instead of just modifying Galaxy servers.yaml and adding in some of these variables. Instead, what we're going to do here is we're going to create a new group file called all.yaml. Because one of the things that we may want to do in the future is we may want to create other machines. We may want to have worker nodes for our Galaxy cluster or we may want to have other machines that we want to be able to create using these Ansible scripts that also have the CVMFS role installed. And instead of reproducing these variables in each of the group bar files for those particular machines. We can create a special group bar file called all.yaml and whatever we put in there will be automatically available to all machines that we create with Ansible from this directory. Hopefully that makes a bit of sense and we're going to use that a bit later on if you come along to the Pulsar tutorial where we will also be installing CVMFS on another machine on a remote machine to run Pulsar. So what we're going to do is we're going to create a new file called groupbar.yaml and we're going to put some of these CVMFS variables inside it. So I'll just copy that. I'll go in groupbar.yaml and I'll paste this in. So basically the CVMFS role we want this machine to have is client, which means that we just want it to be able to access all of our CVMFS data. And then we want, we're going to set this one here to say that yes we want to set this up for Galaxy and the config repo is the one that tells CVMFS how to set everything else up. And then this is the other important one, the CVMFS quota limit by setting to 500 megabytes. And that's just so we don't fill the root disk of these machines. So I'll save that. And now we need to add the role. We need to add the role to our Galaxy playbook. So we review Galaxy.yaml, which is our playbook. And we just need to add the Galaxy project CVMFS to the bottom of this. Galaxyproject.CVMFS. And that's pretty much it. Okay, now we just run the playbook. It's all the Galaxy. It's all playbook. We want to run the Galaxy one. And then we also need to set the user name to be given to. Then we run. All right, we're about to get to the CVMFS repo. We are now installing the app package of CVMFS. It's going to get that out of the special CERN app repository. Okay, it's installing it. Hopefully it won't take too long. Okay, now it's setting up the repositories and it's done. Okay, that was completed. We've now installed CVMFS client onto our machine and we've told it to go looking for certain repositories. Now, to get access to them, we'll see what's in them. So they'll be located at slash CVMFS. So under the CVMFS directory in your root directory. So we can go to that. CD slash. You can see here there's a directory here called CVMFS. So we'll go in there and have a look and see what's in there. And that's this section of the tutorial. So let's have a look. There's nothing in there. Well, actually, what's going to happen is as soon as we go looking for something in this directory, so autoFS will automatically mount the particular thing we're looking for. And so what we're going to do here is I'm going to go CDdata.galaxyproject.org because I know that's one of the repositories that should have been installed. And when I do that, autoFS is automatically going to mount it for me on the fly like that. And now I've CDdata into it. And if I do it all L, you can see here, I've got some things in here now. I've got by hand and managed. If I go into by hand, you can see here that I have quite a lot of different genomes and their tool indices. So if I go into, say, SACS Earth 2, you can see in here there are Botoy index, the BWI index, the Picard index, the SAM index, a whole bunch of other different things, including the original sequences, etc. So yeah, quite a lot of data. And we just have access to that on the fly. And then as soon as we try and look at any of these files, what will happen is CDMFS will automatically cache it to the local disk within that 500 megabyte cache that we set up earlier. This is really cool. And we can tell Galaxy to look at all of this data and use it as its reference data. So that's what we're going to do now. So now we're going to try and configure Galaxy to use this CDMFS data and to have it so that we can run things like BWA, BWA mem and run them against the human genome or the B genome or the mouse genome and take advantage of the fact that a lot of other people in the Galaxy community have done a lot of work for reference data for us already. So the way to do this is we're going to edit the GroupVarsGalaxyService file and we're going to add a variable called ToolDataTableConfigPath and then we're going to point it to the two files that are in, there's one in by hand and one in managed. If we go into by hand, you can see here and then we're going to location. As you can see in here are all the lock files, but you'll also see there's an XML file here called ToolDataTableConf.xml and we're going to point Galaxy at this file and there's another one in the same position in managed. Managed, location and you can see here there's another one there. And so we're going to add both of these files to our Galaxy configuration and then Galaxy will be able to use all of the data contained within this repository. Okay, so we'll go back to our Ansible directory. This time we're going to edit the GroupVarsGalaxyServiceYAML file. And in our Galaxy section, which is here, just here at the bottom of that I'm going to add a variable called ToolDataTableConfigPath and I'm going to point it to the locations that I showed you before. I can't remember what they are off the top of my head, but luckily they're inside this solution box. And so I will just copy them and paste. And as you can see, it pointing to CVMFSData.GalaxyProject.org by hand location and then that ToolDataTableConfig.xml file. And then we have a list here and we separate it by commas and then we point it to the second one. Right, so we save this file and then we run the playbook again. So I'll just clear the screen. Ansible playbook.GalaxyYAML-U-U and run. So this time what we're going to do, all we're doing is making a minor change to the GalaxyYAML file in the GalaxyConfig to add that one line and then we're going to restart Galaxy. And yeah, Galaxy will suddenly automatically have access to all of that data. So you can see here we've changed the GalaxyConfiguration file and then Galaxy is now restarting and it's done. Okay, let's go and have a look at our Galaxy server and see if BWA can suddenly see all of those and stuff. Right, so we're back on our Galaxy server. We'll click on Analyze Data to just reload the page. We'll go back to Mapping and load BWA MIM. And then suddenly instead of having no options available, you can see here we've got the B genome. Now click on that. There are lots and lots and lots of available genomes now including lots of human, mouse, rat, yeast, all sorts of things. And in fact, if you want to see the list of all the different available genomes now that we have available to us, if you go to Admin, we've got data tables over here. You can see here that we have a couple of data tables for Managed and for All Fast Day. So if we click on that one, you can see that we have a lot of genomes available now in the All Fast Day data table that Galaxy can get access to. If we go back to the data tables again and go down to BWA indexes or BWA MIM indexes here, you can see we have access to a lot of pre-built indexes for BWA for all of these different genomes. That is pretty powerful. So I wanted that to take us maybe 30 minutes. And suddenly our Galaxy server has access to all of the data, the reference data, and the tool indices that the community have built over a number of years. And it's super simple. All right, we'll go back to our tutorial. Yeah, just finally, just before we finish up, if we're developing a new tool and you want to add a reference genome and a different index, just drop us a line on KIDR and we'll be able to add it into the reference data for the community. We're looking at automating the process of building all of this material using data managers and ephemeras. And we're working with a group of people called the IDC, which is the Intergalactic Data Commission, which is a funny name for everyone in Galaxy, in the Galaxy community who likes reference data. And we're looking at making a community-controlled resource that will be semi-automatic. One of the other things that you can do is have automatic fallback. So if, say, you're in Australia and you're hooked up to the Australian mirror of the CVMFS repository and the Australian mirror dies, the CVMFS client is smart enough to automatically go to the next closest one and so you won't lose anything. If you're interested in looking at plant data, there's a link here for that. And finally, if you could please click on this link here and give us some feedback on how you think the tutorial went, whether it was useful, if you enjoyed it, or if you have any criticisms, could you please put them in here as well. And if you end up using this to build a Galaxy server and you publish that Galaxy server, could you slide the tutorial for us, please? That would make a big difference to us. All right. Thank you very much and I hope you enjoyed it. And hopefully I'll get to meet some of you in person one day soon at a Galaxy conference. Thank you and goodbye.