 Reference data with CVMFS Have an understanding of what CVMFS is and how it works. Install and configure the CVMFS client on a Linux machine and mount the Galaxy reference data repository. Configure your Galaxy to use these reference genomes and indices. Use an Ansible Playbook for all of the above. Many tools need reference data. For example this BWAMEM tool requires a reference genome. But where do you get this data and how do tools know where to find it? Much of this reference data requires calculation to generate it. For example, building dataset indexes. It is better to build them beforehand. So Galaxy has a concept of reference data. Here is a more concrete example. A reference dataset is required as input. A tool, for example BWA, is used to build the index. The outputs are registered with Galaxy's data registry in a lock file. The tool data table XML file has a listing of all of the lock files. When a user runs the BWA tool, Galaxy knows where to find this reference data. Data managers are special tools in Galaxy. They create the reference data and update the lock files. Here is an example lock file. They list all of the indexes of a specific type. Some of these indexes will be tool specific like this one. And some will be more general like a list of genomes. This file is updated by the data manager. The tool data table conference file lists table names and their associated lock file. Additionally it defines the meaning of each column in the lock file. When a tool wants to use the data from those tables, it needs to declare which tables it wants to access. See previous slide. A select type parameter is created for the user's selections. The tool knows that the options should come from a BWA MEM indexes lock file. These reference datasets are sorted by column 2, the name, and some validators ensure a helpful message is shown if no data is available. This was a lot of information and very genomic specific in some places. A lot of work to create and update the reference data, but there is a better way. Imagine going through this process every time a user request comes in. It would be unpleasant. CVMFS and the IDC solves these issues. CVMFS provides a global repository of reference data originally built by CERN for sharing software. We use it for data. It's an HTTP based protocol and very firewall friendly. All of the user galaxy.star servers host a CVMFS repository. The IDC is the other half of the puzzle. CVMFS provides the storage and the IDC provides the data. Join us on GitHub if you are interested. CVMFS has a hierarchical structure. At the top is the stratum zero server, the original copy of the datasets. This is replicated to the read only stratum one servers. Anyone can connect to these stratum one servers. And whenever connections fail, CVMFS will fail over to another mirror. The mirror selection process is based on connection round trip times. As a result, the nearest mirror is usually selected. There are CVMFS servers across the entire world. The primary mirrors are run by Galaxy Project, Galaxy Europe and Galaxy Australia. If one of these mirrors fails, you will still be able to use the reference data. Thank you for watching.