 So today I'm going to be talking about using Galaxy file source plugins to work with remote data in Galaxy The slides are available here Let's definitely broken into infrastructure and applications to motivate this work Let's talk a little bit about getting data into Galaxy So in recent years the user experience around uploading that data into Galaxy has gotten a lot better Web browser technology has limited a lot of limitations that play us in the past And we've architected the Alexi's back end to handle large uploads with less configuration pain and the front end To deal with large numbers of uploads better Tools like collections rules and upload or UI improvements But upload is still limited the files need to be on the researchers machine or available via public HTTP Sites and the UI and robustness in the web browser will never be able to compete with FTP or file managers, etc So FTP upload and library directory uploads are still more usable and robust at scale For getting data into Galaxy. Unfortunately, these two mechanisms are not very general configuring Galaxy to talk to an FTP server as a variety of options Using the API is very different than other uploads And the very different user experience than using library uploads And these limitations and differences are entirely historical artifacts about how these systems were developed in the narrow use cases They were trying to solve none of the differences are Complexities are reflected reflect useful differentiation that admins users or developers would want One case where this is not true is the remote files API So there's an API for like listing files and it can list files in library import directories or FTP directories and And so while the upload is still sort of bespoke at every level and different This is a nice piece of technology that really makes it clear that these are essentially the same thing could be easily generalized So the key realization to this work was noting that if Galaxy can sort of seamlessly upload HTTP or file prefixes Why can't it also upload? FTP, you know Galaxy specialized Galaxy FTP or Galaxy import directory URIs and Navigate them the same way so much like there was a remote files API The piece of code for streaming data URIs I was synchronized across upload mechanisms And so it was right for generalization So what we did is we added a pluggable system So that admins configure different URIs games for browsing and resolving files And this was 988 After this was uploaded, I mean after this was merged You know both the remote files API and then tools including the upload tools could treat these URIs uniformly So GX FTP or these GX imports or even custom plugins could all sort of be treated as simple URIs and Uploaded and browsed the same way in the API So You know the awesome upshot of merging these code paths is that we could implement real plugins And so we added this plugin file configuration file config file sources com.yaml And then here's like an example We we have a Dropbox demo demonstration plugin For the labs Dropbox files You know documentation that appears in the UI for the user and access token that the whole lab will sort of share and Then now if this file is configured with this, you know these everything that's GX files lab Dropbox as ID Can be browsed and uploaded as URIs and then we also added an upload dialog that can browse the Dropbox files This configuration file is templatized so it can access user preferences. So in this case You could set up a user preference to read this this access token on a per user basis So if I have this as my Dropbox and I I can configure in the user preferences my API Accessed token Then in the upload dialog instead of choose FTP files, I don't have choose remote files. It's a generalization And along with my FTP directory, there's also the Dropbox directory and I can dig in and see those Dropbox files now These are the FTP files and the same in for the same instance and you can sort of see the user interfaces are identical for navigating in dealing with these That's great and in a question that might pop up is what's the difference between These file source plugins and object stores and it really comes down to object stores providing data sets not files So the the files are sort of organized logically by galaxy in a very specific way around the concept of a data set where a file source is Providing a description of files and directories as we sort of typically think of them And they're meant to be browsed in a hierarchical fashion There's also no concept of like extra files and things that would be in a galaxy object store Also object stores are assumed to be persistent where file sources don't need to make that same assumption We at this point have a bunch of plugins for the plug-in infrastructure they sort of all descend from the base file source class and galaxy file sources we've got the A pause experience that includes the sort of galaxy fdp and and upload directories We've got a specialized SF3 FS plug-in and then we got a bunch of plug-ins based on pie file system, too Including the drop box plug-in that we've seen on the anvil one that will be demoed later on and many others pie file system, too is Is an exciting project that provides a sort of common? layer our common file system extraction for Python and We've added infrastructure for rapidly adapting pie file system to plug-ins to galaxy file source plug-ins Pie file system to provides numerous backends And if you'd like to integrate galaxy with some custom data source or layer implementing it as a pie file system to plug-in Means that you can just need a small wrapper around that pie file system to plug-in to write something That's very general but still useful inside of galaxy And so this is what we did for for instance our anvil plug-in that we'll see later on After merging the real link URI PRs, we immediately started working on additional functionality For years, we've talked about being able to write like History exports to the FTP directory because that's so much, you know More robust for downloading for galaxy for galaxy users And so what we did with our next big iteration on this functionality is formalize the interaction with tools allow Plug-in sources to be marked as writable And Implement that ability to write to FTP directories. We've been included some example tools That shorter show, you know, how to how to load up the file source configuration file and an accessing interface from inside of a tool and then how to You know get the directory as a as a tool parameter and how to write the file source configuration As a config file for the tools With that infrastructure in place then we were we were ready to sort of revise the history import user Revise the user experience around history import and export, you know, so we significantly overhauled this At the end of 2020 Luke's demo at the end of the talk will show off this but some key points are histories can now be written to galaxy's FTP directory So this was been long discussed and and desired feature Before only this link option was available Likewise you can import from FTP directory or any any remote file source any any sort of file source plug-in This this rewrite process had a bunch of other goodies involved we migrated everything to view JS We provided actual feedback on what's going on in real error messages And the whole you user interface even for the old link-based approaches to history import and export are much more robust now and much more clear Additionally the EU and David implemented Support for securing file source plugins Yeah And there's very appreciated Next skipping the applications beyond of course quickly turn this fun infrastructure into really nice practical way to access large amounts of public data He added a public FTP plug-in and a vastly more stable S3 plug-in and This is just one page of the data sources that are browsable from use galaxy EU as of today a Bunch of these data sources are Amazon open data sources These data sources are hosted for free on Amazon S3 and they include the important resource such as a COVID-19 data lake Here we can see sort of navigating into the data lake additional bio related resources Include genome at least in code a thousand genomes genome arc Amazon has made a bunch of various client models and observational data available All that's linked to you on the EU and then a bunch of public FTP servers including Ebi and TBI ensemble are available for browsing and importing as well as a few COVID-19 FTP resources Next up we're going to see a demo on anvil done by Luke and demonstrating the anvil plug-in Here we have an anvil workspace If we select the data tab, we can see data that we've selected previously for inclusion in a table section We have a BigQuery table, which is a big query that we've previously performed Here we have a DIRS or data repository service URI, which points to a Google bucket file Here's a cohort which represents a BigQuery query to be performed, but it's not yet and We have other tables that can contain data as we see fit Here's one such table that contains GS URIs all of these files live on Google buckets Here is a much bigger table Similarly structured we can also have reference data, which is again similarly structured But this is a bit less relevant for Galaxy since it often has reference data bundled Here we can see our personal data These are arbitrary key value pairs that we've specified, which can have DIRS URIs or GS URIs or any such thing that we'd like And here are some of our personal files, which include any Jupyter notebooks we might have made and a text file that I added Normally to launch Galaxy you would select notebooks and create a cloud environment for Galaxy But you can also run the anvil file system plug-in locally as we are here in Galaxy 2105 So we have a fresh history ready to go And we want to get some data in it, so we load our own data Choose remote files And note anvil is an option The top level directories are the same as in the data tab In the tables we see the same tables represented as folders The contents of the folders will be a TSV that represents these tables Literally, let's grab this one And then whatever files might be referred to by some of the entries in the TSV Note here that the DIRS URI has been resolved to the actual file it is and not just a UUID If we look at that very large participant set We can see that it has several kinds of files Lots of files Fortunately we can narrow it down, so Let's look for some cram indices As they're relatively small and let's grab one of those We can look at some of our personal data Note in the arbitrary key value pairs the referred to files are an option in addition to just that TSV And here's the text file Let's grab all of that And in normal galaxy flow Hit start which will create an upload job Ta-da The big query was performed Google buckets were grabbed And all the contents are as we expect but This is a workflow application So let's run some tools on this data We can select a simple tool cut to Remove In excise one column From our workspace that TSV Is it eliminated by commas? No We submit this job It is executed And lo and behold success We have our single column requested Now that we've done that We have the option to save our history To be restored later or shared with colleagues We do that by selecting export history to file And choosing A remote file The only place on anvil that is actually an uploadable directory And not just a table of some sort Is files So we select that Which represents our workspaces Google bucket We select the name My cool history And we export it Under the hood galaxy will compress All of these files include Pertinent metadata And then push it to our selected location Our personal google bucket Success We can see the success in the data tabs In our files as we looked at previously Great So we can use the same plugin To pull this history We import from file Choose a remote file The one we just created And let galaxy work its magic Great success But did it really work? Let's have a look Here we have the same files as we did before At the same size Same content The anvil file system plugin can be found on pypy The python package index And the source code can be seen on the anvil project organization And with that I'd like to thank the whole galaxy community for building awesome data innovations. Thanks much