 Hello, my name is Enis Afgan. I'm a Research Scientist at Johns Hopkins University, working with a group of great people on the ANVOL project. Today, we're going to take a look at how we architect a galaxy for working on protected data sets. So what's it like working with protected data? Unlike the open data counterpart that's easy to download off of a variety of URLs, we'll protect the data set. As a researcher, you first have to make an application to the data access committee to grant you access to the given data set. Assuming your application is adequate and you're granted an approval for accessing the data, you are allowed to download and analyze data. However, there are some gotchas. Here, your institution, the company that employs you, is ultimately responsible for maintaining the confidentiality, integrity, and availability of the data that the DAC committee granted you access to. This comes with some ties as to what kind of environment the data is allowed to reside in. However, it's an afterthought is not a good place to start here. You have to design the security into the chosen environment before you start downloading the data, because the integrity of the data has to be maintained throughout its life cycle. So what does this look like? How do we design the security, the necessary security into an environment? Well, depending on where you live in the world, there are different organizations that will publish best practices on what it means for data to be considered in a secured environment. In the United States, this is NIST, the National Institute of Standards and Technology, who publishes a document, one of which is NIST 853, that's a book of rules, guidelines, and checks that ensure that after you've implemented those, your environment adheres to the set of best practices. Once you have this set of controls implemented in an institution, you can go and have it compliance verified, which carries another term such as FISMA and FedRAM, or HIPAA, for dealing with human data. But overall, the basic principle is that these are complex, time-consuming efforts that require a lot of money as well, and in turn are inaccessible to individual researchers in most labs. So what are you as a researcher, as an individual researcher, to do? Well, this is where ANVOL comes in. Unlike traditional environments that were intended where people downloaded data onto their local system and analyzed them locally, ANVOL bundles the compute infrastructure, the data, and the applications into one place. Hence, as a user, you're able to get the data, get the necessary tools, and the infrastructure to operate on the data. Better yet, all this is surrounded in a compliance environment, in this case, FedRAM. So anybody in the world can come and use ANVOL and get access to these tools, Galaxy being one of them. But what did it take to actually get Galaxy into this environment and make it compliance, adequately compliant? So if we dissect what is Galaxy, it's for years now, accessibility, reproducibility, and sharing gets stood as the three pillars defining majority of Galaxy. So how do we replicate these in a compliance-based environment that imposes some stringent rules as to how an application and how data can be shared and run? So for replicating accessibility, we maintain the browser-based access. So you go to the Terra portal for ANVOL, so ANVOL.terra.bio, you select Galaxy, and you will be, in a few minutes, given access to a Galaxy instance. What has happened in the back, a dedicated instance of Galaxy was launched for you and you only. You're not able to share this instance with any other users. So in this case, we have to manage potentially hundreds of Galaxy instances and deliver them in a robust fashion to the users. So how do we actually implement this? Well, we adopted Kubernetes and Helm as technologies of choice that allow us to create, to have this robust deployment mechanism and subsequent management of the application. For replicating reproducibility, you would think history is enough. But what if the discs that Galaxy runs on are transient? So here, we have to architect Galaxy to first allow it to be, to ingest data from ANVOL itself, ANVOL houses over three petabytes of interesting data sets. And so in Galaxy, you can now go browse these data sets from within the Galaxy interface, select the ones you would like to work on and have them saved on persistent discs within your own environment. So you can come back at a later time, recreate the same Galaxy environment and hence gain the benefits of reproducibility. There's a full link talk on this topic, listed down at the bottom of the slide. So please check it out as well for more details on this feature. And lastly, how do we replicate sharing? Because again, these are individual instances that cannot have multiple users simultaneously using them. So we've enabled exporting of data to ANVOL workspaces. The workspace has the capacity to then be shared with different users and who can then import data into their own Galaxy instances. Similarly, we've allowed exporting of methods or workflows into Dockstore as a repository of methods that also implements sharing capabilities and users can then import those histories in their own environments. So if you want to try this, come by Poster and or go by ANVOLproject.org.