 Hello, everyone, and welcome to our talk, Galaxy Australia, the influx state of a national e-research service. I'm Gareth Price, service manager of Galaxy Australia, and I'm very excited to be able to present to the GCC once again, and this time with my co-host, Simon Gladman. Hi everyone, I'm Simon Gladman, and I'm the lead engineer of Galaxy Australia, and Gareth and I will be presenting this talk together. Okay, I'd like to set the scene for you. Galaxy Australia has been in operation since 2018. It grew from pockets of regional funding to be funded, supported, and really championed as the national service for mature life science analytics. And this was achieved through a number of mechanisms that I want to talk about. A close working relationship with Galaxy Project, membership with the use Galaxy Dot Star ecosystem, and quite frankly, all our close global friends in the Galaxy community, and also locally, a lot of hard work by the team. To make the service something researchers could trust in and in particular could recommend to their colleagues and students to come and use. Now, thanks to planning, funding, and successes, we have secured a nationally funded and ongoing host for Galaxy Australia at the Australian Academic and Research Network, otherwise known as ARNE. This talk is going to go through some of the reasons why and the planning that's taken us along to this decision. So, what do we know about our users? Well, the first and most obvious thing is the number of users is growing and the number of jobs is growing. But graphs like this don't really tell the whole story. What we know that was important to our users, well, service up time, seems obvious, but when you get tickets in, anytime you go down, asking why you're down, you know just in fact how important the service is to your users. We know our users increasingly want to be able to run larger and more complex jobs and have the capacity to run more jobs concurrently. We know our users are asking for a greater range of tools and workflows tailored for their research. And the talk after this by Johan is going to show you how we're engaging with our local research communities in Australia, so please do stay on for that. We know our users want more references. And thanks to Galaxy Australia's policy of hosting data, our users want easy mechanisms to get data in and out. And Newell, part of our Galaxy Australia team, has a lightning talk coming up on Galaxy Cloud Store integration, linking national resources to secure data transfers between our services. To find that on theskage.com, please just search for Galaxy Cloud Store. What was important to us was the operators of the National Galaxy Service. Well, many times it was the same thing that was important to our users, that's uptime and in our case with minimal constant intervention. Failover landing pages and failover responses if and when the service does go down. Clear and reproducible actions so the team knows how to respond on a daily basis. And again, aligning with our users because they're the ones that we work for, the capacity to give them a greater job size and the capacity to allow them to run more jobs concurrently. And really, quite frankly, what we want is a lot less bus factor and a lot less sleepless nights. So we did ask ourselves, when was enough enough? And that was when we acknowledged our historical design slash baggage in our operation of Galaxy Australia. Our current infrastructure at the time, which was excellent, but its infrastructure and our goals just no longer aligned in enabling a national service. That our staff growth meant that no one really understood implicitly how Galaxy was built and operated anymore and that we needed to address that. And that our original setup had evolved over time with gaffer tape and rubber bands. And that was enough enough. What we needed at first was maybe an intelligent design. We thought better about that. What we needed to do was evolve Galaxy Australia. So Simon's going to take over from this point and describe some of that journey. And I'll be back at the end with a couple of slides. For a handover to Simon, I just do want to point out that unlike the graphic on the bottom right there, our team does tend to wear more clothes than that as they operate Galaxy Australia. Thank you. Thanks, Gareth. And then in September and October of 2020, Galaxy Australia had a series of significant outages. Most of these were caused by the massive extra demand that would have undergone in the previous six months. But some of it was also caused by the fact that we were running on over-subscribed cloud nodes. And we had some very noisy neighbors on those cloud nodes. And we also had a few back-end storage issues that were concerning us. And then through some negotiation between the Australian Biocommons and the Pozzi Supercomputing Centre, we were given an opportunity to move our service to a non-oversubscribed cloud. And so we decided to take that opportunity up. And then we also thought this would be a really good opportunity to re-architect our system, to move away from our gaffer taped together, evolved Galaxy service into one that we think would better handle a larger demand and also to stand the test of time. And so the way we did this was we decided to systematically look at the architecture of what a Galaxy service would look like. And then we wanted to redesign ours with all of this information in mind. And we wanted to make this entire system completely reproducible. We wanted to use best practices like Terraform and Ansible and GitHub. And we really wanted to use all of the Galaxy community Ansible roles for as much of our system as possible. And the reason we wanted to do this was so that if we had staff changes, new staff could come into our project. They could do the Galaxy administrator training, which uses all the Galaxy community roles. And then they'd be able to transition onto our production system very, very easily. And we also took this opportunity to completely automate all of our tool installation, all of our tool updating, and all of our tool testing. And Catherine Bromhead has a really nice poster on this in Thursday's poster session. And so we did all of this work and we completed the entire move of Galaxy Australia from Brisbane to Perth in six weeks. But how do we actually go about doing this? Well, the first thing we did was we spoke to all the other Galaxy admins we could find. We talked to usegalaxy.org admins in Nate and Martin, et cetera, and their teams and talked to them about how they operate their Galaxy server. And then we talked to Galaxy Europe. We spoke to Björn and Jim Moro and Helena and asked them about how they built their system and some of the issues they were having with it. And just thought about how other people's experiences could relate to ours. And then what we did was we decided to build a diagram of what a Galaxy service actually looks like from a functional point of view. We wanted to separate out the virtual machines that we had all these things running on. We wanted to show you what a large Galaxy service actually looks like from a functional point of view. So we had our Galaxy application, our Q controllers, some Pulsar nodes, our CVMFS references, our email service, database and backups of the database, storage controllers, and how it all linked together and how it all fit. And then we sat down and we looked at what happens to all of these different components when they're booted and how do they affect one another. And through this, we managed to find out what our bottlenecks were and most of our bottlenecks were concerned with our storage controllers. And the fact that we had everything was trying to contact the storage controller at the same time. And so the loads were spiking and especially when this is on oversubscribed cloud nodes, we had a problem. And so we took all of this information and we redesigned our architecture. So the first thing we did was we split all of our service up into different machines so that we could have one for tools and references, one for user data and one for job working directories. And then basically built everything up. Every virtual machine in our new service is completely put together using Terraform, Ansible, and everything is automated using Jenkins. All of their Ansible scripts are stored in a GitHub repository at newsgalaxy-iu infrastructure if anyone's interested in looking at it. And all of the changes that we make are made via pull requests to our Ansible playbooks in our GitHub repository. We don't let anybody change the live machines anymore because that's just the path of not remembering what happened. And then we had to think about what to do with all of our user data. We had about 120 to 150 terabytes of user data sitting in our old location and we wanted to know did we have to move it all? It would take a long time to move all the way across the country. And so after it's a bit of investigation, we came up with the fact that we probably really need to move the last two weeks of user data and our data libraries. And why is that? Well, we did some analysis and it looks like at about 99 percent. Our histories have worked on for about two weeks and then they're rarely accessed after that. But we also found that data libraries are used quite regularly and so we probably should move all the data associated with those. But we also kept an ongoing NFS connection back through our old user data for everything else so that users could still log on and see all the old data, but all the new data and all the most recent data would be local to our new Galaxy Service in Perth. And so we moved our Galaxy Service from Brisbane to Perth last year in December. Got it all up and running and it worked really, really well. And then as Gareth alluded to earlier, our system in Melbourne hosted by Arnett is nearly ready to go and when as soon as it is, we will be moving our service from Perth over to Melbourne using exactly the same process that we use to move from Brisbane to Perth. And we hope that things will go just as smooth and we plan on making this move immediately after this conference, so very soon. Over the last six months and in the coming six months, we also will be making some large improvements to our Pulsar network. We've been given access to some significant compute resource in terms of remote compute. And so we've turned them into Pulsar nodes that we've added into our network. We've got a new one in Canberra. We also have large high-memory machines located in Melbourne and in Brisbane that we're adding into our Pulsar network. So this will help us support bigger jobs, more jobs and hopefully help future-proof our service a bit. One of the really important lessons that we've experienced over the last 12 months is what do we do when systems become unavailable for long periods of time? We really need to meet our user expectations of availability. But what do we do when we have significant hardware failures, when building services fail, like the power of the cooling system fails and it's going to be out for a long period of time or even if it's a long planned outage somewhere, what do we do? So we've thought about this and we've decided we're going to build a disaster recovery site. We're going to make one of our Pulsar sites dual-purpose. Normally it'll be a Pulsar node and then when we press the big red panic button it will turn itself into a backup Galaxy server which we can switch to. And we're going to accomplish this by some replications of the database and the recent user data, et cetera and we'll have a semi-automatic failover. If you're interested in this subject then please go and have a look at Nick's poster which will be on Thursday. And so hopefully by the end of this year our Galaxy servers will look something like this. We'll have our head node located in Melbourne hosted by Arnett with our main worker nodes and our main storage and we'll also have some Pulsar nodes in Melbourne, another Pulsar node over in Perth, our backup site in Queensland as well as some high memory Pulsars. And now I'd like to hand back to Gareth who will talk a little bit about connections to some national storage and then to wrap up. Thank you. Thank you Simon and it's over to me to wrap up our presentation with one final topic and that is the connections that Galaxy Australia is making to other Australian research infrastructure. So we do know that data movement has become increasingly more complex as our users become more ambitious in their own political demands they're bringing more data to the service they're generating more results and they want to bring that in easily and export it easily. To that end, Australia has a nationally funded research network in Arnett which provides one terabyte of data storage per person, a researcher in their cloud offering cloud store. So we in Galaxy Australia wanted to tap into this and so we did. We built, thanks to the hard work of Newn, two tools for Galaxy, an import from cloud store and ascend data from cloud store to cloud store option. These move data securely through authentication with password management and there's a talk here at GCC this year one of the lightning talks Newn's put together Galaxy cloud store integration which is linked to national resources. So I do encourage you to chase that down if you want to hear just a little bit more about our service. And with that, it remains for me to thank the Galaxy Australia team that's made our service so successful. So outside of Simon and myself that is Igor, Nick, Catherine, Newn, Michael and Anna and beyond the core Galaxy team we have so many things to have. So we've got a lot of comments with specific mentions there on this screen. There's the greatest staff at Cusin for Melbourne by Informatics, funding and resources provided by AIDC, Pausie, Arnett, Galaxy Project and via platforms Australia and funding from the Queensland Government. So thanks to all those partners for making Galaxy Australia the success story it is. And with that, I'll close and we welcome