 Okay everybody. By my clock, it's 10 after four and we should get going with the late afternoon session. We've got a few important things. We are going to hear from our Platinum sponsor Amazon. We are also going to try to take a group picture after that and control the chaos enough to come back here and then hear from the Galaxy main project PIs on their annual Galaxy update. So some things to fit in. I just wanted to very quickly before we do that a couple of announcements. One is that if you are speaking tomorrow as a presenter you haven't done it yet, please get your presentation to Jen so she can get it uploaded and all in place so that you can make it happen up here. There is also if you look in the program a scavenger hunt which is basically taking a selfie picture at a number of notable locations within the Twin Cities area. If you do that and submit those pictures and have more than anybody else, we're going to give the top three people that can show that a bag full of Minnesota related prizes. So keep that in mind. It may not be that difficult to do so go snap a few pictures you might win a few prizes. And last but not least, we do have some travel award fellowship winners and I wanted to just recognize them. I'm just going to read off their names that I got fellowships just to defer some of the costs of the conference this year. So let's see, there are three fellowships that went to people from the Cleveland Clinic and that's Brian, Robin, old, Fabio, and Josie and then one that's local at the National Marrow donor program and that's Ray Salkuga. So maybe we can just give a round of applause to all the fellowships. Okay with that I'm going to get off the stage and our first talk is to our Platinum sponsor Amazon and they are going to talk to us about some options for genomics on the cloud. Yeah thank you. I thought I was being clarrer for this title. I think Luke really took the show and actually takes this to Amazon we're not free. This is the set that we're going to talk about here. Genomics money and that's why I'm here. I want to do this because we're not free. And then go into some specific services just as a knowledge sharing and honestly this is, there was a lot of surprise by the mention of some of the architectures I'd say that have actually realized that folks were, I mean I knew folks were deploying Galaxy. I mean that's obviously the platform that is used for teaching for within a long time support of this community so that you can have a stable infrastructure to teach the community how to leverage Galaxy but obviously you know whether their production used to do not I was happy to see that there was production as well. So but I'm going to begin with why the cloud at all for this space way back in 2013. I came from academia of the University of Pennsylvania running the infrastructure supporting genomics course and one ASHG I believe it was ASHG or it might have been one of the cancer conferences I forget it was a while ago. TCGA project at that time was being hosted at the San Diego Supervision Center. It was pretty much the largest genomics collection cohort out there of genome sequencing. It was about petabyte at that time. And I gave a presentation on the data coordinating and data distribution service and it was roughly eight months into the year and they had already egressed a full petabyte of data out from people requesting this 500 terabyte data set. And it was growing I was expected to it was accelerating and it was expected to by the end of the year be one one type of device. And to me I was a giant red flag. And it was actually one of the reasons I started looking at cloud architectures for doing science because how many how many 500 terabyte copies of TCGA were out there that folks were repeatedly paying for. And it was only going to get worse. And how much time is wasted with somebody babysitting downloading. It sucks right so completely unsustainable. And I started this journey about how do we support genomics researchers not only for changing the model of compute with the cloud flipping but that paradigm of download data to compute on it but really just bringing compute out to where the data lives. And for the most part the first few years that I was at AWS I spent a lot of time working with the NIH on their policies to allow control access data on the cloud that actually wasn't a thing. And the other thing that we did is we created this open data program that allowed hosting of of data different data sets across different domains satellite imagery machine learning a lot of technical purpose and there's actually a lot quite a bit of genomic information and data sets within this industry open data. And like I just said the references right now are in Sydney we might copy it if you guys are a little bit slow but there's a whole list of other data sets within the data set. And that was the first step of really thinking about democratizing access to data for real for real right because not every institution actually has an htc cluster that you can rely on sometimes when it does have that htc cluster you might be waiting on the queue for days to weeks which you know we were fairly well funded and sometimes our queue is quite long. So the reason I named AWS is because I actually while running a htc infrastructure half of my recent jobs running on AWS simply to avoid the queue itself. But I was fortunate enough to have like Debbie Warbucks very well funded beyond because we're not free. So speaking of the big that are free you know what services are good for running Galaxy you know just describe if you do that that are there. So I'm a developer relations developer advocate for the HPC within the HPC services group and our group actually developed software for doing tightly couple workflows also high throughput workflows for a lot of different domains. And one of those projects that we have is called the AWS parallel cluster this is a command line tool it's open source cluster management tool that integrates with a bunch of AWS services and what it does is it stands up a slurm cluster for you a scalable slurm cluster for you with pretty much creating an environment that's very similar to what you find in your institutional clusters. So this is just a subset of AWS services here on your left my right there's a lot it's a big ecosystem I'm not going to get into it we come out with things all the time and I hear about it sometimes because a customer tells me oh you guys just released this and I was like oh really we did it's really awesome I was waiting for that. So it's just so big but in terms of what you need for research computing you have some storage services on the top you have visualization if you're going to do again other domains do a lot more visualization through sort of big applications like ansys fluent or CFP applications where there's some fancy things outside of notebooks and web technologies we of course have compute and then we also have some high-speed networking that is for the most part not not relevant to most of genomics but but it is relevant to a lot of other sciences and then AWS parallel cluster orchestrates all of these services to give you what you see what do you see over here on the right which is a slurm queue it can stand on remote desktop visualization tools with this protocol dcb that we developed as well as on the on the far left there are the resilient storage services and then scalable compute so practically speaking what about you buy a scalable compute is that you have a queue that queue can be zero therefore you compute the zero outside of the head and as soon as you start putting in things into the queue our our systems look at it and start saying oh you need you need instances but at least five cpu's and and four views of memory per cpu we'll stand up this sort of instance to do your job because that's what's what's configured to launch as far as parallel clusters can serve you know launch those until you drain your queue once your queue is drained it'll shut them down for you so you're not paying for those computers when you're doing it so that's nice that's really really nice right to to just not even have to worry about provisioning resources it just scales it just in time to them one other thing that i'll point out so parallel cluster has a lot to it so the way that i would say if you're going to use this as the core infrastructure for your galaxy install you can install like a slightly beefier head move than you would normally so that you can run the galaxy processes on the head move expose that as a as a web port and then attach on the back of a you know either a VRM uh drama connector or a straight just normal there for the plug-in is on the back for you to see schedule it the other thing that i'll talk about in terms of uh file systems because this is this is actually really cool is that uh parallel cluster will also not just not just the cluster itself could also stand up a full luster parallel mfs for you and so we have this other product called fs for luster by the way they just added other file systems uh and that's specifically net app fontan so you can sort of mirror data from from local if you have a net app system and two AWS using the fs from net app and they just uh announce it into your pass as well so but i'll show you luster today because that's what today but parallel cluster supports and and basically the fs for luster system will stamp up parallel cluster uh file server for you uh that's ssd based if you want it to be sort of for security access and it'll do the metadata caching but the really cool thing about this is that it actually um you can set it up to mirror the metadata from a s3 bucket and say present to me as a parallel file system this this s3 bucket of the doc's and geome's data or the galaxy reference data and on your first ask for that file one it will copy that file uh into the uh into the fsx system so it's available across the cluster so it it sort of rewards you for repeated access of files um but not only that if you start uh creating files into the um into the uh uh it'll do the reverse right as soon as you create a file within the uh parallel fs system it'll cache it back into s3 for you as an object which is really cool all automatically and there's a lot of there's a lot more to have the sex that you can do across regions you can you can cache on-prem storage in fact and use use it as a cache for on-prem and and back and forth so it's a really powerful system you don't have enough time to talk about it uh and finally we just released a few months ago um agui because parallel cluster is a yaml file but while it's it's fairly simple to do oftentimes you you might have users who just really want a um you guys know i mean galaxy right everything is command my environment so why does galaxy exist because people have them and they're they're easier to discover the functionality uh they're easier to implement stuff so if you're if you're looking to deploy a parallel cluster or play around with it we have a bunch of workshops uh that use this as the basis for deploying a parallel cluster so okay back to kary's talk about graduating from galaxy right there are there are things when you get to um operations you know like when you when you get to a production system that's doing the same thing over and over oftentimes you're not going to use a gui for that um you may but oftentimes you go into something like stink maker next floor or a or c w l uh which is the you know that workflow itself that you're not managing they're off it's based off of that's like some files will come into um come into the file system and you they have a theme and looking for the run run complete uh file uh so it can kick off in a production system so all sort of automated and for that we have a different batch scheduling service called pay average batch uh and it's a you know it's a fully managed service uh you schedule and run asynchronous jobs uh it is our scheduler right so it's has its own syntax uh and uh in addition to just being that cube and that job scheduler it'll also manage the fleet behind the scenes um for you to allocate the compute that it runs on and the reason you might want to use batch is because it actually has different allocation strategies for you including uh best fit like give me the best machine that you can get uh for the things that I specify in my queue or if you can find that just give me the next best one right so it's called best fit progressive and it also works off of the spot market where it says give me the biggest capacity pool of spots so I can be more confident that you're not going to take that machine away from and match knows how to do that so it will it will get you know the spot market I'm not sure if you all know but we've got a lot of compute you have no idea how much compute we have uh and so we have something called the spot market which is uh essentially uh ununallocated compute that we have offered a very steep discount uh but the caveat answer is that we'll give you a two-minute warning when we take that thing back right so you can checkpoint jobs or you can restart jobs if you've got a workflow that knows where to start from where it's off you just restart it and do things like that batch takes care of job retries for you as well um it is container based so it actually has a lot of native plugins on the back end that support you know that was back and snake mate next flow obviously which I said the other other other sort of industry work for languages that are outside of general it's like air flow or medical or bg or even our own step functions and the reason we tell folks to you know if you got more than two steps in a in a any sort of uh sort of analysis you should really think about a workflow framework because it does a lot for you including um being able to do local development run it on you know your own hbc infrastructures or he has this cloud or even other clouds it allows that workflow portability especially if you're looking at you know containers and uh containers for the application and good separation of inputs versus what's in the workflow so that you can change the input source that easily without changing the workflow itself uh speaking of genomics we actually have a whole division that's outside of the hbc services team looking at genomics as a problem and the first thing they released was this genomics cli basically this is the parallel cluster equivalent for genomics workflow and and do this right so it supports standing up infrastructure so that you can get uh fast at running next flow or or prom well uh it supports uh cwl via toile uh and stinking and the back end of all of those languages is aws batch right so it prom well itself we know who develops it uh they're not aws currently but uh but we we've worked with them to add some functionality to prom well it's not been folded upstream yet but to to support you guys better uh same thing for for uh for snickering we are so we are pushing it back up street at some point so that's all i had uh i know i'm gonna super quick there uh but if anybody's got any questions i'm happy to answer it there are other things that i forgot to put on here so you you heard about the open data program which is free so we have some free stuff we also have a research grants program so if you're an academic researcher and want to um do an experiment or try something out look up AWS credits for research we we um have a regular uh call for funding that you can apply for return for cloud credits because of course the NIH Strides program from from the NIH as well and there was one more thing that I wanted to discuss right uh speaking of containers we've been working with yarn to push biocontainers into um AWS's registry uh for containers called ECR public gallery we're about 20 percent complete uh once it does complete we'll put out a blog post about how you use it and how you can answer the problem galaxy and other systems all right now i'll take the question that's all in that yeah so uh i see you focused on like yeah a little bit more just like to see what goes uh a lot of people are also um it's fair for me to see you on the job yeah can you refer from this that doesn't scale to that you know miss uh or uh no no i mean kubernetes again we're not um we're not pigeonholing folks if kubernetes is working before you run clusters and workflows go ahead and use it right we do have a uh we do have a managed service as we saw from an earlier talk uh lasting community service which stands up that control plane and the reason you might want to do that is because there's you know so i saw this really nice quote on twitter the other day about the portability of kubernetes isn't actually the portability of your infrastructure it's the portability of an operator experience right so it's always the people and and being able to hire and train and afford your skills to other places that have kubernetes as a native infrastructure just lowering the bar that don't become useful at the end so there's always going to be some um some customization depending on where you where you're deploying kubernetes to whether that's you know like google or microsoft azure kubernetes or or eks the year at anison there's always going to be weird or not weird but like specific to the platform uh things about authorization the control plane what versions are supported how you integrate with other services right uh long-winded answer it's fine if that's what you're that's where your skills and capabilities are to use it and we're more than help we're happy to help folks implement on top of the system that they thought as as they're you mentioned that the genomic cli um is basically run on aws batch but you also mentioned that aws batch since you're using spot instances your compute might go away with a two-minute warning um so does the genomic cli handle the snapshotting for you in case batch goes away a lot that has to be done that's that's so application specific it's always up to the application to define how to start workflows again right and if you're looking at you know some workflows are good about um knowing at the output level and doing the check summing on on your behalf to say i've already created these files i don't need to create them again i can start from here so that's at the workflow that will be at next blow or or it's going over or whatever um batch itself has a simple individual job retry right and they don't rely on the application whether you start from the beginning or for you know if they can skip a couple of steps so that's always going to be application better uh agc is really about standing up the infrastructure endpoint to be able to submit workflows so it actually supports uh the gf of gh workflow execution service at point standard i think there was more well there was somebody behind here and you can go you're yeah thanks um i have more than a type of question so in our workshop uh about bio containers um there was a there was a question raised of sustainability of our infrastructure model so how dependent are we on for much of that so we are using gio we are using emerald on the free services now for reference data we are using youtube which is also a service that we all use that we have dependent on which is free and that's cool we have these um we have these dependencies right and the question is how do you see that from your side of commercial entity um would you like us as an open community to more or less use these services the smallest contract and uh should be should be being prepared as a community for those services is it right that you find to use an open free crisis so that's the uh so for the folks online the the question about uh sustainability what's raised about you know like there is there is charity here right there's a lot of there's charity by um uh commercial organizations to support the open data program right uh typically what you'll find and you'll find this true across all of the top providers is that their programs are founded by time right so the open data program for the reference data has a two-year contract uh for for us paying a hostage because we want to encourage use of the data and then there's a renewal just like a grant as a specific aid and a renewal so I think to your point about how do commercial organizations view uh some of this of communities and what that we do we do it off a specific age and whether you you know it or not you know there's always going to be something there that is tracking is this useful for the community uh and is it something that we can support for the long term so at least for us we take the long view of these things and say that uh these programs aren't the right way they've been there for years we've been doing this for years uh but it is the human onto who we work with to engage with us to make sure that the thing is a success it's not drop it in and then we take care of it we specifically structure these things so that we're still owner we're just an enabler to the community and the project itself and as long as the owner is still a good steward of the community we're more than happy to keep supporting and I'll guarantee because we have time out of it and commercial entity we can't we can't do things like that but neither does the government now on my soapbox strives to be forever to get you know uh an initiative for cloud funding that's even somewhere near the footing of what capital expense uh things are there for right so I think as a community we have large large problems coming from it not even related to cloud just the people when we're doing that so there's a sustainability problem there and uh but and I was having a conversation I'm sorry if I'm going off but I find this interesting I was having a conversation about how are informatics projects funded today and they're usually tied to either data distribution or data coordination centers or very large you know very large projects that are the disease-centric or data generation around this right and is that okay should we be funding informatics for informatics sake I think Galaxy is getting very it has been supported uh very well over the years and I foresee it continuing but Galaxy's more project should should there be more resilient funding should there be more training programs should there be allocation and that only that is only going to come from you and the funding agencies need to hear a product there and then to the sacrifice maybe some other initiatives okay I think we should move on thank you um AWS may not be free but you are kicking back some of that money to support this community and this conference so we're very thankful for that so next phase here is we're going to try to get a group picture and I don't know how that's going to work so I'm going to get out of the way somebody else knows something about that I think so you want to give some instruction all right so I don't know what I'm doing uh and it's strong on me into this so if there is a professional photographer or just someone who knows at least a little bit please interrupt me and tell me how to do the correct screen all right welcome back um it is my pleasure to uh introduce the next uh talk here and that's the annual galaxy community update uh from the leaders of the galaxy main project so I will give it over to them and we'll take it from here make sure we want to do everything correctly so speak to the right one that doesn't work so the list of people in the stock uh make sure you have on the program is bigger but we thought that you know with five people it won't be a little so here's so so we come to the consensus it will be me and my giving the stock um so um before beginning anything I just want to thank Mike because he took the team under his wing and it's the same old team it works very well and for the past few years he managed to I think improve drastically the spirit of this project so thank you so much for doing that without him it would be impossible for please be very kind to him the dealer at that department so we're the one from all the other departments like this and I also want to thank the entire team I want to thank Bjorn also without Bjorn that would be I want to thank the Australians Australians that are here and also Andrew because it's been tough two years this is the first time we're together so it's it's amazing so that's the end and now we're going to start the talk so uh galaxy is at 15 it's it's teenager it's a well-behaving teenager um stars and ghosts and I think in this talk we first wanted to sort of go back to the philosophical foundation of what the project was about check if it's still valid uh and then uh go into some of the goals the way we kind of see what the project should be doing and then in besides some things that happened in the past year and so but before doing any of this um I want to thank the organizers because as you well know organizing conferences does not add to your likes okay so those that are rare and we're in Minneapolis we're in the University of Minnesota and this is very special because galaxy b well speaking in a galactic language this would be this was the super critical event because galaxy b is also a teenager uh and uh so for us this was incredible because uh the project that we were building uh different team do it and used it for something we have absolutely no idea about the proteins that complicate the genesis so we we we don't know we're going to say five years like that so that's sort of as far as we go and this was an incredible motivation and I think the fact that we're here today is because of this so that's sort of that kept us going and so thank you galaxy b team uh so thank you again thank you organizers uh so it's uh and obviously thank sponsors as well for making this possible this is as you see in this t-shirt uh screenshot this is the first physical conference after this black hole so it's important okay so uh let's sort of revisit uh what sort of work we why we're doing why the hell are we doing this all right it's also doesn't matter your lifespan so uh this is one of the original James's slides this is from 2019 but you know we used slides for many years and we had a cheesy animation just to get to this you know the worst thing is to use crazy so that's what he was just he was leaving so so three goals this is uh kind of a you know communist manifesto galaxy so it's absolutely absolute thing i'm completely unachievable but a good physical foundation nevertheless so three goals accessibility this is actually being able to run they've got them analysis actually the transparency which means we can explain how the sausage is made i mean easily and uh reproducibility in the sense that other people should be able to move the sausage so they should get ingredients you know and get the same product and uh it's still very uh valid concern in our field and i think what became clear in the past 15 years that's also really uh concerned in oil fields that do computation so uh i think the relative weight of this concept may be changing a little bit i will specifically emphasize uh accessibility so reproducibility obviously is being able to repeat and uh in general they added this screenshot from Princeton about um machine learning research being irreproducible so it's not just biology in public collaboration this is essentially ability to share so it's also ability to be uh transparent how i did this it's also ability to introduce things and uh in galaxy these things are such as reproducibility and sharing thereby design so it's uh i guess the software feature but ultimately you also need to be if you want to do analysis you need to do it somewhere so you know the software doesn't run on air so you need some resources to do that and that sort of leads us to accessibility you should be able to run it somewhere so it has to be accessible and i think in the especially corded show that accessibility of computation accessibility in general ability to do analysis is really not that good so accessibility is power let me sort of explain how this works uh this is google maps there's a town in france for mobility it's on the sea but it's not really close to this so there's a beach there what they need itself is liberal town it's like sentences but the people who live on the beach better let them walk bourgeois so they really don't want the north africans from the sea going to their beach so if you have a car it's accessible but if you don't have a car it's impossible to get there so there is no laws in france which conserves allow this discrimination this segregation so you don't need to you don't need any laws you just need to you just need to remove accessibility so the tram stops right here and so you have to work for i don't know talking orders you can still access it but who's you're doing so this is this the difference between these two things this is how materials and methods of many papers are so if you read them there is some github thing there there might be even you know little workflow or something else but can you really go to that beach no so i think one of our primary goals this is sort of a refinement of these ideas is fighting analytical inequality because the science the number of people who do science is high not only in the united states not only europe there is very high level of science in africa very high level of science of eastern europe and asia and so on but these people have much harder time getting access to resources and even in the place of sort of three not three for example commercial clouds in france it's difficult to use amazon because a it's american and b you need to have a credit card to access it and it's hard to reconcile and this order goes back to that beach so i think for the sort of a one of the fundamental so take home message here is the galaxy makes now is accessible you can use made now you can go create an account go to the dot dot b u dot a u dot fr dot ve dot es and you can do them now and that's this is this is i think a fundamental value what we bring in addition to everything else well first of all anton thank you for that very generous introduction and really thank you thank you so much to all of you i uh feel like a massive ambassador being here but over the last couple years i don't deserve leader first of all but i feel like you know this has been an opportunity it's in a and i'm glad to be here i wish it was under different circumstances but i'm glad to be here i've never met a community that was so passionate and so intellectual and so driven i remember when i was getting ramped up by us anton you know there's hundreds of developers worldwide there's i don't know how many thousands of servers petabytes of storage like where's the chart where's the chart that explains what everyone's doing anton laughs at me he laughs at me we don't need a chart everyone is so independently driven and motivated and excited and we are here for this accessibility we're here to change the world we're here to make the research better we're here to do so much but we don't need the chart we don't need a top down administration the bottom up it's the community that drives us and it's been remarkable so i really sincerely thank you all um it's really a privilege to be up here today and and i acknowledge and i understand the great weight of it and it's really a great privilege so uh we're going to continue on and sort of comment on some of the things that we've been observing you know many of these teams we've already been talking about throughout the whole conference so it'll just in some ways just be the summary of the work that all of you have already put forward so first of all some of the easy metrics if you look at sort of you know jobs that are running around the world we've hit some important milestones we're in the us and new and soon to be in australia on the order of a million jobs per month thanks to the magic of log scale and exponential growth we've had about as much growth in the last year than we've had in the previous 10 years that's incredible that's incredible that after all this time we're still continuing with excitement we're no way leveling off we're just getting started we're just getting started with all the great work that we can do here today in terms of commission commits it's a repository this is just one of them of many you know hundreds of developers working worldwide 24 hours a day 365 days a year you know dozens and dozens and dozens of commits every single day there's a very very vibrant active network people just trying to help each other support each other again add that capability make that accessibility bring forth all that science to everyone to the whole world and we're also here uh not in isolation we're trying to give it all away we're trying to train people we're trying to get people excited about the work that we're trying to do you know this is growing a community and the gtn about the topics as tutorials or contributors you know just this massive amounts of materials pretty much any topic that i'm interested in i was you know doing a google search and the gtn is always like on the first page you know there's just such a comprehensive repository of validated reproduced the workflows this is so exciting this is what we need this is what's going to extend our galaxy into more and more directions so in terms of what was happening so you heard enough so by the way if you do want to get to the beach as an error so in terms of interaction with the galaxy so we need to be able to interact with the software and you know so uh this graph has on purpose uh so ui and api have the same weight because one of the misconceptions about galaxy is that it's you know it's for people who like to click you know you click your drag you know develop carpool syndrome you don't use galaxy again this kind of stuff but uh it's very important to understand that you can use it to be a graphical user in the face you can also use it here api and in our view in the view of people who develop it these are equal things um so i will talk about why because i don't understand how api work but uh but essentially there are sort of four groups of things which which you can think of when you're interacting that's history your analysis are tools workflows and data um so in terms of history so one of the huge things that happened in the past year is the new history at this point it looks very much it functions very much present world in terms of set of uh things that you can do however it's a completely new it's a blank sheet it's a new foundation which would allow us to develop really good features such as being a map view such as being able to actually view history as a key hierarchical and to turn them uh just a linear recognition of data sets and we already have very powerful uh search functionality and it's finally fast in terms of tools we have again new framework for explaining the tool panel this there will be a lot of changes in this in the coming months uh so you'll be able to release tools because one of the problems that we have is a new problem we have two new tools so it's impossible to find and i can find one because i've been going through this panel for 15 years but in general hardly find cut and in general you need to think you actually need a dedicated tool panel perhaps uh especially now i don't know maybe some of you seem to be on so single page app so there are lots of things that are possible now uh in terms of workflows um i just want to so workflows obviously is a way to communicate with Galaxy this is the way to run analysis and the birth of IWC would allow us to have curated set of high quality workflows when everybody can make a workflow everybody can deposit workflows somewhere but these workflows are tests they are versioned and they are developed by people who actually so to know this analysis so for example or VGP workflows i'll be talking about they'll be distributed by IWC so it's kind of a stamp of quality uh and in terms of data there are lots of happening so one of the things is for example accessing these remotely repositories such as all the data produced by uh 1000 genomes for example uh these deferred data so the way to select data sets you want the Galaxy actually doesn't done more than until they need to be uh until they need to be run and it doesn't account towards your uh discordant and a new color of Galaxy right so it's a huge achievement of the past and we're there for the cloud uh the cloud uh so that is these data sets are kind of there but okay on the other half of the of the chart here we have all the API work that's taken place at the top of the list has been the implementation of a fast API the sort of standardization so basically all the major features of Galaxy now are accessible be an API this opens up all kinds of new opportunities we've heard about a few of this conference where you kind of script things together that have just never been scripted before it also serves as a form of documentation we can actually review oh we're all the API calls available what are the parameters that are associated with them again it's all about sort of exposing this technology in new and interesting ways but in addition to kind of our internal APIs and internal developments there's a lot of sort of activity worldwide going on if I'd point to one example this would be the global alignments for genomic health the GA4GH where they developed you know quite a few new APIs talking about how to change data how to change tools how to use our authentication and and already we have made a tremendous progress in implementing many of these sort of key standards I think we're pretty close to being the first reference implementation for all the GA4GH standards and that's going to open up all kinds of new opportunities to be able to do work and research at a global scale I'm incredibly excited about that and and again through all these APIs it's all about automation it's all about deployment and it's all about creativity where we combine its components together in ways that have never worked together before we can make something new rather than having to reinvent it from scratch in time so you know it sometimes is a little bit hidden that all these APIs under the hood but I want you to know that they're incredibly important and we're incredibly appreciated all the hard work that is done. So this is about being able to do analysis at home without using specific tools so we talked about this in the previous GCCs of some time ago they're just made Jupiter now it means many things but in general in any analysis you come to the point where there are no more tools you want to do this or you want to do that and so in galaxy kind of about three ways of doing this it's post analytics using interactive environments such as you know Jupiter our our studio work for example Observable which you heard about yesterday also being able to create tools on the fly that's instant tools and visualizations so I showed that slide I think in Indiana but the basic analysis is that basic point is that the galaxy itself galaxy program is good for analysis of a large number of data sets but eventually we get to the point where you need to compare this with that there are no tools and so I think one of our goals we've been working this for some time but I think we're right in the next year to finally make this a robust so the Jupiter integration and Earthquake integration are too robust on on our main instances this is going to make extremely important coming forward that's for example part of the VGP analysis work flows where actually we're going to sequence genome and now we want so you do this you do that and then we want to kind of look at this and here for example that's a little Jupiter noble which shows gene models so they're impossible to create a browser back for example it's custom to replace so the instant tools is the functionality to prevent this it's a part of the coordinate work flow it has lots of boxes and so one of the criticisms for example from state-based communities that well if I want to cut paste and groom and then something and then many bunch of these tools so it becomes difficult it becomes unwieldy it becomes hard to debug so instead of doing this probably once now with and this and was called probably this efficient to eliminate all of this and so we do have this new interface for interactive environments which actually allows to do that so create a custom tool and define inputs and eliminate that one of the collection of boxes again this will be made robust by the visualizations we talk from Sam on this this is more of a constrained way of visualizing the data but as you know Galaxy has a large library visualizations again polishing that is one of the rules going forward here are visualizations so as I think we all know you know compute is only sort of half the story especially in biomedical research you know we also have to be very very mindful of the data that also that often sort of dictates the type of analysis we've done also some of the limitations and some of the considerations that have to be taken into account at the top of this is the scale of data is growing as I'm sure all of you have seen some version of this plot before we're up to 67 petabytes in the SRA and that's probably I don't know a tenth of the genomics data that has been sequenced in the world so the scale of this is quite substantial in growing every day and Galaxy has to grow and has to sort of be positioned to take advantage of these very huge data sets that are not being generated in genomics and far beyond as well as we heard about I guess it was yesterday from John Shelton I'm sure there's some place here you know there's been a lot of work sort of under the hood to support these sorts of very very large data sets rather than having step one be ingesting these huge data sets under some head mode we can have deferred and sort of remote execution of data we can also imagine different tiered storage classes where maybe you would have you know very fast but sort of transitory scratch space but then the final outputs can be saved the way the high performance storage that will be just sort of a lot more robust a lot more stable over the over an entire project. In addition to kind of a work sort of internal inside of a single Galaxy you know through the pull starter network we're actually standing up these servers around the world that can all communicate with each other and enable this sort of remote execution so instead of having to move data from one continent to another continent oh it's just that much more easier to move compute to where the data actually reside and then finally I'm really really really excited about being able to bring your own storage into Galaxy so that we can expand and expand scale as the project succeeds. So I'm training in outreach there's been quite a lot of activity as well great we've started made a huge progress towards a global hub where all the different sort of galaxies around the world can kind of share infrastructure so that we can have sort of harmonized presentation also it's going to save us from having to sort of reproduce that infrastructure over and over again it'll just really simplify the platform accelerate make it that much easier to share. In terms of the training network there's tons of activity going on all the time now but of course because of COVID this is like one of the few in-person events that happened the last few years but nevertheless the network has persisted there's been new tutorials new trainings new webinars all kinds of events all about trying to share this information try to empower users. If I had a point to one example of how incredibly successful this effort has been it's been a smorgasbord that had about 2,500 registrants around the world are participating all learning how to use Galaxy many of them using it for the very first time incredibly excited. So finally of obligations this is what really I got up to speak last year we should be working on a genome project the idea here to basically assemble everything because all the genomes they're very interesting and recently we had a kind of idea about that because if you look carefully at these screenshots that's 2007 so Galaxy this is over the last 30 years in Galaxy. And so we had this idea to every every month we would publish because we needed to make it beautiful so we'll publish a picture of unsequenced animals so in 2007 that was safe because nobody's known as super zero in 2007 certainly flying forks either and at the end of this I actually thought this calendar and some of you might actually have it and we send it to some of the super users so I know we're bringing maybe 50 calendars and sending them out this is probably a rare item but it was cool and it was safe because we didn't expect these things to be sequenced. So when BGP was announced this was like obvious we need to use Galaxy today and the one class that Dolphin was talking about this is a partial list of things or animals that were assembled in the past year you can see it's ever been unfeeded the mammals here and I think one of the genomes is highlighted it's about you know four geeks so you know human is boring and these genomes are really they're fascinating things once you start comparing them comparing their sex chromosomes comparing the mitochondria comparing different gene arrangements and things like that. So I know Anton just said the human genome was born but I think there's been but necessary but there's also been a lot of activity in Galaxy in the last few years to be able to sort of take us in entirely new directions especially for human genetics. Probably the top of the list in my view is the through the NHGRI the analysis visualization informatics lab space I didn't name but it's called the ANVIL and this is a federated platform designed to support biomedical research for all of NHGRI it's a federated system so it's sort of composed of many systems talking to each other and front and center right there is Galaxy that's our one of our main portals in order to be able to sort of tap into this tremendous resource. So currently inside of the ANVIL we have something like 600,000 human genomes that are sort of indexed and available ready to be looked at and it's just been you know just a really awesome experience to be able to to have this view into human genetics for the first time this is far beyond any of the previous projects I've ever worked on before now I might ask you know why another deployment of Galaxy you know so Galaxy brings a lot to ANVIL users so it's a it's a function identical instance of Galaxy so it's all the features for accessible reproducing and integrative science with thousands and thousands of tools we're going to be able to tap into this very large community through the training network you know so there's lots of obvious benefits to ANVIL to have Galaxy be a part of it I also wanted to kind of highlight that being an ANVIL there's a lot of advantages for Galaxy users as well so we get at the top of this we get access to 100s of thousands of data sets it's in a FedRAMP certified system so it's sort of set up and secured so that you can actually do this work with protected data sets we can avoid data downloads so that we can just sort of work with the data right away there's no sort of fixed quotas so you're not sort of limited to you know just a few you know hundreds of gigabytes or a few terabytes if your analysis demands it you can rapidly scale up to many terabytes or petabytes as it is around occasionally you may need to go in and sort of change the tooling or change some of the parameters so in ANVIL you get to be your own administrator and you can you're actually empowered to do that and ultimately it's all about connecting data in novel ways and being able to make new discoveries so this is getting set up it's starting to really work I'm really excited about this in the future hopefully maybe next year we can really see some exciting scientific results come out of that in parallel to the work sort of focused sort of generally on human genetics there's been a real concentrated effort as well in the cancer but genetics has only heard a little bit about that earlier today and it's all about connecting cancer tools and data sets at the national level through the informatics technology and cancer research program the ITCR as well as the human tumor atlas network the HSTAN network you know galaxy is really really well poised now to be able to access huge amounts of genomics data imaging data and other sort of cancer related data types what's the reason for that well it's it's obvious right we have all the tools we have this great training network we have the infrastructure set up to be able to support this analysis and galaxy is really central to all this work to make really big discoveries about the sort of the underlying causes and hopefully some of the treatments and improvements that we made there for individual patients and as an example what can be done there here's some sort of listing of some of the key workflows that are now available for clinical and research use to really empower precision cancer medicine so we can look at sort of pre biopsy pre treatment and post treatment see the changes that are there look for recurrences and especially of metastatic cancers and then hopefully be able to guide the treatments that are possible I find this work to be incredibly meaningful where we can actually help real patients that are afflicted by some of the most horrific diseases and we could actually provide for them really really strong support in addition to kind of the primary analysis you know looking at the sequences looking at the images in a wrong way as you heard about earlier today from germany and others you know there's also very very sophisticated technology to do machine learning inside of galaxy so you can sort of tease out those really really subtle patterns we can really sort of look very very broadly you know even when we don't know what the features are we should be paying attention to do the machine learning we can automatically discover them again hopefully be able to have new insights better treatments really really support medical care so Monday here is a place of quality uh because everybody's secret I understand that but you might know that we've done a lot of work with quality analysis in general what you know what you hear what you're in your times you vary and for this is also this is essentially a variant analysis but the quality was very uh illuminating in the fact that actually variant analysis in a short term can actually be very challenging and uh we learned a lot of things from that this was a collaborative effort between uh the world between you and also with with help from from from South African team and also from NAD and so there was there was a lot of things learned for example how different modifications schemes affect what you actually see as variants and so on and we have again a set of high quality workflows with what these workflows are available from IWC and the plan of this effort is to make them more generic uh meaning that making them applicable to any kind of pathogen of course that would require a lot of tweaks these workflows for example work at multiple ports now we have efforts related to AVM and obviously HAD and many other works so this is a continued effort to make a high quality set of variant only workflows for non-deployed for for for for for micro-deal pathologies uh and on the same note there's this micro-galaxy effort uh it's uh it's a big group of different plans uh this effort is led by you know and this is again in in previous talk we heard about this deficiency of unsustainability of micro-medics training and how a lot of my informatics comes from large projects this is absolutely true and it's felt the most acutely in micro-deal world because in micro-deal world uh projects are so small in terms of how much money needs recipients and that's simply unjustifiable to higher actually people who know how to do analysis and so it's like wild west well or prehistoric uh your asian so you sort of have lots of little demons uh they don't talk to each other they all do things these things are very often completely crazy uh and there is kind of no progress as a whole so i don't know if we can get demons to talk to each other uh but at least we want to be able to provide some way of some free platform that you can actually go and try these workflows at least and these workflows again will be either with your workflows they will be better they will be high quality and they will embrace lots of different analysis from uh now from metagenomics to uh training comparisons those are very good points and one tool that i want to mention that this is our collaboration with ncbi data sets because uh ncbi now has a very good way to get the sequence annotated actually original but in particular uh my primary viral fields and this is now a data source in the galaxy and this is our excellent resource and so we're hoping to make this uh fundamental part of the micro galaxy uh yeah and so finally um another misconception that we frequently hear is that galaxy is just for bio which well first of all i don't think it's bad uh but the second there is you know there's a growing evidence evidence not true i mean if you look at the altitude there's nothing biological there certainly from some historical data sets in genome built but harder than that it's not real and uh climate effort for example and in this week in this uh neutron scattering uh poster that's this from from outreach uh this is this is this is amazing and so uh again we really want to try to get other fields who struggle from the same problems actually one of the interesting the relations of the last year was that jorn initiated uh co initiated um collaboration with cern which is of course physics and once we had our first meeting it was discovered that first of all the certain people who are physicists you know get to know these biologists and butterflies and flowers uh and suddenly well outreach people always realize right away uh but they suddenly realize oh so first it looks like they have data and it's like they have sort of the same problems and they look like they develop pretty cool stuff so maybe we should uh do more and i think you're just continuing that and this is all along the same lines is that uh it's galaxy super for any data science so that's the it's not just genetics but not just life sciences it's everything which has some dates some data and tools and and so on and again it's all about analytical inequality being able to give people ability to platform to analyze their data just to say that we're just getting started i know we're in our 15th year but there's i don't know 15 50 500 years after us there's just so much activity going on by far the biggest challenge as we just had it's just being aware of it being aware of it so that's why i love this meeting we're all here i mean i've just been thrilled by all the conversation again thank you so much for welcoming me thank you so much for being here in person it is just an incredibly well uplifting experience uh to be here so thank you all so much you have the rare opportunity for leaders standing in front of you so i don't remember the eyes don't know anything i'll give you a good answer though any question i'll give you a reach oh we did actually make a bicycle okay well with that let's let's thank you