 Great, thanks for having us. All right, so let me share my screen. So I just want to kind of give, have a little discussion about this Anvil project to try to introduce it and talk about where it is now and where it's moving into the future. And I think, you know, it's talking mostly about Anvil, but some of the themes are much broader than Anvil, where there's sort of, you know, there's some emerging technologies I think everyone should be aware of. And I just want to kind of open a discussion about where they're moving. I also wanted to credit NS, he's not on the call right now, but he created like more than half the slides here. So he's, he helped me out enormously. And then behind the scenes, there was a huge team of people that have been focused on this. And I'm really grateful to everyone that's contributed. So just to kind of open with a very broad introduction. So the Anvil is this analysis platform that's sponsored by NHGRI. So for those of you that aren't really aware of the politics. So I'm sure everyone's heard of the NIH, right? That's the National Institutes of Health. That's organized into, I think it's 27 institutes and centers. And there's a lot of independence between the different centers. So, you know, there's, you know, one for cancer, there's one for heart, long and blood, there's one for, you know, kidney diseases, for psychiatric diseases. So they kind of get organized into into themes sort of centered around either a disease or an organ system or, you know, some sort of major component like that. A relatively new institute is NHGRI. That's the National Human Genome Research Institute. It was initially formed with kind of a singular vision, which was to sort of organize and launch the human genome project. So it's a relatively new institute. Over time, it's morphed considerably. It started with just, you know, kind of establishing the reference. And now it's much broader. So here's just kind of a picture of its research areas. In my mind, it's sort of the molecular biology branch of NIH, where they've been really, really innovative at new sequencing technologies for genomes, but also kind of, you know, everything, omics, so gene expression, looking at protein binding, you know, looking at anything kind of at the molecular level. It's increasingly thinking about a singular cell analysis. It's increasingly thinking about what other data types, other modalities should be augmented with just an omics to kind of make sense of, you know, what's sort of happening in development and disease in this very sort of fundamental molecular level. In terms of scope, it's a, their annual budget is something like $600 million a year. So it's not nearly as big as, say, NCI, the National Cancer Institute, but it's still a very sizable institute. Historically, NHGRI has funded Galaxy, you know, the initial RO1s. And more recently, the center grants have been funded through NHGRI. So I consider them like a great friend and a benefactor to the Galaxy projects. So, you know, there's sort of a lot of goodwill between them. So in response to, you know, because, you know, initially started basically with one genome, the initial reference genome. But since then, their mission has grown enormously in scope. You may have heard of these very large consortiums like GTACs or the 1000 Genomes Project or the Centers for Mendelian Genomics or Centers for Common Disease Genetics. I don't have a precise number on this, but something like on the order of like hundreds of thousands of genomes a year are sequenced through a variety of NHGRI projects. Today, it's a real challenge where, you know, huge amounts of data are being generated, you know, many petabytes per year. But those data get locked away into various silos. So the number one silo is like the SRA and DBGAP. But that's mostly for published data. For data pre-publication, it gets siloed away at like different institutional computing centers where, you know, there's one at Hopkins, there's one at Broad, there's one at Wash U, there's one at Yale, you know, kind of like major research institutes have their own data centers where these data get locked away. So a few years ago, about five years ago, NHGRI had what I thought was the good idea was to try to figure out solutions where data doesn't get locked away in these silos. We want to have sort of maximal exposure. We want to have maximal scientific impact from these data. Ultimately, these are taxpayer dollars that are funding these projects. We want to get as much sort of, you know, much bang for your buck as much science as possible from these data. So the idea is rather than having, you know, these huge silos where you have to copy where it's expensive, it's time-consuming to move data around, what if we could flip it around where there was a centralized resource where these data would all be sort of co-located and then users could remotely connect to it, you know, through the cloud to be able to access these data in sort of a standardized, harmonized way. Some of these data are open access, like 1,000 genomes. You can just freely post it on the internet. You can download them. But the vast majority of the data are controlled access data sets where there's individual, like patients, you know, are being treated for different diseases. They've consented to sort of release their genome data and other measurements for research purposes, but they don't want to just like blast it over the internet. They want it to be, you know, a lot of sort of safeguards put into place so that only those that are approved access those data can do so. That's the vast majority of those 100,000 genomes per year. So, you know, there's a centralized resource, but there's a lot of, there's a security perimeter that has been established to kind of add all kinds of protection. So there's encryption at rest. There's encryption between services. There's auditing of who's accessing. There's firewalls, intrusion detection systems that are constantly being activated, you know, to just to kind of monitor the perimeter, make sure that only those people that are authorized are able to get into these data. You know, I know that in, you know, in the galaxy, there's a great tradition of open access and open sharing. And of course, when we can, that's what we, that's what we, we move for. But there are these really important data sets that for really good reason are require these sorts of levels of protection. So that's sort of the initial motivation of and then, you know, to put some sort of numbers behind this. So the Anvil was launched through a few two sort of major awards about four years ago, one to Hopkins, one to Broad with a bunch of sort of subcontracts. And it's taken a few years, but it's really working now in the sense that there's a platform, we've been able to kind of start it all kind of starts with the data. We're trying to aggregate these huge numbers of data sets together. Currently, we're approaching five petabytes of data, about 600,000 genomes have all been adjusted into the Anvil. Again, a lot of this starts with these mega consortiums that are looking at, you know, common diseases, rare diseases. And then there's a bunch of consortium in the future that have been sort of prioritized. Right now, we, I would guess that we capture I don't know the exact number, certainly more than 10% of data from NHGRI, maybe as much as 50%. I mean, I think that the vision is eventually, you know, in the next few years to basically have all of the data generated through NHGRI adjusted into the Anvil. We'll probably never get 100% because, you know, there's a long tail project, some are very big, some are small, some are, you know, looking at different types of data. We'll probably never get every single project, but the goal there is to get a huge fraction organized into a central platform where researchers can look at it. Mike, can I can I ask some questions or you want to do it at the end? Yeah, please jump in. So is this data as it's produced is directly deposited or is this sort of, you know, your lab does something. And then when you want to publish it, you just scramble and put it on there? Yeah, I think this is, I think this is what's something that's really special about the Anvil is a lot of these data are being loaded as they're being generated pre-publication. So for a lot of these like mega consortiums, they're using Anvil, where, you know, get 100,000 genome sequence, you need to do a mega analysis of it. You know, let's get it sort of cloud first. I'm with you though, Marius, there is this great tradition of just scrambling at the very end to upload data. We're trying to bridge. But maybe that's not going into end, but let's actually go into the public ones. Yeah. Yeah. But I think it really is sort of changing. We're breaking out of that old tradition of just uploading at the very, very end of this process. So that's what's going on inside of Anvil and NHGRI. But kind of more broadly, this sort of trend is emerging, not in all 27 institutes of NIH yet, but in several of the major ones. So like NCI, the Cancer Institute has established their cancer research data commons, heart, lung, and blood have established bio data catalysts. You know, other institutes are sort of joining this effort to have cloud platforms to sort of organize and analyze these data. From NIH points of view, it's a way to get kind of the most bang for their buck. If I'm a little bit cynical, I think there's what they're looking forward to in the future is actually cost savings in the sense that right now a lot of the institutions, including Hopkins, apply for so-called instrumentation awards, where we can buy, you know, a million dollars of computing. Doing that at every single university, you know, the United States starts to get expensive. So instead of distributing and building all these separate data centers, you know, let's effectively have one mega data center that all of NIH researchers can use. So these efforts are being sort of organized through this thing called the NCPI, the NIH Cloud Platform Interoperability Effort. Today we have sort of buy-in from a few key platforms, but my full expectation is this is going to grow and grow over time. In aggregate, there's something like 11, well, this is a little bit out of date, but there's more than 11 petabytes approaching a million genomes that are all kind of available through these different platforms and will be kind of, you know, kind of leading the way here with sort of the most data available. So it's just, you know, kind of a testament to the types of data that are being organized now and into the future. So we have this huge amounts of data. It's all organized in the cloud. A lot of it is on GCP. A lot of it is on AWS in the near future. A lot of it will be on Azure. You know, it's on, you know, commercial cloud platforms. You know, we're talking about, you know, you know, tens of petabytes of data. This is a substantial data footprint. They wanted, you know, best in class. They wanted the security. They wanted, you know, the distribution network. They wanted to sort of tap into all these capabilities. But data at rest is kind of boring, in my opinion. It's sort of necessary. It's essential to have databases. But to me, what gets exciting is when you actually have an active platform where you can have these ginormous data sets and then actually do interesting analysis on top of that. So ANVIL is, yes, a data repository of sorts. But to me, the exciting part is in addition to these huge data sets, there's also a variety of different tools that are available to kind of make sense of it and look at it in new and interesting ways. So here's kind of what we call the ANVIL Neal, which is kind of a really, really high-level overview of some of the capabilities that are present there. Capabilities for sort of programmatic access in different forms, visualization of different forms, user interfaces into the data specifically, and then sort of downstream analysis that are available there. In the kind of interesting time, I wanted to spend a minute sort of reviewing some of the key analysis platforms. This is going to be a very sort of superficial highlights tour of some of the things that are possible and some of the things that are on a horizon in the near future. So on the left here, we have sort of Jupiter and RStudio and Bioconductor. I think these are very familiar technologies to everyone on this call. So I'm not going to really spend any time on it. I think they serve an important purpose. If you want to do really customized analysis, really customized visualization, you need to write some code. I think these are great platforms. They have some trade-offs in terms of what you can do. I think they're kind of necessary but not sufficient in terms of what can be done, but they're fully featured. They're working in a very robust way. Another sort of key component is this workflow description language or widow. This is sort of the primary technology that's been adopted for some of these mega consortiums. So if you have a project that has thousands of thousands of genomes, if I'm honest with myself, this would be the preferred technology to do it. I'll say a bit more about it in a minute, but it's extremely scalable. It's extremely flexible. It's very technically demanding in the sense that there's sort of a lot of pretty low-level scripting that needs to take place to use it and there's a lot of cost considerations that come with it, but at the same time, if you have a mega project, that's a very effective technology. In addition to that, the Galaxy is sort of a first-class citizen in this universe. I'll have a lot more to say about that in just a minute. So to give you kind of a sense of what's possible with Widdle, I wanted to spend five minutes talking about a recent project that I was involved in through the Telemyr to Telemyr Consortium. This is that project that's been getting a huge amount of buzz in the last few weeks here. So in this project, I'm sure everyone is aware, there is this reference human genome. The first version of it was established in some 21 years ago, sort of first published in 2001. And that reference human has been sort of iteratively approved over the years, but sort of the big challenge there is that it is still missing quite a huge amount of sequence. If you open up the reference human today and you sort of look at chromosome one, instead of beginning ACGT, ACGT, it actually begins NNNNNNNNNNNN. There's millions of nucleotides just at the beginning of chromosome one that are unresolved, where we can look under a microscope and use other technologies to see how big the chromosomes ought to be. And we know how much of it has been assembled, so we know that there's huge chunks that have been missing. So the ends of chromosomes are the so-called Telemyr regions. The middles of chromosomes are so-called centromyric regions. And just because of the way that they're organized, historically, they've just been too complicated to make sense of. The sequencing technology just did not exist to get in there, but the amazing thing is the sequencing technology now does exist where we can look inside of those regions and actually resolve them for the very first time. So we recently published this assembly. It's a special sample called CHM-13, where we've been able to assemble every single chromosome from Telemyr to Telemyr, from kind of one tip of the chromosome to the other tip, which brought about 200 megabases of additional sequence, and it also fixed megabases of errors that were detected through the old reference genome, GRCH-38. So now we had this new assembly, and one of the key things we wanted to do was look at it and look at how it would impact our analysis and our understanding of human genetic variation. So kind of the idea is we have this new assembly. That's great. But, you know, we have an old assembly. We have a new assembly. We just want to do kind of an apples to apples comparison. That's when you look at other data sets, other human genomes, you know, what changes about our interpretation using one versus the other. So we put together a team of analysis. There's something like 50 people that were part of just this one team. We're kind of looking at trying to answer this question using long read sequencing, short read sequencing, you know, looking genome-wide, looking in clinically relevant regions, and just trying to, you know, be able to ask the question of what changes by using this new reference genome. So the data set that we ended up looking at is this very famous data set from population genetics. It's called the Thousand Genomes Project. That's the name. It's a bit of a misnomer in the sense that today it's actually 3,202 samples. It's called Thousand Genomes, kind of for legacy purposes. At the time that was a very aspirational goal, and now it's been far exceeded. So it's more than 3,000 samples. It has representation from five major continental populations. Within those sort of continents, there's 26 individual populations. That's what these different stars are sort of highlighting. So it's a nice collection because we get, you know, representation from Europeans, Africans, Asians, sort of Native Americans. You know, we get sort of a nice initial view of human diversity through this collection. Another sort of nice attribute of it is that all of these 3,000 people have been consented for open release of their data. We can just go to the website. You can download their data. It doesn't need to be done in a protected way, which also makes it nice where any results we have can now be freely shared with the world. We don't have to sort of set it up behind any sort of barriers that can limit access to these data. So it's a great collection. It was very recently resequenced by the New York Genome Center. So it's sort of fresh, high quality data. The one sort of technical challenge of it is, so it's all compressed data. It's something like almost 100 terabytes of input data. So we knew we wanted to look at this, but it's a pretty sizable collection. We actually started the analysis at Hopkins and we realized it was going to take like more of a year of compute. So we needed to kind of do this in a more efficient, more effective way to be able to look through this. So what we ended up doing was we developed a workflow that could look through these data in a pretty rapid form. In the grand scheme of things, it's actually not that complicated. It's sort of a few major phases to kind of organize and do some preprocessing, a few phases to align the reads and kind of get them organized. Not shown here is then once we have the alignments organized, we can do downstream variant calling using GATK. The idea was to kind of follow the protocol established by the New York Genome Center so that we can get an apples-to-apples comparison where if you just change the reference genome, what changes in our interpretation of all these sequencing data. I really wanted to credit my student, Samantha Zartar, who actually wrote all the whittles for doing this. If you haven't seen it before, a whittle is a pretty simple concept, at least from the sort of user experience. So it's a workflow language. It's organized into sort of a series of tasks. These tasks can be linear. You can also do branching. You can also do aggregation. You can also have some conditionals, but by and large, it's a series of tasks that get sort of chained together where the output of one task becomes the input to the next. So it's super simple. The way I think of it is like a turbocharged bash script. So this is one of the tasks for aligning the reads. So there's a bunch of input data that describes what the read files are, what the reference genome is that we're going to align to. There's a series of shell commands that are then executed. So this is sort of the fancy bash component. Anything you can express in bash then gets executed. There's a runtime environment that says, oh yeah, this is the Docker container I want to use that has the Unix tools preloaded into it with how much RAM and how many cores we want. And then there's sort of an explicit listing of which output files do we actually want to keep. For a lot of these workflows, you're going to be creating a gazillion intermediate files that are actually not that interesting. So we just want to sort of decorate, oh yeah, these are the key steps, these are the key outputs that we actually want to save away. And then those get actually sort of recorded. So it's a fancy bash script. The key to it is their execution engine called Cromwell that the behind the scenes will kind of orchestrate a whole cluster where it'll boot up virtual machines of the specified type. It'll load the Docker containers that are necessary. It'll stage data out of, in this case, Google buckets onto the local disk. It'll do the bash processing. And then it'll take the decorated files right back to the buckets. So from a user perspective, it's super simple. Behind the scenes, there's some pretty nice engineering to actually orchestrate this at scale to be able to connect these different paths together. So the computer doing was quite substantial. This is a snapshot of the Google Cloud console where we're like routinely using like 10 or 12,000 cores for a few weeks time. So we're able to tap into a few million core hours of computing over a few weeks. It did cost money. The computing part of this was about $50,000. So it was a pretty substantial compute. But the scientific outcomes from it have been enormous. So I sleep very easy at night. I think this was a very good use of resources in order to support these activities. At this time, I won't go through all the details, but just very quickly, some of the scientific outcomes. So using this new genome, which has resolved 200 megabases of variants, 200 megabases of additional sequence, we find about a million additional variants in the population. So there's all kinds of new opportunities for discovery about how those variants impact our health, how they impact disease susceptibility, how they just explain differences amongst different population traits and how they look and how they behave and so forth. In addition to finding a substantial amount of novel variation, we were able to demonstrate and prove that the variants that we were able to detect are actually of higher quality. Where there's a reduction in false negatives, there's a reduction in false positives. And this includes within clinically relevant genes. So when we started this project, we thought the whole paper was going to be about all the new stuff. But a big part of the paper as well has been improving parts of the genome that we thought we had correctly assembled before, realizing that there were some issues there and then being able to reduce, in some cases, a tenfold reduction in false positives in certain key clinically relevant genes. So basically in every way we can possibly measure, we see improvements using this new reference genome. We're really, really strongly trying to advocate for its adoption just because it makes everything better to use it. So that was all orchestrated through Wittles just because of the scale with some downstream processing in our prime end Python to some degree to make a bunch of custom plots to be able to execute this custom kind of analysis that it just never been done before. So I'm excited about it. In the grand scheme of things though, this is a relatively modestly sized project. This was 3,000 genomes. A lot of the mega projects are at the scale of like 30,000 or 300,000 genomes. So it was the biggest single project I've ever been a part of. But in the grand scheme of things is actually not, by far, not one of the largest that are ongoing in the field right now. So it's kind of exciting and also terrifying to think about what the future is going to hold here. So that takes us back to our main story about where things stand for Galaxy. So it's working well. It's working quite robustly. Any of us can log into Ample today in a matter of a few clicks, you know, bring up an instance of Galaxy that is fully featured can do a lot of sophisticated analysis though. Now I mentioned early on, you know, one of the key requirements for Ample is, you know, great sort of mindfulness for security. And, you know, one of the sort of original design decisions was, well, how are we going to make out? We knew we wanted to bring Galaxy into Ample. That was like a certainty. One of the great sort of choices was, you know, what's the right mechanism to do so? On the continuum, one extreme would have been to try to mimic something like Galaxy main where there is sort of a more or less permanent instance of Galaxy that was always running that any user could kind of immediately connect to. That I think at a technical level, you know, works well. I mean, obviously main works well. It has, I don't know, 10,000 users a month. But we were concerned about the security policies for so-called multi-tenant system. So a multi-tenant meaning, you know, many users can simultaneously access the same software stack. You know, since that time there's been a variety of security audits, you know, at us help orchestrate kind of an external review. I think we're like in pretty good shape. But, you know, through those audits and kind of through our own reflections, you know, I think there is some work that needs to happen. If we're going to be able to, you know, kind of confidently be able to have a multi-tenant system, it is accessible for protected data sets. I think it could be done. If we want to reach that so-called FedRAMP certification, there's a very long checklist of items that we need to address, some of which are just procedural in the sense that every, there's a set of requirements about how code changes are committed, how they're reviewed, how testing is executed. I think we're kind of moving in the right direction. But if I'm honest, you know, it would be a pretty substantial cultural shift, you know, if we wanted to go pursue that. I think there are, you know, there are pros and cons to that concept. But in the here and now, it wasn't practical to execute in that sort of scenario. It's sort of the timeframe that we had available with the sort of level of funding that we had available. As an alternatives model, and this is what's available today, was instead of having a multi-tenant environment where many users would connect to one centralized galaxy, we've moved into a single tenant model where individual users can boot up and manage their own version of Galaxy. This way, there's sort of protection where if I'm using Galaxy with my data, you know, none of you can see it, even if, but you're allowed to have your own independent versions of Galaxy. If my Galaxy needs to sort of interact with your Galaxy, you know, you would have to have an explicit export, you know, of one that can be done through the workspaces, that could be done in the case of tooling, to say Dock Store, where, you know, there are ways to kind of share information, but it's a few extra steps, so that security is really sort of a central there. By the way, that's sort of just one layer behind the scenes, and I really credit Ennis for putting together a lot of these diagrams, is the kind of user facing component is so-called Terra. There's an API layer called Leo, where through a Leo command, there's a single Hamilton chart that launches a Kubernetes cluster, you know, that where you can kind of dial in how many nodes, and how many cores, and how much disk you want to have there. That then actually, that sort of boots up Cubeman that can kind of bring up the whole Galaxy stack. And then like I said, if you want to kind of work with that, because it's kind of a fresh installation, you can bring in data from workspaces, you can bring in tooling from, say, Dock Store on demand as needed. Just one more sort of diagram here is, you know, because we're using Kubernetes, so if I orchestrate all of this, you know, we have a lot of flexibility. If you need, you know, 100 nodes, no problem. You know, that can be kind of booted up. It's all kind of executed through one command. We could basically get effectively any number of job handlers, any number of sort of resources behind the scenes to be able to do the analysis there. Couple more screenshots. So this is a screenshot I took last night, kind of showing how you launch Galaxy inside of Anvil. So, you know, you kind of click this button, create a cloud environment that brings up this sort of configuration panel where you get to pick, you know, how many nodes, how many CPUs, memory, how much disk. The basic deployment is 53 cents per hour. So it's not free, but, you know, for the cost of a cup of coffee, you know, we can boot up a fully featured instance of Galaxy where you get full access to all the components that are there, you know, just as if, you know, you're connected to main or to EU, although it's, you know, tightly coupled to these data sets where you get access to being able to do all this sort of amazing analysis. What's one thing that's like really cool to me is, you know, you can just sort of dial in how many nodes you want. So if you want your own dedicated cluster with 800 cores or 8000 cores, it's very accessible and easy to do so. It does cost money, but again, you know, for some of these scientific projects where we're going to be making great discoveries, I think spending some money is probably a good use of resources. Obviously, we don't want to be wasteful, you know, if we can minimize the spend that is in there, obviously that's something we want to do. We want to be aware of those costs, but sometimes it's definitely kind of an appropriate use of resources. Once that's booted, you'll get a very familiar version of Galaxy. It's branded Anvil with, you know, kind of the color scheme and some of the logos, but it's just, you know, it's exactly the same code base that's executing there. So everything you could do inside of, you know, any other instance of Galaxy is available. One really special part, and I really credit Luke and others for kind of building this, is that, you know, so this is running in a Kubernetes cluster inside of Google. You know, one of the main things you want to do is look at the data that you have available in your workspace. So this could be, you know, all 1,000 genomes. This could be all of the centers for Mendelian disease genomes, you know, basically unlimited, effectively unlimited amount of data you might be interested in. You can go in and just browse that, you know, right through the Galaxy UI. Once you've kind of identified the data sets that you're interested in, you know, in a few clicks, you can then ingest these data out of the Google buckets into the Kubernetes cluster where Galaxy's, you know, kind of really running. And then, like I said, all the familiar features, some unskilled workflows, import, export, workflows, histories, data, you know, all that is available today. It's working really well, in a really robust way. And again, I really credit a ton of people that have contributed to making this stack just as awesome as it is today. There's also nice features where it can auto pause, you know, and then once you're done with your analysis, you can kind of totally turn off the cluster. You can put it back in a residual state that costs like one penny per hour, or you can delete it entirely if you're entirely done with the analysis. So I think, you know, you know, Anvil and specifically Galaxy on Anvil is a really exciting platform, right? We get secure computing with direct access to huge numbers of data and growing every day. We can use Galaxy without any sort of fixed quotas. There's no competition for queues because it's kind of, it's your own private, for instance, a Galaxy that you can run with. We can connect data in novel ways and we're going to make those scientific discoveries. Just last week at NHRI Council, they have approved the renewal for at least five more years of Anvil. I kind of think that this is going to be the sort of project that, you know, in a hundred years and a thousand years, it may not be called Anvil anymore, but this is, in my mind, clearly the right thing to do from a sort of a government point of view. So I think there's going to be continued interest to build out these resources and make them even more powerful, even more capable in the years to come. Okay, so that's sort of the exciting part. A couple of reflections. So one sort of mega reflection is, you know, I think historically Galaxy has really led the way in terms of, you know, making tools accessible, making science reproducible, but there is effectively fresh competition. You know, through Terra, through Dockstore, you know, there are workflows that can be executed in a very reproducible way. They are getting hardened. You know, they at kind of a high level, they have some of the same virtues that we've been highlighting in Galaxy for more than a decade. So how do we maintain our competitive edge? In the here and now, in terms of accessibility, in terms of ease of use, Galaxy wins by a landslide, by a landslide. It takes a thousand clicks. It's very complicated to write a whittle. It's sort of very cumbersome to modify them or do anything interesting with them. But nevertheless, it can be done. And for users that are comfortable with, you know, batch scripting, you know, we have to be really mindful of that competition there and how we maintain our competitive edge. And then kind of the other mega reflection is how do we optimize the Galaxy on Anvil experience? I think, you know, this cloud platform is here to stay for many years to come. You know, whittles are ridiculously simple because they leverage all the cloud technologies about buckets and cloud-native APIs. You know, behind the scenes, there's a lot of cool stuff that's executed. You know, how can we tap into that sort of amazing capability to kind of just make the Galaxy all the more successful, all the more powerful? So a couple of wishlist items. And I think basically all of these are in progress right now. But this is just sort of my top of the line wishlist. You know, number one is for that TDT project, step one of the analysis is like moving around 100 terabytes of data. You know, I'm excited to hear, you know, remote data type activities are on the near horizon that will immediately be transformative to the experience. Kind of a couple of this, now that we're talking about projects in the hundreds of terabytes to petabyte scale, you know, I think we've got to be really mindful about, you know, which data are kept. You know, can we leverage buckets instead of, you know, clearly not, you're not going to get a five petabyte SSD. So kind of leveraging these closer object stores, I think in the long term is going to be really essential. Things like optimizing the launch, you know, in the here and now it takes a few minutes, which is, you know, for analysis, it's going to run for two weeks. Five minutes doesn't really matter. But if we're, you know, if we're trying to, you know, really harmonize that user experience, is there some way that we can bring up, you know, in a matter of a few seconds, even if it's a very minimal UI, you know, just some sort of, some way to give exposure, you know, very quickly so people can see what's going on there. And then in terms of tooling and workflows, you know, trying to harden the existing suite and more tools for human genetics, I'm very excited to hear that, you know, things like user installable tools are on the horizon. You know, sometimes when you're in the middle of a project at 2am, you know, I don't necessarily want to write a whole wrapper for a tool. I just want to get my analysis done at 2am. So that sort of capability, I think it will also make the platform a lot more attractive to a broader range of analysts. I love this. It's amazing. Thanks for stating this because that was not at all a unanimous view. I mean, I get it, you know, you know, the danger is suddenly everyone's going to, you know, write one offs that are never shared. But I think we have to be kind of flexible and open-minded about this. You know, maybe not every single tool needs to end up with a wrapper. You know, maybe sometimes we can acknowledge this is only going to be run in a once. If we want to make this a platform for everyone, including pretty sophisticated analysts, you know, I think having this capability is going to be really transformative. Yes, I think the moment I started working, the moment I started working on Galaxy and talked to other people about this, they were like, but I have to be admin, right? And I mean, that's another thing that works around this, which is not a problem on Anvil, but just in general, it should just be possible to plug in your script. I agree. I agree. You know, it's, I mean, it's one thing, you know, once the workflow is developed and you just want to run it a thousand times, that's one thing. But, you know, the way that these projects off to go is there's this initial phase where it's very exploratory. You're not, you know, you're really not sure what it is exactly you want to do or how it's going to look like. There we just want to be super nimble to support that sort of analysis. I'm wondering if others have any kind of thoughts or comments? This, you know, this is just my, my personal impressions here. You know, like I said, you know, Galaxy is working now. It's amazing. It's super solid. You know, my goal here is just to kind of, kind of plant the seed that, you know, let's, let's consider this, you know, in a really, really deep way and what sort of opportunities are there to make it even better for the future? Yeah. I mean, you know, I mean, so if I can say something, when you talked about the reanalysis of the re-sequenced 1000 genomes project, I mean, you've answered this question already sort of, but I was wondering, okay, so why not take the extra step and do it in Galaxy? And I mean, of course you mentioned some of the limitations, but I also think we need these big projects to be in. I think a lot of improvements are coming out of the VGP project, but it's, it's not the same sort of considerations, right? So for this project, a lot is, a lot of focus is on like efficiency, efficiency of transfer of data. And it's, I mean, it's relatively removed from the core team of doing an analysis like this, right? So if we said, okay, we want to be able to redo this and maybe be dwindled in terms of execution time, I don't think that's impossible, right? But it needs more knowledge now than we could easily do with like, you know, get started. So I guess my question is like, what do you think we need to do so that this becomes an option to people that right now would prefer writing with them? I totally love the, the, the spirit of what you're saying. And I should comment, this is, you know, the, the genomes just published will be the first of many new reference genomes that are put out. You know, there's through something called the human pan genome reference collection consortium. There's something like 350 additional reference genomes that are being created. And that's just NHGRI. I anticipate in the near future, there's going to be, I don't know, thousands of human reference genomes and beyond. And you're right. I would love in say a year's time to be able to do this whole analysis entirely in a galaxy. I would, you know, I'd be open to if that was galaxy in anvil or galaxy on main, you know, it's open access data. So some of the usual privacy considerations are, are not even a factor here anymore. The reason that we picked widdles was, you know, was, was kind of this number right here intermediate data was like five petabytes. I mean, it's, it's a very interesting thing that you mentioned that we didn't even put on the roadmap. But I mean, it's obviously something that, you know, we've ignored for a long time because nobody really said, well, but we really, really need this, right? And I think this is actually more, well, I mean, I shouldn't say this, but this is as valuable as keeping the initial import step remote, right? So that, you know, the, the, you know, 100 terabyte of input data doesn't need to appear in Galaxy's object store. I think it's just as important that intermediate files that you don't need don't need me to be transferred back, right? Yeah, you know, clearly some of them need to be transferred back, but we probably don't need to collect every single intermediate. At least not forever, at least not forever. You know, it's, you know, genomics has been an evolving field. There's new technologies all the time. There's new questions, new types of questions. So I think this is just sort of the progression of the field. And I think it can motivate us in Galaxy to, you know, adapt, adapt to the changing landscape. I'm wondering if anyone else has any impressions? Like I said, I know many of these things are, you know, kind of in flight already. So that, that's awesome. I think we're, we're already kind of pointing in the right direction and, you know, anything we can do to go even further, because we're, this is just the very beginning. This is just the very beginning. I mean, at this, at this point, I think it's also very good to just have like concrete use cases that drive development, because sometimes you're working in a bit of a vacuum and you don't really know, okay, if I'm going to spend the next six months of my life doing this, will it be worth it? Right? Or is it something that, yeah, it's nice to have, but that's, that's a great idea. I mean, one, you know, natural thing we could do is sort of, you know, kind of formalize this as a use case, and it's highly scalable. You know, we did 3,202, but there would have been value to do 30 genomes or 300 genomes. So, you know, I think we could, we could set up some intermediate goals that would be scientifically valuable. And then, you know, from those, we could kind of look to the future to tie all these pieces together. Has, has there been any work to convert the sort of reproduce the workflows that you've used here in Galaxy yet? Would that be interesting at all? That's an interesting question. I think many of the tools are available. Maybe not all of the components of GATK that we use were, but I think, I think many of them would have been, yeah. I mean, there's also the ongoing effort to support CWL, where I understand there is some conversion possible between the two. So that would be another option for workflows that already exist. Again, there was a big discussion, is it even worth doing this? Because, you know, then you end up with parallel pipelines, pipelines defined by somebody else. But I think this is just, you know, this is a production pipeline. And for reproducibility sake, you may want to execute it the way it was written. I agree. And we've had subtle bugs introduced when we try to, we've converted from like Snakemake to other workflows and subtle bugs pop up. So, you know, being able to just run the native, hardened, tested workflow brings a lot of value to it. I mean, we always have, in general, there's always going to be workflows that people publish that aren't going to be polished by the IWC. But that's kind of the point of the IWC is having those specifically hardened workflows. But that's not to say that there's no benefit to having things outside of that. A random thought is that I really like this wish list. I'm 100% with Marius that user installable tools are so important and don't ever get sort of, they don't ever sit at the table in the way I think they should. And this, you know, purging the intermediate stuff and workflows, we spent a lot of time, like, I mean, again, I just wanted to echo what Marius said in terms of like, the initial data is not so, I mean, we've done it, we made a bunch of progress towards it. It's good that we did that. But we could run a whole workflow on a node on Anvil and just collect the outputs of the workflow and where, I mean, it feels like that would be, I don't know, it feels like there needs to be a pipeline to get from Anvil, from these like Anvil wish list things that are things that would really just help Galaxy in general into like the core team or like onto the roadmaps and stuff. I just wanted to like, I don't know, showed up those two particular things, which I think, I mean, obviously our big, important, exciting developments that should happen and presumably could happen in, you know, we could have both of those things done in the next year if we put our, if they became priorities. So I just wanted to sort of echo Marius, I think those should be priorities and it's really good to see an awesome use case for them. Yeah. And I don't know about ongoing projects and I mean, everybody's drained or maybe I'm just extrapolating for myself. I mean, we're pretty busy, but you know, I mean, if another thing like this comes up and, you know, you say, okay, now we want to scale an analysis of that size or more, we just, you know, want to see how far can we get with Galaxy. I think that's something where an engineer's time is well spent. I love it. I love it. Let's let's do that. Let's do exactly that. Like I said, there's going to be, you know, digital genomes coming, you know, through VGP, there's, you know, equally exciting work going on. So we're perfectly pleased to do this. And if we can kind of tie analysts with some of these use case or engineers analysts with some of these use cases, I think that could be enormously impressive. Awesome. Awesome. And you know, I also appreciate, you know, this is all these are not five minute jobs. So it's the kind of thing where, you know, months or years of work are going to be necessary to fully achieve it. But I think it's useful to have a long term vision as to where things are heading. All right. Thank you. Thank you, Mike. We are over time. But if there's any additional comments or questions, you can take them. Yeah, we should probably wrap. I would say just poke me on Slack or other if you want to chat more. And I'm looking forward to the workgroup meeting and what a month's time or so. And then I'm especially looking forward to seeing many of you in person at GCC. Thanks, everybody. All right. Thanks, Mike. Thanks, everybody.