 All right, so this is the kind of last minute I've been wanting to bring this up to everybody's attention for a while, since we've been looking at this for a while. But it just never kind of got to a point where it was a good milestone, but yesterday, nobody had signed up for today's slide. That's going to be the day. So the slides are at the link below if you want to follow along, there's a couple of links in there that you save you from typing. But again, this is a work that's been going on for a while with the group of people on the slide. And so we kind of started on this on the last item on this list, which was getting a better understanding of cloud costs, specifically for the annual project. But then as the project developed, we realized there are some additional benefits that can make use of this work. If you're not familiar with TPV, it's basically a meta scheduler for Galaxy as a plugin module that's been developed mostly by the one in the Australian team. And there's this notion of a shared database within TPV. You've never set up your own Galaxy and especially a production one, one of the early on steps actually the processes are running is to decide how many resources to assign to each tool. And it's tedious, especially, you know, if you got a lot of tools, but also it's very sort of open ended, right, as a system administrator and even as a domain scientist, you may not know what a good value is for how much memory or how many CPUs are to tool be assigned before to the scheduler. And so TPV is looking to develop this notion of a shared database that any Galaxy using TPV can simply point to and make use of and of course override with local settings if they choose to do so. So the goal here was then to, you know, see the values with some data driven decisions. Then kind of an extension of that is that, you know, this is still a static list. So, for example, you know, one of the popular tools like GoTi2, we can always assign it 16 CPUs and 64 gets a memory, right. But if you submit a tiny job versus a large job those may not be super familiar, super appropriate. And so ultimately the goal would be to allow Galaxy to set these research requirements on a per job basis given the inputs. So Kaigon has been working on the service called Ask Galaxy that uses some machine learning models and historic usage data to guess or to estimate how much memory and how many CPUs a given job ought to be assigned given its inputs. And then back to sort of the original aim of all this work was the annual project, you know, is looking to grow adoption over the next few years but one of the biggest concerns people have about it is how much is this analysis going to cost. This can be, you know, from the exploratory stage of where I'm just running a few jobs and you know I don't know whether the job's going to cost a dollar, you know, $100. And then all the way to running tens of thousands of samples or thousands of samples and costing, you know, tens or hundreds of thousands of dollars again it'd be nice to be able to put some of those estimates in the proposals when people write those. And of course when they have the budget. So those are the sort of high level goals that we undertook for this project. And then the other thing that sort of surfaced as we're looking at some of the plots that we're going to look at together is that all this data is kind of being available now. And that helps in this data driven decision. So we today only use galaxy main for collecting the historic usage data and so it gives you some insights into the server usage right we've oftentimes reported we have 250,000 users or we run 500,000 jobs per month. But this gives you a little more insight into sort of annual variability and how are those 500,000 jobs divided across tools so we can put the tool popularity. We can look at analysis patterns right are, you know, is RNA seek being phased out in favor of single cell tool. So, again, things that kind of hopefully we can make use of especially when time comes to do some reporting for grants. Secondly, we can get up anybody can get access to this dashboard and I'm going to show you all. So researchers can then poke through it to and see what other tools are people people using and what's popular. And then, if we see a new tool coming up, maybe the, the goat community or the group team can look at, oh hey, this is emerging. Maybe we need a training in this department or something. And there's of course a lot of more a lot more options that can be pulled out of the data, how many day jobs are failing and things to kind of further help with this but it's a first step in that direction. And so the project's been going on for a number of months. And so today, we've we've discovered we got so we created and filtered usage data from use galaxy.org as our primary data server. Yeah, we visualize much of this data and this interactive dashboard. And the call it a kind of code of the dashboard, thinking that while it is a dashboard, it's, it's also like a platform for continuing to develop this, these visualizations and so it leaves opportunities to do that for those interested, but it's also kind of on the sidelines of galaxy project.org site so at some point we may have like a more integrated dashboard but for the time being, it's a powerful and not yet super integrated website. And then, Kyvon has used some of the data that we're going to look at behind this, developing the ask Galaxy API and training the model says get a number of people have been working on this. And so we do meet every Thursday, every other Thursday at 11am Eastern, if anybody else is interested, the next schedule call is for next Thursday so week from now today. Again, anybody's welcome to join. And so examples of what kind of some of the dash insights that we can get from this dashboard. So I can say we can look at what are popular tools right in my mind at least popular was a function of how many users are using it and how many tools, I was like how many jobs are being submitted. So a combination of those two is sort of the simplest way I find a popular tool. And so again, that's one of the sort of plots that's available and you can look at in this case, sort of the number of users that have run a given tool. Sorry. So we can look at sort of usage anomalies late last year. There was this massive spike in the feature counts tool for a month. Could have been one of the COVID analysis that was submitted could have been a training. Actually, don't know what didn't dig into it but it is something that raises some flags and people can, you know, take the next step and dive deeper. And then I think is use galaxy.org specifically is a public resource that is shared. And so we can look at what what tools are using the bulk of resources to balance usage of the server, maybe adjust the server usage right so I didn't put in there on these tools but you can tell that sort of the bottom three tools are consuming as much as you know the top, probably, you know, 100 or 500, I don't know the count on these on these lines. So, you know, looking at those specific tools making sure we've optimized those is maybe worth while it's an effort. So, I wanted to, you know, kind of go through the plots that are here and sort of open it for discussion. If people have questions that are interested in looking at what's asking some questions we can, you know, do that so the dashboard is at this URL which is observable cloud cost use galaxy usage and Michelle has been the star of this dashboard to date she developed most of the plots. And so this is this is the dashboard. And, you know, the data that we've that's visualized down here is from August 2021 through August 2022. And, and we've filtered out a couple of tools because they were dominating most of the data. So the data fetch and the upload upload with sort of being used late last year I believe early this year forget now and it just dominated. And so we got basically two classes of tool to class in the bottom here. This is the one that we took the most time to develop and it's this try to combine multiple axes or multiple questions into one to kind of give you more of an answer, rather than a look at the data. And it's using consume memory as a proxy for resource utilization, because the CPU one didn't really give us much diversity in this plot, and then the number of jobs so presumably tools that are in the sort of further through top right are both popular and resource intensive. And the ones in the kind of on the left hand side, the resource intensive but not necessarily popular. And then tools on the bottom right or tools that are popular but maybe not resource and then these two red lines indicate averages across all the tools that have been run so we have a number of these kind of, I guess the resource intensive popular tools. And then a vast majority I guess in the, in the low resource usage so there's a, you know, we can zoom in through this for a number of full, I mean for a number of clicks. It's not infinite, just due to some restrictions on how observable the data and there's some technical limitations of the plot library but you know just regard that it's a, I think, really cool visualization. And so you can hover over any, any ball in the in this plot, and it gives you the stats on the details I guess of the number of jobs so again you know what we take out of this data, like previous slide kind of show some examples but anybody's got, you know, I got some ideas or questions. You can talk about it now we can look at it after the call and run with it. The other type of plots are these staggered line plot or stack plots that sort of show two variables plotted so this one is the number of jobs per month grouped by a tool. So, again, they're nicely labeled with tool tips that contain details of the particular job for a given month. And so we can see in this case, sort of, again, not really surprisingly, the largest number of tools or is important again we filtered out the data fetch tool just because again it was dominating by election order magnitude. And then we get some past QRQC tools. I'm personally not familiar with advocate. This was the first time I came on my radar. So I got a rest. Is there a question. Anyway, so, you know, that's one, one way that you can, you know, the cool thing again is you can filter some of the sort of the dates and filter by the tool popularity so we visualize everything. So what it looks like. And so this by the way this combines so there's some data massaging that took place here that combines all the tool versions. Because otherwise there was a lot of broken up lines. It didn't make a lot of sense so we bundle all the versions into one one tool. And then we kind of get this was not really discernible. Michelle added this slider that allows you to sort of filter the 20 tools that have the most jobs and sort of, and then on the kind of. Well, this isn't this is so we limited it to 100 at the moment so it's from like the 80 to 100 most popular tools this is that range of 20. So you can dig in a little bit and get some details and then if you're interested in, you know, how did a specific tool behave over that year. For example, you can click it and it just does this nice filtering I guess or dimming of the rest of them to see whether it's standard or whether it's continuous in its usage or whether it's what fitting for month to month. And the other thing is users per month. So how many users have used a given tool in a given month, contributing to that notion of tools popularity. It's, in my opinion, really surprisingly consistent for a given tool in a given, you know, across the year even so we have some dips. It's following the holiday season people are picking things up and then over the summer, general usage kind of drops. It's mostly seemingly, you know, it's small, small components of each tool but it seems the QC is dominating this as a same principles apply for the plot so you can scale filter most popular at least a little less popular tools and get that. And then now get some resource usage. So this is the total CPU time. This is my given tool per month. The average CPU time this one is a lot less static consistent. And then we go to memory so we did users memory and CPUs as the three levels of of a gesture. Anyway, that's, that's what's here for now there's a couple of experimental plots down below that aren't on this front time radio. If there's any questions or comments about this. Go back to the presentation I got a couple more slides. I have a quick question. Does this update or houses updated over time. So it's not updated automatically over time so that's the remaining slides. Okay. And to what extent can this be transformed to use with Grafana and telegraph which would be live. What was the second one. Grafana to the graph. Oh, um, I don't know. It. You know, there is a, I guess the question of where does it integrate with some of the existing tools or the one topic is that a galactic radio telescope that collects some of this data or used to I'm not sure what the status of it is. But I mean, I guess maybe I'll talk about, I mean it's just a couple queries that feed this data and we can talk how it might be fed into Grafana or something like that. But yeah. So, yeah, so this is what what's happening behind or what's happened behind we are looking at all the data use galaxy that was painfully slow and it was running it on galaxy main database so it wasn't what a faint of heart I guess so we wanted to minimize that. And so we basically took the data from use galaxy.org extracted filtered only some tool and job info and created a few local tables that then develop some of these queries. And we extracted these metrics, such as what I mentioned the users and jobs and CPU time and then created files that feed this dash. The couple was chosen because it was a lot of the code work was really kind of cool in how it was visualized on there and it was seemingly simple to put up cool visualizations that are open. And allow others to edit that I guess part hasn't really materialized yet but it allowed a lot of flexibility in how data was presented. And again, most of this was originated in trying to get some benchmarks and stuff for cloud costs but then this kind of got a life with And so all the queries and water in this repo usage metering, which basically this is what it does to begin with so it just takes out about half of half a dozen columns from the job table and the merit tables. So it minimizes the amount of data that we have to work with locally so we don't have to shift to all the job parameters and things that are massive and not really considered in this case. And then we created this dozen or so queries that focus on massaging the CPU and memory usage data and some of the user counts. And that produces these files that are then used on on observable and so if files are edited same to your point. If files are edited, they observable automatically reflects those changes, but the files need to be run manually as it stands right now. And I think this will be great additions to GX admin which has these part of it is already so I mean, once that's in GX admin, you could directly stream that out of Grafana right. Okay. Okay. Yeah, no, I mean, that's, that's cool that sounds good. I mean, I guess that gets me to where we are and ideas or discussion for what can be done with it. So the dashboard is sort of being actively developed answering questions that are pertinent so Michelle even this morning made some changes on on it to allow this filtering of the bottom of the less popular tools. So for a lot of time and efforts being put in is complimenting this usage data with the targeted benchmarking again we use this as a way to identify what are the popular tools so namely which ones are sort of resource consuming a lot of resources and consumed and used by a lot of users. And then Keith's been working on his ABM library and using it to benchmark a broader range of job configurations so on newsgalaxy.org and this is what I've on learned is that our scope of variability is fairly small, meaning that all the jobs run pretty much with, you know, for a given tool with four or 12 CPUs or something and so the machine learning can't predict what's a good number of CPUs because all it has it's for every time. And so Keith is working on expanding that and sort of looking at the boundaries that then then to runs models or hopefully yield a more representative picture. You know, Marius to your point right it's, and Sam saying this doesn't update automatically, you know, be cool to see this live right it's we had data from 21 I think then we kind of updated it by six month margin and it'd be nice to be able to look at every month. At the end of every month for this to run automatically every day, whatever. So, I hadn't thought about it as part of the GX admin and feeding it to Grafana, but that was cool. And that's it. I got a summary of any of this if you missed the slides or links along the way. I don't understand I think a reminder that anybody is in the next meetings on a week from today. I'll just add one thing besides the work that Keith is doing to diversify the number of CPUs use for tools we also plan to get the data from you on Australia, because they seem to be having using different number of CPUs so that's going to make the data more varied. Anna Simon was waiting for some of this to become a little less custom I guess I wanted to run these queries on use galaxy.org. And I understand he got the GRT running at a you work is that's going to leave that. That's awesome work, and I send everybody. I guess there's a couple there's a couple of things going on. So I think we've been talking about for years now, there's an awesome paper to be written around this. In the meantime, is this is our link to the observable dashboard on the hub. No, not that I know. I forget I'm trying to remember the context but not a lot maybe it was in the context of writing up the anvil renewal grant. I was looking for an update up to date version of this I think I found an old version and conflicting version so be nice to, you know, kind of consolidate on like a single version that is like the official version that is somewhat updated. Yeah, that was about October I think observable introduced the notion of teams, free teams. But they didn't allow you to go because I think that the original tag was at anvil project or something like that but I couldn't figure it out I don't know like to make that become like a team thing so we have to use a new handle. And that's how we ended up with this because now because if you used to be you have to go through basically working in a PR process which was cumbersome is it's not a straightforward people are familiar with it, as they are in GitHub and so each individual had to have their own work and it was just complicated to work on with model, it's, you know, everybody's collaboratively working on the same notebook at once, but it required this change of Yeah, but moving forward I guess this that problem will go away. Now that there's a team and everyone can kind of edit the same document and just move forward. Okay. I'm very comfortable with that. And then I, you may have said this but I, but I may miss it. Is there an effort to expand the set of tools that are considered or is that list more or less fixed at this point, or for the benchmarking part. Yeah. Yeah, yeah, so that is, it's on the to do list so I mean the, again, feedback and input from somebody that's in the know but the main would be great. The current list of the thinking was we got a few of those mapping tools. The next set was RNA kind of focused tools, and then an RNA seek workflow that would combine the two and those would be sort of the three prongs for the paper. That would be a unit for the time being but then I hear a lot of people are diving into single cell stuff so it'd be cool to benchmark those but it's feedback input would be super valuable. Yeah, I mean, we're never going to be able to do a comprehensive analysis of everything, but if there's a few major topics or we want to go deep. I think that would be compelling. I have the next call on my calendar so I'll be sure to join. I just had my 27th class yesterday with my last class so I'm, my life is free again so I'll be able to join starting next week. But awesome work. I'll be. Yeah, thank you so much and the team. One question in the chat. Is it okay to broadly share the observable and tweet it out. Yeah, yeah, sure. If anybody wants to work on the observable stuff. Michelle will be thrilled. I'm sure to have some help. Awesome. And the other questions for NS or the group. If not, thanks again for joining. Again, this is our last community call of the year. So we'll see you all in 2023. I'm going to be starting putting together the lineup. So if you're interested in presenting, please let me know and we'll get you scheduled. Thanks everybody. See you at the hackathon. If not, then see you in 2023. Bye.