 Alright, I'll take it away. So, hi, I'm I'm Bridget. I'm a rising sophomore CS major and I'm interning here. And I'm also a CS major for writing senior. Unfortunately, plan won't be with us today but I just want to credit for all the charts he made and observable, especially finding what the popular tools. So a lot of the charges see would be made by him on a circle. First you can go over some of the goals for internship we have also main goal would be to determine the approximate cost on cloud based resource jobs. Just see how much resources are used and then compare that to how popular toys. So you'll give more picture of which galaxy tools are most popular. And then, like I said, some jobs as an upload are extremely popular, but consume very little resources. While others such as BWM and other high resource jobs, I use a lot of memory or CPU would use a lot of resources but are not used or not as common. So if you get some benchmarks those would be able to visualize and predict about how much each tool consume based on the results get from all the benchmarks. And that should give a good idea of the approximate cost per job, depending on the CPU and memory allocation. So the way that we go about collecting this benchmarking data is we've used three different cloud instances so Jetstream Google Cloud or GCP and Amazon or AWS. We collected several DNA and RNA data sets, both paradigm single reads to test a variety of data on a suite of tools. We used workflows to run these benchmark tests through tools, running these DNA and RNA data sets through tools like BWA, BWM, bow tie to high sat to string tie salmon and Calista. Because those seem to be relatively common tools that we would get some benefit from figuring out run times of. Lately we've been using bio blend to speed up this process on the different instances, because that allows us to run multiple workflow benchmarking tests at the same time, which is really nice. Victor you're muted. You're muted DNA data set benchmark of workflow. So what we'll do is we'll have an input which is a fast Q data set. And we'll combine it to each of the tools want to run. So in this case we have five tools. So we have both tied to BWM, either a high set to and string time. In this case, we decided to link string tie to high set to, because this would be the most consistent job. So never failed. And furthermore, what those work, the workflows, whether to grab ID, and then involved them, we could use that ID for the workflow, and then generate the benchmarks automatically for certain allocation for the RNA data sets. We had to have two inputs because salmon and Calista required a reference transcriptome. But otherwise, we're also testing bow tie to and BWM on that one and we linked string tied to bow tie to because again that one was the most consistent running tool in that workflow so we'd always get the string tied to run. So here is just a quick look at some of the data we have on this Google spreadsheet. Let's take a quick look right now. So we did at first is we just put all the data into this spreadsheet. So we'll have like the link to the job. And you also look for the job ID and the data sample, as well as the CP allocation like this. So we'll put them with the link here and then the test ID. In the future though we plan to move all of this to observable because it'll be much easier to look up all the data there. Because once you start having multiple runs so let's say like no three or four runs per instance, it's going to be a little bit cluster on the spreadsheet so be much nicer and also more compressed on observable so all the other day will be there as well. So it will go over some of the observable data we have, including some of the plots. So have her screen now. All right, so this is our observable dashboard. Dan has the main versions around his page and it's updated. So, these are some stacked area charts of historical data which pay one collected the data for and made these graphs so shout out pay one thank you. So we can look at which tool is represented by which area because of tool tips so they just pop up as you mouse over them. So as you can see upload is quite popular. So it's it's at the bottom and it has the most area. And then there are things like, I don't know there's there should be BWMM in here, probably more interesting in terms of resource consumption is total CPU time per month. So not that too is a large resource consumer, which is strange because it's deprecated, but also you can see BWMM and bow tie to another common tools like that that we're testing. These two are kind of what is green on that chart. I think that's RNA star. Yeah it is. Yeah. The average memory consumption per month by tool and average CPU time per month by tool or kind of not super readable for some reason but yeah. And then these are tables that go with those those stacked area charts so that you can look at the specific numbers and organize you know sort however you want. So here are some tables for comparisons a number of jobs versus number of users, number of jobs versus total CPU and number of jobs versus total memory. Now we get to some some new stuff that wasn't here on Tuesday. I just made these yesterday. So there's tables and plots for all three instances benchmarking data so far. So this is a filter bowl plot and a table below. Okay, so if you put in bow tie, you can see all the bow tie runs and you can also mouse over with tool tip and see, like more detailed information like the CPU count and the memory, and also the specific system runtime and real runtime. You can also look up like memory usage so for this one I think 50 would probably yeah. So for the 16 CPU and 50 mem, you can see all of the runs of all the tools with that resource setting on them. Nice, nicely enough because of graphing system runtime versus real runtime they seem to form lines. So even if you don't have a filter on you can kind of tell which resource tier. Yeah. So, yeah, and when you type in this bar it also filters the table so you can do two at the same time. We have the same. Yeah, so I want to unpack what you just said there so it makes these like lines and you said that corresponds to the configuration but the configuration of what what resources we were running on the cluster when we ran these benchmarking tests. But like the number of cores or the amount of RAM. Yeah. Is there an easy way to kind of say what those lines are just very quickly. Yeah, I'm it's it's just like kind of an interesting feature that happened when I grafted. Like that one specifically is that like the four core eight gig line. Yeah, this this one is right here. Yeah. And then what's the one above that should be nine and 28. Yeah, nine CPU 28 men. I see. And it's nine because I think it was a 10 core machine but you reserve one for for the GUI and other kind of scheduling test. Yeah, that's right. So what's interesting is like to compare it like the same dot so here like I mean it's still working progress but like the top right most dot is likely to be comparison of the same tool. So that one's four and eight runs whatever 9000 seconds and if you move over to the middle one top one it's going to be probably the same tool but you have more CPU so like that that's the that is linear that means the tool speed up is linear across the across the CPUs but again that's figure out more visualizations but that's the way to compare what the impact CPU configuration changes have on the runtime of a tool at this point. I see. But you could do the filtering you do bow tie and see all the bow tie runs. Yeah, yeah, but you want to compare the lines. Yeah, yeah. That's so cool. Thank you. And also there is. You can zoom in and out a little bit, because pay one and Dan and worked on that so I just, I use their, their pots down at the bottom here which I'll get to in a sec for a template so that helped a lot. Yeah, and we also have AWS and GCP. Same thing. You can filter them all so. And the hope is to overlay these at some point or at least make it so that we can have like them all in the same space and select to switch between them rather than scrolling so hopefully that I'll, you know, can implement that soon. Do you know if the, I guess processor architecture and clock speeds are comparable across GCP and jet stream and AWS. I don't know how to answer that. I would probably have to overlay that to see if the system runtime or, you know, comparable across all these platforms. Yeah. That'll be in the documentation to some degree of the respective providers but we have, we use the same machine class like memory. They have CPU heavy ones in the memory heavy ones and whatever so the same class across the providers but they have not looked at like the CPU specs for each other. Yeah. So just to wrap up, we we draw our all of our data from GitHub. So it just links right in so that's nice we don't have to upload any files. And here was the the testing of zooming down below and here's a scatterplot of the number jobs versus total memory and number jobs versus CPU time like the charts, the tables up above so yeah. And that's that's that's what we have on here so far we're planning on adding a table of contents for better navigation and like I said the the overlaying or the at least quickly switching between the instances benchmarking data charts so that should be coming soon. I think the plan I mean it's great that this is available but I think the plan is to sort of embed some of these figures onto Galaxy hook. Yeah. Yeah, eventually we will be incorporating all this into an API. So, well there's, you know, there's the API but then there's also just sort of making it easier to find rather than having to know to go to Dan and like private repo here. This is so cool. Thank you. Yeah, of course. All right I'm going to stop sharing my screen now so. There you go. So like we mentioned earlier, we're just going over the observable. That's on the graph. So hopefully in the future, one will be able to update and then also once you have more benchmarks on the same incident and on the same data set so multiple reruns will hopefully get an average system runtime, and then compare the plots for those across each and instead of each job. And hopefully you can see something. See if it's linear acceleration or not as I said, so we did have a few challenges. One of the main challenges was trying to get polyester to run. So polyester would be able to create synthetic RNA datasets to use. And a lot of those will be able to run those and that data sets on the tools to see if the results are any different from any of the real life RNA data. Also, one issue just resolved earlier, when he added a new node for GCP. So we now have 32 GB note. So we should be able to do the 29 GB. So we have two other resources on GCP. The main issue right now is having some issues with creating new histories or starting jobs. So hopefully that gets fixed soon. And then also having some issues with RNA star and arson, mainly due to memory allocation issues, as well as some Disney extra references to run. So hopefully we'll get polyester and also run all the third tools, as well as a high set to for RNA in short while as well. So like I mentioned earlier, with the help of Keith's bio blend scripts benchmarking is going, going forward on different instances if we decided to add more of them is going to be a lot faster. We have configuration files written for RNA and DNA data set benchmarking so we just modify those and run them. And it's a lot quicker. And one issue with this going forward though is the data set and workflow IDs change across cloud platforms. So Keith is currently making progress on identification by name instead of ID for bio blend so hopefully we'll be able to get that working if we before we add another instance to our benchmarking so that's exciting. And we're also aiming to extract names and data set type info from Galaxy for the observable plots and tables because as it stands right now we can't tell which dots are DNA or RNA or paired or unpaired or how many gigabytes of storage they take up. So we can't really drill down like which trials are which basically. So we're trying to get that information available in the charts. And then finally, the end goal of this is using this benchmarking data, we're trying to implement an API and hopefully incorporate it into the galaxy interface to provide access to cloud cost information. So, yep. Thank you are any questions. Yes, this API, what would it return. What's the, how would it work. So you would be able to select a job and then also maybe select the approximate data set size. Depending on the allocation of the CPU and memory, they'll be able to return approximate runtime, based off, you know, some of the benchmarks we're doing. So we'll be able to get a general idea of maybe like average system runtime, once we get those benchmarks down. And then we'll use that as a reference to give you the user approximate runtime for the job they're running. I mean, that's kind of a holy grail. So I can, I can share something as well that what we've been working on a bit. John, can you add me to the share list. I think everyone should be able to share. Oh, sorry. Victor, if you can, you gotta stop first. So this is what the sort of started talking about as a schema for what the API will return. So we have a list of tools that have been registered on here, and then the metrics that have been captured for the tool so it gives you the most significantly the number of CPUs memory and the runtimes along with the inputs that were used to derive that benchmark result. And then we have sort of a database of inputs that we tested with that's available as an endpoint. And then, or, sorry, those are for posting. These are for, yeah, so you can get a list of tools, the metrics for the tool, and then the inputs and so if one idea was that before you hit the execute button in Galaxy, you could have an estimate or give me some examples button next to it that you click would create this API and return an extrapolation of the available benchmarking data is combined with the data that the user has provided on the input form but that's way down the line. Yeah, I mean, I can post this in chat and if there are comments as to what this API might look like because it'd be great to have the consuming and using it as much as the building out and to begin with. So currently is all this just treating individual version of a tool as a separate entity. So, not able to necessarily versions. And is that the same for the metrics so fun, you know, they're like 10 different versions of the WA. We listed 10 different kinds. Yeah, I mean, as it stands, yeah, we didn't aggregate across the versions. I don't know whether we should have or not, but we didn't. Have you guys considered incorporating galactic radio telescope data so that in addition to benchmarks, you know, of just like sort of stock data sets, you can get a picture from all over, you know, random data and different configurations of servers and that kind of stuff. I mean, it's certainly an option but I mean this was, we had six weeks in the state of radio telescope was kind of in flux so plus it as you said it's it's random data as opposed to sort of control inputs. So yeah, I mean it's it's on the table certainly just with the timeline we had the radio telescope seem to be out of scope for the protection get info. Yeah, makes sense. I really, it's good because the past projects that have focused on this, we're using the that kind of data right like the existing ones I like the, the approach here of like, I mean if you're changing these variables intentionally there's more things you can sort of pull out of the data I think it would be really great if like, I know I know these summer projects are are short term and stuff it would be really one cool outcome might be just like a list of issues on some galaxy project galaxy on the on the board on the issue board just like, here are the things that we could do to sort of improve the API is make it easier to sort of do this next time. But yeah that documentation is really nice. I mean I guess in that day is there is other things that that that that galaxy framework or galaxy the UI or the API could do to sort of make this process easier. You might have some comments on there. I've talked to Keith. Yeah, yeah, there's that the issue with trying to determine the IDs of data sets when we move them across issues, but john and I have already been talking about that in getter and I had some ideas this morning. Still trying to determine if the problems I'm encountering or you know a problem in galaxy or problem with bio blend. So I'm just trying to see if I can pull the information straight out of the database to see what's in there. And I'll have some questions about that for john and getter probably later today. But otherwise, you know, like, it's only for the benchmarking, but being able to keep a consistent set of IDs across instances I know Nate has a trick where you set the secret to unknown variable or something and you I don't know how far that goes. So like I think the first data set you can predict I don't know if you can go further. But like if there was a way to like force same IDs of resources as you upload them. Yeah, I mean Kyle did some work around you you IDs I mean data sets have you you IDs I don't know if it's if it's useful in this context though and then the API is that can consume them are kind of sporadic. But if we could have like a consistent approach of using you IDs I think that was the idea there. So would those be the same across instances like if you. I mean how that uploaded them if you uploaded them with a specific you ID they would be right so if you scripted this out and I believe the upload. Let me think. Maybe I don't know. I'm thinking of hashes we could you can tell the upload to like hash the data set, but we can't do much with that in terms of searching after the fact. Yeah, I don't know. I think the upload tool lets you specify the you ID during upload I don't think it's in the GUI but look at that that's useful. And what is the back end for the API that you just showed will that be part of every galaxy instance or this would be an independent app that there's no back into this this is just a swagger spec and open API. Pay one is starting to look at implementing this in Django as a as an app so we run this somewhere right. I don't believe adding more features into the galaxy core base that are not related to galaxies the right architectural path forward and so I think the coupling things are is better and can update independently and such. I was just interested where what your what your vision was there, but I think we have the scripts in place now that we could, you know do this profiling of usage on main quite regularly right. You almost turn into a crown job that runs on once a week or once a month or whatever we want. Nice to monitor for abnormalities and like if infrastructure is not behaving as expected and also you have some kind of deviation from the average and see what things are running so they could paid in service stability testing, for example, to not just benchmarking. Yeah. Yeah, and I noticed there is that expansion on star. You know if you kind of squint at those charts you can see where you know coven was very popular and protein folding was very popular and R&C was very popular it's kind of fun to watch every time. I mean historic day like historic day. Yeah, that'd be that'd be that'd be great if we I mean I think overall if we could drive more of our development decisions based on user usage. It would really point us toward what people are doing as opposed to what we think. I think we can already do that because the fact that top hat is still there is disturbing. And it's clearly uses resources that we need elsewhere so get rid of top hat. Very clear message. Nate can we do that please. Yes, you tell me to remove it and it's gone. You just put more warning text at the top saying hey you probably don't want to use this. I don't know. I can also send it to things like stampede where it won't run for two days and then maybe people stop using but Yeah, how much data database massaging was necessary to generate these graphs for historic data. I mean, we pulled out the certain fields from predominant to the jobs table jobs in the metrics tables and selected the fields that did not include inputs, or, you know, they were also that that shrunk the size of those so I mean works reasonably fast, but I mean the queries are in in that usage metering repo on GitHub. I'm thinking to read me even so the queries that run for a while but other than that it was not not complicated. What I'm thinking is that if we can run these some queries which produce kind of de identified data regularly put it somewhere and then anyone can start observable pull this in and visualize there would be that sort of soil while reporting problems. Example, I mean, there was, there was an interest among public admins of re re resurrecting the telescope to do exactly that. So, you know, I kind of wonder if we should combine efforts on this. Because I think there is a lot of interest in doing that. I think that would be very wonderful because I mean the benchmarks that are done now are amazing and are very interesting. But it's you know a dozen tools out of what 10,000 or so, the only way we're going to have a broad analysis mining, you know the runs that are going on anyway. Okay, you know there are sensitivities around that so we got to do it in a very sort of transparent and sort of iterative approach so everyone's comfortable with. In the very least we should be talking to make sure that we're using the same fields, or what fields will be useful for the database. And then ultimately I mean here we're measuring runtime and so forth but in a cloud environment we're measuring dollars to. So, I think we're going to be the heroes of genomics if we can put a dollar amount on these popular pipelines because you know when you time 10,000 samples, you know starts to be real money on the table. And it'd be cool even at the end of running a workflow if it tells you how much would have spent. Yeah, I mean it could be fun that you know especially attack where it's all free be kind of fun to say had you done this at GCP this would of course you know whatever it's $15,000 attack would love that. I would be yeah that'd be amazing I'd love that as well. Especially if you could have like a running tab. People how much we're giving them. Because if you see one job, it's like a 12 to 12 cents or 12 cents or whatever. But if you see over the course of a month that you've consumed, you know, $500 or $1,000 or $5,000 worth of compute. That's an impactful resource. So once, once the internship is over how do we continue. Well, the great news is that they will express interest in carrying on avoided the lower engagement levels but Yeah, this is just the beginning. I'm wondering if the top four tools in terms of CPU usage have a GPU version available that we can utilize free up the CPU. I'm not aware of a good version of top that to a GPU but you know as we mentioned there are newer tools like high set that are basically dropping replacements and in order to faster. And also, you know, to me like the fact that almost almost feels like 20% of all jobs are upload jobs is, you know, they don't represent 20% of the resources consumed but it seems like from the user's perspective. If we could streamline the upload process if we could allow remote data and Jeremy keeps saying to be just linked, it would have a massive impact on me that's one out of every five jobs that runs on main is an upload job so focusing on solving that to make that seamless faster less resource intensive. I think the users will really appreciate. I think we have something similar really and like the personal histories where you can kind of drag the data sets into a new history. Is that what you mean by streamlining a little bit. So you could use some of the previously up to the data that you need to use. I like sharing it between users or sharing it across galaxy instances. If the, you know, if you go there's multiple of these so we're using use galaxy.org but there's.eu.au and many others. If those data sets were uploaded once and if users swap between instances they were able to reuse the data sets without having to re upload the data. And that's one secondly oftentimes and increasingly so, particularly the public data sets already live somewhere, not on the lab on the user's laptop. And so instead of having to go through the task of going to explicitly uploading waiting for that upload and running the metadata. If you could just say the file is in this S3 bucket. And then when the job runs, it downloads the file for the purposes of that particular job. But the upload step can be skipped altogether from the user perspective you don't have to wait that extra whatever 15 minutes an hour depending on the size and your bandwidth. And then in terms of user experience, just to throw it out an idea, if you could like drag and drop a file from your desktop onto the upload button and then it just magically brought it in it would probably save, you know, a gazillion clicks over time. There's some relatively simple things that can be done there. So you can already drag a file into the upload form box, but adding that action action to the button itself. You open the box, you just drag stuff. Okay, cool. Yeah, it was kind of interesting comment on the nature of genomics. You know, if one in five are uploads. And, you know, I mean there's a lot of caveats but around numbers most workflows are only like four or five steps long from a given data set. It's kind of interesting that it's relatively shallow. But I guess that's sort of compatible with my life experiences. Often it's kind of a few steps per day. Obviously there's exceptions and some go on very deep, but those are rare, relatively rare. I think this is also an indication why we need to focus on UI more because it's also probably your limitations that they just give up at some point. Maybe part of it but I just think just the state of the field is, you know, upload do some QC do a map do very call do some comparison done. I guess to also add the download rate of downloads because I mean the people just stop, or do they pull it out of galaxy and take it elsewhere is another question that can be asked and at that five point step. It's a good question. I just got an idea earlier, like, you know, sharing data across instances. Maybe if you could somehow have a way to get like two separate tabs on screen. And then one is like the maybe the regular instance and one is like a EU instance or something. I can maybe like drag the data set from the EU instance and then drag it across the tab into the regular instance, and somehow share the data that way. I'm not sure if it's possible. Yeah, I mean that the back end is the challenge there in how galaxies stores references to the data sets or metadata but yeah that that's the idea I think just implementation is the sticky point. All right. Awesome any other questions or comment points or anything just following up on another one. I think it might be possible because you know how each day as a link attached to it like in the history. And then you can actually copy that link and then on another instance and then upload it that way. So it might be possible like you know like that and so directly copying over just having the link copied over when you drag it over or something and then uploading it automatically. Yeah, I mean, yeah, there's a nuances there but yeah. You want the like metadata to be the same right so you if the user had selected a special extension for that file or something you wouldn't want to just sort of take the link and re imported you'd want to capture some of that metadata but I'm. I'm not sure how the API is on the client work for Greg and dropping between browsers but if we could sort of capture more of that metadata. Yeah, I mean that'd be very cool. All right. I guess we'll end the meeting a little bit early. Hi everyone. Thanks. Thanks for interns. I'm glad to hear you're sticking around that's great. Have a good school year. Thank you so much Victor and Bridget. Thank you. Thank you.