 And now I'm going to turn it on to Kevan to talk about the machine learning toolkit in Galaxy. Just, I should have communicated this this is going to be one example of machine learning application in Galaxy it's not specifically machine learning toolkit, but anyway I thought I mentioned that probably it was it's not reflected in the title. So, let me share my slides. So do you see my slide work. Yeah, looks perfect. So the topic of this presentation is resource prediction for galaxy jobs. And the motivation is we have a lot of data. galaxies may server has been collecting job run data since 2013. I believe the Europe and Australia have been doing that since a little bit after that. So, the large collection of job run data can be used to train machine learning models to predict job resources. And why do we care why is predicting job resources important. Multiple reasons, it allows the galaxy to schedule a job on a note with just enough resources. So we are over allocating. We're going to have reduced system throughput utilization. And if we are under allocating then we might have job failures to not enough resources being available. So doing just enough, I mean allocating just enough resources for a job is important in that aspect. So there's another application is that resource prediction allows cost prediction. When we wanted to run our jobs on commercial clouds like AWS I think this is something I spoke with Ennis. They are doing some research analysis on how much it costs to run a task on the cloud. Given different combinations of like, you know, the hard work that's available. So, this would basically reduce the amount of analysis that they have to do because, you know, the idea, but how much memory or CPU or job needs. So there, this is a continuation of a previous work that was done a few years ago. This was based on historical data. This that research mainly focused on job runtime prediction with a little bit of work on memory prediction. However, the job runtime prediction is a wall clock time is very much tied to a notes hardware. So if you have a note that has more number of CPUs and more RAM, the job will complete sooner, otherwise it will complete layer. So what the point is that this is this, this data is not really transferable across different hardware, different hardware. So, so if you learned a runtime on one node, it may not be useful for another node with a different number of CPUs around etc. So the research also had proposed some future work. One was to predict memory and CPU utilization. These are things that are independent of the hardware if we know how much RAM or get an idea of how much RAM a tool uses or what is the CPU utilization as much easier to transfer that to a different node. So my idea was that if you could create a service that has endpoints that will that we can query it for memory and CPU utilization for specific tool. So this can be this service can be called by, you know, some other client that's trying to predict the cost of running a job on commercial clouds, or it would be called by the galaxy server that I have to schedule a job. So, for example, you want to run a job, you're going to get the CPU and memory utilization predicted, and based on that you're going to assign it to a specific node. Okay, so the current work I created a new GitHub repository called galaxy job analysis, and I refactored some of the Python code in the previous work available via the previous work. And I basically cleaned up the script documented it and made it parameterize. So any, I mean we're going to be training models on job job run data, and what job run data we're going to train is now part of an input file. So the tools that you want to train a model on in that file. Also, we're going to be training different models on data, and the list of models is also specified in a separate file. So there are two, those are two input files that specify which models are we going to train and what data are we going to train our model on. So the more these models have parameters and the ideal value for those parameters is not known. So we need to basically train a model for a range of parameters and see which one performs the best so that's a hyper parameter tuning. And that is also we can specify the hyper parameters in the input file, and the script would do basically a grid search on all combination of all parameters and find the best model. So that's as far as training the model is concerned. So after the model is created, we're going to use those models so I created a fast API service with endpoints for resource prediction. And the eight fast API service upon startup loads the model binaries that were generated by the training step the previous step. And then you can query, you know, different tools for memory and then CPU utilization. So there are finally added like lots of documentation and how to train them all have to start up services and so on and so forth so everything can be found a galaxy job analysis. So, specifying genre, genre, john job run data so I'm going to go through all of these items one by one and more detail. So it's a CSV file. And in that repository, my repository it's in the conflict folder. For example, it could be input files that CSV and name of one bomb, and each line in the file specifies path to job run data for specific tools for example, let's say we collected both I to job run data from main. And we're going to add the path to that file is and dot examples both I to so we specify that online one of that input file CSV. And we also say which column and both I to CSV we're trying to predict. So that's also been parameterized. So for example if you want to predict the max memory usage. The column in both I to that CSV that corresponds to that is named memory dot max usage and bytes. So, and then in the next line you can specify a different tool, possibly a different column to predict I mean usually we just want to predict memory or CPU usage. So this makes it easy to change the job run data, or the resource to predict. So these are all, you don't have to change the Python script, you just have to change the input file. Let's say I want to train a model on 10 tools, then I want to run, I want to train the model on 10 different tools. If you have to input files, the script is unchanged. And I can use the same script to do that or I want to predict memory usage on these 10 tools and I want to predict CPU utilization on the other 10 tools. Again, what's only the input file that changes the script is unchanged. So this is one example. This is a CSV file that has two columns input file and label name. So input file, we have job run data for both I to and high side to have the path to the files that were created off the job run data on main, and we also specify which column we want to train. And now we also can specify which models we want to train and this is specified in a JSON file again it's a it's in the conflict folder and one example would be models.json. Each, it's a JSON object, and it has a name, it has a list of name value pairs. So name is the name of the model, for example, random force regressive. And the value is also a JSON object that specifies models hyper parameters, the class and module name. So, this makes it easy to change the models to train and their hyper parameters. So if let's say you want to train 10 models on the data that we have and then we want to train 10 other models, we just create two JSON different JSON files, one of them has the first 10 model the other one has a second model, or if you want to update the hyper parameters. It's the same thing we just specify the hyper parameters that we want our training to train in the file and then you know the Python script is again unchanged. Okay, so this is an example of a model.json. Here, we're training two models random force regressor and extra trees regressor. The hyper parameters are specified under parameters, each have three hyper parameters, this could be 10, one, I don't know how many doesn't matter, just for example, I've added three hyper parameters for each. And for each hyper parameter you can see there are two possible values again this is just for example you could have one value 10 value whatever. I'm changing the number of estimators for random force regressor, I'm going to trade 50 then 120 I mean ideally I should have a value in between like maybe 75. And then I specify the max depth of the tree could be 1015 20 and so on, and how the not max features are selected. What's the method for doing that it could be automatic it could be square root of the number of features it could be some other thing. So, as you can see, it's very simple to change the hyper parameters, etc. Two other things are the module name of the class name. So, when my Python script reads this file, it's going to automatically import the module whose name is specified there and it's going to automatically instantiate an instance of the class. So, you don't need to like, I mean so this is done automatically as well. Okay, so we want to train ML models on genre of data job run data. The script is called regression or probably rename it that's not that that's a two generated name. But anyway, for now it's called regression it's in the scripts folder it takes four parameters. One is the input file that we just saw it's this one that lists the job run data files and the column to predict it. This is the models which is the JSON file we just saw here. So we say what data we're going to train on. What are we going to predict, which models are we going to use, and what are their hyper parameters. We specify two other directories to two other parameters one is the the last one is a models directory that's where after we try every combination of hyper parameter. So for all of them we pick the best one and we save the binaries to that models folder. So that's where the binaries for the models are saved. And the output file is also, that's a file that on each, let's say we have to job run data that we want to train on we have two models so that combination will be for. And each line says, what is the best hyper parameter for that model for that job tool data, and what is the prediction accuracy on training and test data. So that allows you to compare, basically, different models, not just different I mean you're going to get the best hyper parameter for specific model, but you want to be able to compare different models so for example maybe does better than run for us or the other way around, you can look at that you can look at the prediction accuracy on the site. So what that regression PI script does it does a great search cross validation, that's part of the cycle to learn library, and it's going to save the buyer for the best model models. So this is an example of the output CSP. So we specify the input file that is the job run data we're training on the second column is the name of the regressor it will be running for us or extra trees regressor. We get the best score that's an R squared score. So this is one that's perfect if it's zero it's the worst. And then we have a list of the best hyper parameters are semi column separated next so max depth is 15 max features is auto number of estimators one to one this is for the first line first. The first line kind of is too long goes to the second one. And then we have the test data score. We separate part of the data. That's not useful training when we can evaluate our model on that and we score the score here as well. So. Okay, so when we do this, we'll end up with some model model binaries based on the hyper parameters that's been specified that input data that we specify and also the models are specified. The next step is to basically use those models. So there's a fast API service in the app folder that's still made up by it's very simple right now it's just as one endpoint for, you know, illustrative purposes runs on the local host for 8000. So upon startup, it loads all the model binaries from the models folder and defines endpoints for resource prediction for each tool. So for example, for both I to memory prediction there's an endpoint called both I to memory I mean I'm assuming that I'll add one for both I to CPU later, and then we have different tools they would each have their own endpoints. So let me go to the docs page to view the input parameters and call the endpoint which I'll do shortly. So, let me actually show you that first share my full screen, let me. Okay, do you see my terminal. Yes. Okay, cool. So, um, this is the regression script. Trans the model of specify the input files here. This is for memory prediction that's why it has the underscore ma'am here specified the models here model Jason, the output file. This is the output folder. It's called output file ma'am. And I just say save the models to be models directly. So the virtual virtual environment is activated if I hit this is going to try different hyper parameter combinations, and it's going to train the models. The first model is being trained. And this should be short because I only have two models and two files to train on. But I want to illustrate this. I mean, ideally, later on I'll be running this on full main data, but that's going to take a while. So I'm doing this short test just to make sure everything works fine. And then if I start with some main data and it takes like half a day, I'm not wasting my time after we do things because of the part. So that's the third model. And this should be the last one. So this is trying different hyper parameter combinations. So right now as you can see the number of estimators is 120 used to be 50. Let's wait a few more seconds. I just want to show you the outputs as well. It's always takes longer when you demo. Okay, so if you look in the models folder, we should see the binary for these models based on the tool name and the model name, I just have appended everything to the name of the binary so it's clear which is which. And if you if you look at the outputs folder. This is our output file. So this is for both I to run the forest regressor when predicting memory usage. This is the best are squirts for that we got. And these are the best hyper parameter values. And this is the final column is the best are squirted up for. For testing. And that's that so now separate window. I'm going to run. I mean, I've already in the app folder. I'm just going to run main dot pie. This should start the fast API service locally for 8000. Here. And I'm going to go to docs. So this particular have one endpoint I loaded one model, one endpoint. So you can see. You can specify. I mean, these names, they have to be more meaningful to change that so these are the size of the input files I think this is maybe size of reference file. You can try it out. You can specify values. I'm just going to put some garbage values, but it should work. What you can do is you can look at the one of those, actually, find some meaningful values. Let me see this is both I to memory. So if I do, we are. It's not. For example, so. So we're going to try to predict the max memory usage and the own file of this value. The actual value here. Then we have input one input two. Okay. So copy input one. Sorry, K1. What are these values? This is a, this is a file that was generated off main. That says how much memory both tied to users based on different input combinations. So each line. Yeah, okay, but the one that you just copied there. What's that value represent. I think it's the primary key of the object. So like you're literally checking, you know, I mean, Yeah, I mean, so do you understand my concern. So what you're inputting here is the identity of the input data set that you want to predict it on time for. You have an exact match in your data. So I mean, like, you know, you could just look at your file and see what is the actual value. Well, I guess what you would want to do is do this without actually providing input one or input two because those are going to be different if you want to use this to predict something. I mean, I don't know about both I to exactly, but this is the size of the input one and input two, and those determine how much memory and CPU requirements, what the CPU and memory requirement is. So, so if it's a size that's a little bit less of a concern but it's also highly specific right I mean I suppose there's probably just one file on main that has that exact file size so is that, you know, a model going to overfit this and just say hey this is that one particular file and I know how long that's going to take, or is it really, you know, learning something there. Well, we evaluate the models based on our square. So if our score is close to one. We're doing a good job. And also I should have mentioned this before I use this data I don't use this data as this portraying a lot of these columns are removed. There is a method called like remove bad columns. So anything that identifies as a unique identifier or useless information as far as like prediction is removed so maybe, you know, the main usage data specifies 30 columns, but we only use eight of those to train the model. So, yeah, so you can execute this, and it will tell you in this case the both type two will use this much memory. So, go back. So the input one input to I mean that must be the size of the reads files that are being input. And then the genome here is mouse so mm 10. So this is kind of one of the reference genomes it's kind of available as reference data sorry. It's not like arbitrary, just the genome length, because you know the genomes of the similar size that have very different repeat characteristics that will could potentially require different amounts of Ram or time. Okay, so it's sort of specialized for the genome that you're mapping to got it. Okay. Do you see my slides. Yes. Okay, so I mean I'm sure my yes this. I don't know. I'm not a bioinformatician obviously I don't know the, all of these bioinformatics tools it would be a great idea to have a like a code review, and make sure like I mean I would have at some point wanted to show this to people whose job is by information and make sure like the input outputs, and the predictions make sense, but this was for illustrated purposes without doing that yet. So that would be the idea there were some fields that will be chosen as the ones who would decide what is the memory and CPU usage you can enter those and get an estimate, and then you can compare it to what was the actual want to get an idea of how good your model. So, there are some current and future paths regarding this. I need to generate fresh job run data. Because this research was done for five years ago and since then we've had, we have a lot more data. The repository. The previous research doesn't have instructions or scripts on how to generate that I messaged the owner, but at the same time writing a script as well I may be done actually by tomorrow. So this allows me to run the script on main and collect job run data. It will be a larger data set because obviously several years have passed since last time. Second thing is, we need to run this analysis on a subset of the tools. Obviously we have thousands of tools we don't want to analyze all of them and it's not necessary to analyze all of them. So we care about tools that run often and tools that use a lot of resources so a combination of those. So, based on that, we can pick maybe 10 or 20 tools and try collected data, train the model and provide the end point. The problem with the end points is that each of these tools have different parameters, and if you have like hundreds of end points, you have hundreds of different parameters so the end service maintenance will become difficult. It's important that we only select select few tools and provide the endpoint for those own when it comes to resource prediction. Next, I mean another current future task is a big predictions of CPU requirements. We do collect some metrics when it comes to CPU usage. We also know the number of CPUs used by each tool. And we also know the total time spent on all CPUs. And we also know the duration of a job like from the start to the end. So we could come up with CPU utilization by this formula we divide the total time CPUs divided by number of CPUs and divide that by duration. That will give us some idea of CPU utilization. Maybe if you know the number of CPUs and we know the CPU utilization could come up with like heuristics or rule of thumb on how many CPUs we need given a specific utilization number of CPUs. The end goal is obviously to select those subset of tools, train the models and verify them. We can predict memory and CPU usage and then write up for submission to a conference. I think that's that's about it for now and questions. Hi Kevin, this is Michelle. Thank you so much for your work. I look forward to digging into it more. When I start looking at other tools or similar input files or different input files that I could use with the same tools that using here. Can I expect to receive about the same R squared values. I know your R squareds are pretty high. And from my limited understanding of our squared anything above 65% is pretty good. So getting in the 90s is exceptional and I just wonder, in your findings, was that pretty much the norm like 80% 90% it was, but I have to, I have to add a caveat. This data size is not too big. Sorry once again. I mean, I think for a larger data size, we would have a better more realistic R squared, and I have to look into see how the p value can be calculated. Anyway, we can say that, you know, this is not due to chance. So that's another thing that got it. But right now. I mean my main goal at this stage is debugging the script, making sure it works correctly. Because if I want to run this on large amounts of data, which may take hours of days, I just want to make sure everything works so my goal was just focus on two small files, and two models, make sure everything works but yeah. I'm guessing that if we run this on a larger data set, we would get maybe lower R squared values. We'll see. Got it. Thank you. I mean some are related to that my notice to using random force regression, which is like super powerful and can sort of capture arbitrarily complicated functions. I would hope though that there's a relatively smooth curve. I'm wondering if you've explored, you know simpler polynomials or neural nets. Yeah, I mean I could. I will probably try every regressor that cycle. Run. So I'm assuming you know it'll be 20 models or more. And then I mean this is going to be iterative obviously you know maybe at first I'll pick the hyper parameters in a wider range, and then see which part does better and then fine tune the hyper parameters focus on that and run it again so maybe I'm guessing that we multiple iterations. The models and the hyper parameters. I mean this is the. I mean rule of thumb that I've heard that if the fields that you have of our different types scales, usually tree based models work better. If they're homogeneous usually neural networks work better. So we'll see. I mean I'll obviously will try neural networks as well. But my expectation is that these are tree based models with a performer I think in Kaggle most of the winners are tree based models whereas an image that most of the winners are almost all of the models are neural network model so we'll see that that intuition holds. Okay, I very much look forward to results. Thank you. What, I mean, so, you know when you make this sort of recommendation there is. There isn't really a way or it's not even smart to use like the optimum value right because you might. And there's a cost to you know just giving it the optimal value and they suppose then and like 50% of the cases it's below that or above that so if it's above that the two may crash after running for a while. How do you, you know, do you have any concept of how to turn this back from predicted values to, you know, values that you can actually use. And the job resources. I mean, what we usually do is like, I don't know 95 percentile, you know the upper 5% those are going to fail and we run them again. You know what's what's your feet on that. There are a couple of I can think of a couple of ways to address that one is that you know if it just returns a single value, whoever's using it based on their requirements. And pass it a little bit. The other thing is that we provide a range or probability distribution. I mean, 90% I mean, you might say 95% of the time. I mean, whatever requirements are below this that way it gives it gives a confidence of whoever's using it to decide what's the best course of action. I mean, luckily you can just you have all this usage data you can whatever role you can you decide you can go back and simulate how it would behave and what percentage of jobs would fail. If you applied it. I mean, I didn't understand that our, our squared topic. Was that what you did there. You know you don't have the data and you check with the other half. Yeah, it compares basically the predicted versus actual values, and it's a measure of how much they know how much a prediction matches the predicted value if it's one we have a 100% match. If it's zero. It's completely off. Yeah. So, you know the ballpark number, you know if you just train on half the data set and use that to predict the other half, do you like value how far this is off. I get a very good R squared, but as I mentioned this is not a real data set, so I need to generate, you know, larger data set to see what the R squared value would be there. But is this the famous holdout training or is it not. Is that something else. I mean I don't know anything about ML. I have a data set. Let's say I have 10,000 examples of high SAP to memory usage or CPU usage. So I set aside 2000 for testing. And I leave, I take 8,000 for train that 8,000 for training something called cross validation. That means I break it into 10 parts. 800 each, it's 8,000 to divided by 10 that would be 800 each, you train on nine of those 10 parts, and you test on the 10th one, and then you alternate so you train on nine other parts trade on the new 10 part, and you repeat this 10 times. So you get 10 evaluations of the same data you average them that would be your training R squared. Then you pick the best hyper parameter. Of your model, and you come up with a model with a specific hyper parameter. Finally, you run the test data which you did not even use for training. And you compute the R squared for that as well. So that's the process. So do you think we can actually eventually, I mean you mentioned that your goal is to write a paper but do you think we can actually use this, you know to inform for instance the selection of destinations with total perspective or ticks. I mean, I don't see why not I mean this if this service is available on the side, you know I'm imagining galaxy, let's say, if you're calling one of these 20 tools makes a call to the service gets a prediction of CPU memory usage and makes a scheduling decision based on that. And if the services down it's just going to go do whatever it was doing before so adding a service to the process should not be affecting anything or bringing the whole galaxy down. You know, call the service, if it's available, use it if it's not available do whatever you do before. So it's safe to use it, and it could potentially increase the throughput. The more immediate application would be what Ennis was talking about. I think Ennis and Keith, they're trying to understand how much it costs to run different things on the cloud. And they did like good search so they tried running a tool on different hardware combinations and so on. And they realized that, you know, if they increase the number of cqs etc the runtime nearly comes down and you know all that stuff. So this could allow them to, you know, avoid doing such extensive analysis to find out what they want. So that's another thing we discussed that we're in GCC. I guess there are not any questions. One other thing we wanted to discuss was coming up with this once a month ML meeting. Jeremy suggested GCC, I think it's a great idea because there are different people who are working on ML applications and we don't know about what every other person is doing. So if we have a monthly or maybe bi-monthly meeting, people can share what they're doing so everyone knows what everyone else is doing. So exchange ideas, possibly collaborate. And there are a lot of people not directly in the galaxy who are interested as well. I think Stephen Shank from Tempel University, he said he was interested, he wants to attend. So there's all of the, you know, Penn State, Hopkins, Oregon, Cleveland, Freiburg, all the folks there, they showed interest and also there are some people from outside, you know, they wanted to attend. I mean, I could send a doodle to pick a date. I didn't know that the community was this week. Maybe we can have the ML meeting, monthly ML meeting on a different week so it doesn't overlap. If anybody has any thoughts on this. Sounds like a great idea. Okay. Yeah, thanks. I'll try to, I mean, I'll send the doodle to everyone and I mean every one or two people from, I have a name of folks that I know are interested, but if you know someone else who is interested, we can share it with them as well. Should you maybe, if you add this to the primary working group calendar, people might discover it as well. That's a good idea, yeah. Yeah. So, I will do that. When the date time was decided, I'll add that. Thanks. Well thanks so much given for the awesome presentation and everyone for their questions in this discussion. The next community call is going to be in two weeks on August 18. And the topic for that call is going to be the outcomes of the outreach you project so that'll be the outreach mentors and interns presenting. And we'll also be sure to do a little bit more notice ahead of time for folks so that it's kind of on the top of mind for that. Thanks for everybody attending. And we'll see you in two weeks. Bye.