 Hello, everyone. My name is Kayvon Kamali. I'm a member of the Galaxy team at Penn State University. I'm going to be presenting Galaxy ML, let me share my screen. Okay, Galaxy ML, an accessible, reproducible, and scalable machine learning toolkit for biomedicine. This is a work done by a member of the members of the Galaxy team at the Oregon Health and Sciences University and University of Freiburg. I'm a new member of the Galaxy team. I'm going to, I'm working on some extensions to this work regarding marker basket analysis and deep learning. So this is the overview of the talk. I'm going to be talking about machine learning in biomedicine, the Galaxy ML toolkit, some use cases and results and summary and future work. So ML is essential to make sense of high dimensional data sets and we have many of those in biomedicine, genomics, proteomics, and energy. There are two types, roughly, of a machine learning algorithms, supervised and unsupervised. Supervised ML is when data is labeled and unsupervised when data sets are not labeled. So in supervised machine learning, we're going to use the labeled data set to build a model for prediction. It's the labels have a continuous value. It's a regression problem. It's a discrete value. It's a classification problem. Unsupervised ML, the data set doesn't have a label and we're trying to find patterns in the data set. There are various techniques like clustering, association rules, and dimensionality reduction. So there are many, many applications of machine learning in biomedicine. Here was a few developing models for drug metabolism rates using brain images, genotype phenotype association, prediction of broken structures, and so on. So using machine learning in biomedicine is challenging. There are multiple challenges. One is tool integration challenges because a successful ML application spans biological analysis tools to ML tools. And these are all used for feature engineering model building that evaluation. So we have to have both of these tools accessible. Also, ML tools must be scalable and reproducible and workflow engines, software package managers, and job schedule learners are needed for scalability and reproducibility. An integrated software solution is needed to address these challenges. This is to make a machine learning accessible to the informaticians with limited programming knowledge and connects to connect machine learning tools with biomedical analysis tools and a scalable computational work branch. So our solution is Galaxy ML toolkit. And it's a toolkit of machine learning tools for the Galaxy platform. The Galaxy platform is a user friendly web based computational work branch used by tens of thousands of scientists for biomedical and bioinformatics data analysis offers over nearly 8,000 tools at Galaxy ML allows the Galaxy community to incorporate machine learning into their analysis and it's installed on over 80 servers worldwide and the tools have been run over 10-12,000 times on USMU servers. So how do we address accessibility? You know, Galaxy's web based user interface basically allows anyone to use complex analysis tools and workflows without detailed knowledge of workflows, software dependency, or job schedules. Those are all abstracted away and people with limited programming experience can use these tools and these functionalities. Galaxy enables iterative development of machine learning because in our future engineering future selection hyper parameter tuning is iterative in nature. This is a overview of Galaxy ML figure A. We have the number one box is how we define a learner. The number two box represents the input and we train and evaluate the learner and visualize the performance. Image B shows a ensemble method tool on Galaxy that's the user interface. You just select a bunch of values from dot towns. Very simple and figure C is a workflow, which is a sequence of actions you do in your analysis. This is obviously a very simple one, but everything that the tool, versions, parameters, etc. They're all stored and this makes the producing results very easy. The availability, large machine learning analysis required building thousands of models and we can use Galaxy's workflows to execute large scale analysis. This workflow system distributes jobs across compute clusters to be run in parallel. As for reproducing reproducibility, you know, all the parameters tool versions and workflow versions are safe in a workflow and this allows the research to be the results to be reproduced easily. There's some effort to allow reproducibility outside of Galaxy. That's by making Galaxy workflows compatible with common workflow language. This is a working progress. So Galaxy ML implementation, there are libraries for pre-processing, modeling, assembling and evaluation. Scikit-learn is a very popular Python library for machine learning. Scikit-rebate is for feature selection. Invalence-learn is for when our datasets are in balance and we need sample techniques to come up with new datasets. XGBoost and LightGBM are extreme gradient boost and gradient boost in libraries. ML extend is for meta ensembles or stacking as it's called and also for association rules and Keras is a very powerful deep learning library on top of TensorFlow, which is a Google library. Keras has a very nice API version to use. So there are multiple use cases. First one is 10 machine learning benchmark. It's 276 datasets, 164 classification with a binary or multiple and 112 regression datasets. There was a previous analysis that compared various classification regression models. What we did was we created 15 models for classification 14 for regression and we used hyperparameter optimization for a total of 4,028 models. So that's basically we specify a range for parameter and try different values in that range and we find which value results in the best model. Evaluation was done by using 10-fold cross-validation and for classification problems we use F1 score, which is a harmonic mean of precision and recall, and for regression we use R-squared, which is a standard way of assessing their regressors. So this is the result. As you can see on the left-hand side, you know, you can compare various techniques. So what this means, for example, the way to interpret this is if you look here, XGBoost is on the wins row, you can compare it with the losses column. For example, XGBoost is better, 38% of the time compared to gradient tree boosting. Also gradient tree boosting is a better 11% of the time compared to XGBoost. So if you add these two numbers up, that's 49, 50% of the time they both perform the same on the data set. And on the right-hand side is the improvement that we see by hyperparameter. This is the same thing for regression, basically the image on top of the bottom right. This has the runtime, so certain methods, for example, extra trees, they yield the same R-square, but they take much longer, so they should be avoided. So results, we got agrees with results in the original study that shows that the Galaxy ML time used to solve real-world machine learning problems. The second use case is predicting drug response. We used meta ensembles to predict what response of cancer cell lines. That's called, this is a technique called stacking. So cancer cell lines gene expression and drug response data set came from cancer dependency map project, 50,000 gene expression values over 1,000 cancer cell lines. So very high number of dimension, very limited number of samples. There are 256 drugs. So we also, this is a regression problem, but we finalized the data. So we also could also use classification techniques. We use eight regression 11 classification techniques. Some of them use principal component analysis for reducing the dimensionality of the data. And we obviously did hyperprimer tuning. And this is a result, you know, the, this figure can be interpreted the same way. Stacking regressor with search CV is better at 50% of the time compared to linear based, that's your gradient boosting ass along. So that's one of those for classification. The other one is for regression models. Again, this is very comparable to, I think this was based on our previous work. So this is satisfactory results. So the third use cases DNA sequence analysis, this was a, we basically reimplemented reproduce the deep learning models that were implemented this in Salin, which is a deep learning library for biological sequence data. And then we trained an existing deep learning architecture called deep sea has three convolutional layers, two layers, one fully connected layer and one statewide output layer that's a convolutional network. And the second one, we compared the performance of deep sea deep sea with an extended architecture that includes three additional convolutional layers. So this is the comparison of the first and second analysis between galaxy and all on Salin as you can see the accuracy and area under the curve. They're very there's a fourth use case that's perfect currently under under development if you will. So we're using market basket analysis market basket analysis is a technique for finding correlation between different entities according to their core currents in a data set and this can uncover extract useful trends in the data set. It's a two step process. First we find frequent items that's in a data set we define what frequent means based on the support parameter, then we generate some association rules that satisfy some criteria, like confidence, connection, so confidence is basically conditional probability left is joint probability divided by individual probabilities and conviction is a metric that has a direction, it's not symmetric. So we're using market basket analysis offered by ML extend library their algorithms are a p a priority and as we broke. We're using market basket analysis to variations in HIV to loop region. And the goal is to generate rules that we take association between positions among the samples. So, we will be able to present this results. There are other unsupervised machine learning techniques. So galaxy ML supports virus clustering and dimensionality reduction methods. And also, our program are pursuing density this possible that principle component analysis as we discussed for dimensionality reduction. Also, galaxy supports various deep learning architectures via the Paris library, which is a very easy library to use. So we're using tutorials on people were neural networks with criminal efforts and convolutional networks to basically allow any users of galaxy ML to be to learn to to learn how to use deep learning in galaxy neural networks are usually used for image and video processing with criminal networks, usually are used for sequential data, whether it's time based or order based and feed forward neural networks are used for more traditional classification. So results in summary, use cases showed that ML tools are general powerful enough to support realistic use cases. So analysis are reproducible workflows, which store all the parameters tool versions all the steps allow us to reproduce results galaxy ML is accessible if you go ahead provides a web based UI, and people without programming knowledge can do complex analysis. And like similar scalable, you have a job scheduler that distributes jobs across one or more compute clusters, allowing them to run in parallel, even some of them can run on GPUs. So that's great. And that's similar is extensible. I think you can add extra libraries to galaxy ML, and then it becomes available to the whole community, not only the whole community to use and all the features of galaxy, like workflows, scalability, sensibility, visibility, they're all there as well. So, thank you very much. And please let me know if you have any questions.