 Okay, hi everyone. My name is Wendy Wang and today I want to share with you my little experiment on applying AutoML with biological data sets to predict clinical outcomes. So automatic machine learning, which is also known as AutoML, has a lot of definitions. So according to Wikipedia, AutoML is the process of automating the process of applying machine learning to your world problems. And this basically means that AutoML should be able to take a raw data set and come up with the optimum deployable machine learning model. And as a computational biologist, I have seen the explosion of biological data in the past 10 years and biology has really become a information science. So according to this paper that was published in 2015 called Big Data Astronomical or Genomic Code, it's predicted that in 2025, even though the acquisition of genomic data is still going to be less than astronomy data, the storage of genomic data is going to be two to 40 times higher than astronomy data. And we are not lacking statistical methods for analyzing all sorts of biological data. So there are almost 700 statistical methods which are hosted on a bioconductor, which is an open source software platform. So I decided to carry out a little experiment to learn more about these AutoML algorithms to see whether we can use AutoML to quickly gain insights without knowing much about the data set itself. And whether certain feature selection can be used to help to optimize better algorithms using AutoML. And finally, whether explainable AI can help us gain insight. So using very generic explainable AI tools. So before I start, I want to just talk about a little bit about what's different in biological data sets compared with the AutoML, which supposed to be general but mostly used for optimizing data sets that are from large companies. So one huge difference is that biological data sets usually have a lot more features than a number of data points. It's very difficult to collect samples, especially human samples, and it's even more difficult to collect cases compared to control so that often leads to a very unbalanced positive negative samples. Here I'm showing the central dogma, which we take the genomic DNA gets transcribed into RNA and then RNA gets translated into protein. So for typical RNA expression data set, we are looking at the 30,000 non-redundant human mRNA sequences. So these are, we are measuring the abundance of these mRNAs and of which 20,000 are protein and coding. So I decided to look at a data set on studying preterm birth on our gene expression. So this is a data from my hand at all. So they look at the gene expression from pregnant women at two time points. So preterm birth is when the baby was born before 37 weeks. And this accounts for one in 10 births in the U.S. And the cause of preterm birth is largely unknown. But the health outcome is pretty severe. The babies sometimes need to stay in the hospital for over six months. And so they use a very simple multiple logistic regression but selection. They didn't have a test set. So they reported the 5-fold cross-validation AUCE to be equal to 0.703 for the gene expression data alone at time 0.1. So I set up my study as follows. So I'm going to study, I'm going to look at three auto ML libraries, the H2O auto ML on AutoGruon, which is based on AmixNet, and TPOT, which is based on Scikit-learn. And I want to set the time limit to only 20 minutes for these algorithms to optimize. And I'm using MLflow developed at Databricks for keeping track of the parameters and results. And then I'm also going to look at whether feature selection can help. So I'm using the fast correlation base filter feature selection, which looks for features that have high correlation with the target, but little correlation with each other. And I was also going to look at the top differentiated genes between preterm and footer, but I ran out of time, so I didn't do that part. So briefly, the data, I have the raw data, and then I would feed all features or the FCBF selected features to the four algorithms, including the H2O GLM as my base model. So I tried to do as little pre-processing as possible. So I downloaded the normalized data from gene expression omibus at NIH. And then I'm only going to look at time point one for this experiment. So this accounts for close to 30,000 genes with only 165 samples. And about one third of them are preterm. And then I spread it into training and test that seven to three ratio. And I quickly did a principle component analysis to see whether there's outliers that I need to pour out or whether there's obvious batch effects. And there doesn't seem to be any, the colors represent the preterm and the footer mothers. So this is an example code for running my little experiment using H2O AutoML in R. T-POD and also AutoGoo on our Python APIs. So, but the setup is pretty similar. So I set the experiment name with MLflow as actual AutoML because I started the using actual for my experiment. And I'm actually using the same name for T-POD and AutoGoo on so they are tracking the same experiment. So for each run, I run the algorithm and I keep track of the parameters. So here for actual AutoML is basically one line of code. I give it the features that as X and the target as Y, and then the data frame name. And then since my data is unbalanced, so I set balanced classes equal to true. So when H2O does cross validation, it will automatically over sample my cases so that the case control are balanced. And then I also set the number of votes for cross validation in this case is five. I run this and then I log all the parameters that I use. So the algorithm is sexual AutoML and the votes for cross validation. And then oops, one second. And then I also keep track of the top model metrics. So here I'm keeping track of the training AUC and training precision recall AUC. And also I track the top model. And then I also keep track of the performance on the test set. And lastly, I save the model as I save the X2O data, the top model from X2O AutoML and then I log the artifact. I did the similar things with TPAD and AutoGuan. So and then I can open up the MLflow UI to look at all the runs that I have performed. And as you can see, some of my runs have failed. So this is a nice way to look at everything I have done. And I can see that for X2O AutoML, my top model is XGBoost for this to run without any feature selection. And it gave me the basic, the top model is a ensemble model for, with feature selection. And then with that, I can select these models and then click compare. And then I get this nice table with all the parameters and metrics that I have tracked. And the first thing I notice is that there's very severe overfitting going on because the training metrics is so much better than the test metrics. Another thing that I noticed is that I probably make a mistake for X2O that this is running time in seconds is much greater than the 20 minutes that I set up. And I also, and for T-Pod, but for X2O AutoML, it finished within five minutes with feature selection. But T-Pod didn't finish within 20 minutes, even with feature selection. But luckily, it does give me the best model within those 20 minutes. So this is your five minute warning. Okay, thank you. So this is, so we can see that with feature selection, T-Pod does improve because I think mostly because it was able to do the, to go through all the models more efficiently. So I don't want to draw very strong conclusions based on an incomplete experiment. But I have the feeling that AutoML probably does pretty well when there are a few factors with large effects on the outcome. But in a biological system, it's basically a lot of factors that contribute to the clinical outcome. And each one has small effect. And this, I am not sure how AutoML deal with this and whether it could optimize efficiently to the best model. And I think that feature selection does help to at least reduce the time for finding the optimal model. And to also, it should also help in reducing overfitting, although we still see overfitting in the, in my little experiment. And I also think that AutoML is very useful and it's convenient. But maybe we need to have some things add to more, some domain specific features to be added to the AutoML. So like feature selection, that based on prior knowledge of biology, like the pathway. So TPA has that feature built in. So I'm going to test that later. And also custom metrics. So for example, if we are more interested in the sensitivity of picking up preterm models, we need to optimize that. And experimental AI, I haven't spent too much time on that. But from what I can see, it was, it was mostly interested in giving out individual feature importance. But in biology, we are more interested in looking for interaction between the features or whether a pathway gets changed with a certain phenotype. So that's not that we probably need to go to more specialized software for that. And lastly, the stability of the AutoML algorithms, because we have this small sample problem. So I'm not sure whether if I change my input a little bit or I change my random seed, do I get a completely different model? So that need to be investigated as well. So, so I just quickly show you the package I use. So this is when I listen to other data science talks, this is my favorite part to to see what tools they have used and so that I can try them out. So these are the ones I use. So I just want to point out that Mark P is the one I use for generating these slides. And mermaid was I use mermaid to produce the full chart that you see earlier. And lastly, I just want to point out that I have kept track of the AutoML libraries on my GitHub. So please feel free to check it out or contribute to it. And I like to thank all the organizers for organizing such an amazing event. And thank you for your attention. Wendy, thank you very much. Thank you. That was a fantastic talk and you were exactly on time. So I just wanted to point out to everybody that there's a ask a question button on the right hand side. Or it's fine if you just want to put the question in the chat. Do we have any questions because we've got certainly time for one question, maybe even two questions. And if we don't have time for any questions, then please everybody have a think. And you can add your questions to this session even when it's finished. And Wendy, you will be available to ask any questions if people have any follow up in Slack, won't you? Yeah, I will be checking Slack. Thank you. Okay. Thank you very much. That was fantastic.