 Awesome, awesome. Let's just jump into it then. We're going to be talking about doing some machine learning models with EHR data using the Forester package. So a little reintroduction to Forester, for some of you might have already heard before. This package came out about two years ago, but unfortunately because of other responsibilities and in some cases new opportunities, the authors weren't able to maintain the package and it kind of fell away. But fear not, because it's back, new and improved. The original organizer Anna got some new people, Hubert, Adriana, Patrick, they came back, they overtook the project, they built the whole thing back up from scratch, they learn from their predecessors, they learn from mistakes, they learn what else was out in the ecosystem. And I think it's a really cool thing that you should reinvestigate if you haven't already. So what is the Forester package? The Forester is an auto ML tool for R, for tabular data and regression and binary classification tasks. It wraps all of that kind of stuff up into one nice pretty little function, the train function. This function gives you a data check, it goes through that pre-processing for you, it gives you enough of the initial data set to see if the models can be trained. It trains five tree-based models, you got your decision trees, your random forest, your XGBoost, your cat boost, your light GBMs, what else could you want? You got random search involved, you got Bayesian optimization and at the very end of this bad boy, it spits out a nice evaluated look at them all gives you a ranked list of how they performed against each other and which one you should move forward with. So that's the what, how about the why? Well, it's really great for beginners who want to just train their first models who want to get in there and just start playing around with it. It's great for researchers who just already are doing lots of analysis, want to add some machine learning in. And if you're already a machine learning expert, it's pretty cool. And the concept that you can just sort of quickly create new models really fast and explore random data sets without all the code required for pre-analysis. There's a lot of tree models, you might be thinking, hey, what's with all the tree models? Well, good question, I say to you, imaginary person. And that is tree models can be very versatile. In some ways, they can address the overly smooth solutions and come out of deep learning models. And while they do have their own issues like overfitting, most of this stuff can be addressed with bagging and boosting. Also, tree models are really easy to understand. You don't have to have a machine learning background to kind of understand how they work. That's why they're really common in medicine. Doctors are already using them, they already get them. So when you're pitching it, you're not trying to explain to some like physician on a clinical trial team how like your deep learning TensorFlow model works. And you can just move on with the results of the model. You know, these have been benchmarked against other things. Leo has a great paper. I don't have time to get into it here, but I'll definitely sort of leave that for you guys to kind of read if you're interested in how these things benchmark. And where, where can you get it? You can get right here, you can get on GitHub, you can go to GitHub, and you can install this bad boy and get started right away. So let's do that. So let's get started with some EHR data, our style. There's lots of packages to talk to your EHRs, manipulate EHR data. I've never actually been able to get the EHR package to work. If anyone else has, Congrats to you guys. Let's just use DXPR because it's alphabetical, it's at the top of the list. So we want to pull in our DXPR, we're going to take a look at some core mobility data from an EHR data set, not going to go into too much data detail on how to do that, because this is not a DXPR tutorial. But when you're done with your kind of you're getting your data that you want, it just has to be tabular. It doesn't matter what data set it is, you can have this could be EHR data, I'm going to give an example here, but it could be transcriptome data, it could be any kind of data that's just in a table, this package is going to work. So with after these functions, I'm left with this nice little table, great, fantastic, time to go to Forester, let's go. So first thing you want to do is a little bit of a data check. So this is great check data function that's built right into Forester. If you've done any machine learning before, you know that typically you start with an exhausting exploratory data analysis, you know, your writing scripts, looking for correlations, removing things that overfit, you know, it takes a long time. Forester's got this built in, which is actually kind of pretty fantastic. So this function, whoops, that's all right, jumped ahead a little bit trying to use my mouse wheel. Don't use your mouse wheel on PowerPoint, guys. So anyway, the check feature here in this case is going to look at static columns, it's going to look at duplicate columns, it's going to interpret how to do missing fields, it's going to do a dimensionality check, it's going to look at correlated features using Kramer's V and Spearman rank coefficients, it's going to look at outliers, target balance, to make sure like you have different, each thing you're trying to classify has the exact same or similar enough amount of features and outcomes within your data set. Again, things that normally take so much time to do, already done. All right, so we kind of got that part of done. So let's just jump to, you know, the prepros, so the train function. So we have this train function that we talked about earlier, everything's wrapped up in this train function. It's going to do the data quality checking stuff again for you, but prechecking I think is always a good idea. And it's going to remove static columns here, going to have the binarization into classes, missing values are handled through the mice algorithm. Many of you might have used mice as an R package by itself. It's going to remove highly correlated features in the data set, it's going to do removal of ID columns, it's going to feature selection with the border of package algorithm, that's a different package you might have used already with your other pipelines. And then it's going to save all the stuff and keep all things you don't need deleted. So it's going to be really good. And in this case, you know, if you want to do something really quick, in this case, they're just taking the cancer kind of channel of this data set. You can add Bayesian alliteration and random evals, of course, but you may not want to do that first because Bayesian alliteration that takes time that's the most expensive component to like most of these kind of models is the time it takes with Bayesian. So let's take a quick look, we'll run this, we'll see what we get, we get some models. Now in this case, ignore this output, the F1 scores and accuracy are ridiculous because it turns out when I created my data set, it was highly correlated. But you get the idea with your data set, it won't be highly correlated. And you can look at this really fast, really quick and kind of just look at it, get an idea of what's there. And then you can play with tuning all those little hyper parameters, right, you can add Bayesian alliteration, you can add random intervals, get different versions of those models based on those fine tuning, and kind of hopefully improve your modeling output. And this gives you a rank, it gives you an idea of which models have worked best, it gives you an idea of where to go, you can start implementing it. But that's obviously not where you guys want to stop, you want to know, hey, how was my model deciding stuff? What's going on with my model? How do I explain to somebody in medicine which features are being used? Two ways to do this, one is with the Daleks package. It's really useful for multiple types of modeling, not just the Forrester package, it works with other modeling packages. A couple lines of code can give you some really nice sexy plots that helps you explore both the model comparison and the features used by different models. So you can take a look at what's being used in your Ranger models, driving it, what's being used in your XGBoosted model. But if you don't want to do that, if you're not the kind of person who wants to be like playing around code and playing around with all the options and features in your in your dataset, Forrester has a built-in function called report. It will generate a report that contains the rank list of the best models, group plots comparing the best models, the data check report we did at the beginning that gets thrown in there too. It could be really quick way to kind of explore your data and explore your models and try to figure out, hey, what's the jam here? What's going on? Ultimately, that's kind of just how easy it is to use. There's other machine learning auto ML model packages out there for R. I know a lot of people in the community probably might even use Python a lot for machine learning. I know I do a lot of my machine learning in Python, but with my groups and with stuff, I teach a lot of R, so I do think it's a really robust, easy way to do this in R. And what makes it better and not better is a strong word. What makes it something I promote more than the other packages out there for auto ML is this sort of data check pre-processing steps at the beginning. Because when I teach my students, my PhD students, postdocs that I'm training, the biggest barrier to entry for them is sort of this pre-processing and data checking. They can learn machine learning algorithms. They can learn TensorFlow. They can learn vector machines. They can learn random force. They can get their head around that pretty easily with a little bit of reading and some example code. But then when they get down into it, the thing that stops them is that pre-processing. And this really kind of helps them get like around that really quickly and jump right into the modeling. So I really like it for that reason. And I think it'll just quickly end on a little plug. So for those who know me, I'm involved in the bioconductor community. I'm one of the organizers for the bioconductor conference coming up in August. It's not too late to sign up for that. It's a hybrid format, so you can join us like in this nice little Zoomy space we're in right now, or you can come and join us in Boston. And we'll have to hang out. We've got our workshops, package demos, all that good jazz. You can follow me on Twitter. If you want to bug me, ask me any kind of questions about this kind of stuff. And I also live stream on Twitch by Informatics work. It's mostly transcriptome stuff, spatial, site seek. It may not be of interest to everybody here, but if you want to learn how to do that, you can come and join me over there as well. And I spent through that pretty good, but for 10 minutes, not bad. Awesome. That was very efficient, Wes. Thank you so much for your talk. And we have a couple of requests just to get a link to your slides whenever you get a chance. And then we have a question from, let's see, David. Do the evaluation metrics include the Matthews correlation coefficient? They do not. Out of the box, they do not. All right. And any other questions? We have about another minute before our break. How do you compare with tidy models? I find this easier to use in tidy models. Tidy models actually use some of the same back in our packages to do some of the modeling. So from that perspective, the outputs of the models tend to be very similar, especially with the hyper parameter changing. That being said, I like the pre-processing better on this. Excellent. Okay. Well, this concludes this first part of the day here for our medicine, this final day. So we'll go ahead and take a break now. Wes, thank you so much again. Excellent presentation. And we are coming back at 36 after the hour for our first panel. So 36 after the hour, we'll be back here.