 Hello, and I'm bringing our next speakers to us, Neil and Kyle, who are going to talk about a Bayesian hyperparameter tuning, a logarithm for healthcare models built with sparkly R. It's a video, so as the video plays, you can ask questions in the chat, and when we come back, we'll do the rest of the questions then. Good evening. My name is Kyle Armstrong. We also have Neil Dixit here with us. We're from the Advanced Analytics team at Independence Blue Cross. Today, we'll be presenting a Bayesian algorithm we developed in R, which we use for tuning sparks machine learning. Machine learning models offer an amazing opportunity to learn complex relations from data. Many of these models require setting non-learn parameters and as hyperparameters. It's been shown that these settings can greatly affect model performance, but for even the simplest of models, finding an optimal solution can be attractive. At Independence, we want to ensure that these settings were being optimally chosen across various projects and models. We created a model agnostic Bayesian approach using R and spark to search hyperparameters based on the map, the lines, and the chance to return. Briefly go over a roadmap for today's talk. We'll introduce Independence Blue Cross, discuss a little bit about the antirates infrastructure. Go through some algorithm motivation. We'll highlight the algorithm overview. We'll review some findings and finally we'll present some closing remarks. Independence Blue Cross is a regional health insurance company headquartered in Philadelphia, Pennsylvania. In our local region, we ensure about 2.5 million people. Nationally, we also serve many employers including Comcast, NBC Universal, Urban Outlayers, and QDC. Our vast network of insured individuals means that we work with hundreds of hospitals and tens of thousands of health care. At Independence, we work with a diverse set of data. Some of the data that we have includes demographics such as age, gender, location, market segment, and product. Most of our data comes from claims, which includes cost utilization. We also utilize chronic conditions, procedure, and diagnosis code roll-ups and risk scores. We have pharmacy data that includes NDC, GPI codes, and pharmacy costs. We have lab data that comprises of low-income codes and test results. And we have data around benefits such as coverages and costs. Briefly go over the analytics infrastructure that we have at Independence. We have an R-Box that has 32 cores, 188 petabytes of RAM, and an NVIDIA P4 GPU. We also have a Spark cluster that has 10 data nodes, 320 cores, and 3.7 terabytes of RAM. We use Sparkly R to interface R with the Spark cluster. Here's a brief example to explain the algorithm motivation. Again, machine learning models have many tunable hyperparameters. For instance, Sparkly R's random forest has 7. Some of the hyperparameters include number of trees in the forest, data sub-sampling rate, max depth of the forest, and so on. We list a typical number of levels for each of these hyperparameters. A level represents a unique value that you would use to set for the hyperparameter. For instance, you might try 20, 50, or 100 trees for your random forest. The total number of combinations to the product of these is the total. You can see here that you have an estimated 134 million different combinations to search over. At one minute per model, RID search would take you roughly 255 years. Finding an optimal combination is an intractable problem. Now I'm going to take it over to Neil to discuss the algorithm. Thank you, Kyle. Our algorithm aims to make it feasible to search these intractable hyperparameter spaces. We'll now go over how it works. We start by initializing the process, which includes defining a hyperparameter space, such as the one we just reviewed for a random forest. Loading our data, which typically includes training, validation, and test data sets, setting the number of ethics to run, which equals the number of models the algorithm will try, and finally setting how many space updates we want to perform. The updates are how we will hone in on the highest performing hyperparameter. Our example will show how the algorithm works by ethnic, as if we're going to run 100 models and perform five space updates. We'll also highlight which parts of the algorithm are performed in R, Spark, or R and Spark. We start with Epic 1, which we'll categorize as a training step. This means no updates will be made to the hyperparameter space. The first step is to sample one set of these hyperparameters. Next, we take our training data and sample parameters to train a model. Sparkly R's ML Random Forest classifier is an example of this. After training is complete, we'll calculate several model evaluation metrics on all three data sets. Some examples are Precision, Recall, F1, and AUCPR. Finally, we'll save our results. This includes saving things locally in R, such as our model evaluation, but also the trained model to HDFS, which is done using Spark. With that, we move to the next epic. Now, we're going to jump ahead to our first hyperparameter space update, which occurs at Epic 20. The first four steps are the same as before. We'll take a sample, train a model, evaluate it, and save all of our results. Next, we'll use the results from the last 20 epics to perform an update to our hyperparameter space. This will include building our base GLM, pruning the space, and saving results. Now, let's dive into the update step. When we first enter from the training loop, we load the model evaluation results and hyperparameters. We only load the prior 20 epics of information. We're also only interested in the validation results. Next, we're going to train a Bayesian generalized Lydian model, or Bayes GLM for short. This model will learn how the hyperparameters impact our model performance objective. We're training this model on the results from our validation dataset, which our random forest model never saw. Once the GLM is trained, we'll run the model through the ARM Simulate function to simulate our model coefficient priors. The priors are saved so we can reuse them in our next update step. This can be thought of as a memory for our model as it learns its coefficients, since we only ever load the latest 20 results. With the finished GLM model, we use it to estimate our model performance objective across all hyperparameters-based combinations. We compare these estimates to our current best model and remove any models that are lower performing. Similarly to our priors, we save the pruned hyperparameters-based. This completes the update step, at which point we exit back to our main loop. Now, here's a reminder of where we were in our training loop. We'll perform 20 more training steps after this, before we re-enter the update step. The final piece of our algorithm can be illustrated with the second update step, Dundant Epic 40. Similar to our first update step, we load our last 20 validation results and hyperparameters. Unlike our first update, though, we train the GLM also using our saved priors. This allows the algorithm to converge much faster than if we had used all of the results, something we realized from extensive testing. We proceed by re-simulating our priors using our new model. These should now be closer to our true coefficients that will help us identify the highest performing sections of our hyperparameters space. We finish this step by saving our new priors. We save all priors and pruned spaces separately so we can go back to understand how the algorithm performed. Next, we re-estimate the space objective using the pruned spaces we saved during our first update step. Using the same methodology from before, we further prune the space. The update is completed by saving the pruned space as we exit back to the main training loop. Now that you have a good idea of the algorithm procedure, how does it look in action? For illustrated purposes, we'll review a few batches of updates for a logistic regression model. The main takeaway is that we see increasing model performance with successive updates. The graph on the left shows our original space with bounds of 2.56 and 1. This represents about 2.5 million combinations. The main objective for tuning a logistic regression is to find the right amount of regularization so that the model does not overfit to the training data. In our graph, we populated 20 points that show which hyperparameters the algorithm sampled for the first 20 efforts. The colors and size of these points indicate the AUCPR performance of each model, which we'll use as our space objective for the base model. We've re-graphed the results into a histogram on the right. You can see the performance of models varies widely, from a low around 0.1 and a higher around 0.8. At this stage, we feed in the 20 results and hyperparameters to our update step. We're left with the region that the algorithm now recommends we search through to maximize our model performance, indicated by the tiny red triangle on the bottom left corner of the graph. In this case, our space was reduced by about 99% of its original size. For the next 20 efforts, we limit our samples to this newly reduced space. You can see that bounds have decreased to 0.193 for our regularization parameter and to 0.049 for our alpha parameter. The results from our previous batch are still visible on the right. This will represent our cumulative look at how the algorithm performed. On the left, we've again populated the 20 points that show what hyperparameters were sampled, not for the second set of 20 efforts. You notice the AUCPR values have dramatically increased, far exceeding the results we observed in the first batch of models. This is most apparent in the histogram on the right. The distribution for batch 2 models is far to the right of batch 1, with a range between 0.97 and 0.98. We're now up to our second update step. We'll utilize the prior simulated during batch 1, and the model results from batch 2 to update our space estimates. Following this procedure, we're left with the hyperparameters outlined by the red triangle, again on the bottom left corner of the graph. The reduction is not as pronounced as the first, but the algorithm is telling us that we don't need much regularization for this model. For the next 20 epochs, now epochs 41 to 60. We limit our samples again to the newly reduced space. You can see the bounds have again decreased, not to 0.076 for regularization, and to 0.037 for alpha. One last time, we populated the 20 hyperparameters that were sampled, now for the third set of 20 epochs. The AUCPR values are slightly higher than the previous batch, but the algorithm is showing us that we need little to no regularization for this model. While it may seem trivial, by using the algorithm to make this decision, we can rest assured that we are choosing the parameters based on information and insight, rather than just blindly assuming the model does not need to be regularized. Ultimately, the final model had a test AUCPR of 0.9791. More than the final results, however, the impressive nature of this algorithm is in its ability to sift through low-performing sections of the hyperparameter space and hone in on sections that are more likely to produce hyperforming models, as indicated by the histogram on the right. With that, I will hand it back to Kyle to wrap up our presentation. Some of our limitations include that the algorithm is very slow on large datasets, and that currently we're only learning on a linear boundary. Of our next steps, planning experiments is how random sampling would affect model performance. We're asking, you know, would this algorithm benefit from a non-learning variant, and we're conducting experiments to better understand why this algorithm works. Wrapping up here with some closing remarks, identifying optimal hyperparameter combinations presents an instrumentable computation. We develop a model agnostic phasing approach using RNSPR to search hyperparameter spaces. Our algorithm tends to increase the chances of maximizing model performance. In addition, we performed a metamodel analysis that we used to reduce our initial search space configurations. That's all the time we have for tonight. Thank you very much. Oh, Bob, I think you're on mute. Sorry about that. Thanks, Kyle and Neil. That was great. I enjoyed your presentation, and I got to see some of the questions that scrolled by in the chat. There's only one question posted, so I can read that. How long did it take you to design and implement your R programming environment? It's an ongoing process. Yeah, we're constantly refining our algorithm and the code that we use to sort of streamline this process. In fact, Neil here has just released to our team a next iteration of this algorithm so that we can further streamline it and further test this algorithm. We're very excited to test this. Okay, great. We're hoping to get it written into a package to publish for public use, so that's kind of a part of that effort. To answer the question, I would say it took probably about a year to get set up with all of the infrastructure that we use and just to be able to kind of efficiently work through our projects. I would say a year is probably a realistic amount of time. Okay. Are you pulling data directly out of your data warehouse? Yeah, so we do have an enterprise information team that manages a lot of our data for us, so they're actually populating the data from our warehouse into our Hadoop environment, and then for all of our models that we run, we're pulling on tens or hundreds of millions of records from there to build these models. Okay, it looks like we have one more question and I guess we can actually wrap up a little ahead of the schedule. While developing the solution, did you consider search methods other than basic grid search? Yeah, I would say that this is kind of an ongoing design. We're continuing to kind of refine how it's done. Kyle, I think you've tried kind of K-folds type search, right? We've done random search type algorithms. I think this is kind of the culmination of... Yeah, I think in a little bit of a sense, this is a good marrying of a combination of different designs. So you have grid search, which searches over all of the possible hyper parameters, and then you have random search, which sort of searches at random. But then we have our algorithm, which sort of selectively picks various spaces and then from these trials and errors reduces the search space. So that's kind of how we arrive at this algorithm. Great. All right. Thank you so much for your talk and we'll move on from here to the next. Thank you. Thanks.