 Hello, my name is Victor Castro. I'm a data scientist at Masteno Brigham in Boston. And the title of my talk today is Assessing Machine Learning Model Performance in Diverse Populations and Across Time. And the focus of the talk is lessons learned in developing and replicating a machine learning model that we built in the earlier part of the pandemic back in March. Well, the model was developed around June of 2020 based on hospitalizations in the Masteno Brigham healthcare system, which encompasses six hospitals, academic medical centers and community hospitals. And we used predictors from laboratory tests, results prior medical history, and some vitals and other demographic information to train the model. And so the model itself was using a data set of 2,511 admissions. And we used a LASO regression using the tidy models package, which I highly recommend, makes things nice and clean and also really helped with replication six months later. And specifically, we use the GLM net engine of the tidy models workflow. And again, the predictors included demographics, prior medical history lab results, and vitals at presentation. So typically that was at the emergency room, but it could have been when they were transferred from another hospital. And so the way we did the training and the test was we use three hospitals that kind of feed to each other. So there's two community health centers, community hospitals that kind of feed into one academic medical center. We have two different of those situations. So we use one as the training, which so that had one academic medical center and two community hospitals. And then we used a test set, which had again, one academic medical center and two community hospitals. And these are kind of the demographics. And so the training set was about two thirds of the data. And I'll show you this table just because it's going to come up later in the talk important point in the talk. But you'll see there's there's there's a pretty good distribution, as as is widely known, you know, especially the earlier part of the pandemic, the older older people were being admitted more often. And there was actually over different different ethnicities and race were also overrepresented compared to our typical hospital population. And then I just point out this, the model was trained to predict two things. There's two different outcomes, two different models. One is COVID-19 severity, which was defined as either an ICU admission, mechanical ventilation, or death. And then a second model, which was just a mortality, so death and outcome. And so overall, we had 11% of patients that died and 18% that had a severe, severe outcome. And here's how the models, the finals model looked again, this paper is published in JAMA Network, published last summer. And the the number of predictors that was actually quite a few predictors that were were fed into the lasso engine. And these were the kind of the final models, one column is for the severe illness model, and the other one is for the mortality model. And you'll see age is a strong component, baseline oxygen, some lab tests, prior history of respiratory infections, COPD, that kind of thing. Nothing super surprising here. And the model performs, the models performed pretty well. So we had an AUC of 0.807 for the mortality model and 0.847 for the severe outcome model. And I show one graph here, it's a Katherine Meyer survival curves, splitting up the model predictions into quintiles. So this bottom line here is the top quintile of risk, which is what we usually generally use as our cutoffs of the top 20% of predictions. And you'll see that that line actually does just penetrate in the x axis here is time since since presentation to admission or presentation to the emergency room. And so this is the severe outcome model. So you'll see it actually separates quite nicely. So we are really excited to see this. And then so that was published in June and or in the summer or sorry, in October, based on data through June. And then earlier part of this year, we decided to look back and say, okay, the pandemic has changed quite a bit. And we had different waves. And at the time, we were kind of in a second wave of the winter wave in New England, which was different than the first. And so we're saying we were looking at how well does the model perform given all these changes that were happening. And so we recently just actually published a paper as a research letter in again, German network open, which is a follow up looking at that. And so this is kind of the lessons learned and kind of the focus of this talk, how we did that. And we're hoping to get into some our code, hopefully to share some some tools and make that available. And again, that's using the yardstick package, which is part of tiny malls, which is really nice. So this is again, is a table one. And it looking at comparing patients from the initial training set, which is March, April, May, primarily, and then a replication, which is the subsequent period. And really, as we expected, there were kind of big differences. The difference in age differences in specifically ethnicity. So there are quite a few more Hispanic patients in the first wave compared to the second or the subsequent waves. And and notably, there was a significant drop in our outcome. So as you remember, the number that was 18% of severe illness in the first wave. But when we in the subsequent time period, that was dropped down to 8.3%. And similarly, deaths, the outcome of death was was was almost half of what it used to be. So that that was, of course, great, great news. But we want to see how our model performed, especially because linear models like lasso regression is is pretty sensitive to changes in outcomes. And this is another kind of view of how the outcome changed. And the graph on the left is the proportion of severe by month. And the blue box describes the initial training set, the time period for the training set. And you'll see the earlier parts were quite higher and slowly dropped down. And then it kind of became stable after that initial training. And again, that's really good. And so this is kind of the crux table of this talk. And aside from the outcome changes, we wanted to look at how the model performed across different subgroups. And we define kind of subgroups by gender, age groups, race, ethnicity, and whether they were admitted to one of the academic medical centers or community hospital. And again, this left quadrant here describes the original performance in the original test set. And you'll see the ACs were sometimes better, sometimes worse. But we we actually didn't really look into this the first time around. And this is one of the big lessons learned here is that we were actually quite underrepresented in our initial set in terms of some groups. So in the initial wave, there are actually no patients under 50 that died. And in the subsequent waves, there were seven. And we predicted that. But there was really no way that the model could really do well. And it was never in the training set that there was any patients in those groups. Similarly, a few Asian patients. And some really small numbers with the outcome in this initial set. And while the model performed really well in this training set in the test set from the training period. And again, this is PPVs and specificity sensitivities in the top quintile. When we got to the evaluation set, which again was the subsequent period. The ACs actually held pretty well. So this group where we didn't actually have any patients under 50 AC was pretty poor. And the biggest drop we saw was in PPV. And again, that is probably over it is because of the change in outcomes, significant drop in outcomes. And so, and again, quite a bit of difference in the different subgroups. So you'll see compared to the Hispanics, PPV.13, non-Hispanics.23, whereas in the kind of original set, we had it a little bit closer. And so I encourage you to look at the paper with the results. And so I wanted to show you, since this is in our conference, how I did this and how relatively easy it is to do this with the yardstick package, which has been kind of a game changer for our work. And so I will show you this is kind of a dummy version of our dataset. But, you know, we have one row basically per admission per patient. And then you have your categories here. And this is the view from our studio. You have your categories of how you want to break it down. One row per patient. And then this is the true outcome, whether they actually had the outcome. And again, this could be, in our case, it was either death or severe illness before discharge. And then this model class is the model prediction. And this is, again, based on the cutoff of the top wind tile. You know, you can also have a continuous prediction, which you can use to calculate the AUC just for brevity. I'm not going to show that here. But I will mention that putting this, making sure that you maintain the same parameters from training into the end evaluation is not trivial. So there are quite a few steps, as we know, from generating the data, transforming the data. And the tiny models package does make it great, because you can just create that workflow while you're doing the training and then save that. And so you want to save the model, but not just the model, you also want to say the transformations of the predictors. And then also the cutoffs that you use from your training set. And you pull those through, because you don't want to take the top wind tile, for example, of the predictions in the evaluation set, because that's not fair. So again, the yardstick package is what we use. And just a few lines of code. So you could just load the yardstick package. And I have the URL there. And then you can define your class metrics. So these are the ones you saw on the table. So specificity, sensitivity, PPV, NPV. And that's a metric set. And then I'll walk you through a little bit of this code. So the D is the table you saw earlier. And we pivot that to be longer. So you have one row per patient per subgroup. You group by the subgroups. And then you use deeplier nesting function. And then you do a map across the different nests. And basically apply these class metrics function passing in the truths, which that second to last column and the last column in the data frame that you saw. And then once you do that, you basically have your results. And you can nest them and then pivot wider. And this is how it looks, similar to the table. Pretty few lines of code. And this works with large data sets pretty efficiently. And again, you can calculate AUCs a little trickier, but not much more so. And also confidence intervals can be calculated as well. So this brings me to some conclusions. Number one, our COVID-19 risk prediction model maintained good discrimination. The AUC was comparable between the training period and the replication period. But the calibration was diminished by sharp reduction in outcome of COVID severity, COVID severe disease, and death. And probably to fix this, we would need to do some recalibration of the model. And the second thing is that assessment of the model performance over time is quite useful and interesting. It was really worthwhile to understand what was happening. And we would advocate for a lot more of that happening, especially in the literature, where we see quite a few publications with clinical risk prediction models, but very little replication or assessment in other sites or even within the same site across time. And the last point is probably the most important is to assess the risk stratification models across the patient subgroups. And we didn't do this the first time around. We didn't actually look at what our data looked like by subgroups and specifically how represented they are with the outcome, which had a significant impact on performance in those subgroups in both the test set and the replication set. And so hopefully here I show you an approach to do that with your own modeling, pretty straightforward to look at that. And it's super important because we just need to pay attention to make sure that these kind of modeling approaches operationalized in clinical settings don't adversely impact subgroups that might be underrepresented. And so I want to call out one interesting paper in the literature by Mark Sandak at Duke where he looks at presenting sort of like a kind of a drug label for models, which I thought was really nice. And I put here he has an example in his paper, which I have a link for down the bottom. And it looks like a drug label and it just basically says, okay, this is how the model works. This is how it's trained. This is kind of a limitation. And so understanding the limitations of what population it was trained on can be communicated in this kind of thing. And this is a way more detailed than, for example, what you see in the tripod recommendations for publishing models in the literature. So I think this is really cool. I'm hoping to work on look at this approach more in the future. And some of the just sort of thoughts that just looking at, I saw on Twitter recently, a nice discussion, which I have this link here, about what do modeling choices impact performance in subgroups that are underrepresented in the data. So the kind of the excuse I would say usually used is that there's not enough data for these underrepresented subgroups. And that is true that there is less data typically for a number of reasons. But that's not the only excuse that bias exists. And modeling choices do have quite a bit of impact as well as transformations of data set of predictors and other things. And this is a nice diagram of where all those biases can be introduced and kind of ways to approach that. So I encourage you to look at that. And then just some thoughts about maybe figuring out ways to optimize models for representativeness in underrepresented populations. And I know there's quite a bit of work happening on that. So I look forward to that. So thank you for listening. And I'm happy to take any questions. Thank you.