 Hi, my name is Karan Deep Singh. I'm an assistant professor of Learning Health Sciences, Internal Medicine, Urology, and Information at the University of Michigan. I also chair the Michigan Medicine Clinical Intelligence Committee, which oversees the implementation and governance of machine learning models across our health system. I'm excited to be talking to you today about bringing machine learning models to the bedside at scale. One of the first questions that comes up in healthcare operations is why machine learning at all? I like to frame this as a component of the learning health systems approach from the standpoint of the learning cycle. When we're talking about the learning cycle, we're talking about forming a learning community around a health problem of interest, of taking the performance and practice that we're currently doing and turning that into data that we can routinely record. Then in the data knowledge step, we take that data and learn from it. Then based on what we find, we turn that knowledge into performance or into practice. In other words, we intervene on our current workflow and change our current workflow with the idea that we're going to change to try to improve, and then we're going to measure our performance to see if it worked. Within the context of this, we need to be able to run machine learning models in our electronic health record, which means that we have to integrate them so that they can routinely generate data even if that data is just being silently recorded and not shown to anyone, as part of our ability to figure out how good these models are. This requires infrastructure. The main use case of machine learning in healthcare in my view is that it lets us identify high-risk and low-risk patients at scale. If you're a clinician and you're seeing a patient and right in front of you, you see that the patient's heart rate is rising and their blood pressure is dropping. You can tell that that patient is sick. But if the question is, who are the 20 sickest patients in the hospital right now who aren't in the ICU? That's a question that's very difficult to answer at scale. You need to basically call up all the charge nurses on all the different units and try to figure out who the sickest patients are. That takes a substantial amount of time, and this is where machine learning can play a role. If you can figure out who those high-risk or who those low-risk patients are, you can then allocate resources more effectively. This is the knowledge to performance step, where you can either by adding an intervention or by changing your workflow, you can align your resources a bit better. One thing you can do is you can treat higher-risk patients more aggressively, or you can reserve lower acuity care and prevent unnecessary testing on lower-risk patients. For example, one of the areas where we want to identify low-risk patients is in trying to enroll patients in our hospital care at home program. These are patients who come into our ER. They're high enough risk that they need to be hospitalized, but they're low enough risk that they can be hospitalized safely at home with remote monitoring and having nurses come to their house. This is an example where we're essentially saving a hospital bed by sending these patients home, but we need effective models that are accurate that can find these patients who can safely go home without needing to get a higher level of care. Many of the machine learning models that we're using in the health system are actually early warning systems, which I view as essentially machine learning models that run on a fixed patient population. Some of these models run every few minutes, like models who detect sepsis, deterioration, or falls. Some of these models run every day, like models to predict which patients aren't going to show up to their clinical appointments, and some models run on every visit or every encounter, such as a model to predict hospital admission for patients in the emergency department. What I'm showing on the right side is a model that predicts deterioration. Each of those boxes represents a different patient. In the top two boxes, you can see that a patient who came into the hospital needed to be transferred to the ICU within 48 hours, indicated by the blue line. The red line indicates the patient who needed mechanical ventilation, and the bottom two panels show patients who actually did not need to be transferred to the ICU or started on mechanical ventilation. And the actual black line that you're seeing here is the epic deterioration index, which is one of the deterioration indices that we use in our hospital right now. So we're going to talk through a lot of the challenges that we face in trying to get machine learning models to the bedside. We'll start off with, do we have an infrastructure to even implement models? Then we'll move on to trying to figure out should we implement a model? Once we actually go about implementing a model, how do we then measure its performance? Then probably the toughest question we face is, is a model good enough to use? And then for figuring out, do users actually agree on how to use that model in the form of a clinical workflow? Finally, if we're able to agree on how to use a model, is the model actually effective in improving clinical outcomes or improving quality measures in the way that we like? And then we'll close with a discussion on governance and talk about what does governance look like for machine learning models within health systems? So in many ways, we're going to actually open the talk talking about technical infrastructure and then finish the talk talking about social infrastructure in the form of governance. So the first question that comes up, if you're at a health system that's kind of newly entering the space is, is there technical infrastructure to even implement and run machine learning models? Let me set the scene for you. Let's assume we're at an academic medical center and a researcher says, we want to pilot our group's readmissions model, a model to predict readmissions. So you as the chief medical information officer or kind of someone working with that person in that role say, yeah, let's go ahead and build a web interface to your server. You run the model on your server, we set up the web interface and that way, you can update the model as you please and the interface should be pretty stable. A second researcher comes along and says, we want to pilot a prioritized scheduling model in clinic. You say, great, we did something that worked for one of our researchers, we're going to go ahead and do the same thing for you. Let's build another web interface to your server too. Now, a few weeks go by and there's an intermission, a new electronic health record update is installed. And one of these multiple models breaks and the question comes up, why is my model not working? An infrastructural problem needs an infrastructural solution. And when you set up separate infrastructure, essentially separate web interfaces to all these different researchers, you're essentially treating this infrastructural problem as a series of one-off problems, which introduces risk. And although a lot of what those researchers were doing, we're asking for ways to deploy models. Implementing machine learning models involves a lot more than just deployment. It involves managing the entire lifecycle of those models. So you need to be able to store those models, map the variables and manage versions. And there's actually several options to do this on the technical side. I've only named a handful here, but examples of open source mechanisms include MLflow and Knowledge Grid. But there are also electronic health record based vendor clouds, there's other cloud computing platforms that exist. And then there may be internal computing resources either to your hospital or to your academic medical center or your university that provide ways of versioning objects and running them from a web service. At our institution, we run a cloud-based machine learning infrastructure that is a component within our electronic health record. This infrastructure comes with a set of models that we can deploy, but probably most importantly, we can run any machine learning model securely in the cloud as long as it can run in the Python. Unfortunately, R is not on the roadmap for running models, but there's still lots of ways that we can train models in R and deploy them in Python. So once we have infrastructure in place to actually implement models, the question comes up, should we implement a model? And when I say implement for this talk, what I really mean is integrate the model in the electronic health record and run it silently. When I say the word use, what I mean is to intervene based on the model's predictions, such as by linking it to a clinical workflow. So the decision to implement a model is important even though the model is not being shown to anyone and no action is being taken. The reason it's important is because every time you go to implement a model, even just to run it silently, it requires a fair bit of effort on the part of an analyst to connect up the model. And you also are committing yourself to then taking a look at that model after it's been running for several months or has accumulated enough data where you feel like you're gonna have a stable estimate of model performance. So this decision is important and it needs to be made based on imperfect information. Why do I say that information is imperfect? Well, even if your model is developed at your own institution, typically the validation numbers that you'll have are based on retrospectively calculated performance. And there's a number of reasons why that retrospectively calculated performance may not measure up when you actually try to go and run that model prospectively. Performance calculated at other institutions may also not generalize to your institution. So in many cases, a model developed by a vendor, they'll say, we've tested it at several institutions and it works well. You'll have to compare your institution against those institutions, including practice patterns to figure out whether you think that model will then generalize to your institution as well. And then any time we go to implement a model, we're usually working based on a model information sheet that contains information about that model, its performance, what variables go into it, et cetera. But many of these model information sheets come with missing information. For example, many of these model information sheets don't tell you where the model was trained or where it was evaluated. If the model generates a score from zero to 100, it often doesn't tell you what its score of 50 actually means. Or in some cases where the model acts, you know, in the presence of missing information, it doesn't tell you how the model handles missing values. And underscoring all this is that many times the model information sheets that you're dealing with are not peer reviewed. They're kind of internal documents meant to be shared with clients only. So when you're faced with an important decision based on imperfect information, it's important to have people with the right expertise available to help make these decisions, including folks with statistical expertise. How complete are these information, these model info sheets? This is a preprint by Jonathan Lew and colleagues out of Stanford. And if we just look at the row for tripod guidelines, which is a pretty standard set of guidelines that we use when we write up papers and submit them for review focused on just development and validation of models. You'll see that on average, only about half of the items that the tripod guidelines asked for were reported in this series of 12 proprietary model information sheets for different Epic models. And this is not a unique issue to Epic. This is an issue for any proprietary model and even some of the models that are published in the literature. So once a model is implemented and running in the background, how do we measure its performance? Well, first of all, do we need to measure its performance? Why should we measure its model performance? When you're a model developer, you're used to thinking from the frame of view of a model developer. And as a model developer, you're taught that internal validation is okay, but external validation is good. Well, when you're sitting in the seat of a model implementer at an institution, it's great to know that a model externally worked elsewhere, but it's even better to know that the model works here. And so that's probably one of the biggest reasons to try to measure local model performance. And I'll do a case study of this that relates to the local validation of the Epic deterioration index during the first wave of COVID-19. So in March of 2020, our situation at Michigan Medicine appeared to be pretty serious. I think the current date was March 24th when this chart was released by our health system. And they were anticipating that under a best case scenario, we were going to reach a peak hospital census that was more than three times our hospital's capacity. Our hospital has about 1,000 beds, of which probably 800 or so are licensed for adults. And so you can see this blue dot, denoted by the red arrow, where we were expected to exceed our hospital capacity on April 7th. So the question came up, can we find a model that will help us figure out which patients to send to a field hospital if we were to open a field hospital around that time? On March 31, it was announced that a field hospital was being planned and that it would be likely be built. It was set to open on April 9th or 10th, just around the time that our hospital was expected to reach maximum capacity, it was going to contain 500 beds. And the question came up, which patients should go there? Well, on April 1 the day after, Stanford announced that it was looking at a test of existing models to figure out if any of them could be useful for COVID-19 patients, to figure out which patients are high risk or low risk. And even though here it says it was an accelerated test of AI, they were really evaluating and looking at the Epic Deterioration Index, which is a widely implemented deterioration index model to see if it could be suitable for this task. On April 1, Stanford also held a human AI conference whose sole focus was on COVID-19 and artificial intelligence. It was a virtual conference. And in that conference, Dr. Ron Lee from Stanford presented the deterioration index score from a patient to illustrate how potentially this patient whose score was getting worse and worse and worse could be an indicator for a patient who needs to get transferred to the ICU. And this could be used to allocate resources more effectively. Fast forward a few weeks later and there was an Epic Press Release on April 22nd, where Epic said that this model was saving lives. It's helping to save lives. The model predicts which patients are getting worse and will need more care. It shows us if things are changing rapidly, was the statement made by a physician at Confluence Health. And if you look at the very end in that highlighted portion in the red box, many organizations such as Confluence Health reported to have completed their validation of this tool and were already using it for these patients, except there was nothing in the literature and nothing publicly available talking about how this tool should be used. A couple of days later, it turned out that not everyone was using it the same way. And so even though dozens of hospitals were using the system and had it available, this article by Stat News showed that at Parkview, doctors analyzed data from nearly 100 cases and found that most of the patients were in this middle zone of 38 to 55 who eventually needed to be transferred to the ICU. And thus at Parkview, they were recommending that patients between 38 and 55 be the ones who the attention was paid to. Whereas in our early experience, we found that this tool was identifying high-risk patients at the higher levels of score, not in the middle levels. So why should we measure model performance? Well, sometimes the performance of the model matches what is reported by the vendor for our intended use case. So in this case, even though Epic didn't publicly release and publicly talk about the performance of the Epic deterioration index, they did internally share some error into the curves. And the error in the curve that we found in our study that we published in the Annals of American Thoracic Society of 0.79 was fairly closely matching what Epic had shared internally with clients. So in other words, sometimes the performance when you evaluate it independently does match what's reported by the vendor for that intended use case. Other times it does not. So if your use case or your outcome of interest is different than the outcome of interest on which the model was trained and evaluated by the vendor, then your results may be discordant with that of the model developer. Even when the performance actually matches, there is something useful to be learned. So this is a plot showing the Epic deterioration index in COVID-19 patients at the University of Michigan. This plot was drawn from our paper published in the Annals of American Thoracic Society. And I refer to this as a threshold performance plot, mainly because on the X axis, it shows all of the possible thresholds. And in each of those boxes or panels, it shows the performance, the sensitivity, the specificity, the positive predictive value and the negative predictive value. Below the threshold performance plot is a histogram showing the distribution of predictions. Now for this particular plot, we were trying to answer the question of, well, what does a score of 50 mean? So just to help make sense of this plot, the distribution actually is not a distribution of all of the Epic deterioration index scores. It's actually a distribution of just the maximum score for each patient. Because we wanted to figure out not so much what does any given score mean, but we wanted to know if a patient ever exceeded a score of 50, what does that mean in terms of whether a patient would need to go to the ICU, become mechanically ventilated, or pass away? And so looking here at this blue line, the Epic deterioration index score of 50, if you look at those positive predictive value and negative predictive value panels, the light gray bar indicates how many total people were classified as positive or negative respectively at each of the thresholds. And then that black line indicates the measure with the darker gray representing the 95% confidence interval. So at an Epic deterioration index of 50, if you ever crossed a score of 50, there was a approximately 45% chance looking at that positive predictive value that you would need to go to the ICU, become mechanically ventilated, or die during the remainder of your hospital stay. And a negative predictive value was relatively high at approximately 90%. You could see that at that same threshold, the sensitivity of the model was approximately 80%, and the specificity was around 50%. So this was actually helping us figure out what a score of 50 actually means. Another way you could look at what a score of 50 actually means is to look at the distribution of all of the predictions. In this case, we were showing a calibration plot. So what have we done? We've taken all of the scores, rescaled them from zero to one. So a score of 50 translates to a score about a predicted probability of 0.5 here. And we wanted to see for any given score, what is the probability that a patient would experience the outcome during the remainder of their hospitalization? So you can see it's a much smoother distribution on that histogram below. And if you track the x-axis and you look for that predicted probability of 0.5, corresponding to the score of 50, you'll see that the chance that that patient will actually need to go to the ICU or experience the outcome is about 33%. So a score of 50 here doesn't mean a 50% chance of experiencing the outcome for the remainder of the hospital state. It actually means a 33% chance. Now it might seem like this model is miscalibrated. And certainly the way we describe this phenomenon in our paper was to say that this model is miscalibrated. But technically, I would say that the model is actually not miscalibrated because it wasn't fit with a binary outcome. It was fit with an ordinal outcome that was a composite of several different events that were somewhat arbitrarily weighted with arbitrarily decided look ahead periods. And so knowing how the model was trained, if a patient gets a score of 50, how should a clinician actually interpret this? How should they make sense of this? And you can already see here that there was something funny going on because at low scores, the observed risk was actually higher than that kind of some of the more intermediate scores. Getting to this question of what does a score of 50 mean? In the original epic press release, they provide an example of a score and what it meant. And so in their press release, they said that a score of 37 was labeled as being in the danger zone. Interestingly, if you look at the calibration curve that I just showed you, a score of approximately 37 or a predicted probability of 0.37 actually corresponds to a relatively low risk and not a high risk. So this is again, just showing you that even though the score was being touted as a score with a lifesaving potential, the messaging around what a number on the epic deterioration index translates to with regards to patient risk was actually not so clear. And the messaging the risk is actually really important because oftentimes we're making the decision based on what we think is the predicted risk of an outcome. And interestingly in the original epic press release, they made it look as though that score was a probability. They labeled it 37% and they said that it was a high risk and labeled it danger. A couple of days later, they actually did go and update their press release to correct this, pointing out that the score was now just a number. It was 41.9 or approximately 42 and that this corresponded with medium risk and not the danger zone. Because again, 37 was actually relatively low risk in our evaluation. And so basically what does medium risk mean? We can't really say here. And I think that's one of the challenges was that despite this having a good area into the curve, the numbers themselves were difficult to translate into a probabilistic risk that a clinician can then use to guide decision-making. So how did we end up operationalizing the epic deterioration index? Well, during our first wave of COVID, we instructed frontline clinicians on how to add the score to their sign-out, which is the list of patients that they have so they could see the patient's scores at all times. We showed them how to inspect trends and we showed them how to interpret the numbers. This is a one pager sheet that we prepared that was sent out to all hospitalists, kind of providing some basic explanation for what the numbers mean based on our earliest analysis prior to our publication. And then we suggested two ways of incorporating this into their routine clinical care. One was to consider rounding first on those highest risk patients because those are the patients potentially most in need of going to the ICU. And to prioritize those patients during handoffs and spend more time on them so that issues overnight that might be a concern to get worked out. So why might a model performance not match the expected performance? In this case, we talked about the fact that the outcomes were actually different based on the way the model was trained, which was ordinal and the outcomes that we were looking at, which was a composite outcome of transfer to the ICU, mechanical ventilation, or death. In lots of other situations where you expect the model to be totally fine when you run it prospectively, things come up unexpectedly. And local validation can uncover problems with the model. Even when you've trained the model locally based on retrospective data and validated it locally based on retrospective data, issues with timestamps can still cause your model to not work as well when you start running it prospectively. Because information that's technically charted retrospectively doesn't always become available until after the fact. In the simplest example of this, a vital sign might be checked by a machine at 1 p.m., confirmed by a nurse at 1.10 p.m. and then retroactively charted at 1 p.m. based on the time that it was checked. So it might look as if that 1 p.m. vital sign was actually available in the system even though it wasn't because it hadn't yet been confirmed by a nurse. Variables might also be mapped differently in derivation and local validation sets. This can be a problem even when the model is developed at the same institution because in many cases your research data warehouses are actually different than your production or operational electronic health record databases. And so there may not be a one-to-one variable mapping across these two. If the model is trained elsewhere, then a different acuity of patients in your local cohort might cause the model to not work as well. And additionally, many of the early warning systems rely on clinical workflows as predictors. And so because workflows differ between the institutions, a model that works at one institution might not work at another one that has a different clinical workflow. Here's a nice paper by Erkin Odle's G.A. Ho and Dr. Jenna Wiens, who are my colleagues at the University of Michigan, which shows that a model that was trained and evaluated retrospectively at the University of Michigan had a slight drop-off in performance when running prospectively, likely because of slight changes in the patient population, changes in clinical workflows that occurred during this time, and changes in the infrastructure in that there was a different infrastructure used to train the model than where the model was ultimately deployed and then prospectively evaluated. Data set shift is increasingly being recognized as an important problem. This was a paper that we wrote a letter to the editor in the New England Journal of Medicine where we highlighted kind of an extreme case of data set shift, which was the performance of a sepsis model during the COVID-19 pandemic. Specifically, we had deployed a sepsis model. The sepsis model was in clinical use on three different units in our health system, and on these three units, during our first wave of COVID, we had nurses request for the model to be disabled because the models were generating excessive alerts to the nurses. Now, why did this happen? Was it the fault of the sepsis model? Not necessarily. This was probably a fault of the shift in the underlying patient population where the relationship between fevers and bacterial sepsis was altered in the presence of COVID-19. And ultimately, this led our governing committee to decommission the model's use, and it wasn't restarted until several months later. Now, this is an extreme example, but many causes of data set shift are more subtle. So I'm showing you just a part of table one here, which highlights one example from changes in technology that could lead to data set shift. But this table in its complete form contains lots of different examples of why this might happen, how it might be recognized, and then what are some of the earliest steps that one might take to try to actually mitigate this and many of those involve governance. So why should we measure model performance? One of the most common problems with models is model miscalibration. If we don't measure model performance, we won't know if it's occurring, and if we don't know if it's occurring, we can't fix it. And this is just an example of some of the different ways that models can be recalibrated according to this textbook, clinical prediction models by Dr. Steyerberg. All right, so once we've justified the fact that we do need to measure how these models are running prospectively, how do we actually go about measuring that performance? Let's say we have implemented a model to predict inpatient clinical deterioration. How do we measure its performance? First, we need to define a time zero for the outcome. Now, that might be a change to ICU status, or that might be a physical transfer to the ICU. But you wanna choose the earliest event that indicates that the outcome is inevitable. In the case of COVID-19, the change to ICU status was a very important marker because many patients were not actually physically transferred to the ICU. Instead, their beds were converted into ICU beds. So we included change to ICU status as an earlier marker of going to the ICU because some of those patients ultimately didn't physically move beds even though they were transitioned to ICU status in the same bed. Once you define a time zero, in general, you wanna exclude all predictions after time zero. There may be rare exceptions to this, but by and large, you don't wanna include predictions after the outcome has occurred because those predictions have the benefit of knowing that the outcome has occurred, especially if the outcome is included as a predictor. And then we typically use one of two strategies to model, to measure model discrimination. Strategy one focuses on looking at encounter level performance, where each patient contributes only their highest value to the calculation and each patient is equally represented in the calculation. For example, the outcome in this case might be transferred to the ICU during the hospitalization. A second strategy is what I call prediction level performance or others may refer to as time horizon level performance. In this case, each patient contributes every single prediction to the calculation and each prediction is treated independently. So in this case, the outcome may be transferred to the ICU during the next four hours or eight hours or 12 hours, but that outcome is reset after every prediction. So in other words, each prediction has its outcome within the next number of hours calculated separately. Even if that prediction is bouncing up and down. So let's take a look at this first strategy of encounter level performance. In this setting, models are judged based on the highest prediction during an encounter. So if you look here carefully at those boxes on the right, you can see those X's mean that those were predictions that occurred after the outcome had already occurred. So we first throw those out. And then we take the highest value prior to the outcome occurring, which is represented by that pink circle. And there's nothing special about that highest value. And I know you may be thinking, well, how would you know that highest value in advance? The highest value is really there to represent the fact that you wanna know if a score ever exceeded a certain value. And so by scoring that patient's highest value, it will help you calculate the AUC on whether a threshold was ever exceeded for each possible threshold in that AUC calculation. In this setting, if a patient didn't have an event, all it takes is a single high prediction to bring down the AUC. So look at that bottom right box. A single high prediction would have changed the highest value for that patient and would have totally brought down the AUC for that patient and across the board. And from a clinician's perspective, if a model generates an alert when the score exceeds 65, then the encounter level AUC penalizes models that ever alert for non-cases, even if most of their other predictions are good, because in reality, that patient would have generated an alert even if all the other predictions were just fine. Now, you can get an artificially high encounter level performance if the alert fires, you know, let's say only a handful of minutes before the outcome. So an AUC of 0.99 is completely meaningless if the first prediction that exceeds the threshold occurs minutes before the outcome, assuming that that's not enough time to actually do anything about it. So these are ways in which, you know, encounter level performance can come down or that it can come up artificially that makes it not that useful. So anytime you look at encounter level performance, you have to look at a lead time analysis where, you know, each of these notches here represents a time that the model exceeded the alerting threshold. Now, in reality, you wouldn't necessarily actually send out alerts every 15 minutes, but this is just to show you that, you know, there are a handful of patients where they were consistently exceeding the alerting threshold throughout, represent on the top half of those graphs, whereas the bottom half shows that there are patients who just alert for the very first time minutes before experiencing the outcome where the model is probably not all that helpful. And that panel B is basically a zoomed in version of panel A focusing on the most recent 24 hours prior to the event and all of these patients shown here experienced the event. Strategy two is to look at prediction level performance where models are judged based on every single prediction. So a single high prediction for a non-event wouldn't actually dramatically change the AUC even if it generated an alert because the remainder of the predictions are totally fine, like in that bottom right panel. However, the timing of the outcome does affect the AUC or what I would refer to as the prediction horizon. So you can see here that the prediction hour, the horizon of four hours represented by that pink bar versus 10 hours. Well, if that patient exceeded the alerting threshold, according to the 10 hour horizon, they experienced the outcome and so that's great. According to that four hour pink horizon, they hadn't experienced the outcome within the next four hours and so this would be considered to be an incorrect prediction. So you can see how prediction level performance and encounter level performance provide different types of information. I like to use both strategies when we look at models because they both measure relevant but different characteristics. So once we've got a model, we've got it implemented and running in the background. We have measured some initial performance on the model running silently in the background. How do we know that that model is good enough to use? Now in reality, when you read these machine learning papers, there are lots of papers where the AUC is 0.95, the model is perfectly calibrated and there is absolutely no questions about whether that model is good enough to use. But the reality is that most of the things that we're trying to predict are actually not that predictable to begin with. So if you have a model that can predict sepsis upon admission and the error in the curve is 0.70 and that model is slightly miscalibrated, it's really not immediately clear whether that model is good enough to use. And if you ask the clinician this question, they'll say, well, it depends, but they often won't know what it depends on or they'll say it depends on the performance and it kind of becomes a circular argument. So what it actually depends on is how many patients you're willing to evaluate to capture one case of sepsis. This question and the kind of resulting performance of the model can be displayed visually in the form of a decision curve. And in decision curve analysis, it displays this visually in the units of net benefit. So for example, if you're willing to evaluate between 10 and 20 patients to capture one case of sepsis, that means that you would want to be about five to 10% certain before you go ahead and evaluate a patient. How did I get that five to 10%? I just mathematically derived it from the 10 to 20 patients that you're willing to treat because one over 20 equals 5%, one over 10 equals 10%. So that's essentially the relevant range of thresholds that you should consider implementing for that model. In this case, implementing the model, which is represented by that red line, appears to have a higher net benefit than a strategy where you would evaluate all patients represented by that gray curve, which actually looks like a line here, but it's technically a curve. And that black line, which is basically a net benefit of zero, which represents a strategy where you evaluate no one, and thus there's no benefit that could possibly arise. This can't necessarily tell you if it's better than clinical practice in its contemporary form, but you could in some cases chart current practice as a curve and see whether a model-driven practice would be expected to achieve a higher net benefit than your current clinical practice. So there's a lot, I haven't talked about there, and there's a huge literature on decision curves that I would ask you to go to, but we've actually started using decision curves and this kind of net benefit thinking when we first triage requests for prediction models, because we'll start by asking people not only how do they plan to use the model, but how many people would they be willing to screen or evaluate to capture one case? And it can be really instructive. If you have an intervention that's not very effective, you may only want to screen one to three patients or two to three patients, because you don't want to spend too much time screening patients if you don't have an ineffective intervention. But if you have an effective intervention or the consequences of missing a case would be severe, you actually would be willing to screen many more patients. And you can calculate net benefit at the clinically appropriate ranges of thresholds based on that number willing to evaluate. So let's say you've got a model, you've decided it's good enough to use because as a positive net benefit, it looks like it adds value over current clinical practice. What do you do with it now? How do you incorporate it into a workflow? In other words, users actually agree on how to use that model. First of all, physicians even agree on the definition of the problem in the right course of action. And I would say on the surface of it, sepsis appears like a problem that people should generally agree on, if not on the definition then at least on the right course of action. I say that because there are several papers in the literature that look at prediction of sepsis, whether it's sepsis related mortality, whether it's sepsis in the ICU, whether it's the accuracy of the sepsis model, or whether it's an evaluation of the sepsis model's performance in a non ICU setting. But of course it starts off with what is the definition of sepsis? There's a sepsis through consensus definition which sounds really promising because it has the word consensus in it. Then there's a definition used by the Centers for Disease Control which is the adult sepsis event definition. The CDC uses that definition to track the incidence of sepsis year over year. And then there's a Medicare sep one definition which is what Medicare uses to define cases of sepsis for whom they assess the sepsis quality based on this bundle called the sep one bundle that assesses the quality of care provided to those patients. So here's the publication on the sepsis three definition. And here's one on the Medicare sep one definition. But as you can see, even though the sepsis three is the kind of consensus definition, if your health system is being judged on its quality by the Medicare sep one definition, it's certainly possible that you could improve the care of patients who meet the sepsis three definition but not the sep one definition. And it would look to the outside world as though your sepsis care hasn't improved at all. So in a health system, you need to know which definition are you primarily trying to improve. And either one of these three, I think, would be a reasonable one to choose to try to improve because they actually are formally coding for sepsis. Even if you agree on which patients have sepsis, do physicians agree on the right course of action? I really enjoyed this study by Downing and colleagues where they randomized patients to an electronic health record-based clinical decision support alert for patients with severe sepsis. They delivered this alert to, I think, over 1,000 patients and they found that there was absolutely no difference in primary or secondary outcomes, including in the amount of fluids administered. Interestingly, when they went and asked the clinicians, why didn't you administer fluids even though you got an alert? Only two out of the 26 physicians administered the recommended 30 milliliters per kilogram of IV fluids in cases where there was chart review confirmed sepsis hyperprofusion. And interestingly, if you asked the physicians, why didn't you give fluids? The physicians reported that the primary reason they didn't give fluids were that they were worried about fluid overload or they felt like the patient wasn't sufficiently hypotensive enough to warrant that degree of IV fluids. Although some also disagreed with the diagnosis of sepsis itself. So just because you have an effective model or you have a rule-based system that detects a problem early doesn't mean that you can actually improve even process measures when physicians don't agree on what those right processes are. Although we've primarily been talking about alerts as a way to notify users, alerts aren't the only way that we inform users of problems. So here I wanna talk about several of the different ways in which the output of a model can be operationalized. So obviously the most interruptive way or the most interruptive alert to notify a clinician of a high-risk patient, for example, would be a page or any other interruptive notification. A page is probably the most interruptive because it doesn't matter if you're at home as long as you're on a pager, it doesn't matter if you're eating lunch. A page will go off regardless. And if it's important enough of a situation where it needs to interrupt you, then a page is probably the most appropriate way to deliver that information. Most clinical decision support alerts are not interruptive in the sense that they don't interrupt you at lunch, but they are interruptive in the sense that if you open a patient's chart or if you open the electronic health record and log in, they'll interrupt you to tell you that something is going on that you need to take care of. And most of these only are active when you actually click on a patient's chart and are trying to do something else. And then they'll remind you that a patient is missing a specific order or there's some issue that needs to be attended to. Another way is to call them in a clinician's sign-out or on a clinician's schedule where they can just see the score. It's visible in front of them so it's interruptive in that sense, but it's not actually changing their workflow. They can continue to move on and do what they need to do. You can also dig in and look at a score just like you can look at a vital sign. In fact, in some cases where we've had adverse events like following surgery in the PACU, we've actually had, you know, sentinel reviews where we look into those and try to figure out what went wrong, what could have been detected earlier. And as part of that chart review, we've increasingly started to look at these various scores to figure out could any of these scores identify a patient earlier than they came to actual clinical attention? You can also look at a dashboard of patients sorted by acuity and then, you know, intervene on the patients at highest risk. So our rapid response teams currently do this for deterioration. They come in in the morning, they sort patients from highest to lowest risk of transfer to the ICU and then they go check in on the top 10 or 20 patients proactively to see if they can either facilitate ICU transfer or to try to prevent ICU transfer. And then of course, you know, at the bedside you can use decision aids and these are completely voluntary and that's why I kind of say that they're the least interruptive way of using a model's predictions. How you use a model obviously has major implications on equity. If you've got a no show model that predicts which patients are not going to show up to clinic, you could use that model to double book patients. You could use that model to cancel visits arrange for transportation or switch the visits to virtual and how you use that model will impact the downstream impacts on equity that will be experienced by your community. So you have to be really careful about how you actually use these models. So let's say, you know, you've agreed on how to use these models and then you actually implement the model you start using it and now you want to know when you use it, is it actually effective? In other words, you probably implemented a model to try to improve some performance metric or some quality measure. So the question is, did the model actually achieve it? Or did the model actually improve the patient's clinical outcomes? So the question of whether the model plus intervention improves care requires you to first evaluate the model and make sure the model's good and then often include that model in the nested within a randomized control trial to figure out whether the actual intervention when linked to a model works in improving the outcome that you were trying to improve. In this case, this was a paper that we wrote where we were trying to reduce the number of ER visits for patients by identifying high-risk patients and then randomizing them to an intervention involving case management. In this case, interestingly, the model was relatively good, but when we actually put it into clinical use and linked it to an intervention and studied it in a randomized control trial, the actual intervention was ineffective even though the underlying model was good. So this just goes to show you that the story isn't finished when you look at the model. The model is being good as maybe a precursor, but only if the intervention linked to the model is good do you have a package that's worth sustaining. So once you've got these models up and running, you have to govern them. You have to have a process in place to evaluate these models. How does one go about governing these models? You can think of that governance as a form of social infrastructure to balance out that technical infrastructure that you needed upfront to even get started. So in our health system, the way this arose was that we had a clinical decision support committee that was managing all the rule-based decision support alerts in the electronic health record. So if you have a patient with diabetes and there are certain quality measures that need to be checked and certain things that need to be acted on that are based on clinical rules, that's the kind of a rule that would go to our clinical decision support committee to get triaged, evaluated, and then eventually endorsed, approved and built. In July of 2018, our health system, Michigan Medicine, established the Clinical Intelligence Committee as kind of a sister committee to the decision support committee, but focused primarily on overseeing the implementation of machine learning models operationally. So if we think about research translation operations, we already had a lot of stuff going on in the research world. We had an AI laboratory that was world-class. We have the Michigan Integrated Center for Health Analytics and Medical Prediction, also known as MyChamp, which brings together researchers across our multiple campuses and our multiple schools and colleges focused on AI research. And then we launched a precision health initiative that's launched from our university that focuses on taking these research discoveries and getting them into the bedside in the form of studies and research. So I refer to that as kind of the translation phase. But then there's actually day-to-day clinical operations, which involves things that we need to predict like sepsis, which are also of relevance and interest to the research community, but also to predict various workflow-related issues that may or may not be as interesting to the same exact communities on the research side. So we needed a place that could decide should a model actually get implemented in operations and should we start using it? And that's where the Clinical Intelligence Committee comes in. In addition to just having this committee, it's not enough just to have a committee. You need to, as part of your social infrastructure, have a reporting structure where that committee actually reports up through a normal chain of command. And in our health system, that chain of command goes through our IT system and then back up through the health system leadership. But you can see it follows a relatively similar path to our Clinical Decision Support Subcommittee. So our Clinical Intelligence Committee, unlike, I would say, some of the other committees that we have is a highly multidisciplinary committee. It's made up of physician informaticists, including our Chief Medical Information Officer and several of the associate CMIOs. Our Chief Nursing Information Officer and other folks from Nursing Informatics also make up our committee. Because we pull in live information from labs, we have pathology informatics as part of our committee. And then we have folks who are analysts with expertise on how to integrate models in the electronic health record and where the operational data lives. We have quality analytics folks, folks with expertise in learning health systems. And we recently added research informatics because we recognize that, although we initially saw our role as primarily being in operations, the work that goes into operationalizing models is the same work and the same people involved that goes into implementing models for the sake of prospective validation and research. So we have a prioritization structure that we use that encompasses both operations and research. And of course, you need ad hoc members for each of these models that understand the clinical problem to make sure that the models when linked to that workflow are actually expected to solve that clinical problem. Lots of different governance issues come up. Questions come up like, who can request a model? Who evaluates whether these models work? Is it the vendor that evaluates or do we evaluate? Who approves models to be integrated into the EHR? That takes some substantial time and effort. So you need a group of people who can prioritize an analyst's time over other things that they have on their plate so that these models actually get built and integrated. Who prioritizes which model is integrated first? What if there are competing models developed by competing research groups or by a research group and a vendor? We have a situation right now where we have two research groups and a vendor model, all kind of competing against each other. And so you have to be very clear that you want to set fair ground rules to be able to look at these things. And you don't want one model to look better simply because of the way that they did their calculation but rather for the process to be fair. What if someone updates the model? Does that require re-approval? What happens if a model stops working as intended? What if it starts sending pages in the middle of the night every 15 minutes? What processes are you gonna have in place? What level of support is gonna be in place so that these models can get appropriately deactivated and then fixed and then put back online? And then what happens if a model's performance is dropping over time? Whose responsibility is to keep an eye on that? Because implementing and using a model is only the beginning of the journey and not the end. There's this really interesting paper from Jamiya which shows that a model trained several years ago at the VA hospital to predict acute kidney injury when it was perfectly calibrated in year zero where that observed expected ratio is right at one. If you let that model continue to run because of changing rates of acute kidney injury in the hospital, that model will get systematically miscalibrated. So you can see here how the model's observed expected ratio is dropping. In other words, the model is expecting to see a lot of AKI and it's observing much fewer AKI than expected. So implementing machinery models within a health system is fraught with challenges that go well beyond model discrimination and calibration. Technical infrastructure is critical to get started on this journey but governance is needed to be able to monitor and sustain these efforts and to make sure that things are working and when they're not working to be able to modify and de-implement models. I wanted to thank a number of folks here including Fadi Islain who is the first chair of our Clinical Intelligence Committee and several other collaborators here as well as many others who I haven't named including members of my lab. Thank you everyone and feel free to reach out to me by email or message me on Twitter and I'll stick around for questions. Thank you so much.