 So, welcome back. It's my pleasure to introduce our first external speaker of this summer school, Finale Doshi Velez, from Harvard University. Finale did her master in Cambridge, her PhD at MIT, and then a postdoc at Harvard Medical School. And now she is the John Lerb Associate Professor in Computer Science at the Harvard Halston School of Engineering and Applied Sciences. She's a star in the field. She has won numerous awards, a Sloan Research Fellowship, for example, and NSF Career Award. She's, according to IEEE, one of the top 10 in AI to watch. So there's her long list of Morgan faculty award recipient, Finale is that as well. And her research interests like the intersection of machine learning, health care, and interpretability. Importantly, for our community, she's one of the people behind the machine learning in health care meeting that has been reoccurring in the US over the last few years, so an important meeting in our specialization. So we and I am very happy to have you here, Finale, and to listen to your talk about machine learning in health care. Welcome. Thanks for having me. It's really exciting to be able to join, especially because sometimes it's hard to travel, but I'm glad that I can join remotely. As I get started, one thing I just wanted to say is that I really welcome questions throughout the talk. I see the advantage of me giving this talk live versus recording it in advance or something like that is that you can jump in and you can ask for questions. We can meander a little bit. We can go on different topics. So please, I really welcome interruptions and questions as I go through. All right. So I'm going to be talking about reinforcement learning in health care. And what I want to do is I just want to jump in. I'm going to give you an example of a simple and effective insight that came out of work in our lab that's helped out in two areas. So the first question that we were looking at was how to sequence drug cocktails in the context of managing HIV. And the second question we were looking at is how to manage hypotension for patients in the ICU. And both of these questions, they look very different. But what they have in common is that they're both sequential problems with long-term effects. So if you are, for example, sequencing these cocktails, you want to make sure that you don't get drug resistance later on because of a choice that you're making now that removes your options. If you're trying to manage hypotension in the ICU, well, there might be ways that immediately change the patient's blood pressure, but have a longer-term negative effect in the context of the days that they're in the ICU versus the few minutes after the treatment is administered. So we're in a situation where there's a long-term delayed effects and then a sequential set of decisions that have to be made. So RL seems like the natural thing to be doing. So before I go into that simple and effective insight, let me just mention a few background terminology on the RL setting. And by the way, most of this talk is meant to be at a reasonably high level because I believe that a lot of the audience is starting out your PhDs, which is great. But I'm happy to jump into more technical details as we go. So this is structured as ideas. And then if you want more, we can always, again, ask questions so we can jump into more detail. All right, so the basic setting of RL is that we have the agent and we have the world. And the agent sends actions out into the world. We don't actually know what's going on inside the world itself. It's a black box to us. The world is sending back observations and rewards. And there's two main things that we can be thinking about that live inside the agent. The agent has some sort of memory or model, something to store information about its past experience. And the agent also has some notion of where it is now. So usually we call that the state. We may call it something else. But state is the most common term. And then observations and rewards vote updates the way the agent thinks the world works. And it updates the state. And those two things together are used to select the action. So for example, in the case of HIV management, we could say that there's some unknown patient states. We get measurements like the patient CD4s, so there are viral loads, the mutations, that's the virus. And different drugs move us from state to state, which affects the different observations that we see. We have some rewards associated with where we want those viral loads and those mutations to be. And based on this reward and knowing how the model works, we could try to optimize. We could decide what are the best sequence of drugs to give to be able to get the highest possible rewards. So this would be like a classic RL sort of way of attacking the problem. Another thing you could do is just look for patients that are very similar to yourself. So you could say that here is my patient's history over here. This patient happens is right here. Here's another patient who looks very similar to my patient right now. And let me see what treatments worked for this patient. Or what treatment did this patient try? Did it work? And the advantage of something like this, so now the memory, that memory cell in that first slide, is really just keeping track of all the patients in your database, because when you need to choose an action, you're just trying to find the patient as closest. And the advantage of this is that making a model might be quite difficult, a parametric model. But here you're just trying to find patients that are similar. You hope that they have similar properties. Now our key insight was that these two ideas, whether you make a formal model or you just look for similar patients, they have complementary strengths and it can be combined. So if you are lucky enough to have nearest neighbors, then you can just find your nearest neighbors. And that's fantastic. If you're unlucky you don't have nearest neighbors, you can fall back on this model that you've built about how the disease evolves and use that instead. So again, very simple idea. Basically, we're saying that there's multiple ways of trying to solve this problem of choosing actions. And sometimes it's good to build a model of the disease. And sometimes it's good to just look for patients that are similar to yourself. So in this case, we had a finding nearest neighbors that's the kernel making a model that's the POMDP. You put in some patient statistics and you get out the actual action. And I'm moving through this relatively quickly because this is just a story to get us into the main action, which the main part of this talk is about validation. But this is an example of a line of work that we started doing that made us think about validation. We're going to get to that. So we took this relatively simple model, a simple idea of mixing these two parametric and non-parametric models together. And over here, we applied it to a database of patients from the EU, about 33,000 patients. The action space notice is pretty large because there's a lot of relatively common drug combinations that are being given. And we compared to a variety of policies. So here's the neighbor policies is if you think of that as kernel only. So what happens if you try to recommend drug cocktails by just looking for nearest neighbors? And the model-based policy was what if you recommend actions just based on building a model of how the viral loads will evolve over time, so an actual POMDP model of time set by time set, what will happen. And what you notice is that these estimates of the reward will get back to what does it mean. This is the DR stands for doubly robust. We didn't actually go out and experiment on patients, obviously. So how did I get these numbers? I got these numbers by using an off-policy estimator, which I'll talk a little bit more about later. But you notice that based on this off-policy estimator, the model doesn't do great. Looking for neighbors is much better. But if you combine the models and the neighbors in various ways, you can get significantly better results. That the model does end up helping out if you use it just at the right time. You'll be fitting into the theme of precision medicine. So then that was great. So you'll notice that this work was first published in 2017. We started it probably in mid-2015 or so, this line of direction of work. This summer, our most recent work along these lines has been an extension that I'm really excited about, where we did some transfer between cohorts from the EU, which typically are measured much more regularly. Patients in the EU often have their virus sequenced on a regular basis, so you can actually know which mutations are happening as they occur in the virus. So you get these very well curated cohorts. And our question was, could we transfer from these well curated cohorts to less well curated cohorts in South Africa? And again, the idea was that you can't always transfer, because the types, the strains are different. So not all the strains that are present. So there are more strains present in South Africa than that are present in Europe. So we can't just directly transfer. But we can do a mixture, where we transfer sometimes, when we believe that it's useful and we don't transfer when it's not useful. And again, you see across the bottom, we have better numerical results. The final example that I'm going to give in this really just early string of like, here's a cool idea. It looks, hey, look, at least seems to do well, is an example applied to sepsis management, probably more accurately, hypotension management. So we're focusing on vasopressors and fluids, which are used to manage circulation in particular hypotension. And here we took a cohort of patients with sepsis from the MIMIC data set. And our goal was to reduce 30 day mortality. So again, we had to come up with a reward function. We had measurements now from patients in the ICU. And we looked at their mortality over, we tried to optimize to reduce their mortality over the next 30 days, or their odds of mortality over the next 30 days. Same idea, a slightly different instantiation in terms of like the model and the neighbors, but the same basic idea of using mixtures. Again, performed significantly better. So here we tried two different ways of compressing the data. So you can ignore the two recurrent and non-recurrent for now, but we have, how well did the physicians do? How well did they score? How well, if you only looked at your neighbors, how well if you only used a model, in this case a DQN, instead of upon DP, but some sort of parametric model? And how well do you do if you use a mixture and you find that the mixtures do better? So so far, we're seeing nice numbers, right? Nice numbers make you feel good. And to back up those numbers, we've made these sorts of plots, or to just kind of make those numbers more concrete. So here are the plots that we've made. And now I do want you to pay a little bit more attention because so far it's just been, hey, we have better numbers, right? It has been the story in terms of what these plots are saying. So what we did is we looked at the difference. So the x-axis here on this plot is showing the difference between what we recommend and what the doctors did, right? So if we're at zero, it means that the doctor did what the algorithm recommended. And if we are far from zero on the x-axis, that means that the doctor did something very different from what we were recommending, right? That's what this plot is showing on the x-axis. And on the y-axis is the mortality rate for patients in that category. So in this plot, what you're seeing is that when the doctors happen to do what the agent recommended doing, right? What our algorithm recommended doing, the mortality rates seem to be lower, right? So the mortality rates, either these plots are both kind of U-shaped. There's this funny thing over here. We'll get to that. We'll get to more than that. But they have this kind of U-shapedness suggesting that when you look at this, you think to yourself, okay, when the doctor follows our recommendation, mortality rates are lower. Maybe these estimators, these statistical, doubly robust, more robust, doubly robust, all of these sorts of things, they seem complicated. But here's a, here's something that just looks, you know, here's a sanity check that looks very convincing. So this is where we were. This was about 2017. And then I was teaching a class on off-policy RL and some of the students said, well, what if we plot just a random policy or a policy that does no action at all? How would it look in terms of, you know, would we, clearly that should do badly, right? Like that we shouldn't get them to U-shaped curve. And this is what we get, right? Or this is what we saw. So when we plotted, now you're seeing overlaid in green. Over here is the random policy. In black is just taking no action whatsoever. And they also kind of look U-shaped, right? And so this was our eye-opening moment where we realized that some of our evaluation measures are not nearly as robust as we wish that they would be. And when you see some nice looking plots, it's not necessarily the case that we found a much better way of treating hypotension because by that argument, the random policy here would also look like a great way to treat hypotension for patients with sepsis in the ICU. So in particular, this particular issue that happened happens because of the spread of what treatments are actually being given. So most of the time, nothing is being done to the patient. So there's a lot of data that's kind of sitting around here. And most of the time patients are doing okay. So you're getting biases of where the data are actually located. So that's why this particular plot happened. And I point you to the archive post that we have to get more details on this. The main point that I want to make is that this is a weirdness. And that this issue is just one example. So when we're computing these off-policy estimators, there's a lot of assumptions that we end up making about the behaviors of the, can we predict the behaviors of the clinicians? How well are they known? To what extent there's confounders in the data set, et cetera. So since 2017. If I may, they have two questions. Received a number of votes here on Slido. Let's read them out. So first one is, how big do the cohorts need to be for effective application of reinforcement learn? So that's a great question. So I think it really depends on the decision that you're trying to make. So the way we think about it is that there are places where you want to recommend maybe a change from what you're doing now. And at those places, you'd need two things. You need to have enough variation because if the clinicians always take one action at a particular point in time, you have no evidence for what would happen if you took a different action instead. And that needs to happen sufficiently often. So what we're doing actually currently is we're trying to search inside our cohorts to find places that are like, OK, we have, here's a place where there's like 100 patients and two different actions taken with 30% or 70% probability or something like that. Here's a place where we can actually make a recommendation. So I think if you have a smaller cohort, what will end up happening is that you'll just have potentially fewer places where you can actually make a recommendation statistically responsibly. Whereas if you have a larger cohort, maybe with sufficient variation, you'll have more opportunities to make recommendations. I see. Thank you. And the second question is, in reinforcement learning, handling continuous actions is a problem. Are you confronted with continuous actions? If yes, how do you handle them? That's a great question. So we have discretized our actions, Faith. I don't necessarily think that handling continuous actions is a problem in the sense that I think sometimes they're easier because you can differentiate through a policy gradient or just a Monte Carlo rollout when you have continuous actions. I think sometimes the optimization is actually more straightforward with modern RL methods and continuous actions. You can borrow a lot from control theory also, which has been dealing with continuous actions for a long time. I think we discretize because we just don't feel like the continuous actions, we don't have the data to give us the resolution. And we talk to the clinicians and they're like, well, if you see a difference, because especially a lot of times in these sort of situations, flu is they're given in bold lists. So you're not never going to recommend, I want to give someone 359 milliliters of fluid because they come in bags. So our actions based naturally becomes discreet even if it milliliters is a continuous quantity. Very good, thank you. Cool, yeah, thank you for coming. All right, so these sorts of realizations and there's many, again, in that archive post that I'm listing here and the Nature Medicine Paper commentary that we have really made us think carefully about the fact that for most part, I started out in this field as the build AI system person. That's kind of often what we want to be doing. We want to be creating the new system that's going to be providing value to patients and to clinicians and all of these other people. So that was where we started. But what we quickly were realizing is that we needed to think much more about validation. And so today, I'm going to be focusing on validation with batch data, so the idea being that you're given a cohort. So it might be this, you resist cohort of patients with HIV that have been monitored over many years. It might be this mimic cohort. Any cohort that you get from your hospital or registry or other source. And I think this is a relatively common situation because there are a lot of situations where we are keeping records of what the doctors are doing. And now we want to mine those records, basically, to try to find out what's working and what's not working. And of course, though, I want to emphasize that that's not the end of the road by any means. This is just a place where a lot of us in this area end up having to do some work. And then if you find something promising, then maybe you start moving toward prospective things like a silent trial, a clinical trial, post-deployment monitoring, et cetera. But for today, what I'm going to be focusing on for the rest of the talk is validation. Again, using motivation of our own lab where we started out as the system builders or the methods, the innovators in terms of the AI. We had this cool idea. We were applying it all over the place. So it's a still cool idea, right? It's still a very valid idea. But what we realized is that we also needed to be spending time on doing validation. So here's their roadmap. And as I mentioned at the beginning, this talk is going to be relatively high level and broad. And I'm happy to go into any particular area in more detail if people ask questions. But here's the picture that I want you to have. So when it comes to validation, there are two major categories. There is validation that comes from statistics of various kinds. And then there is human focused. So get an expert to somehow check what you have. And these are both needed, right? It's not like you can do just one or the other. And then within both of those, there's a couple of key distinctions. There are standard sensitivity and robustness types of things you can do in statistics. Then there's off policy evaluation, which I briefly mentioned before. When you have human experts, you can ask them to validate a particular decision or you can ask them to validate a whole policy. So these major areas, sensitivity and robustness, off policy evaluation, local inspection by experts, local inspection by experts. That's what I'm gonna go through for the rest of the time here. Just, and again, my goal is gonna be to give you a flavor of each of these. So if you end up in a situation where you're applying RL or thinking about applying RL to a batch, to a batch cohort and you want to do validation, you have a menu of options. And many of these things apply for validation more generally. So even if you're not doing RL, there are things here that just in general validation, maybe in health or for retrospective data, this is a useful roadmap of different choices. So let's start out with sensitivity and robustness. This one I'm going to be relatively brief because I think it's fairly obvious, but I'm still gonna name the sorts of things that we can do. First of all, we can ask a question of do the results replicate across sites and across different measures. So here we have an example, when we were looking at our HIV work, we didn't look at just one cohort, we looked at two cohorts. And we also looked at three different types of off-policy evaluation metrics. So a doubly robust estimator, important sampling-based estimator, a weighted important sampling-based estimator. And again, it's not important for you to know what those different estimators are, but what it is important is the fact that we tried multiple different estimators and the results were consistent across cohorts and across estimators. We've also tried adjusting the reward function. If you remember from slide way back, our reward functions are really funky looking, right? And these come from a lot of back and forth with the clinicians to try to figure out what's the right form. But many times the clinicians aren't sure, right? Because it's very hard to convert a sense of like what a patient needs to be well into a number, a single scalar. So that's another axis that one could imagine playing with. And again, these things like sound like super simple things to do and they are super simple. All I'm saying is it's a very important category of validation to do because it's things like this don't check out, then you know that there's some, potentially some kind of trouble. So in addition to the reward function, we do all sorts of other sensitivity analyses. For example, this doubly robust estimator, the way it works, it requires you to guess what the value of your policy is going to be. And it's supposed to be, it's doubly robust. So it's supposed to be robust if you get that value estimate wrong in the long run, right? Asymptotically, his end goes to infinity. But what about the short run, right? In the short run, parameter choices, even if you have statistical estimators that are consistent and valid, these parameter choices are gonna matter. So here again, it's not important for you to understand exactly what the plots are. I'm gonna give you just a sense of what we did. This is us, me showing you like our homework, right? This is our scratch paper before we published some of those later papers after we discovered the issues that can possibly go wrong. So what we did is we tried several different ways of producing that estimate of the value of the policy, which was an important parameter. And for each one of those, we ran many, many bootstraps of the data set. And we looked at how often did our policy look like it was doing better with, across all the bootstraps for that particular choice of parameter. And we looked across different parameters. And when we found that many times our policy looked like it was doing better for different choices of the parameter across all the bootstraps, we were like, okay, at least it looks like this signal is robust, right? It's not super sensitive to, we had to get this particular parameter exactly chosen, right? And that makes you feel good because you don't know in real life how to take this parameters exactly, right? So I'm hoping that that is, maybe it's obvious, it's probably obvious, hopefully it's obvious. Just important things to do because many times when you do these sorts of checks, they sound super simple, but you still uncover issues that are funky and you have to dig a bit more. Cool. All right, so that's the basics. Then the more complicated thing that many times we turn to you is this notion of off-policy evaluation. And the key idea here is that when we have data, the data is collected under some, what we're gonna call behavior policy, high B. And the behavior policy is the policy that is being followed by the clinicians. So this registry, this electronic health record, the data set, it consists of the data that's just measuring the day to day, right? The clinicians are not doing any particular experiment for you. They're not being told to try this out on this patient and try that out on the other patient. They're just doing their thing. They're just trying to help people the best they can. And we're looking at that to try to glean insights, right? So that's the batch setting that we're in. So here we have different patients. We have their sequences of observations and actions. And then based on that, we build the AI system, right? That was the first block before we got to the validation block. We come up with a new policy. We come up with a policy PI E, that's our evaluation policy. And we want to know what sort of trajectories will that create? Will it create better trajectories? How do we know that it will create better trajectories? That's the question that our policy evaluation is trying to answer. So there's two major categories of ways of solving this problem. So the first way that this problem ends up being solved is a category of techniques that are based on important sampling. And the idea here is that in this data set, if we looked at the original data set, we just have all the trajectories looking, they're the same. There's no reason to suggest that patient one is more relevant than patient two. But now what we're going to do is we're going to look at all the trajectories in our data set. And we're going to compute this importance weight. This is the simplest version where our important weight corresponds to the ratio of what was the probability of an action under the evaluation policy, the action that we're suggesting over the probability of the action under the behavior policy, under what the clinician was doing. And what this is going to do, even if you just look at the new, so if you understand important sampling as this ratio makes sense, if you haven't seen important sampling, the idea here is still pretty intuitive. If a policy has a very low high E, that means that the clinician didn't take an action that you thought was likely. You wanted to take some other actions. The high E is close to zero. Then we're not going to look at that trajectory. And the trajectories that are going to end up with high weights necessarily have to have reasonably sized high E's. In other words, those are policies that the clinician ended up just by chance, perhaps taking the action that you would have recommended. So we're weighing the trajectories based on how likely those trajectories are to have been produced by high E. And again, for those of you familiar with important sampling, this is exactly important sampling. Nothing new, just applied to this off policy or this RL setting. So that's the main concepts here. There's lots of ways to try to reduce the variance because that's the big issue. Like what if only a few patient trajectories have our recommended action? So if we're looking here, because especially of this product, many times, the weights can quickly get very small, right? But not only do I have to take the right action, like not only did the clinician have to take the action I recommended at time one, but they have to take it at time two, time three, time four, all the way through the trajectory because anytime they take an action that's different, you get a single zero in this product, the entire product goes to zero, right? So you can lose a lot of trajectory. Your data set might have thousands of patients, but you end up with only a few patients that have non-zero weights or non-trivial weights after you've done this sampling. And that leads you to high variance. So actually to the question that was asked before of how large the cohort needs to be, so a common evaluation measure in off policy evaluations for these sampling techniques is effective sample size. And what we've found is that very easily you can have a cohort with like 18,000 patients and end up with effective sample sizes that are a few hundred. And what we do in practice, and I think I have a slide on this later, I'll show you, is we limit the number of types of policies we consider to make sure that that effective sample size is big enough. But the big issue here, it's variance, right? Because that you can lose a bunch of your data. Another way you can attack this problem is to say that, okay, well I have all this data. Why don't I build a model based on the data that I have? And once I've built the model, I can simulate outcomes out to my heart's content. No more variance because I've built the model, I can simulate from the model a zillion times and get the exact mean expected outcome of a particular policy, a particular decision I make. Of course, the trouble with that is if the model is off, then our results will be off. So there's a huge, huge literature on off-policy evaluation. And many recent advances in off-policy evaluation are basically they're all finding clever ways to tackle this trade-off between the variance and the classic statistical trade-off. But the basic intuition is always the same. So if these kind of pictures make sense to you, then you know what's going on even with more complicated and more recent advances. And just to be able to share something that, so what do some of these more recent techniques look like? I'll give you one example from work that we've done. And in this case, we are using mixtures again because I started out with mixtures. I thought, let me give you another example that uses mixtures. So here, let's suppose that you have two different ways of modeling a trajectory. So in green here is the trajectory that you want. That's real life, but we don't know that. We have these two models that supposedly are kind of like real life and we're trying to figure out which ones to use when. So it may look like right now it's probably good to use model one. But maybe if you use model one, then afterwards the modeling just goes off the rails because remember the models can have different levels of bias. And if you had chosen the model that had less air, sorry, more air at the beginning, you might have ended up overall being able to sample a trajectory. So this is my guess of the trajectory. Here's a real trajectory that was closer to what would have actually happened. The counterfactual, the trajectory that you never get to see because the doctors never did it, but you wanna be able to simulate anyway because that's your recommendation. So there's, as I mentioned, I'm keeping this high level. So I'm not gonna give you the balance and the maths behind it, but we found a way to mix between these two. If you have two choices or N choices of different models, how can you choose those models in a way that balance your long-term error rather than any sort of short-term error? And we did, and we get some nice results where using the mixtures, so those are these darker ones over here, does a lot better than choosing either model alone, those are the yellow and the red. So that's just one example though of these different off-policy evaluation techniques. But off-policy evaluation is very tricky because a lot of assumptions have to be made. Most critically, you're making most of these techniques and not many of these techniques, you're making some strong assumptions about the state-states and what it's observed and what you have access to. So what I want to turn to now is that, all of these statistical things are really great because first of all, they are ground, many of these are grounded in a lot of really good theory. And also we can compute them very easily, for most relatively easily, right? So if you're iterating on some new AI system, you can get your off-policy estimate for different versions, different types of proposed policies and you can check them, right? So that's really appealing, that you don't have to go often and bug a clinical expert, especially these days doctors are busy and treating patients and that's very important. So we want to be able to use our statistical methods to at least prune options. And that brings me to the last point on the statistical side is that even as we recognize the limit of these statistical methods, there's still one more important computational thing that we can do which is to identify what are the most promising options to be presented to a clinician for further betting, right? Because we're gonna have to turn to humans at some point to check for all the things that the statistics can't do. We also want to respect their time. So the cartoon of this is that we have, let's say we have a start, we have a goal, there's lots of ways to maybe try to get from the start to the goal. And maybe with statistics, we can say that some of these are bad, right? And there's only two options that we really aren't sure what are the better of the two choices? And if we can do that, we've already provided an important step towards reducing the space of choices for a clinician to have to bet. So we did this sort of thing in two works from over the last year, year and a half or so where our goal, I'm just gonna explain this, you don't, in terms of the equation, what our goal is to find a high quality policy, a set of high quality policies that are also different from each other. So here, different is defined as a KL between the different probabilities of actions. And we also added this term here, which we term safety. And this goes back to this question of like, what are the questions we can even consider? And in our case, we just put a hard boundary and say that if the clinician rarely took an action in a particular state, then we should not be considering that action. We are only gonna consider actions that are sufficiently commonly taken in a particular state. Because otherwise we could be hallucinating, all sorts of crazy things. The model's not gonna be unreliable. There's no way to have trained a reliable model in a place where we have no data. So as we do this, we're able to get different types that we try out different versions of safe and diverse. You notice that our effective sample sizes are in the few hundred of a data set of about 1800. So that you might think, wow, we've lost a lot of data, but if you try this out yourself, you'll realize that a few hundred is actually pretty good for this data set. And we tuned this safety parameter in part to be able to get these larger actions. And what I wanna show here is that the policies do look different. So here's the marginals over different actions that are taken. So this is in black is the behavior policy. So here you're just seeing histograms of action probabilities. And you see that the different agents have different histograms. And so these don't necessarily have meaning for, I'm not expecting these to have meaning for you as machine learning folks, but these are things that we can start to say, like, okay, these policies are different. And now we can try to explain those differences to the clinician and say that it looks like there's a couple of reasonable choices here. What do you think is the right choice? Why do you think these are happening? And it's really important to be able to have something to start that conversation. Which basically brings me to the second half of this roadmap, which is the human focus interpretability. Because at some point, the statistically focused methods, they're going to reach their limits because of assumptions. And I'll just mention a few more things that go wrong, just to convince people, measurements are taken at all sorts of times and then recorded at other times. So you may have issues in terms of like the quality of the data, right? The numbers that you see may not have happened at the times that they actually are recorded at. That's like a common issue. Many things might be observed by the clinician and those affected the choices that they made in terms of treatment, but they're never recorded in the electronic health record. They just see the patient and they're like, man, this person needs this sort of thing or here's my gut sense based on like what I, all these other things that I'm just getting from looking at the patient. There's all sorts of things that are unmeasured and there's all kinds of errors. That there might also be things where, especially like take the ICU setting where the reason why we did this for this patient is it was, it was crazy in the ICU. I was dealing with three other critical patients and this action was the simplest and safest, right? Maybe there was another action that could have been marginally better, but I had to take this one because I needed to keep in mind that I had four people that I was taking care of, right? So there's lots of things going on in these data sets that you don't see. And for that reason, there always needs to be a human sort of check of what's going on. So how do we do those human checks? Well, one thing that we have done is consider, is consider what if we ask the, show the clinician a particular setting of the model and then we ask them, does this make sense? Or we give them a recommendation, we give them some explanation and we ask them, what would you do in this situation? So here I'm gonna present just one example from a study that we ran last fall, which I was really excited about. So in this study, we got around 200 psychiatrists off of Facebook like there's a psychiatrist Facebook group, which is fantastic. We advertise there. We said, we wanna do the study where we're gonna give a description of a patient that's here at the top. We are gonna provide recommendations, recommendations. So this was all Wizard of Oz, so this was fake. I'm gonna provide recommendations of like which drugs might work. So it's like, this panel shows whether they're likely to work. This panel here shows, are they likely to drop out because they don't like the drug? And we provided this explanation of why you might want to choose, why the system made these particular recommendations. And we also, so here it's because the patient has diabetes or high blood pressure. Another form of why that we looked at was that here are some rules, right? So you notice similar things like UT polygation is also over here. Here it's listed just as a score but it's very glanceable. Here it's listed as part of a reasoning, right? So here ways that we can, at the point of care, provide some information to the clinicians of what the system is trying to do and hopefully that helps the clinician make a better decision. Well, what did we find? So we found that the clinicians have a very hard time catching here. So if the system, and this was Wizard of Oz that we were making it up so we could create situations where the recommendations were bad and we tried to check, can the clinicians notice that the recommendations are bad? And the answer was not really, sometimes, right? And if we provide an explanation that was harder to read, they paid more attention and they did catch errors more often but still they did worse with the system than without the system. So if we didn't, if we presented none of this over here we only presented the information about the patient then they would have done better. And the second thing that we noticed in follow-up work that we've been doing over the last spring and the summer is part of this maybe coming from the fact that clinicians at the point of care expect the system to be accurate. And this is not a crazy assumption. They're like, well, we wouldn't have been given the system by the hospital if the hospital administrators, staff had not vetted the system to be fairly accurate, right? We're busy at the point of care, we're super busy. And we expect the system to work well. So to maybe going forward like AI systems with AI systems, like clinicians will change their view on the accuracy, right? They'll learn that the systems are not perfect. But this was a very fair point but they do expect the systems to be fairly accurate. Which turns us to the question of how might now instead of checking at a particular point of decision how might we get a human to overall that's the quality of a system, right? Because the idea is that somebody before the system is rolled out which is actually the very sensible thing to be doing needs to check, is it good enough or when is it good enough? Somehow that determination needs to be made. So the ways that we can do this we can provide a bunch of examples, right? A priori, now not at the point of care and ask panels of experts like you agree and we've done that. And I'm also showing you this to emphasize it. Those initial slides at the very beginning I was just showing you numbers, right? Like, hey, our scores are better but we do the homework, right? We follow up and we do all these sorts of validation checks to see if these things are reasonable. So we can ask experts but that begs the question of how have you checked enough examples, right? All right, if the experts have checked 30 examples is that enough? Do they have to check a hundred examples out of your system? Is that enough? Like, when is it enough? And that's a tough question. So I'm gonna very briefly do a little detour. So this is also how our research went. You know, we started here and then we said, how do you know if you've checked enough examples? And so we did a bit of a detour in our own research in terms of how to summarize a treatment policy like basically how many questions and how many examples do you need to get? And can we make those examples the right example? So if we give, you know, these 10 examples is it the case that those are better at giving a clinician a sense of what the policy should be versus these other 10 examples, right? So that's a question that we asked. And the way we tested it is that we would, here's an example of like we were showing a policy, a partial policy on one side. So we're showing which like arrows, what action would you take in some number of parts of this grid? And then for other parts of the grid, we're asking the person to fill in, you know, what would happen here to test if based on a couple of examples, they're able to extrapolate to other cases. And we did a simple similar thing with an HIV simulator where we said, you know, here are the patient stats, here's the treatment, what happens in a new situation? And what we found is that humans use different methods of filling in the, or extrapolating in different scenarios. Sometimes they just try to learn an imitation learner. Sometimes they do something more goal directed. So that's IRL. Other times it's really hard to tell. That's a non-specific. And so what this made us realize is that maybe examples are not the best way to validate. So then we get to validation by rules or presenting the system by rules. And this is the last thing that I'm gonna mention and then hopefully leave time, plenty of time for some questions. So work that we've done in this space. So this is, so over here, sorry. What we're realized is that if we want a clinician to do a vetting of the system, yes, we could try to do it by examples, but it's tricky to make sure that we have covered all the bases, especially because a clinician might look at certain cases and say, I feel good, but they might be extrapolating in a way that doesn't actually match how the system is going to work in cases that they haven't seen. So they might be feeling really confident whether they shouldn't be. So the alternative would be, is there a way to somehow show them everything, right? Or nearly everything. And so that is the question of, can we summarize a system based on rules or something compact that they can look at? So here are two examples. So we've done some work in terms of distilling complicated black box models into much simpler structures like decision trees and jointly learning the decision tree model and the black box models together. And what we find, this really depends on the domain, but in many clinical domains, there's a lot of noise, right? And so it isn't that the complicated models are giving us some edge, but maybe the edge is not that much. And so if we can make them look like decision trees, we actually don't lose very much in terms of quality. So that's one thing that we've observed and that we've noticed. Another thing that we've done is try to build models that are small by making them relevant. So in very recent work at AI stats this year, we tried to learn a POMDP model. So the very first model that I showed you in the beginning of this talk where we had states, we had actions going in, we had the observations coming out. This is like as classic as it gets. And we trained it with discrete states, five discrete states, that's all. So we trained a POMDP with five discrete states, but we trained it to get high quality such that when you optimize that model, you will get high quality policies for this hypotension management problem. And what we find is that you can actually, so if you just try to train a POMDP with five states, it's gonna crash and burn. So that's over here in terms of the quality. Or it doesn't actually, it doesn't crash and burn. It doesn't do very well. But if you try to learn a model that explains the data, but also has a policy that will perform well, you can get policy values that are higher, but the models can still be inspected, right? And that's the key point that I wanna make. Like the thing that is so exciting about having a tiny model that still performs quite well is that now I can show you the tiny model, right? It's a little more complicated, but now remember, we're not gonna be showing this to like a clinician at the bedside with a patient. We're gonna be showing this to another, like an expert collaborator who's gonna try to help that system for us. And we can say that, here are different action choices. We can see the distributions of states and how the action choices are distributed amongst the different states. And people can look at this and see whether they make sense. So that's the summary, right? As I said, my goal in this talk was to give a really high level presentation of the menu of different options that you have when you're trying to do validation. You can do some basic sensitivity and robustness checks. You can do off policy evaluation. As you do these things, it's really important to recognize the limitation. And when you move over to human focus methods, well, humans are limited too. So maybe certain types of this vision of like, oh, and we'll just provide the right explanation. The clinician will make the right decision for the patient. May not actually be accurate for this community. Because again, clinicians are busy. Or there needs to be the right way to present this information. They're presenting a global summary for someone to vet also has its nuances. So none of these methods are perfect, but together, if all the lines of evidence point in positive direction or at least do more, no harm and at least in a safe direction, then maybe we finally found a policy that came out of our AI system that we validated in all the ways that we can with batch data and is ready to move on to some future steps, like silent perspective trials or clinical trials. So with that, RL and Health has a lot of potential. I really believe that careful validation is really important. And if you plan with validation in mind, there's a lot of really exciting technical questions. So our work over the past several years, if we hadn't noticed that bug in 2017 would have taken totally different direction that I'm very excited about all the really hard problems, the hard technical problems that we've had to solve when we discovered this very real world issue. So I'll leave you with two quotes that I really like with great power comes great responsibility. We have to be careful. But perfect is also the enemy of the good. We're never gonna have perfect policies coming out of our models and what we need are ways to get them out into the world slowly, carefully, responsibly managing risk as we go. So I'll leave it at that and take some questions. Thank you very much, Finali. That was a great talk. Thanks a lot. Now we have some time for questions. Let me see. Is there any question from the network? If so, please raise your hand, not yet. So then I move to the Slido questions and one was just posted. I'll read it out. How can one effectively translate interpretable models into medical practice? We struggle even with simple decision trees with respect to survival analysis. That's the question. Yeah, so I think this is the big question is just when, who needs to bet the model? So if you're practicing clinician probably doesn't have the bandwidth, doesn't need to have, never needed to have the bandwidth to understand what any of these like even simple for us models mean. I think that what we've, or rather what we've found is that there are specific collaborators who are just like, I can listen to them talk about various medical terminology and not kind of follow along what they're saying who are ready to follow along and learn from us in terms of what we're saying. And it's a relationship over multiple years that we build up with these people. And so those are the folks who are at the front lines of this validation, right? They're the ones who can help look at this decision trees or survival models and say, hey, this is cookie. Like it makes no sense. There's no mechanism that would make sense for this to be or this is sensible due to this mechanism. And then they can be the people who then help translate that to the next level of practice. So I think you really need those, those allies and colleagues, you know, just like we're their allies and colleagues to the machine learning world. Thank you. There is a question by Damian Rokairo from the network. Please Damian. Hi, thank you very much for your talk. My question is also related to the validation aspect and maybe asking you in your experience, particularly in the US. So do you know of examples where these systems have been implemented in a clinic and at the point of clinical trial and also the second part is, how well do they generalize? So if you train on data for one particular hospital, do they do well in other cities? Those are both great questions. So the path, especially in the US to getting these things out into the world is slow. Some examples I can give are at Duke, there have been a number of experiments with early warning systems. And same with Johns Hopkins, Intermountain Health in Utah has done some of these as well. Lagone and NYU. So they're places that are definitely exploring, not necessarily with RL, but at least the machine learning systems being integrated into the system. And I think it's slowly happening as people are figuring out just how to even work with all of the IT infrastructure and all of that. So I think in the short term or the medium term, I'm hoping as just in general machine learning systems become more integrated and people know how to use them, we can, like RL systems will fit in. But right now, I know of ML systems, but not RL systems that are out there being used in practice. Control system, maybe, but I don't really, but that's, I think, a separate category. I mean, things like insulin pumps, maybe count, I don't know if you count that as RL, I think, maybe it's another example. So I think that that's definitely a challenge that we're slowly, the systems are coming around to you. What was your second question, sorry? What was about the generalization? Right, so we've looked at that. And actually we have a commentary and a work in submission about how things don't generalize necessarily very well across hospitals and a major factor, or at least one major factor that we've looked at is that different hospitals have different protocols for when measurements are taken. So a hospital that's very well resourced might be taking certain measurements just as the, we always get this measurement for every patient who comes in, whereas another hospital might say that we only collect that lab if we really need it, if it looks like we need that information. So I think this opens up actually a really big question, I think an important question. So a lot of the machine learning community wants models that are fully general, it'll work everywhere. Certainly industry wants models that are gonna work everywhere because then you can train it once and you can sell it. And you can put it inside your EHR and you can sell it everywhere. But I think realistically that's just not how things are gonna pan out. My vision is that we're going to have models that can be quick, or rather our goal maybe as researchers is to develop models that can be quickly adjusted to a new setting. So think of it as a transfer learning problem because different settings, and this is not, like this is how a clinician would respond to, right? So if you just teleport a doctor who is used to treating in a very well resourced area and put them in a lower resource setting or setting with different protocols, they're going to adapt, right? They're not gonna use the same policy that they use in the first hospital and the second hospital. So creating systems that can adapt to the new setting I think is both a very important practical problem and an exciting technical problem. Thank you. And thank you from all of us. We send a round of virtual applause to you. Thank you for an excellent talk. Thank you for having me. We enjoyed this very much. Thank you, Finali. Thank you.