 So now it's 4 p.m. time to start the final talk in hours in posion today. And it's my great pleasure to welcome David Zontag here from MIT. David, despite his rather young age, has been working on machining medicine longer than many others in the field. He's a real leader in this domain to give you some facts about his career. So in 2010, he did his PhD in electrical engineering and computer science at MIT. Won an award for this PhD, then joined Microsoft Research in New England as a postdoc, then became assistant professor at NYU, and then moved back as a tenured professor to MIT by feeding the clinical machine learning group there in the computer science and artificial intelligence laboratory. So he also won an NSF career award, and he's the chief health strategist and principal scientist at ASAP. We are very happy to have him here. And to watch him deliver the last talk in hours in posion, David, we are very much looking forward to your talk. Well, thank you so much for the invitation, and I wish I could have been there in person, as I imagine all of you do. So it was also a pleasure to listen to the last talk by Dana Baer, and I hope to live up to her high standard. So the work I'll tell you about today is joint work with all of my lab, well, most of my lab. My PhD students, Michael Oberst, who is Hane Mosinar and Christina G., two recent master's students in my group, Suraj Bhumenathan and Helen Zhao, and our clinical collaborator, Sajat Khanjalal, who is also an infectious disease clinician at Mass General Brigham Hospital in Boston. So as a computer scientist, it's a really exciting time to be working in the health care field, because we suddenly have lots and lots of data collected by electronic medical records. That data makes it possible to use machine learning algorithms, but also really importantly, it makes it possible to start to deploy those machine learning algorithms by building on data that's available at the point of care. That data takes a variety of forms, which we'll be talking about in today's talk, everything from unstructured free text notes to imaging data, genomics, vital signs, and so on. Machine learning can serve a number of different downstream use cases with this clinical data and for clinical use cases. For example, some of the quickest advances we expect to see are going to be in areas like pathology and imaging, where recent advances in deep neural networks and computer vision immediately translates to being able to provide decision support tools for clinicians, nurses, in helping them both bring health care closure to patients and reduce the cost of health care. In work that my lab and many others have been doing over the last few years, we've shown how one can take data that is just captured as part of the practice of medicine, such as longitudinal health insurance claims data, and change the way that we think about risk stratification. For example, identifying individuals who are likely to have undiagnosed type 2 diabetes. Rather than having it be something that's done using a form, like I'm showing here in the bottom left corner, it's something that can be done using that readily available data behind the scenes for millions of individuals by using machine learning algorithms, such as our recently advanced paper in AAI, which shows how one can use transformer architectures on this longitudinal claims data to get much better predictive performance. Other use cases of machine learning health care are more on the administrative side, really improving workflows. So for example, in my lab recently, but we've been working on changing the way that documentation in electronic medical records is done. So clinicians will continue to type, but now we provide what we call contextual autocomplete. So for clinical concepts like symptoms, diseases, and treatments, we can automatically suggest what the completion might be. And as a result, both keep within the initial workflow, so letting clinicians type in free text, which is much easier for them faster, but then also get some structured data that's really valuable for downstream use cases, such as comparative effectiveness studies. And finally, I want to talk about what will be the focus of today's talk, which is taking all of that longitudinal data and looking at records of how clinicians have actually treated patients. So for example, in this little sketch on the bottom, we see that this physician, after talking to a patient and reviewing their medical history, selected drug B for treating this patient at this point in time. This provides a lens which allows us to think about how one could improve treatment strategies by both understanding what clinicians are currently doing today and what outcomes result from those actions. It also gives us an opportunity to try to help standardize care and improve best practices by identifying treatment variation both within sites and across sites. So that will be the focus of today's talk. And we'll use a running example throughout the talk coming from our recent work on empiric antibiotic prescribing. This work looks at the condition of urinary tract infections, which are extremely common. So it's estimated to affect one in two women in their lifetime. And it's the third most common cause for antibiotic treatment. The way that treatment for urinary tract infections typically works is a woman will come in to their doctor's office with symptoms that are indicative of a UTI or urinary tract infection. Often a sample is taken. That sample is sent to a laboratory. And over a period of time, which could be as long as a week, the sample is cultured. And a number of different antibiotics are tested against that culture in order to assess what's known as the antibiotic susceptibility profile, which tells you for each of these different antibiotics, which of these antibiotics would be successful in resolving the underlying infection. The reason being that many, many bacteria today are resistant to antibiotics. And so treating this condition can be very difficult, because one has to find an antibiotic for which this particular bacteria is not resistant. That said, you typically can't wake a full week to get the results of the antibiotic susceptibility profile back. And so the question which faces clinicians is known as what's called empiric antibiotic prescribing. So at the point of care, even before you get those antibiotic susceptibility profiles back, clinicians typically prescribe a first antibiotic. And then if they see that it's not working or when the antibiotic susceptibility profile comes back, they might switch it to a different one. And so the question that we're asking here is could we help guide that decision for what that empiric antibiotic prescription should be in order to help choose antibiotic, which is most likely to not lead to resistance? And so just going back one slide, here I'm showing four different antibiotics. The first two, which I'll denote as NIT and XXT, are what we consider first line antibiotics. And the second two, CIP, Cipro, and LVX, are commonly used as second line antibiotics, which means that they are much broader spectrum. There tends to be many fewer bacteria that are currently resistant to them. And if at all possible, we'd like to reserve those second line antibiotics, not use them, because the more use of those antibiotics, the more resistance is likely to grow, and they wouldn't be as useful as a backup or reserve. And so here I'm showing you the results of that profile. And again, the two quantities we'll be working with throughout today's talk is what I call resistant and susceptible. So resistant is bad, susceptible is good. We like to find antibiotic that the bacteria is susceptible to. OK, so in the rest of today's talk, I'll go through the following. First, I'm going to give an overview of how to learn and evaluate treatment policies learned using electronic medical records. This is really meant more as a tutorial and helping us think through what best practices are in doing this and what type of data is helpful. And then second, I'm going to give an application to empiric antibiotic treatment. We'll return to many of those themes that were raised in the first part in that specific case study that I just talked about. And final return to a broader question where in many clinical scenarios, one isn't interested in just treating a condition at one point in time, but rather there's a series of treatment decisions that need to be made, which is known as sequential decision making. And we'll talk about how one should be thinking about that as well from the perspective of learning from electronic medical records. So the data that's commonly used in clinical scenarios for machine learning health care include structured data that's commonly captured, such as problem lists, past billing codes or diagnosis codes from previous visits, medications both from the past medical history, and if a patient's hospitalized, the medications that are being given or procedures that have been performed during that visit, laboratory tests. And in many cases, we'd also be interested in using the unstructured data, such as clinical text. And in the first part of my career in this space, many of my papers were about the value of using data in clinical text. Clinical text tends to provide a lot richer context around why a patient is coming to the clinic today and their past medical history. But at the same time, using that for both research and deployment tends to be a lot harder because in many commercial electronic medical record systems, both getting access to that data appropriately, de-identifying it for research, and then building algorithms that can deploy using that clinical text tends to be very challenging. So in the bottom here, I'm just giving you a visualization to get a sense of that data that I was just referring to. So on the left is an example of the clinic visit, where we might, for example, have the patient's blood pressure recorded. We might get a billing code for hypertension, and we might observe a medication being prescribed. And there might be some notes written as a result of that visit. During the hospitalization, you tend to have all the same data. And more, lots more clinical notes. You often have repeated vital signs if patients are on beds with continuous vital sign monitors. And you just have a much greater density of data. So the machine learning philosophy is a little bit different from the way that clinical researchers and ballot set additions usually thought about how to use this data for some of those risk stratification type problems I mentioned earlier. Traditionally, one would derive a small number of features, which are both clinically motivated, perhaps has a mechanistic understanding, and use those to carefully drive a very simple risk score for a problem. The machine learning philosophy on the other hand is a little bit different. We say, let's throw everything in, throw in the kitchen sink. And on the one hand, this seems really promising because it gives us the opportunity to discover new predictive signals, and even if some of the factors that you might like to have available are not recorded in the data, it also gives you an opportunity to find surrogates for those factors so that one can do machine learning rather than using data which was carefully derived for the purpose, just use the data that's actually readily available. And the third reason why this tends to be a very powerful option is because it allows one to then more easily deploy on existing data rather than having to drive data for input into machine learning models. But it also comes with lots of challenges, and that's going to be a major focus of my talk today. In particular, the moment you start to throw in huge amounts of data without a very good understanding of the data provenance, it's noise processes and so on, it tends to make generalization much more difficult, even within a single institution, for example, using data like clinical tax defined that the language used from one department to another might be very different. Using data like laboratory tests might seem very straightforward, often the results are standardized, but the processes underlying the creation of the data tends to lead to a lot of bias. For example, different institutions have different guidelines for when to perform certain tests, and your machine learning algorithms tend to pick up on that information, which doesn't always generalize from one institution to another. And finally, when one uses an approach like this, you have to really worry about whether you're appropriately deriving the outcome, and that's going to be, again, a major focus of today's talk. So in that empiric antibiotic prescribing application, the data that we used were patient demographics, comorbidities derived from past diagnosis codes, laboratory tests, location of where they're coming into the clinic to get their empiric antibiotic prescribing. Past antibiotic usage and past resistance when available, and that's an example of data which is often missing. It was to drive a number of population level statistics, such as total antibiotic usage as a period of time and colonization pressure. So to use these machine learning algorithms, we need the input. So that's what I'm showing in the left here as being features derived from the input window. We need to know, well, when is the prediction time? So that's what I'm calling the index gate here. And for this problem, the prediction time is going to be the same time when a clinician would need to prescribe an antibiotic. Our idea being that we're going to be providing decision support at that specific period of time, it could be, for example, a pop-up where we give a suggestion. And so we only assume that we have access to data prior to that point in time. And then the outcome, which is going to guide our treatment policy, in this case, the outcome is going to be the result of the antibiotic susceptibility profile. And we can't cut corners here. So this is where I find that most of the time in the machine learning lifecycle is spent in making sure that we have, we've both aligned our prediction task very well with the actual clinical use case. And then secondly, that we derive the outcome cleanly. If you get the outcome wrong, then the policy you learn as a result is going to be wrong. So there are three scenarios that I want to, us to be thinking about for how to learn treatment policies. The first scenario is a very optimistic one. There's just a single treatment that we need to make. For example, we give a woman antibiotic at one point in time, just one point in time. And we have fully observed outcomes. And what I mean by fully observed outcomes is, there are a number of different treatment decisions. And in our training data, we have access to what would have happened under each one of those different treatment decisions. And as you'll see in my next bullet, that's a very uncommon scenario to be in. But for this antibiotic susceptibility problem, we actually are in that scenario or close to being in that scenario, because that susceptibility profile I referred to, although it's not available at the point of care, it is available later on for machine learning. So we do know for each of the different antibiotics, how they would have done at resolving the underlying infection. Now, if we're in that scenario, then to learn a treatment policy, one could simply do a regression. So one takes the input, the data that you have prior to the index date. So all of the data you might derive from the electronic medical record. You always take as input what I'm showing in red here, the treatment decision. So you could represent that, for example, as a one hot vector, which is an indicator of which treatment decision you're considering. And the goal of the machine learning model, this blue box I'm showing here, is to predict the outcome. In this case, the outcome would be, is the bacteria leading to this particular patient's infection susceptible or resistant to the treatment T. If one can then take the data that I mentioned that you have, learn this machine learning model and then use predictions from this model to guide the treatment policy. So for example, one might say, for this patient, we're going to predict the outcome for each of the four different possible treatments. And then we're going to choose only among the treatments, which are predicted to be susceptible. We're going to choose the lowest spectrum antibiotic of that set. So we want to prefer, if possible, to not use the broad spectrum antibiotics, which are almost always susceptible, if possible use a lower spectrum antibiotic. So that would be an example of how to derive a treatment policy from a machine learning algorithm in this scenario of single treatment and fully observed outcomes for training. Now the second scenario is a much more typical scenario. Here, again, single treatment, but we only observe the actual outcome. So for example, if the clinician prescribed nitroferentinone as the antibiotic, we would get to observe after waiting some period of time to see if the infection actually resolves for the patient or not, we would observe whether there was a resistance or susceptibility. But we wouldn't get to observe in particular any of those values that show in gray here because those treatments weren't tried for this patient. So this is the more typical scenario. And the reason why I'm breaking it up in this way is because the algorithm that you can use here in this scenario is actually the exact same algorithm I showed you a second ago. So your training data, instead of having for every feature vector four data points for each of the four different treatments, you have only a single data point, but you could still learn the same machine learning model in that setting. What gets much more challenging is what you do for evaluation because there could be bias in who in your training data got which treatment. And so evaluation in the setting requires techniques from causal inference in particular one could use what's known as importance reweighting in order to reweight the data points so that you're evaluating the policy with respect to the correct distribution. I'm not gonna go into more details of that in today's talk. The third scenario, again, one only observes actual outcomes but is now in the sequential decision-making setting. And that is the scenario where there's been a lot of work in the recent years on using reinforcement learning techniques to try to learn those treatment policies. And we'll return to this towards then today's talk. So we talked about data, we talked about how to learn and next I wanna talk about evaluation and deployment. So a common way to think about evaluation is to compare the learned treatment policy to what clinicians actually do. And I'll go into this into more detail later but there are often a number of different criteria that one cares about. So for example, in the antibiotic prescribing setting we care about which type of antibiotic is prescribed. Is it a broad spectrum or second line antibiotic or a narrow spectrum, first line antibiotic? We also care about, which was shown in the X-axis here is a successful prescription. For example, specifically will it resolve the infection or not? And if it doesn't resolve the infection we call that ineffective therapy. And so your goal would be to be where that star is you'd like to use as little second line antibiotics as possible and you'd like to have as little ineffective antibiotics prescribed as possible. Now it's not necessarily possible to get there and you could look at where clinician practices today which I'm showing in this caricature with this black dot. Clinicians today using your data you can compute well what fraction of times do they give second line antibiotics and when fraction of times are the infections actually resolved. And then you could look to see well what would your machine learning algorithms be able to achieve. And here it's not an example of a single criteria but you have two criteria and so one can draw a curve and think about quantifying for each of the possible different places you'd like to be on the curve what's achievable by machine learning policy. Now what's really important in this is that we have an apples to apples comparison because the types of statements that you wanna make are for example that this black dot is much higher than the red line. The challenge here is when you're making these comparisons using retrospective data, a clinician might not always have access to the same treatment strategies that your algorithm thinks it has access to. So for example, suppose that allergies to medications are not recorded anywhere in electronic medical record or are recorded very noisily but suppose that clinicians always have a verbal communication of these allergies from the patient. So they talk to the patient today have you ever had this antibiotic in the past and have you ever had a reaction to this antibiotic? And the patient might tell the doctor yes I've had an adverse reaction to this antibiotic two months ago when I was last given it and in that case the clinician wouldn't prescribe that antibiotic. It wasn't a feasible treatment option. On the other hand, if that data was not recorded in your electronic medical record which could have happened for example if the patient had been to a different hospital system then your algorithm might think that because the antibiotic susceptibility profile looks pretty good, right? It looks like that antibiotic does resolve the infection even though it led to a different adverse outcome as well, it might look like that's a good choice to give and so your algorithm might be prescribing that and your algorithm might look like it gets a win even though the clinician would have never been able to consider that option and then antibiotic the clinician actually prescribed might have led to resistance. And so that we're gonna return to this later today but this I view as perhaps the biggest take home lesson from today's talk which is to be really careful our community needs to be really careful about this point when doing retrospective evaluations of treatment policies. So final in deployment machine learning as I mentioned already can be really challenging to deploy in commercial electronic medical records. So for example, in the United States one of the major vendors is Epic and a few years ago they launched a system where one can input a machine learning model using a markup language called PMML predictive model markup language. And that allows one to use retrospective data to train a model, import that model into electronic medical record and then have it trigger a number of downstream decision supports. But those models for example, only take as input a very limited number of features. For example, they don't take as input any clinical notes and so it can be very limiting to build models that have to be deployed in that way. The second question we often think about is well, what would that intervention look like? This is really about human AI interaction or in this case, clinician AI interaction. So my lab has been spending a lot of time in the last couple of years working very closely with other researchers in human computer interaction to think about these questions much more deeply. And in today's talk I'll give you an example of one scenario we've been thinking about which is when one should make a recommendation versus not making a recommendation. Recognizing that if you're machine learning organisms making recommendations too often then it could be a distraction. It could lead to what's called alert fatigue and maybe we should just be popping up alerts when we think it's really important. And that's one example of this interface that we'll get into more detail later. So next we'll dive into this application and I'll speak about each of those earlier points in this context. So the algorithm we're going to use to learn a treatment policy in this single time step fully observed outcome scenario is going to be a little bit different from what I alluded to earlier although we'll have a connection point. Here we're going to directly learn a policy. So we're going to learn a model pi where that model takes the input the patient features X and it has to output a treatment decision which in this case is going to be one of these four different antibiotics. We're also going to assume that in our data we have access to a reward vector for every data point X. So our data consists of these vectors X and those rewards are the reward vector is of the same length as the actions and it tells us how good each one of those actions actually was. So for example, if all we cared about was whether an antibiotic was resistant or susceptible then we would put the reward for these four antibiotics in the scenario shown on the left as follows. We would put the first one as zero and the last three antibiotics as one indicating that the higher reward or reward of one is a good thing. So if your policy output any of the other three antibiotics other than the first one, you get higher rewards. So that policy would be scored highly. So once we've defined this reward function and we suppose that we have data which consists of features and rewards we're left with the problem of how do we actually learn that treatment policy pie? So mathematically what we'd like to do is search over treatment policies pie that have the highest average reward over the training data. Already here by the way you should be asking yourself does this make sense? Like do we care about highest average reward? Maybe what we should have said is that the 90th percentile of the reward is high rather than the average reward. So this is already something where you should be checking your assumptions. Here we'll go with the average reward and even the average reward comes with computational challenges. So we have to think about how are we actually going to optimize this problem? Well, let's rewrite the policy pie as is typically done in machine learning in the following way. We're going to suppose that for every action which I'll denote as little a we have a different function f sub a which tells us intuitively how good that action is predicted to be. Our policy pie then is going to be defined as for having each of these functions f we're going to just, we're going to predict the action whose score is highest. We're going to now plug that in and now we have a clear cut optimization problem. We want to maximize with respect to these functions f sub a, this expected reward. And you can see this looks a little bit like minimizing the zero one error in classification for machine learning which is typically a very hard optimization problem. And as an aside, this does connect very nicely to what I mentioned earlier when I was showing you that blue box and taking as input your features action your human decision T to predict an outcome Y. The only difference from that scenario is this is what's what can be viewed as cost sensitive classification because the every every output here, the outputs are choices of actions has a different cost associated to it. So cost is what I'm calling the reward. And there's a whole literature and the machine learning community on how to do how to efficiently solve cost sensitive classification problems which one could directly apply to this scenario that I'm showing you here. However, those algorithms actually aren't very good. And in recent work that appeared in both the KDD paper and ICML paper in my lab, we gave a new optimization problem for this cost sensitive classification problem slash expected maximization of reward problem. These are identical which shows that one can reduce that difficult optimization problem to a simpler one which is convex anytime your functions F are given to you by linear models. And if you solve this simpler optimization problem one can actually prove that in the limit of enough data and rich enough function families F then that will actually recover what's known as the Bayes optimal policy. So the solution to this second optimization problem is actually as good as a solution to the first optimization problem despite the fact that it's computationally much easier to solve under a few assumptions. And that's what we're going to use for the experiments. Okay, so I already talked about what the reward function might be. The reward function is going to be a little bit more complex than just looking at whether the antibiotic succeeds at resolving the infection which is captured by the IAT rate. In our case, we're going to wanna look at two different criterion. Both that first criterion I just mentioned which I'm gonna denote now as Y and a second criterion which is is it a broad spectrum antibiotic? Is it a first line or second line antibiotic? So now I'm gonna suppose that the training data consists of these two quantities Y and C instead of the reward R. And that's going to allow me to talk about a family of reward functions which I'm gonna show on the right hand side parameterized by this parameter Omega. As Omega increases it's gonna prefer to it's going to prefer this first criterion which is looking at whether the antibiotic is successful at resolving the infection. And as Omega decreases it prioritizes more that second criterion which is are you using a first or second line antibiotic. And just to be very clear in your training data C is always assumed to be all C is always exactly the same vector here because it's always 0011 because the first two antibiotics are always first line and the second two antibiotics are always second line. Okay so now we've got a parameterized family reward functions and with that parameterized family we're going to be able to try different values of Omega and really draw those curves I showed you where we understand all possible trade-offs that are achievable by your machine learning algorithm. So next I'll tell you about the data that we used. We created what's known as the what we call the Boston Infectious Disease Covert which has data derived from Mass General Hospital and Brigham and Women's Hospital in Boston from all patients who had antibiotic susceptibility profiles performed from January 1st, 2000 to the present that's over 300,000 unique patients. However, although we could have and we did run our machine learning algorithms on that full data set it's not useful for thinking about the particular question of what is a good policy because just running all of these patients had very different clinical scenarios and so it wouldn't allow us to do an apples to apples comparison to what treatments were actually available to clinicians at the point of care. To really start to do an apples to apples comparison we went really narrow and we focused on that specific urinary tract infection scenario and moreover we focused even more on what are called uncomplicated urinary tract infections which for example, exclude women who have recently had surgery. The reason that we focused on uncomplicated urinary tract infection is because in that scenario there are typically only four antibiotics that are prescribed at least in these hospitals and those are the four that we're considering. And usually all four of those are fair game to use with one exception that I'll tell you about later. It also gives us a very concrete index date that we should be paying attention to which is the time at which a woman with uncomplicated urinary tract infection was prescribed an antibiotic the first time they're prescribed an antibiotic. So this allows us now to do an apples to apples comparison we can talk about how good clinicians are compared to how good our policies are. We trained using data from 2007 to 2013 and we evaluated our models using test data drawn from 2014 to 2016. This is another important design choice that we made where we didn't hold we didn't just use held out data from the training window but we also took future data. And the reason why that's really important is because in healthcare we have lots of non-stationarity meaning the distribution over the data changes across time. Now, that holds for all of the data is changing across time but importantly, one of the really biggest sources of non-stationarity here is actually in the outcomes themselves because over the 15 year time period that we gathered the data there was a lot of shift in antibiotic resistance. So some antibiotics had very low resistance rates in the earlier parts and very high resistance rates later on. And so what we wanted to be able to assess is how well our models generalize even under changes and underlying resistance patterns. Now, in addition to comparing to clinical practice we also compared to clinical guidelines. And the reason is because clinicians don't always follow the clinical guidelines. So the reason why this is important is because it could be that existing clinical guidelines are much better than clinicians today. And we would like to be able to understand how our algorithms behave relative to clinical guidelines. If our algorithms just perform the same as clinical guidelines then we wouldn't want necessarily change practice using our machine learning algorithms rather the right policy change might be to put incentives and practices in place to increase adoption of the existing clinical guidelines. So in this case, the clinical guideline for uncomplicated urinary tract infection in our population where the resistance of the antibiotic SXT always exceeds 20% is actually simplified to a really simple policy shown in the right hand side where one looks at whether the patient had recent resistance or exposure to one of the first line antibiotics and IT in the past 90 days. If the answer is yes always prescribed a second line antibiotic CIPRO if the answer is no prescribed that first line antibiotic and IT. So this is the clinical guideline that we're going to compare it to. So here's now that curve that I was showing you not with real data. Clinicians are over here on the right hand side roughly almost 12% of women who are prescribed antibiotics are given antibiotic which is ineffective. So does not resolve that infection and clinicians are giving a ton of second line antibiotics. So roughly 30, 35% of second line antibiotics. Now the first thing we compare to are those clinical guidelines. So the clinical guidelines are shown by this purple dot shown in the very bottom. And the first thing you should observe is that clinical guidelines are indeed way better than existing clinical practice. And so that should already start giving us pause. So clinical guidelines today will give in this population give almost no second line antibiotics and we'll have IT rates of about 11%. Now there are lots of reasons why clinicians might not want to adopt those clinical guidelines. One of them has to do with a to risk benefit trade off. It could very well be that where they are and the threshold is a little bit different than where the guidelines were designed. Maybe they don't want to risk as much patients getting ineffective antibiotics. And so the second thing we compare to is what we call these adjusted guidelines which are shown in this by these gray data points. These adjusted guidelines will take the existing guidelines though in essence mix in more second line antibiotics. So in essence allows you to change that risk threshold and gives us a way in order to compare what would be the performance of the clinical guidelines if you wanted to match up the amount of second line antibiotics. And the next thing we compare to is work that was published recently in Nature Medicine by a group in Israel which also looked at predicting antibiotic resistance in urinary tract infections. There were two big differences. The first big difference is that they weren't focused exclusively on uncomplicated urinary tract infections and thus they weren't able to do the type of apples to apples comparisons that we're going to be showing here. And secondly, their algorithms were quite a bit different. So they had an unconstrained algorithm which used a machine learning model to predict resistance and then selected an antibiotic which was susceptible. And as you might imagine that would almost always choose a second line antibiotic. So it gets really low IT rates but as you can see in this top left corner gives a ton of second line antibiotics. The second method that they introduced as known as this constrained method which attempts to match second line antibiotic usage of clinicians and also does on our population does much better than clinicians but actually worse than clinical guidelines. Our algorithms which are learned using this direct policy approach gets the following frontier shown in blue. And the reason why we have lots of points here is because remember we have this parameterized reward function we had this parameter omega that we could shift in order to change this trade-off between giving ineffective antibiotics and giving second line antibiotics. So we can now get to part of the punchline. One choice of these parameters would get us a policy where 25% fewer women get ineffective antibiotics for the same amount of second line usage or we can imagine a different policy where we nearly limit second line antibiotic usage. It's also interesting to look at what clinicians actually did compared to what our algorithms would recommend for existing patients. And so this is all shown in a held out data and we see that in the 1,323 cases where clinicians would give second line antibiotics and 1,245 of those cases where they were appropriate meaning those second line antibiotics actually did resolve the infection, our algorithm would give a first line antibiotic almost always and this light color means that it was also appropriate almost always. So what we're seeing here is that we're getting a lot of the wins by switching from a second line antibiotic to a first line antibiotic. You can look at the bottom here and now look to see what happened when clinicians gave first line antibiotics which occurred about 2,600 times. Well, most of the time those first line antibiotics were appropriate and indeed we also give first line antibiotics and they were also mostly appropriate. But some of the time a bit over 15% of the time the first line antibiotic was inappropriate meaning it led to resistance. And you could see here if you look in this third row that in that case about half the time we switched to a different first line antibiotic and it is appropriate where the clinician choice was inappropriate and some of the time we'll give a second line antibiotic which is also appropriate. Now, I wanna finally return to that question I raised earlier which is is this really, really an apples to apples comparison? Do clinicians really have access to those four treatments always when they're prescribing for uncomplicated urinary tract infection? Now to study that I think one has to do a much deeper dive into the data. For example, if you could do a randomized control trial that'll be one way to assess this you could also imagine doing a prospective study. In our case, we wanted to get as much out of our retrospective data as possible. And so what we did is we did manual chart review. So we chose 20 patients where clinicians chose a second line antibiotic but our algorithm correctly chose a first line antibiotic. So that was one of the scenarios from the previous slide. And we read very carefully read all of the free text notes surrounding that visit and recent visits to understand the clinical decision making and see whether the choice of antibiotic that we would have given really was a feasible one or not. And what we found is that indeed there were some cases where it wouldn't have been feasible. So in particular, three out of 20 times that first line agent was contraindicated because of suspicion of palenifritis which is a life-threatening condition should always be treated with a second line antibiotic or because the patient actually had an allergy to one of the first line antibiotics that we would have given. In two of the 20 cases there was insufficient documentation for us to conclude whether the policy's action was a feasible action or not. And 15 out of 20 times we deemed that it was a feasible action. So with this finding, it gave us a bit of pause and I said, well maybe we need to be recomputing some of the comparisons we did earlier. So we went back to our data and we much more carefully now both derived allergy information from the available data. Now the free text data that's not typically captured very well in the structured data and these complications that we need to be thinking about. And we found that that changed roughly 7% of the original decisions. And so now we have what we feel much more confident in as a real apples to apples comparison to clinicians. And we see that the ML policy uses much more second line antibiotics now than it did before. We went up by something like 6% in absolute values. But it's still about half of the amount of second line antibiotics that clinicians give for a substantial reduction in inappropriate antibiotic prescribing. Okay, so we've talked about how one does machine learning, how one can do an apples to apples comparison to clinicians. But now I wanna start talking about some of those human computer interaction questions. Specifically, we've been thinking a lot about when one should make a recommendation. Maybe this of course has to be tested in a prospective evaluation, but it's conceivable to us that we might get much more uptake in our algorithms if we only surface suggestions when both were really confident. And we think that our decisions actually better than what the clinicians would suggest. And so we've been formalizing that in the language of what's called learning to defer. So by deferral here, we mean defer. So either our algorithm makes a decision or we defer the decision to the clinician. So that showed up in both this K2D paper and an ICML paper that we wrote which was explicitly on that problem of learning to defer. So the way that we, so this direct policy learning algorithm I told you about earlier is actually really nicely designed to be able to deal with this deferral problem. We simply add a new action. So instead of having just four antibiotics now we have a fifth pseudo antibiotic that we're gonna call the defer action. And the reward for the defer action is equal to the reward of the action that was actually taken by the clinician in the retrospective data because in historical data there was no machine learning algorithm. So we know what they would have prescribed had they ignored the machine learning algorithm because we just looked to see what they actually did. So we let the reward of the defer action be the reward of the action A taken by the clinician plus a constant capital P where capital P could be thought about as an incentive to defer. So capital P gives us another knob that we can tune in order to change that deferral rate. And then we simply apply the same learning algorithm. And that then allows us to create plots like this. So as you change that constant to capital P you get different policies out and you could look at those two criteria IIT rate and second line usage of antibiotics shown in the left and right plots respectively. Now the function of the deferral rate. So the higher the deferral rate the more that we are letting clinicians make decisions meaning the fewer recommendations our algorithms are making. And these plots are evaluating the performance on the subpopulation where we do make a recommendation. And what we're asking about here is are we able to identify a subpopulation where we actually are doing much better than clinicians? So in orange here are actual clinical practice and what we see here is that as we are making as we've identified smaller and smaller subpopulations to make our recommendations on those tend to also be the harder populations we can see that clinicians are both prescribing antibiotics that have higher rates of resistance and they're prescribing more second line antibiotics shown in the right. If you look at the blue dots so those are the policies learned by our algorithm and we see that the gap between orange and blue tends to grow as you focus more on this particular subpopulation. All right in the last five minutes of the talk I wanna turn now to this question of sequential decision making. Sequential decision making as we mentioned earlier is important because many decisions have to made in sequence and for example, earlier choices might rule out later actions. So we now wanna learn a policy that doesn't just make a recommendation at a single point in time but makes a series of recommendations across time. As a running example to motivate this let's look at managing sepsis which is a complication of infection and one of the leading causes of death in hospitals. When you want to manage a patient with sepsis in addition to giving antibiotics you have to manage many of their symptoms. So for example, a patient might have breathing difficulties and because of that you might decide to put them on mechanical ventilation at some point in time. Because there are mechanical ventilation you might need to sedate them because of some of the discomfort of being in mechanical ventilation. It's very possible that due to the sedation their blood pressure might have dropped and so we might need to do a different intervention to artificially raise their blood pressure. And you see how there are a series of different actions that really depend on the previous choices that a clinician has to make in order to optimally manage this patient. And each action that was taken corresponds to a particular path in a very large tree. And you can imagine the alternative which what I'm showing you here is these sort of other little paths shown in gray. Those are paths corresponding to different treatment decisions whose counterfactuals we never get to observe in the data. Here we only get to observe the series of actions that were actually taken to use for machine learning. So in training data now you have usually trajectories where you have states, those states S are analogous to what I was calling X earlier, the patient features. You now have a series of actions A and you again have rewards but you have a reward now at every time point. We could use those observed trajectories to learn a new policy pie and that goes by the name of off policy reinforcement learning. Now the important point I want you to take home here is that learning sequential policies from observational data is subject to the same pitfalls that we described earlier. When I spent so much time talking about having an apples to apples comparison to make sure when we talk about doing better I think clinical practice, it's actually actionable. And unfortunately many of the papers that have been published recently in the machine learning community for this problem in healthcare, don't ask that question. And I think that that's a real big limitation and I'm hoping to motivate all of you to start really thinking about apples to apples comparison when you're writing your experimental results section but it's very hard to do so. So in the previous work I showed you how we did chart review. We read the notes for a patient to see whether the specific action that we are commended was a feasible action or not. In the dynamic trigger regime or sequential decision-making setting it's a little bit less obvious how one would check those counterfactuals. And to do that, we developed a new technique that was just published called trajectory inspection for helping us as machine learning folks and domain experts start to sanity check sequential policies. And I'll walk you through this approach over the next few slides. The first step is to do your reinforcement learning. Here we're going to need two things. We're going to need a model that allows you to simulate. So for example, learning the Markov's vision process would work and you need to get the new policy which we'll call PI. So you get the ability to simulate data and you get the policy itself. You need both of these things to do what I'll describe. Next, what we're going to do is we're going to look at the we're going to look at the actual data and we're going to try to find trajectories that have surprisingly aggressive treatments. So for example, at the very top here I'm showing an example of a patient who has given a very low value of fluids that's shown the green dot is the actual value that was given and was given no vasopressors but the learned policy would give a lot more fluids and a lot more vasopressors. That's what I'm showing in the blue cross. So we might want to ask, well for that specific patient was that treatment reasonable? Now that's looking at more of a single time step question. You might be interested also in the multiple time steps question. And there one could look at for every state if you were to get to the state and follow your new policy you can compute the expected reward after following that new policy. And then you could look at the average reward of the data in the data of all patients who reach that state. Meaning compare the reward of the real world to the expected reward where you apply the new policy. And look at cases that are very positive. So you look at the difference between those two values and then focus on this right hand side which are cases where we think that our algorithm is getting much better performance than the real world. And so those are trajectories that we really wanna dig into. Then to dig into these trajectories as before one does chart review. So one does a couple of things. So first one looks to see find patients for which the recommended action is different from what was actually done. Simulate forward in time to see what our model thinks would have happened to this patient. Compare that to what actually happened to the patient shown in the black line and then read the notes. So here's an example of a patient with a surprisingly positive outcome under the quote optimal policy that was learned. This patient was when they were admitted to the intensive care unit was actually diagnosed with stage three lung cancer during that visit. And I think there was a decision made to withhold life saving treatment either for this patient or another patient. The gist of it is the patient actually died due to their cancer. And after talking about this case with our clinical collaborators they tell us there's really nothing that a clinician could have done differently in order to save this patient in order to get more than a short-term survival. And so this... David, just one quick moment. You have five minutes in the one hour slot. Okay. Well, I have two slides left, so I'll finish up. Thanks. I want to also look at the simulations. And you see here the same thing in black is what actually happened for that patient. They ended up dying. And here we can see what are the actions that we're actually taking by the policy. This again enables that conversation because it helps clinicians look to see does that make sense? Would those actions actually have had a chance to save the patient or not? All right. So to conclude, the Charlie Medical Records are a rich news source of data for machine learning for treatment suggestions. I gave you a case study from Empiric Antibiotic Prescribing where we found that our algorithms can substantially reduce second line antibiotic usage at the same time as actually reducing the amount of inappropriate treatments. And I emphasized the importance of doing an apples apples comparison to clinical practice, which I want to be the real take-home message from today's talk. The data that was used in this talk is actually publicly available. And you can get to it by going to my website clinicalml.org. And I also want to mention if you're interested in this field of machine learning for healthcare, we have a online course that's freely available and is launching on Monday on the edX platform. Thank you and Karsten, I'm ready for questions. Thank you, David. This was a very exciting talk about the use of machine learning and clinical treatment. Thank you very much for that. Are there any questions from our network, from our students? Tawani, please go ahead. Hi, and thank you for the talk. I really appreciate it. I have a couple of questions. One of them might actually be very obvious and simple, but in a sequential decision process, how do we properly take care of the time between the steps? So maybe there's not like a regular interval between the decisions. Do we just simply include it in the state as we will do for a normal rate response from learning process, or is there any other consideration we could make? That's a great question. There has been some work in the BioStats community on how to tackle that, but not nearly enough work in the machine learning community on how to tackle that using methods like deep Q learning. Many of the recent applications have done something quite naive where they will discretize time, let's say, in four-hour intervals, for example, and they will associate an action to the most recent time bucket. But as you might be wondering, that in itself could result in problems. So for example, if the dynamics for the effect of an action actually play out on a much shorter time scale than we're doing discretization, then that could inadvertently introduce a confounding of a variety of kind and can get biased estimates. And in fact, that's one of many of the failures of the existing papers in this space and one of many reasons why you should be critical and start thinking about these policies that we learn actually make sense. Thank you. Yeah, thank you. Close to what I was thinking. And I have a second quick question. So one in my limited experience, one of the issues often with electronic records is the amount of missing data because maybe the gathering is not complete. Is that an issue in your case or is it just not that much of a problem because once the model is trained, we use it for predictions. We expect the values to be included correctly. That's a great question. What's the most important thing is that you are able to drive the treatments, the interventions and the outcomes perfectly. That's like, we would ideally like to have those perfectly. In terms of the features that would drive your policy, machine learning algorithms tend to be work really well with missing data. And in particular, that missing data might even be informative. So one might want to, you might not want to impute the data, you might want to take into consideration the missing of properties. What we need to be keeping in mind however, there's a huge caveat there which is that only holds if at test time when you're using it prospectively, the distribution over missing this doesn't change. If the distribution over missing this changes in one of your deployment sites, then suddenly all bets were off and your policy could have disastrous consequences. And so that is really the reason why I often think about motivating imputation, not from the sort of traditional sense impute because you have missing data and missing data is bad but because it's one way, one of many tools in our toolbox to try to deal with data set shift. That's a great point. Thank you. David, there's one question from the YouTube audience. Do you have experience with modeling combinations of treatments? This is very common in oncology and seems to be the way forward in clinical practice. That's a great question. I've started thinking about it. We posted a paper on archive a few weeks ago called I think neuro pharmacodynamic modeling or something of that sort where we're looking at disease progression in a cancer multiple myeloma where they're often combination treatments and we've designed one approach for this problem where we are using a particular representation of the combination treatment. So similar to the one hot representation I was describing earlier and we learn ways of correctly combining those treatments to predict treatment effects. But of course, we're just scratching the surface here. Thank you. I have one question. I hope I didn't miss it in your talk but most of what you described seems retrospective to me that you look at existing data. Like how hard is it now in the next step to go to a prospective validation of what you are proposing here? So for the empiric antibiotic prescribing I think it's gonna be pretty straightforward now to do a prospective study and that's definitely one of our next steps. So the challenge tends to be with what is the workflow for deployment? If you need a ton of features that can only be derived from electronic medical record then someone has to be able to sort of build that pipeline of data so that you can actually get decision support at the point of care. One of the things that we have been doing recently is trying to simplify our models. So I skipped those slides but we came up with a method that hasn't yet published for building hierarchical models and which is using the direct policy learning approach to learn as sparse the models you can for the treatment policy. And what we found is that we're actually able to simplify the models that I showed you about earlier to only a handful of features. So few in fact that it even, it's conceivable to me that we could just print out on a piece of paper now the new policy and hand it around to different outpatient clinics to think about for a prospective trial. But that's something we're actually working on. Very exciting. Thank you very much for this fascinating talk. If there's no further question from inside the network then I would conclude this presentation and I was impossible David, you did a great job. This was wonderful. Thank you so much for joining us.