 It's my great pleasure to open the afternoon session. Our first speaker is Jenna Wienes, a rising star in machine learning in medicine. So Jenna obtained her PhD from MIT in 2014, then joined the University of Michigan, where she's now Associate Professor of Computer Science and Engineering and the Co-Director of the University of Michigan Precision Health Program. And she obtained a number of awards and honors, so an NSF Career Award in 2016, a Sloan Research Fellowship in 2020. And she was listed in 2015 as one of the Forbes 30 under 30 in Science and Health Care. And also in 2017 on the MIT Tech Reviews List of 35 Innovators under 35. The research interests are like the theme of this summer school in machine learning and health care and precision medicine. And she's also one of the main organizers of this annual machine learning and health care meeting. And we are very happy to have you here, Jenna, and we are looking forward to learn more about your research. Great, well, thank you so much for inviting me to come and speak to you all today. So today I'll be talking about, from diagnosis to treatment, augmenting clinical decision making with artificial intelligence or AI. So as we know, hospitals today are collecting just an immense amount of data about our patients. And in my research group, we develop and we use machine learning techniques to detect patterns in these data that can be used to sort patients from low risk to high risk for a particular adverse outcome or event or match patients to appropriate treatments with the ultimate goal of augmenting clinical care and improving patient outcomes. Given the challenges that today's health care system is facing, there's a critical need for this kind of clinical decision support. Currently, the demand for clinical care far exceeds the supply and economists anticipate this is only going to get worse in the near future. By the year 2030, economists anticipate a short follow over 100,000 physicians in the US. And this shortage exacerbates what's already a serious issue in the field, clinician burnout. At this point, the increasing computerization of the field is widely cited as a contributor to the problem rather than a solution. Clinicians are spending more and more time entering and studying data about their patients while still ignoring the vast majority of it. This burnout combined with a lack of tools to make sense of all the data has contributed to a large number of medical errors. A study published in the British Medical Journal in 2016 estimated medical errors as the third most common cause of death after heart disease and cancer. These issues highlight an important need but also an opportunity for AI. AI could help alleviate some of these issues. However, the goal is not necessarily one of automation. We know that models are error prone. Moreover, models rarely can perfectly capture all of the nuances of a complex clinical situation that we aim to augment rather than replace clinicians. In today's talk, I'm going to describe two projects along these lines. The first focuses on augmenting clinicians in making diagnoses. Well, the second presents a novel clinician in the loop framework for clinical decision-making. So first, let's consider the problem of diagnosing patients who present with acute respiratory failure. There are several underlying causes of this disease but the most common include heart failure, pneumonia and chronic obstructive pulmonary disorder or COPD. It's important to get the diagnosis right since each of these different causes requires different treatments. However, this is a very challenging clinical problem and up to 30% of cases patients may receive incorrect treatments. Tools that could help clinicians in making diagnoses could ultimately improve outcomes. In recent years, deep learning has been successfully applied to medical imaging data from applications involving the eye to the brain. In particular, it's achieved near human level performance on a number of tasks involving chest x-rays like the system developed by Pasa et al for detecting tuberculosis. So along these lines, we're going to explore using deep learning applied to chest x-ray data to learn a model to estimate the probability that the patient has pneumonia, COPD or heart failure, CHF. And here, the estimated probabilities do not necessarily sum to one since patients could have multiple underlying etiologies or causes of this particular disease. So our study population consisted of nearly 1300 patients recently admitted to Michigan Medicine or hospital at the University of Michigan who went on to develop acute hypoxic respiratory failure. We extracted chest x-rays for individuals from our picture archiving system and each patient admission was reviewed independently by at least two physicians who assessed each patient on a scale of one to four for COPD, heart failure, CHF and pneumonia. These scores were then averaged and thresholded to produce our labels. In addition to the chest x-rays as input to the model we extracted covariates from the electronic health record including things like demographics and vitals or even where they were in the hospital. We extracted this information from the time of admission up until the time that the patient meets the criteria for acute hypoxic respiratory failure since this is when a clinician needs to make a decision about how to treat the patient and that will depend on the underlying diagnosis. Well, there's been a lot of recent work in applying deep learning to chest x-rays. Much of it has focused on x-rays taken in an outpatient setting whereas our cohort included many critically ill individuals some of whom cannot get out of bed and this results in greater heterogeneity in the chest x-rays. So here on the right, I'm showing a typical chest x-ray from the Stanford cohort of patients. So this is the Chexpert data set and many of these are in the outpatient setting and you can see it's a really nice chest x-ray. On the left hand side, I'm showing a picture from our population and we can see that there is much more variability in the positioning of the patient because they're a sicker population. And while this heterogeneity makes the problem more challenging, it also makes it more realistic. So in addition to this more challenging problem, we're also going to take a different approach than previous work in that we're going to leverage not just the chest x-ray but also the EHR data. A lot of prior work is just focused on looking at images. So why are we going to use a multimodal approach? Well, this mimics how a clinician uses both EHR and imaging data to make a diagnosis. When the physicians were reviewing the patient charts, they weren't looking at just the images. Similarly, they weren't looking at just the EHR. So we're considering a fused model here where we're going to combine the EHR and the imaging through a dense net. And this dense net will then produce a probability for each of the three outcomes. We trained this model on our data, pre-training on other data sets like MIMIC checks x-rays in the Checksburg data set from Stanford. And then we applied the learn model to a held up test set of 213 patients from our cohort. And here I'm just showing you the prevalence of the different outcome labels in our population. So we see that CHF and pneumonia are most prevalent and COPD is prevalence of about 9%. And many patients actually have none of these. So to measure the value of a multimodal approach, we first compared to an approach that only considered images, ignoring the EHR data, and also an approach that only use the EHR data. So on the task of diagnosing pneumonia, a fused model performed better than either modality alone. And we saw a similar trend for heart failure where the area under the curve which measures the discriminative performance of the classifier is better than imaging or EHR alone. However, for the case of COPD, we noticed a slightly different trend. So the EHR data added little over the imaging data alone. And this could be due to a number of reasons. Remember that COPD was one of the most or least prevalent diagnoses in our population. Moreover, it's one of the easier cases to diagnose based on a chest x-ray because it's associated with these large hyper-inflated lungs of these barrel-chested individuals. Overall, we found this really encouraging but it wasn't as easy as I made it sound. Given the small number of labeled training examples, remember we only had 1,300 patients in our entire cohort and we had to split that into training validation and test. And the large number of parameters associated with a dense net, some millions of parameters here, we had to leverage additional data sets. However, the existing models didn't transfer out of the box. So that checks for a data set that I mentioned from Stanford also had a model that predicted pneumonia. Now this was just using imaging data but when we applied it to our cohort, the performance was only slightly better than random. The data from MIMIC, which is the Beth Israel Deaconess Medical Center and consists of a cohort of patients in intensive care. The model did slightly better, probably because they're more similar to our cohort but still we didn't really get to a reasonable level of performance until we trained on data from Michigan. So we pre-trained on MIMIC and Chexpert and then applied and then fine-tuned on Michigan data. And we weren't the first to identify this kind of behavior. So this lack of transportability or transferability. So in 2018, Zekedal published a study that assessed how well convolutional neural nets or CNNs generalized across different hospitals were the task of detecting pneumonia based on chest X-rays. And they found that the internal performance was significantly higher than external performance. So how well you do at your own institution, if you're trained at your institution was better than applying it elsewhere. And the prevalence of pneumonia varied so greatly across hospitals that they considered that merely sorting by hospital was enough to achieve a good AUC or good discriminative performance. And in such cases, the model could simply learn to identify where the chest X-ray was collected rather than any relevant clinical features. So depending on the composition of the training and validation data, a model can take advantage of shortcuts yielding seemingly good validation performance but really poor results when applied to out of distribution test data that doesn't exhibit the same skew. So Garrison all recently published a really nice survey on this showing how shortcuts can appear or arise across many different domains. In the context of chest X-rays, consider training data that are skewed. So here I have a correlation between being female and being diagnosed with heart failure. And it's an extreme correlation of one. So every single female in the dataset is diagnosed with heart failure and otherwise you're not. Well, given this training data and validation data, a model can just learn to recognize whether or not the patient is female based on the chest X-ray, identifying things like breast tissue or differences in the skeletal structure near the scapula. Now applied to a test set where this correlation doesn't hold. So an equal proportion of females and males are diagnosed with heart failure. So this is the performance of the model plummet to below 0.7, because the model didn't learn any clinically relevant features. And the shortcut phenomenon has been identified by others. So for example, Biri et al showed how a cow in an unexpected location could be mislabeled. So the models here just learning to recognize the green pasture background rather than features that you and I might use to recognize cows. And this happens because the training data are biased in a specific way. They only present images of cows on these green backgrounds. In the context of learning to diagnose based on chest X-rays, we start to identify potential shortcuts. So what shortcuts might the model be able to exploit? And characterize the ease with which models could exploit those shortcuts when making diagnostic predictions. This is work that we presented this year at the machine learning for healthcare conference. So first we looked at how easy it was to learn whether or not a patient was female based on a chest X-ray. And if this is a really challenging task, then it's not necessarily a shortcut. However, based on the chest X-ray alone, a dense tank can identify female patients with near perfect discriminative performance. And this holds not only in our smaller cohort, but also in the larger mimic and check birth data sets. Beyond sex, we also found that deep learning models can learn to map chest X-rays to a number of potentially sensitive attributes, including age, body mass index, and even race. So when these attributes are correlated with the outcome of interest, a model can use them as shortcuts in its prediction. And this creates a problem when transferring across institutions where demographics might shift. And the potential of shortcuts related to sensitive attributes could lead to bias amplification in which human biases are not just replicated, but actually amplified by a model. So in 2018, Amazon stopped using a tool to filter applicant resumes when it was revealed that the tool was leveraging shortcuts related to gender. The problem is once a model identifies a predictive feature like gender or sex, even if it's just an artifact of the training data, the model's decision rule may depend entirely on that shortcut. This presents issues when transferring models across institutions or populations. So suppose we have a set of patients, males and females, and we train a deep learning model on their chest X-rays to label heart failure diagnosis. Applied to our test set from the same hospital, it looks like our model does great correctly classifying those with or without heart failure. But now our colleagues at the Veteran Affairs Hospital down the road, which serves mostly a male population, are excited to try out the system. But when they apply the model to their population, it fails miserably, classifying everyone as no CHF. Now, this is because the model was relying on a shortcut and it might seem like an extreme example, but there are other ways in which shortcuts can show up in diagnostic tasks. So for example, consider these two images of the same patient and there's a slight difference between these and I'm wondering if anyone can tell what the difference is. I'll give you a couple of seconds just to study these two images. The people watching on YouTube can also put their comments on Slido if they detect the difference. So I'd be very surprised if any of you were able to perceive a difference, but a model can do this with near perfect accuracy or near perfect AUC. The difference is that we've pre-processed them using a slightly different Gaussian filter. So the one on the left, the Gaussian filter had a standard deviation of 0.4 and the one on the right had a standard deviation of 0.5. So there's an almost negligible amount of additional blur and the one on the right. And if pre-processing is applied differently across institutions and something as small as a 0.4 to 0.5 difference in a Gaussian filter, this can lead to very strange shortcuts in a model if there's also differences in the prevalence of diagnoses. And this is particularly concerning since pre-processing steps, in addition to not being uniform across institutions, sometimes are never even reported in studies or publications. So recognizing the potential dangers of deep learning approaches to exploit shortcuts, we investigated transfer learning approaches to try to mitigate such behavior. In short, we wanna encourage the model to rely on clinically relevant features that are more likely to generalize across populations. We were inspired by the way in which a clinician reuses common radiological features when diagnosing a disease. So in our setup, we aim to solve the target task. So for example, here we wanna diagnose whether or not the patient has heart failure, but we have only a small, potentially very biased training set. And if we leverage just this training data, we will almost certainly learn shortcuts and fail to generalize. So we turn to a source task. This is a different but related task of diagnosing pneumonia. And our approach is two-staged. In the first stage, we learn all the parameters of a dense net on this unbiased source task. So we've checked this source task to see if there's bias among the attributes that we're interested in. And in our dense net, we have an encoder parameterized by theta and then a linear output layer parameterized by W. So we put in our chest x-ray and we output our prediction for our source task. So whether or not the patient has pneumonia. And we train this on the source population. And then we turn to our target task population. Remember our very small, potentially very biased training set. And we freeze all of the layers associated with the encoder. So we keep the theta parameters the same, but we update or we fine tune the last layer here. So W changes to W prime. And we're now focused on our target task, which is predicting CHF. So assuming a target task of CHF in which there's a one to one zone extreme correlation with age greater than 65, we fine tune all layers on the target task that's the no transfer here on the left. And this does significantly worse than fine tuning on the unbiased source task. Or at first training on the unbiased source task and then fine tuning on the biased target task. Now you might be wondering why performance isn't completely random when the source task isn't used as in the case on the left. And while in both cases, we're initializing the network with the mimic and checks for a data set. And this helps to certain extent, but not as much as including another source task. We found similar trends across other potentially sensitive attributes like being female and the body mask index of an individual. So while transfer learning is a well explored area, it's a surprising result, at least to me, that the last layer simultaneously has the capacity to learn an expressive model while also not enough capacity to overfit to the bias. And the full set of results are available in our MLHC paper and you can download all of the code from our GitLab. And I'd like to just highlight my co-authors on this paper, in particular, Sarah Jabour, the PhD student who led the work, our clinical collaborators, Dr. Shoding and Kazaruni, and David Foei, colleague in computer science. So takeaways. Beyond computational approaches to mitigate such shortcuts, we're now looking at human based strategies to mitigate shortcuts. But in conclusion, we should be concerned about these shortcuts. It's surprisingly easy to learn things like demographics or treatment, like whether or not a patient has a pacemaker. I didn't talk about that, but it's in the paper, image processing techniques. And if these training data are skewed, the shortcuts can be exploited and this will lead to poor generalization or worse, bias amplification. But the good news is that transfer learning is surprisingly effective at mitigating such shortcuts. But as I mentioned, probably not enough. So we're now looking at human based strategies. Can they identify when a model is biased and how might they correct it? So beyond diagnosis, high quality patient care requires delivering the right treatment to the right patient at the right time. And it is supervised learning setup like the one we just looked at, when typically focuses on predicting some diagnosis or outcome, whether or not the patient will survive or whether or not they have CHF. And while this information might be useful, it's not necessarily immediately actionable in all cases. In addition to knowing who is unlikely to survive or who might develop heart failure, clinicians need help in selecting the right sequence of treatments. So in the second part of today's talk, instead of focusing on a mapping from patient states to the probability of the outcome, here we're going to try to identify the best sequence of actions or treatments in terms of downstream consequences. So we consider a reinforcement learning framework. Where at each time step T, the agent perceives a state S of T selects an action, A sub T, and receives some reward before transitioning to the next state. Here, the overall goal is to identify the policy or mapping from state to action that maximizes total reward over time. In terms of our problem setting, this perceived state may correspond to different vital sign measurements while the action corresponds to a treatment. Thus, the problem directly maps to identifying the best sequence of treatments. More formally, the problem is modeled as a Markov decision process composed of a state space, an action space, a transition function, and reward function, which specifies what reward we'll receive if we take a specific action in a specific state. As the agent or clinician interacts with the environment, they generate a trajectory consisting of state action reward next state tuples. Given a trajectory, the return is defined as the discounted cumulative reward. Then our goal is to learn which actions to take in each state. However, the actions I take impact not only the immediate reward, but also the next state and thus future possible rewards. And while the state and action spaces are fairly straightforward to define, when it comes to defining the rewards, there are numerous challenges. Should we optimize for long-term quality of life or short-term stabilization of vitals? What if I'm willing to undergo an invasive treatment overtaking a drug that might have negative side effects? Designing a single reward function that captures the goals and objectives across many different individuals is challenging. Well, there's been some work on augmenting rewards via reward shaping. Oftentimes researchers in healthcare will default to rewards based on survival since it represents a clear goal and is straightforward to measure. However, using survival as the reward is likely to induce many mere equivalent actions. For example, different doses of a drug might perform similarly with respect to survival. And if we simply apply a typical RL framework and learn the single best action, we'll end up ignoring these mere equivalent actions. And in reality, there could be many actions that lead to similar outcomes with respect to survival but differ due to a variety of factors not accounted for in the reward. So for example, some medications might be associated with high costs or severe side effects. And while it's certainly possible to incorporate these into the reward function, as I mentioned earlier, it remains challenging to define a universal reward that captures goals of different individuals. Not to mention, the goal of an individual might change over time. So recognizing these challenges, we propose learning a mapping from each state to a set of near equivalent actions. And this presents the patient or clinician with a set of good choices while optimizing for survival. And in turn, individuals can incorporate other individual aspects of reward that might be otherwise difficult to quantify. So our approach learns a set valued policy. So as opposed to learning a mapping from a single state to a single action, the single best action, we learn a mapping from each state to a set of near equivalent actions. So let's consider an analogy where we're in a grid world and we wanna get from the starting state S to the goal state G. But instead of identifying the single best route, a set valued policy will map each state to multiple near equivalent actions, thereby finding multiple routes to navigate the world while achieving a cumulative reward close to the optimal value. And while learning near equivalent actions might seem straightforward, the sequential nature of decision making in reinforcement learning makes it more challenging. Recall that the choice of action impacts not only the immediate reward, but the next date and therefore possible future rewards. For two actions to be considered near equivalent, not only should they be similar in the short term, meaning they have similar instantaneous rewards, they should also be similar in the long term and lead to similar futures. Before we formalize our notion of near equivalency, let's first recall how we define value in an RL setting. Assuming we always act optimally, we can define the value of a state action pair or Q value recursively. It's the instantaneous reward plus the value of taking the best action in the subsequent state S prime. From this, we can define the value of a state with respect to the optimal policy pi star as the Q value associated with the best action. However, in our setting, a policy maps each state to a set of actions instead of a single best action. Thus, the notion of value must account for all possible downstream choices. This suggests a worst case analysis. In particular, v pi is now defined as the minimum Q value over the actions in the recommended set of near equivalent actions. So pi S will give us a set of actions that we believe are near equivalent and we choose the worst case among those. And again, Q is defined recursively, only now it's the immediate reward associated with taking action A in state S, plus the worst case future reward. Now, despite the applicability of this framework, there's been relatively limited work on learning set value policies. Most notably, Fard and Pinot proposed a solution for finite horizon planning, but their solution requires the knowledge of the Markov decision process model and relies on an exhaustive search over all state action pairs. So in light of these limitations, we propose a new algorithm for learning near optimal set value policies that applies in a model-free setting and can be solved efficiently. To quantify how good our set value policy pi is compared to the optimal policy pi star, in line with past work, we define near optimality using a multiplicative constraint. So assuming v star is non-negative, a user can specify zeta at the sub-optimality margin. As we increase zeta from zero to one, we trade off optimality for more choice in our set of actions. So if zeta is zero here, we just default to the optimal policy. And as we increase zeta, we'll get larger and larger action sets. So given a zeta, our goal is to learn an SVP that satisfies this constraint, near optimality. And there's a trivial solution here, where for each state, we only recommend the optimal action. This will respect the constraints, but it's not very useful. Remember, our goal here is to identify near equivalent actions and provide choice to the clinicians or human in the loop. Now recall then, optimal policy is greedy with respect to its own Q function. For all states S in our state space, the best action corresponds to the action with the greatest Q value. For a set of value policies, we formulate a similar equation and seek the fixed point solution. Here we include all actions for which the Q value respects the near optimality constraint. We can think of this as a near greedy action selection with respect to the worst case Q function of pi. Using this equation, we can drive a family of value-based algorithms for learning set value policies. And specifically here, we'll examine a TD variant based on Q learning. Like Q learning, we iteratively update our estimated Q values by sampling transitions from the environment. So we take action A, we observed the immediate reward and the next state. However, instead of using a greedy action selection where we just take the max, we use a near greedy heuristic as the action selection strategy to compute the updated Q values. Note that we require V star as an additional input, but in practice, one can run a separate Q learning procedure first to estimate V star. Like Q learning, this algorithm is off policy and does not require a model of the environment. However, unlike Q learning in the general setting, convergence is not guaranteed. So to see this, consider the following example in which we have two states, S1 and S2. And the goal is to get to S end, the goal state. And we can do this by taking actions R, right or L, left. So if we assume a discount factor of 0.9, then we can calculate the Q values and state values associated with the optimal policy, which is to always go right, regardless of what state you're in. And in this toy example, there are only two possible candidate set value policies, the trivial solution or the optimal policy and one that encompasses all the possibilities. So pi one is the trivial and then pi two, it's the one that encompasses all possibilities. And when our optimality margin Zeta is greater than 0.19, neither of these set value policies is a fixed point solution. And when Zeta is less than 0.19, we default to the trivial solution. So to check if a policy pi is a fixed point solution, we can calculate Q pi, construct a new policy pi prime and see if it changes. If it's a fixed point solution, then it shouldn't. So assuming an optimality margin of 0.2, in the first state, we'd add R since it's exactly equal to the optimal value. Well, in the second state, we'd add L and R since both are within 0.2 of one. But notice the policy changed. This corresponds to actually the second set value policy. So let's check if that's a fixed point solution. So notice that the value of going from state two to going left from state two is zero. Since we have to assume you pick the worst state in all downstream action. So if I'm in state two and I go left, I could go right but then left again and then right and then left again and I'll end up cycling and end up with a value of zero. So when it comes time to construct pi prime, we end up now excluding L from our set of near equivalent actions and pi changes again. So not a fixed point solution. And while our algorithm is not guaranteed to converge in the general case, reassuringly, we can prove that for directed acyclic graphs with non-negative rewards, our algorithm is guaranteed to converge to a unique solution. Moreover, despite the lack of strong theoretical guarantees through a series of experiments, we can show that our proposed algorithm performs reasonably well under a variety of settings, including non-degs. So here we highlight two experiments. First, on a non-deg problem, we demonstrate the empirical behavior of the algorithm and show that it's able to converge to useful solutions, though it's not guaranteed to. And second, on a real clinical problem, we show the algorithm helps discover meaningful near equivalencies among treatment actions. So back to our frozen lake environment. So our goal is to get from the starting state S to our goal state G without falling in any of the black holes. For each state, you can move up or down or left or right, and we assign a reward of one. Once you make it to the goal state G, and to induce near equivalent actions, we assign a small reward to each transition so that roots with the same length will have slightly different returns. And as we increase Zeta, which trades off our optimality for more choice, we end up or should end up with a larger set of potential routes. So here I'm just gonna plot different routes that we get consistent for each Zeta. So with the Zeta zero, we end up with just the optimal solution. But as we increase Zeta, we're trading off optimality now for a larger average policy size. So this is the average number of actions for each state. And this can continue to increase when we get more and more choice. So this demonstrates that even in the non-daig setting, the proposed algorithm can converge to useful solutions. In our second experiment, we considered the challenging clinical task of treating patients with sepsis. And we used data from MIMIC. So this was the data set that I mentioned earlier, only now we're just looking at the structured EHR data, not the chest X-rays. And we use a problem setup similar to Pass Work by Komorowski et al from 2018, where our goal is to learn optimal treatment strategies for patients with sepsis in the ICU, where each patient is defined in terms of 48 physiological signals measured at four-hour intervals. And the actions are 25 discrete action space, corresponding to five levels of vasopressors and five different levels of IV fluids. So for illustration purposes, I'm going to show results when we set our optimality margin or Zeta to 0.05. So to investigate the applicability of our approach, we looked at what our algorithm identified as near equivalent in the action space. So we visualize the action near equivalencies in a five-by-five grid. On the horizontal axis here, I have five different levels of vasopressors, so increasing from left to right. And on the vertical axis, five different levels of IV fluids. And each cell then corresponds to a combination. So here I'm giving no vasopressors, no fluids, and here I'm giving all the vasopressors and all the fluids. And the red numbers here indicate how often each action is considered optimal by the learned policy overall state. So here I'm marginalizing overall states or I'm summing overall states so that I can fit this on one slide. And within each of these boxes, I'm going to plot another five-by-five grid that's going to show me how often for that particular action, the other actions are considered near equivalent. So here's the heat map with the results. To understand this overall picture, let's start by interpreting a small slice. So let's just pick this one here where I am giving no vasopressors and a small dose of fluids. When a low dose of IV fluids is the optimal action, this action's always included in the set of actions, which is good. And about 35% of the time, the action of no IV fluids is also included in the near equivalent set. While the remainder of the time, a larger dose of IV fluids is near equivalent. Here we're collapsing these frequencies across all states, but remember near equivalency will ultimately depend on what state we're in. So if we're in a healthier state, maybe these are the ones that are more equivalent and or near equivalent. And if we're in a sicker state, these might be the actions that are near equivalent. So to further interpret these results, we work closely with our co-author, Dr. Michael Shoding, a critical care physician at Michigan Medicine who frequently treats patients with sepsis. And he's going to tell you a little bit more about how to interpret these results. Let me just make sure I've got this set up. So you should. Results of this figure are consistent with what I would expect based on my clinical expertise. So first take a look at the column at first from the left. So these actions correspond to when fluids are administered, but no vasopressors are administered. And you can see vertical bands in each cell, which means that similar doses of IV fluids without vasopressor treatment are frequently considered near equivalent. And this corresponds to what I would expect physiologically that when patients are given similar doses of a medication, we would expect them to respond in a similar manner. Also, if you look at the second row from the bottom, which corresponds to a low dose of fluids and a range of doses of vasopressor, these actions are also considered near equivalent. Finally, the top right cell may seem a little inconsistent with some of the other results. And it suggests that when the action is to administer a very high dose of both fluids and vasopressors, a near equivalent action may be to do nothing or give no treatment. So this may seem counterintuitive, but it's actually also consistent with my clinical experience. In some very sick patients, no matter what we do, they end up having a similar bad outcome. And that's largely because at the time we're able to treat them, their disease has already become irreversible. So overall, on the clinical task of learning near equivalent substance treatments, I think our proposed algorithm is able to uncover clinically meaningful actions that are near equivalent. So in summary, we proposed a new algorithm for learning near optimal set-valued policies. This represents an important step for clinician or human in the loop decision support. Humans can now incorporate additional knowledge to select among near equivalent actions, knowledge that might be really difficult to bake into a reward function or an objective function. And there is potentially broader impact here beyond healthcare applications, perhaps to applications in education or transportation. Again, I'd like to highlight my co-author, so Shenku, the PhD candidate of my group who led the work at DGM Modi, another PhD candidate at Michigan NCS and my clinical collaborator, Dr. Shoding. And you can find all of the code for this paper on our GitLab or check out the paper. It was published at ICML this past July. So in conclusion, recognizing that AI is far from perfect, I argue that we need solutions that seek to augment rather than automate clinical tasks. And I highlighted how deep learning approaches are susceptible to shortcuts and how this can lead to not just strange but dangerous behavior. And to address this issue, inspired by how clinicians make diagnoses, we presented a transfer learning approach to mitigate potential shortcuts when learning from images. We also presented a novel algorithm for learning good options for patients and clinicians during decision making, giving them more choice. So I'd like to highlight the students who worked on these projects and our collaborators, experts in critical care, radiology and computer vision without them this work would not have been possible. And I'm happy to take questions now. Thank you. Thank you very much. We send a round of virtual applause to you for this presentation. Thank you, that was great. You highlighted extremely well how difficult the application of machine learning is in reality in the clinic. I'm sure we'll have a number of questions about this. Let me start with the network. Yeah, Vesna, please go ahead. Hi, great talk. Thank you so much. My question is about these shortcuts in diagnostic tasks involving chest x-rays. Maybe you mentioned that it wasn't really clear for me. But how do you find these attributes associated with biases in your data sets? So for example, you showed these two x-ray images that you said there were sepital differences in image processing. But in other learning tasks such as estimating BMI or age, how do you find these shortcuts associated with biases? Great question. So for our experiments in the paper, we injected a lot of the bias in order to see just how easy it was to exploit it. So if the shortcut, we ask, if the shortcut's there, would the model take it? But oftentimes you don't know if there are shortcuts in the data. You don't necessarily know if there's biases. And you can check to a certain extent by looking at, say, the correlation between individual attributes and the outcome label. But then really nice thing about the proposed transfer learning approach is that so long as you have a source task that you've checked over for the biases that are relevant to your problem, you don't necessarily need to know which biases are in the target task. Okay, thank you. There is a very related question on Slido, which I'll read out. Do you have any recommendations to more or less reliably check a model for the existence of these shortcuts? Yeah, so that's a really interesting and challenging question. I think it's really difficult to guarantee that there are no biases. That's why I say mitigate, or I use the word mitigate when I was presenting this. Obviously there are checks you can do with respect to correlations like I just mentioned. The other thing is visualization. So when you're using these deep learning frameworks on imaging data, visualizing what the model's actually using to make its prediction. So that was the figure that I showed. Let's see, where was it? I'm trying to pull it up again here, where it ends up using features related to the skeletal structure where there's a lot of differences between females and males or breast tissue in the image rather than any relevant or clinically relevant findings. And this is where sitting down with our collaborators in radiology and critical care who look at a lot of these images, they can quickly say, no, that doesn't make sense. And that's in fact, the next stage of this project is looking at such human based strategy. So what kind of explanations is this enough? And we don't know, that's something that I'm really interested in studying. Now I was muted, there's another question on Slido. Could you comment on the choice of metrics for assessing the quality of the diagnostic AI models in particular variable weights for type one and type two errors? Yeah, so we get this question a lot. In part because oftentimes in my talks I'll just show the area under the curve, which is a summary statistic, right? So the area under the curve is just the area under the receiver operating characteristic curve, which trades off your false positive rate and your true positive rate. And we need to pick something to measure these models or the performance of these models, but oftentimes it depends on the use case. And in addition to looking at the area under the ROC curve, we'll frequently look at the area under the precision recall curve because oftentimes that's what clinicians care about is the precision or the positive predictive value. So of what I call positive, what's actually positive and this is related to false alarms. So I think there's been some work that shows around positive predictive value of around 60% or two thirds of the time if the model alarms and it's actually a true alarm, then it's clinically useful. But it can really, again, depend on what the intervention is. So we have a project that I didn't mention or talk about today that looks at predicting healthcare associated infections and in particular infections with cluster deoidies difficile and it's a relatively rare outcome. So only 1% of patient admissions develop these infections and because it's relatively rare, your baseline, so when you look at positive predictive value, your baseline positive predictive value will be the prevalence. It's not like the area under the curve where random is 0.5, random is now the prevalence. A positive predictive value, better than 1% is better than random. So in that work, we have a positive predictive value of about 6%. So we're able to identify patients at six times the baseline risk and that can be useful depending on the intervention. So if the intervention is, say, give patients probiotics and it's relatively inexpensive and low risk, then you can tolerate more false positives. But if the intervention is put everyone in a private room, then there are obvious constraints there and there's a smaller number of false positives that you can tolerate. So it depends on the use case. And this is where it's so important to work really closely with clinical collaborators or hospital leadership to understand how the models will be used or integrated into clinical workflows. Thank you. Is there another question from inside the network? If not, I will formulate one. It's more of a perspective question. So yesterday, we also had a very exciting talk by Mihaila van der Schaar who presented her work on AutoML. And both talks impressed me very much. Still the message is a bit different. So in Mihaila's talk, the message is we are advancing rapidly in automating machine learning pipelines. In your talk, like one message for me is it's so difficult to transfer a machine learning model from one hospital to another. It's so important to keep the clinician in the loop to get a really clinically useful prediction of the model. Like, where do you see the, or where is your position or what's your take on these two opposites, I would call it, or different perspectives on automating versus integrating the clinician? Yeah, that's a great question. And I have, I wrote a perspective piece about a year ago with some of my colleagues titled Turning the Crank, Ease at What Expense. Turning the Crank for Machine Learning, where essentially, you're right, we have these pipelines now. Essentially, you know, they do model selection for you. And they're nice from a user's perspective. But what are the issues once we deploy them, right? And I think that at least in the short term, it's really important that we include the domain experts, the clinicians, and the patients in the loop, precisely for the reasons I gave related to shortcuts, that until we have automated tools that can identify these shortcuts reliably, or even when I talk about complex clinical solutions, we're doing a much better job than we were just a couple of decades ago in terms of collecting data. We have rich data inside the hospital now. We can integrate electronic health record data with chest x-ray data, but there's still, health doesn't end outside the hospital. You go home and you still have vital signs, but we're not necessarily measuring all of those, right? So we don't have all of the data on an individual. So that's where asking the individual is really important, keeping them in the loop, keeping the physician in the loop, because they can sense things we can't. So I don't think we're quite there from a data perspective, a sensing perspective. And we're still learning about all of the unexpected ways in which these models can fail. So I argue that at least in the short term, AI plus clinician is better than clinician and AI plus clinician is also better than AI. Thank you very much, a very interesting answer, very informative. Good. If there are not further questions, then I thank you very much for taking the time to give a talk here. This enriched our summer school very much. So again, a round of applause for you, even if we cannot clap our hands all at the same time we do so virtually. Well, thank you so much for having me. I would love to visit sometime in person. Yes, absolutely. After we'll do this when the world returns to normality. Yes. You're most welcome. Next year, the school is supposed to happen in Estonia. We hope it can happen in person. Okay. Great. Thank you very much. Thank you very much. You too.