 Well, first I want to thank the organizers for asking me to be here and talk about the work that, you know, myself and my group, really, my group has done. And as I think it's already been alluded to, my interest is myopic in the sense that I am a clinician. I'm a cardiologist who maintains an active, though somewhat limited, practice. And my group is heavily involved in using machine learning for clinical decision-making. So I think oftentimes having conversations with computer scientists who are sort of working at the forefront of this area, there is not a full understanding of what really are the specific requirements that will make machine learning be adopted by the clinical community to the benefit of patients. So I wanted to focus a little bit on that today based on some work that we've done and open up a sort of a broader colloquy that hopefully will continue past this talk. So what does health care look like today in 2019? So we have the clinician. We have a patient. This is a short very long. And the patient gives some information to the clinician. That information comes in a variety of forms. Here we have a patient that's been admitted to the hospital, hence there on the gurney. And there are laboratory studies, there's physical exams that happen by a number of health care professionals as this patient traverses the health care system. And then the clinician has to process these data to decide what actions are in the best interest of the patient. So the questions that the clinician has in his or her mind is, is there somebody that really is going to have really an adverse event in the near future? How can I process the information to be able to make a decision that is in his or her interest? The relay of information is really not as simple as it is depicted here. It's really more like this. This is a patient who is in an intensive care unit. There are multiple sensors, multiple pieces of data, and the clinician is bombarded with this information. And then in this instance, but the questions are the same. So the clinician asks, what is this patient's risk of adverse outcomes? Is this somebody you have to worry about? Is this somebody who have to expend resources to make better to lower his or her risk? And moreover, then once I decide whether this is a patient that is at high risk or not, what's the optimal therapy? So these decisions are wrought with a lot of uncertainty for a variety of reasons. First, there's limited resources. We can't do all tests on all patients. I'm not sure we'd want to in any case. In addition, we don't have optimal guidelines for the majority of the patients that we see. There's lots of clinical trials, lots of effort, money, and resources have been invested in finding optimal therapies by using randomized clinical control trials. But in practice, the patients that we see, the very sick patients, they wouldn't be enrolled in any of those trials. So then it becomes unclear how the observations that arise from those trials are useful for clinical decision making. And as is really depicted here, we have data overload. So where is it that these methods, AI, machine learning, can really make it impact in clinical decision making? Well, we have all of these complexities that arise from the care of patients. But data science can help us identify patients where I'd increase risk. Patients who intelligent clinical decision support, change in monitoring therapies, there's lots of promise that this holds. So when we often talk about this, and we want to simplify this into where these techniques can be the most fruitful, we have the physician being bombarded with lots of data, and we can replace the physician with something. Something that processes these data, finds trends, be able to learn important relationships that may be quite complex between the different variables that arises from the patient. And it provides simpler information to the patient so that he or she can make informed clinical decisions. Then you look at the literature, you look at even the clinical journals nowadays, and you'll find many papers that purport lots of advances in this area. So even most recently there's been lots of advances, actually. I think this is really, this is the most advances I think have been made in the area of looking at medical images and being able to sort of pick up very unique and subtle inferences from complex data, diagnosis of cancer and such. There was a recent paper that looked at arrhythmia classification using the types of images at myself, and I think some of the others in the room, cardiologists use all the time, electrocardiographic data, and also cancer prognosis. But the interesting thing is although you do a literature search, I think in the last five years, even just within PubMed, there's 10, 20,000 papers that purport to use these methods and make very significant findings. But if you go into the hospital and you talk to the boots on the ground, you'll be recognized that this is the case. So while machine learning in the healthcare sphere, I think everyone recognizes that this is important. And it's really not been embraced by the clinical community. And so the question is why is that? Well, returning back to this paradigm, you have some very abstract animal that takes all of these signals arising from the patient and hopes to give these data to the clinician so that he or she can use it for clinical decision making. But if you looked inside of this box, it really can be things that are very complex, models, very complex relationships between input features and outputs that are of clinical interest. And at the end of the day, things like this, these deep artificial neural networks are typically complex, many modifiable parameters. And are notoriously difficult to understand by the lay person and to explain to the clinician. And why is that a hindrance? Well, the clinician, when he or she sees this black box, there's very little intuition on how the model arrives at a particular result. So the expert may train a particular model and say, well, I've tested it on a variety of different types of data. And it has these performance metrics. And so I think that this is something that you should adopt. But the clinician, he or she would say, well, the first medical school I think was established in 1765. So we have over 200 years worth of medical data science, right? And how does what you have created, how does it agree? Does it agree with these hundreds of years of data that we have amassed? What I've learned in medical school, all of these prior observations. So even though the performance of the model is good, how can I be sure that it'll work on my specific patient? I know what happens with the cardiac output if you increase the heart rate. And I know what'll happen with the blood pressure assuming certain resistance. I have these physiologic relationships in my mind that are quite durable, right? So they've been born over time in various animal models and patients. We see this all the time. That is something I believe. But this is kind of difficult to reconcile this with my prior understanding. Moreover, and I think the related question, is this model consistent with what I know about human disease? Let's say I gave it features corresponding to a patient who I thought was very, very sick. Is this model going to tell me what I think should be the result? And if you can't begin from that premise, then I think it's very hard to even begin the colloquy with the clinician. Even harder for clinical acceptance to be the case. So I'm going to look at these two different animals and call them in sort of different names. So this is explainability. I think it falls under the rubric of explainability. And this was discussed, I think, somewhat by the previous speaker. And the former, I'm going to call trust, right? Can I trust that the model will work on different subgroups or different types of patients that I think are important? So when you talk about how machine learning models are typically evaluated in the literature, at least in the literature that I'm familiar with. It's statistical measures of performance, is the accuracy, the discriminatory ability, all of these metrics that we are familiar with. But the point here is that in healthcare sphere, accuracy does not, nor should it mean that the resulting model will gain clinical acceptance. So getting back to this model of trust and explainability. I just want to talk a little bit. I think these are the two paradigms that I think one has to resolve their model with in order to have a conversation with the clinician before we can achieve, for the model can really be embraced by the clinical community. So first let's talk a little bit about trust. So trusting individual predictions. So what do I mean by that? Let's go to the engineering literature. We talk about understanding failure modes. And why is this important? So unlike other non-clinical domains, incorrect predictions can really have disastrous consequences for a particular patient. And when I speak to attorneys of mine who are friends, they're first think, yeah, well, somebody's going to sue you, right? That's not the main thing because to the family that's involved, it's the health and welfare of their loved one. So for example, a patient that's high risk, when they're really low risk, you may obligate them to have lots of invasive therapies, or lots of dangerous maneuvers that itself entails some risk. If you predict the patient to be low risk, when they actually are high risk, and the intervention itself is not very high risk, then you've missed an opportunity to do some good, potentially save a life. Which is the questions we often deal with in a cardiovascular arena. So what we would like to know is for any model, what are the characteristics of the data associated with incorrect predictions? And when can we really trust the output of the model? Given a specific patient, how do we know that that model is really applicable to that patient? Now, this is not new. People have talked about this sort of stuff before, I think mainly sort of in other settings. We have these statistical measures of performance. We can get sensitivity, specificity, precision, some kind of aggregate statistics about trustworthiness. But it's still difficult to know whether that is really applicable, and how to use that for a single prediction or for a particular patient subgroup. So what is often done, or what can be done, is if you train a model on a particular training set, so you know the data that the model has looked at to learn whatever it has learned, and then a new patient comes along. Well, you can compare that new patient to the training data, and if that patient is very different from the training data, you can say, well, you know, I don't know the performance of the model, it's going to be great in that respect. So all viable things to do, the problem is that, you know, clinical data sets are hard to come by. You know, we have sort of individuals who you spend a lot of time gathering, clean, derangled data sets for which models for specific tasks are trained. And getting access to those training data to really know whether a new patient that you have is really different is kind of challenging. You'd have to ask the person who has access to those data to be able to do such analyses. So whether, you know, it's a separate conversation, this is a conversation about not what things should be, but this is currently how things are with respect to data and to address these sorts of issues. So we took a look at this some time ago, a very talented student in my group, Paul Myers, did some work in terms of, well, how can we get an estimate of reliability for a given patient and we don't really have access to the training data? So let's say you have a model, it's been trained on a particular data set, you have a model prediction for a given patient. So let's say we could use another method for the same outcome and we could generate a new prediction using a different method. And let's say we use, and to do this we use a generative model. So what does that mean? We don't have access to the actual data set, but we have simulated data. So if we had some general statistics about not the precise training data itself, but some general statistics on the training data, we can sort of make some quick calculations about what the prediction would be. If we generated some data and used another model to be able to make the prediction. When the two predictions disagree, we say the training data are insufficient to give a robust prediction for the patient X. Very simple. And so the process leads to an unreliability score in which we can calculate for each patient and we can then identify patient subgroups for which the model is, we expect the model to be not very good just because the training data really don't need a robust assessment for that patient. So you can actually analytically derive what this score would look like. And this is just to show you that it can be done. We make some assumptions about the prior distributions of positive and negative patients. This is for binary classification problem. And we have a score which goes between zero and one. The higher the score, the more unreliable the prediction is. So you give this to a clinician and what kind of questions are they going to ask? Well, reliability is an interesting concept, but how do I use this in practice? Well, the statements I think that are the most dispositive are, well, if a patient has a high unreliability score, your prediction is likely to be wrong. The discriminatory ability for patients who have that score is likely to be reduced. And so those are the questions that we ask. And we did this sort of on ask those questions for patients that had high unreliability scores and using old established data sets. We have access to what's called, it's an old registry used to develop a clinically established score for estimating the risk of patients who have an acute coronary syndrome. The acute coronary syndrome, one simple way to think about it, it's like a heart attack. And so patients who had this, this phenomenon, get enrolled in this registry and you follow them over time, over 70,000 patients. So we had a subset of this, about 70,000 patients where we tested this on. We use an established risk score that clinicians use all the time. And the question is, if you use this risk score on these patients and you look at those patients that have high unreliability score, what does the model do on that patient subset? And so here, this is a look at accuracy. We have an accuracy measure, which is called the Breyer score, which is the sort of the mean average error of the prediction. So we know the truth and this gives a supervised machine learning problem. And the red are those patients that have very high unreliability score, and the black are patients that do not. So the utmost left are the patients that have the highest unreliability score. And it's much less accurate because the error is much lower. And those that are in the lowest 99% of unreliability scores. As you go higher, so top 5%, 10%, 25%, 50%, the model gets better. But it's still significantly worse than the portion of the data that have low unreliability scores. Similarly, if we look at the discriminatory ability, and we look at the AUC, when the unreliability score is high, then we have much reduced discriminatory ability with AUC going down to about 0.5, it's sort of a random guess. So the upshot is high unreliability is very sort of simple metric. We really just had to sort of write down some math to get some equations that gave us a robust result and make some assumptions about the underlying distribution. The assumptions are quite simple. We assume an underlying normal distribution for both positive and negative cases. So in addition, unreliable predictions are the most inaccurate. So if you look at calibration curves for those patients who have low unreliability scores, it's not perfect, y equal x is perfect, but it certainly is much well better calibrated than those patients that have high unreliability scores. The red is a high unreliability score. The blue is a low unreliability score. And although we don't have the exact training set with us, if we took a look at all of those patients that have very high unreliability score, they form a distinct distribution from those patients that have different low unreliability scores. What's plotted here is a difference. How different is a patient who has a high unreliability score from everything else in a data set? And those patients are most different. High unreliability score correspond to patients who differ the most from the training set. So I think one of the points that I wanted to get across here, just in a simple, there's nothing really fancy here. It's just sort of made some simple assumptions about the underlying data. We developed a score, I think that we sort of reasonable scale. And it reformed a score that I think one could have a discussion with a clinician that talks about outcomes that he or she thinks are important, the discriminatory ability and the accuracy. So reliability metrics that identify potential failure modes, I think for an important part of any clinically useful score. And a clinically useful unreliability score should really identify patient subgroups where the model performance is worse, is compromised. So it's really not applicable to that particular set, as one would assume by looking at its performance over the entire data set as a whole. So just one example of trust, a concept that I think is important and really paramount in the use of machine learning models in healthcare sphere. But explainability, I think, is another one. So you save the hard stuff for last. Because it's when people start to doze off, the coffee starts to wear off, so you don't pay so much of the attention. So you don't realize that it's really much of a hard problem if you don't have that much to say about it. Now there's lots of people who thought about explainability, including the speaker that was here previously. And there's a previous paper several years ago by Zach Lipton. I think it's often quoted from the communications of the Association of Computing, Machinery, ACM. And talking about explainability, really what does it mean? Because I hear many people talk about explainability in the healthcare sphere or not, but it's really hard to know what that is. And if one can devise an objective definition of it. But it's very, very basic. Think about it at very high level. When you say that a model is explainable, you can really say how the model works. And you may be able to learn from the models. You can ask, what do you have to tell me that I didn't know already? So there's some concepts that Zach talks about. I think that are really very insightful transparency. So for a model to be fully understood, a human should be able to take the input data together with the parameters and in reasonable time steps through every calculation required to produce a prediction. One should be able to do that for deep learning models, a challenging task. Decomposability that each part of the model, each input, permanent calculation, admits an intuitive explanation. So I think these things make at a very high level, but how to convey this in a discussion is challenging. There are some post-talk methods that have looked at reliability. So some natural language processing has talked about text explanations from some very complicated model visualization. Saliency maps, in particular used with CNNs and image to help you identify where in the data is really the, what parts of the data are the most dispositive for making any given prediction. So all of these concepts, I think I've been sort of talked about in the literature and this is only a small snapshot of them. I think that's all well and good. But now, trying to apply these notions in a healthcare sphere is a little different. So now their methods do exist to understand what deep learning model. I'm gonna focus on deep learning. That's sort of where most of the literature in the clinical sphere is. And we know we have sort of deep learning in a very, very simple way. We have some input features. We have a latent space. We have some abstract data representation that eventually goes to something, some output that we care about. So when the computer scientist, when he or she wants to have this, it's alenturing, wants to have this discussion with a clinician, he or she might begin, with some words built on the previous talk, by looking at activations of the hidden layers, we can do X. Or we can talk about saliency maps and help us understand what information is most dispositive. And that's all English words. But the clinician hears something different. Anybody know what language this is? Did someone think someone got it? It's a Swahili. Swahili. So he or she hears something that makes absolutely no sense. So in a sense, what constitutes really an explanation? To some, this is an explanation. But to the clinician, it's far from it. So explanations are inherently subjective. And they can differ from specialty to specialty. And I think this was alluded to in the previous. So myself as a cardiologist, there'll be some dialogue that I may have with another clinician, in fact. And I may walk away and say, that made absolutely no sense. That was no explanation whatsoever. Because I come from a different frame of reference. And it's challenging, consequently, to arrive at an objective definition of what constitutes an explanation. So I don't know how to write down what an explanation is in terms of a criterion that others should follow to be able to do this and apply this in a principled way to whatever model they could create. But I can sort of think about it a little bit if I think about necessary conditions for an explanation. So when is an explanation bad? So an explanation encompasses a discussion. Where all participants speak the same language. So I think, first, the statements have to be that both people come from the same background, but have the same corpus of words in which they communicate. And I think, in particular, clinically, what makes this somewhat challenging is that the clinician will ask, does a model make reasonable inferences in light of my current understanding of human pathophysiology or medical science? And so while we can't, well, it's very hard to state objectively when a definition of explainability is in a health care sphere. I think it's more useful to talk about, well, I know that these criteria have to be met. Because if it's not, then it won't be explainable at all. So just a quick example, I think, along these lines, we've developed a model once using the electrocardiogram for risk prediction. I'm just going to go through that briefly, because I think it's just a little insightful. So the electrocardiogram, I think most people, those who watch, I think it's Gray's Anatomy. I don't know if that's still on TV. One of these shows where people go into the emerging serum and they have surgery and all sorts of nonsense. And you'll see the electrocardiogram. It's a continuous signal that represents electrical activity of the heart. We as cardiologists, we use this a lot. I think there are some electrophysiologists here, actually, more familiar with this than myself. But we often look at parts of this to determine the risk of a patient when we first see them, when they have signs that are consistent with a heart attack or other sorts of myocardial injury. And there's a region of the electrocardiogram that we typically focus on. It's called the ST segment. When it's elevated or depressed and a patient has chest pain, we often are concerned that the patient's having a heart attack or an acute coronary syndrome. So we focus in laser with a somewhat of a myopic view on the ST segment throughout the electrocardiogram for clinical decision making. So what we did, we said, well, can you do this automatically, because what we can look at by eye, the resolution is kind of limited. The computer can look at things at an arbitrary level of resolution. So we segment the signal, we extracted the ST segments and made a neural network and combined that with other sort of patient features and did sort of a deep learning model that would be able to predict risk of death at some time after presenting with acute coronary syndrome. Well, I'm not gonna go through all the laborious details, but the model does pretty well. If you look at the neural network, the univariate hazard ratio, so your risk of dying 14 days, if you are predicted to be high risk is significantly higher than if you're not. And 30 days, 60 days and so forth, we show that this works, just all just to show you that the model works. So that it will work when you are mature, greater than 65 or when you're naive, less than 65. In many different patient subgroups, it still is applicable. But the question at the end of the day is again, what has a model learned? So we began starting from a dictionary that the cardiologist understands. I think that was the point there. That we start from a set of features in which you can at least begin to have a dialogue. So that language is the same. And the model incorporates lots of patient features that the clinician recognizes as being important as in playing a part in identifying high risk patients. So, but is that enough? Because the inner workings of the model is hard, is difficult to decipher. So the question is getting back to the second point is does a model make predictions that are consistent with what I know about disease? So what you can do is you could generate a large list of synthetic patients and synthetic models are number of them that one can use. Simplest one, just being a normal distribution as opposed to various types of GANs I think that have been started to use in a clinical sphere. So they come from some underlying distribution that one either assumes or learns. Feed that into the model, look at the model outputs. The model output is a probability of death within some period of time. And then you could compute all different sorts of marginals. So what's your risk of death if you're over a particular age, if you have certain phenotype? And then you can say, the clinician can say, well, with this particular phenotype, what would be the risk of death? And you can ask reason questions about sort of the insights and inferences that the model makes. So what we did, we did this in different, this is just one example. You can make these sorts of plots. You can look at age on one axis. You can look at the extent of ST segment changes, whether the ST segment is elevated or depressed. Remember, this is one of the things I said that we as clinicians look at. And we know from a variety of different data sources that those changes are associated with adverse outcomes. And then the model says that as you get older, you're at a higher risk of dying. It's a risk of death on this Z axis here. And if you look at the ST segment elevation and depression, when the ST segment here becomes elevated or when it's depressed, that's the one in the 10, your risk of death is higher. So it makes sense, it makes sense. But so you can do this to sort of verify that the model has learned something that is consistent with my prior knowledge. And you can learn new relationships because you can plot different things. I think there could be different hypotheses for going forward. And we've applied this in different settings. So we also have another study where we look at a type of valvular disease called aortic stenosis. And we've been trying to predict the risk of death, aortic valve or aortic valve replacement six months after the first diagnosis. And you can do the same sort of thing. We can sort of start up from a large list of features that may have an impact. We do feature selection method, bootstrap lasso. I think many of you may be familiar with it to select a subset of features because it's easier to have this colloquy with the clinician when the number of features is reasonable. If you've got a million different features to start out with, it's hard to have an informed discussion with a healthcare professional. And then you can put this into a neural network and do the same sort of plots. And that's what have been our approach at least to meeting those two criteria that we've outlined before. So in sum, explainability, standards, physical measures of performance should not be the only metric of success when evaluating machine learning model or any AI technology for clinical use. And machine learning models are more likely to be embraced by clinicians when they are accompanied by additional metrics, identifying failure modes and that you can verify that the model is consistent with one's own prior understanding of disease. So just to acknowledge just the people who did all of the work. So Paul Byers, who I think is here and has a poster is a graduate student who did the trust work and the explainability work. Wang Zi-Dai is also here who has a poster as well. He's did some other work on oversampling I didn't talk about. And of course, our very talented collaborators from the IVM AL Watson and Kenny Ahn, Kristin Severson and Yuri Cartoon. Okay, two minutes to spare. Thank you.