 It's my great pleasure to welcome Mihaila van der Schaar from the University of Cambridge, the Allenduring Institute in London and UCLA. Mihaila is a star in the field of machine learning for personalized medicine, as you can see from the many awards that she has won in her career. I'll name a few here, the on-price and preventive medicine, the National Science Foundation Career Award, three IBM faculty awards, the IBM Exploratory Stream Analytics Innovation Award, and many other awards and best paper awards. Her work has also led to 35 USA patents, and since 2019 she's the most cited female AI researcher in the UK. I mentioned beginning her list of affiliations which are also mirroring her great success. She has now for several years made important contributions to the field. It's a great pleasure to have her here this afternoon and to hear about her work on AI in medicine and her perspective on how the field will further develop. Thank you, Mihaila, for joining, and we are happy to have you here. The floor is yours. Thank you, Carson, for a wonderful introduction. Thank you so very much. So thank you very much for joining me. I'm going to tell you today about the work that we have done on automated machine learning, and you cannot do machine learning for medicine without worrying about interpretability, so I'm going to hopefully have the time to tell you briefly about that work as well. The research agenda in our group, more generally, is developing machine learning methods for next generation healthcare, as well as understanding and augmenting clinical decision making such that machine learning can truly be included in the ecosystem of clinicians and machines. More recently we have started also to look at machine learning for genomics and drug discovery, but the focus of my presentation today will be on AutoML and its role in developing clinical analytics at scale. This is definitely important for next generation healthcare, but it may play a role in the other two topics as well. And overall what we are hoping to achieve with this is develop machine learning models that are high efficient, but also interpretable such that humans can take them into considerations to make better decisions. While today I'm going to tell you about our work on AutoML, in our lab we are focusing on a variety of other methods and models to further this agenda of machine learning in healthcare. Inclusively interpretability and explainability, dynamic forecasting from time series data, longitudinal trajectories of disease, causal inference, especially as applied to estimating individualized treatment effects, inverse reinforcement learning, trustworthy machine learning, etc. So if you are interested to see more about our research please look at our website. I always like to start by saying why I believe that machine learning for healthcare is different than other areas in AI and machine learning. Of course we have a lot of advance in the last couple of years on a large variety of topics in machine learning, but healthcare is really different because the problems we are trying to tackle are complex and they may need new ways of thinking, new ways to formalize these problems into concrete machine learning models that can be solved. And also how the solutions that are provided by these methods can be integrated into the human machine ecosystem is yet another issue. It's sufficient to think about COVID-19 and the role that machine learning may play in identifying patients that may deteriorate and may need to be hospitalized, identifying when these patients should be hospitalized, potentially what type of treatments they should receive, should they be admitted to the ICU, when should they be discharged from ICU, within ICU, they benefit by being in a ventilator, etc. etc. These are all complex decisions that need to be made over time by clinicians and ideally machine learning should be able to support them in this complex decision making. But this is challenging because of both the data that we have at our disposal but also in terms of really coming up with solutions that are truly interpretable and trustworthy. So the goal really of developing methods for healthcare is in supporting and augmenting human decision making, not replacing clinicians and clinical personnel but rather supporting them. So in that process a lot of new challenges arise. Some of them I'm going to highlight them today. But if we are capable to do that over the next years there is also an enormous potential because machine learning can really deliver precision medicine at the patient level, understand the basis and trajectories of health and disease as well as informing and improving clinical pathways thereby leading to better utilization of resources and this is clearly the case now in the current pandemic. Finally it can also transform population health and public health policy. I'm not going to talk about this today but an important impact that machine learning can have is determining when somebody should be screened for a particular disease, how often and potentially with what technology, thereby moving from a one-size-fits-all approach to screening and monitoring a population to a more individualized and targeted approach that may be more efficient and more cost effective. In order to be able to do this at scale, automated machine learning may play a role and again unlike some other areas in AutoML that have gathered prominence in the past couple of years, the focus here is really quite different than other areas in AutoML and I'm going to highlight that. So some key challenges are coming from the healthcare data, which is often biased by clinical practice and we often cannot go in the field and acquire more data. There is no experimentation possible. Also the data is unstructured, there may be a lot of missing data, potentially noisy data, predictions need to be made on the basis of multimodal and at times high-dimensional data. Also at other times, data may not be available due to privacy issues. These are all challenges that we will need to overcome. On the model side, as I'm going to show you in a little bit, there is not often one model that is the best for a particular clinical problem. Depending on the type of data and the type of predictions you want to issue, different models may be best. Hence a challenge that exists, whether it's for somebody implementing these machine learning models in a clinical setting or for a researcher, a clinical researcher trying to use machine learning models to uncover some new factors that may drive disease, a key challenge that occurs is which model with machine learning model to choose. With the machine learning community, I'm very creative, come up with many models and that represents a key challenge for people who want to use these models in selecting the best one for the task at hand. Also we need to have machine learning models that not only perform well but also are interpretable and explainable such that they can provide the right insights to clinicians but also to patients. Last but not least, and I'm only going to briefly touch upon, it's important to have trustworthy machine learning so we need to also be confident in our predictions and know when machine learning has confidently issued a prediction and when not. So machine learning solutions in healthcare have been shown in promise for numerous diseases and they have the advantage that they are data driven and require relatively few assumptions. But as I mentioned before, a key question is which model to use and if you look here you see that we have multiple models for different types of data sets that may be the best. So out of let's say 50 models that we have tried for these different data sets, different models may end up on top for different types of data sets. These are all data sets that I'm not going to go into the details but all have something to do with cardiovascular disease. What's also interesting to see is not only the different machine learning models end up being best but also their performing gain with respect to best clinical rescores or statistical models may also vary. So this is not known in advance it's known after we have tried these models. So there is no one size fits all solution for all these different problems, no one method that will always be the best. So in this case the question is can we really predict in advance which machine learning model will be best and can we potentially do better than any individual model itself. So could we integrate through ensembles for instance a variety of machine learning models to create and craft the best machine learning model. Finally because the data sets that we have at our disposal and the problems we deal with are often biased and maybe only very few patients may suffer from a particular disease that we want to detect. We may not only look at around the receiver operating curve but we may be interested in other metrics of performance such as precision recall and possibly even quality of well-being. So you would like to do that at scale. You would like to really determine what methods are best for a new data set and if you look at the clinical literature and the work that has happened in the last couple of years in the clinical literature and more generally the scientific literature often you would see a bunch of methods maybe some deep learning models XGBoost maybe random forest would be some of the candidates and people are running models and running their favorite models and selecting one or two to report. What we are advocating here is a more holistic approach where you would do that at scale. So the question starts really which machine learning model to choose because we have many diseases the data that may be fed may be quite diverse and we have various needs and also as the case of COVID-19 is showing all may be changing rapidly over time both because the disease may be changing but also the technology to address this particular disease may change. So what we would like to do is rather than have a brute force selection of machine learning models and design parameters which may be prohibitively expensive but also may require expertise which some of these clinical systems may not have. We would like to resort to an automated machine learning model approach to determine what machine learning models may be used to do the crafting of risk scores. And again as I mentioned AutoML here is really quite different than other areas of AutoML. The focus is really not only on doing predictions but also not only classification and regression problems but also time to event analysis, survival analysis. In many settings we are interested in looking at competing risks longitudinal data as well as looking at estimating treatment effects. So the type of problems we are trying to address are really quite different missing data is crucial. The different metrics of performance with respect to which we try to select a model may be diverse. We do want to make sure that the models are of course interpretable and explainable and that may be by itself a reason to select a certain set of models or we may post hoc decide to explain the prediction of the models. We want to make sure that the models are trustworthy as well and we have uncertainty estimates associated with the predictions. So what is key in this setting is that we want to generate reproducible results unlike in other areas in machine learning in healthcare is really very important that our results are not only high efficient but also trustworthy and reproducible. So let me start by looking at the simplest of problems which is that of risk prediction. So if you look at a clinical ecosystem currently even in the most advanced countries usually as a patient is going to their general practitioner there will be a number of risk scores that may pop up when this patient is showing up but what we would like to do is we would like to develop a holistic view of the patient where as the patient is coming to see their GP there may be many risk scores for many conditions that need to be evaluated for this patient. So we would like to do that at scale and in order to do that at key challenges that we have a huge design space and different models may be best for issuing these different risk scores. Also as I mentioned we may not want to deal only with prediction but also we may need to learn these models from data that may be missing as well as high dimension data. So what we would like to have is to build entire pipelines that include data imputation, feature dimensionality reduction as well as prediction as well as calibration. So a first automated machine learning framework that we have developed in our lab for crafting clinical scores at scale was autoprognosis that's an older paper from ICML 2018 and what we try to develop here as I mentioned was pipelines of imputation, feature processing, classification and calibration. A key challenge associated with that is that as we have multiple algorithms with multiple hyper hyper parameters and depending on the algorithms that precede a particular stage or follow different models may be best. So it's a complex pipeline selection and hyper parameter optimization problem. And the challenge is that it is not only a combinatorial optimization problem but also hard learning problem because we do not know the utility of different configurations before trying them. So we need to learn that as well. What comes to our rescue is formalizing this as a Bayesian optimization problem and this is done often in this type of setting but as you are going to see in a little bit there are some key challenges associated with it. So for instance we have multiple algorithms that can be used for a particular new data set that we want to for instance predict cardiovascular disease or cancer risk for the patients in this data and different models may lead to different performance for instance in terms of precision recall. And what we are doing here is we are putting a Gaussian process prior and then what we are trying to do is balance exploration and exploitation by exploring models with maximum posterior mean and posterior variance. So for instance I have here decision tree I'm evaluated this performance and then I have an expected performance for the other models and I explore the model that I expect to have the highest informativeness and I continue in this way all the way till I discover with sufficient performance the accuracy of the different models for this particular new data set. The trouble we're doing this is that we have many dimensions we need to optimize upon. So we have many algorithms at every stage many algorithms that have different hyperparameters that need to be optimized as well as we do this not only for one stage but as I mentioned imputation, feature processing, classification for instance in this case and calibration. So Bayesian optimization as well known doesn't work well when we have many dimensions more than 10 usually and the challenge here is that we have hundreds possible thousands of dimensions. So one way to go about that is to use a structured kernel learning and this is what we have done in autoprognosis and the focus here was to determine which algorithms are correlated and here I use the word correlation as a metaphor. So by correlation I mean algorithms that have similar performance on a particular data set and then what I can do is I can take a divide and conquer approach where I'm only optimizing the algorithms that are found to be related to each other for this particular data set. How do I do that? I do that by again making the following realization that not all algorithms have correlated performance. For instance it may be that some algorithms like XGBoost and randomForest are having a similar performance in this data set and I want to jointly learn across them while neural networks does not. In this way I'm learning which algorithms I want to jointly optimize and which ones do I do not and I learn in this way a structured kernel. To visualize this idea let us see this particular example I'm having here different types of algorithms that have correlated performance and all I need to do is optimize and do Bayesian optimization only for the algorithms that are correlated and I do not need to do that across all the different algorithms thereby lowering the complexity associated with doing Bayesian optimization. The challenge is though that if I'm able to do that on the positive side I reduce the complexity of doing a full Bayesian optimization significantly so this becomes a doable problem but the challenge is that the structure of the kernel is not known so I need to learn for a new data set which algorithms I should jointly optimize and which ones I should not and I'm doing that by learning a hierarchical for a hierarchical Bayesian approach where I'm having a prior on the decompositions and then I'm computing as I gather more information about the different algorithms performance the information about the posterior decomposition and I'm learning in this way which algorithms have a similar performance and hence would benefit through joint optimization all the way to learning the kernel that is optimizing these different functions. Now that I'm able to do that I can determine and predict the performance of a pipeline for a particular data set but I could decide not only to issue a report of the different methods performance I could try to go one step further and issue predictions that are given by ensembles where I'm together issuing predictions across multiple pipelines the advantage of this is that I can also issue uncertainty associated with these predictions and also because some of these data sets may be limited in the number of samples I will prevent information loss. How do we aggregate these different ensembles we are doing we are using a conventional Bayesian modern averaging approach where now I'm creating ensemble using the posterior distribution of the performance and the weight of a particular pipeline is nothing else than the empirical probability of it being the best. We have used autoprognosis for a variety of clinical data sets so far ranging from cardiovascular disease to cystic fibrosis to cardiovascular disease in the UK Biobank to breast cancer and most recently we have used it for COVID-19 and together with NHS Digital here in the UK we have partnered to use this automated machine learning model to help hospitals plan better resources for COVID-19 and if you'd like to learn more about that you can see our website dedicated to our various COVID-19 projects. The challenge though is that as is the case for COVID-19 the disease may be changing over time also the technology may be changing over time so for instance the way in which patients may be treated may be changing over time as again is the case for COVID-19 so we need to have methods that will enable us to take these automated machine learning models for scoring and dynamically adapt to changing environments whether it's because again disease is changing or technology is changing and for that we have developed a method that we call lifelong Bayesian optimization which enables us to update the optimal model and determine when a new model needs to be trained an automated machine learning model needs to be trained and new analytics need to be issued we are doing that very effectively by leveraging the past optimization to update the current optimal model efficiency in this way we are able to learn as the days and years pass when we need to update the model and when we no longer can rely on the predictions made by past clinical analytics trained with past machine learning models and we need a new model to be issued as the state of the art we are going though beyond classification into other areas that I mentioned before just prediction is often not enough we would like to do survival analysis look at competing risk treatment effects or longitudinal temporal models and the problems get harder as we are starting to look at this so if we are starting to look at time to event analysis survival analysis a key challenge is that different models may be best at different times horizons so if you look at a particular data set you will see here that again if you look at the time dependency index different models are best at these different time horizons so it's no longer a question of selecting the best model because multiple models are best so what we would like is we would like to select different models at different time horizons but we cannot do so in an easy way because we need to be able to construct a valid survival function so unlike in the predictive case where I can issue simply an ensemble which is waiting the different models differently here I need to issue a valid survival function and I need to have I do so across all the different time horizons while also being sure that it is well calibrated in the population so for that we have developed a model that we call survival quills which is again an AutoML model but this time has one additional challenge because the ensembles are integrating different are are selecting different models and weighing different models differently at different time horizons the problems becomes more challenging and also this becomes a constraint optimization problem because we need to have a good calibration so the problem is now to maximize the utility function which is in this case the time dependency index but subject to a constraint on the rear score which is the calibration so we turn this into a constraint Bayesian optimization and what we are doing here is we are building what we call quills and you can see why we call them quills if you look at this figure so you see that the different weights associated with the different models is changing at the different time horizons so we are building new quilting patterns in this way um what how we are doing that in this temporal quilting process we are constructing this valid function uh with different weights at different time horizons and we are carrying forward um these models for future time predictions such that we ensure that we have a consistent risk function as I mentioned before we are turning this into a constraint Bayesian optimization and what's interesting yet challenging here is that the quilting pattern is made robust by um introducing endogenous time horizon splitting so this quilting pattern is not done at fixed time horizons but rather we learn on the basis of data when we need to change the quilting pattern so this is not done in a fixed manner but rather in an endogenous manner and it is learned on the basis of the data to let me focus just briefly on the constraint Bayesian optimization which is the second step of of the temporal quilting pattern and what I want to highlight here is that unlike in the previous case of autoprognosis where we just had prediction now because we have a constraint Bayesian optimization we need to put Bayesian we need to put a GP priors on both the utility function as well as the constraint and we are turning this constraint optimization problem into an unconstrained optimization problem that we are um solving using an augmenting Lagrangian Bayesian optimization with GP priors so when we are solving this problem we need to make sure that we are selecting um solutions that fulfill the calibration constraint that we have so you see here for instance that the first two solutions are not fulfilling this constraint so are not selected while the last one is so this is a more challenging problem than just Bayesian optimization for conventional prediction how well does it work as you see here the performance of the survival quilts is is quite significantly better for a variety of data sets while fulfilling the calibration constraints what's also interesting to see is that it pays off to build survival quilts which are endogenously determining when to switch the patterns so if you look at survival quilts where the timing is uh and the setting is exogenous their performance um and here is k is equal to one k is equal to two and k equal to three small mistakes on this slide but the exogenous times are are predetermined you see that they are performing less well than in the endogenous setting where the time to switch the quilt is determined on the basis of the data and is learned by the algorithm let us go one step further and look at causal models so move beyond predictions to treatment recommendations we don't want only to predict what will happen to the patient we also want to know whether it's time to intervene and it is well known that treatment effects are often heterogeneous so hence we would like to determine for a particular patient such as bob what treatment would be best for him and that may be different than that for mary or the general population so what we are interested in here is to solve a causal inference problem and the challenge with this problem is that unlike in the previous setting we do not observe counterfactuals so if a patient in the data set was treated we observe the label associated with it but we do not observe the counterfactual so we need to be able to select models without the ability to do cross-validation so this is one challenge that we need to overcome another challenge that we need to overcome is that patients are not randomly treated or untreated like in a clinical trial usually what we are using here is observational data which is larger scale so it enables us to learn heterogeneous effects but the disadvantage is that we need to learn in the presence of selection bias and without having the counterfactuals this area has been a very powerful area of research over the last a couple of years and a variety of methods have been developed in the machine learning community to solve this problem of causal inference what you are seeing here is some of the papers in the last couple of years at ICML Neurypse and iClear the ones that have an arrow are models from our own lab so a challenge that occurs again like in the previous setting is well we have many models which one is the best model for a new data set we would like to select for each observational study which cause an inference model to use and the reason this is challenging and even more challenging than in the previous two problems is because we cannot do cross validation and the reason behind it is as I mentioned before we do not have counterfactuals so we are addressing this problem in an ICML paper last year and the way in which we determine which model is best is by looking at the mean square error of a causal model and this is often the precision and estimating heterogeneous effects so what we would like to do is we would like to select the model that leads to a particular performance and you would like then to to select that model to issue predictions the challenge as I mentioned is that we learn from data where we do not observe counterfactuals so if for instance Mary and Bob have been treated that respectively untreated we do not know what would have happened to them otherwise so how can we really validate in this setting causal models without being able to observe the causal effect so the model accuracy we are modeling it as a statistical function and what I mean by that is a functional is a function of a function and a statistical function is a function of a distribution where we are trying to minimize here is the mean square of this causal model and the model accuracy will be the statistical functional that we are trying to estimate what we are interested to estimate to determine the best causal model is the empirical measure or for short pay a precision estimating heterogeneous effects but as I mentioned before we do not observe both factual and counterfactuals so this is an empirical measure that we cannot compute so how do we go around this we take insights from a simple insight that we all know from from high school and that is of Tyler series approximation where we are determining the value of a function at a given input x1 and we know that this can be predicting using its value and the higher order derivatives at a proximal point x0 and in this way we are able to determine and the value of a function at x1 which we do not observe in a similar way in an analogy with this Taylor series approximation in this work we are determining the performance of a causal inference model as a functional of the data generating distribution and as I mentioned before this functional is a function of a function and what we are interested in is determining the performance at a new point where we do not observe the counterfactuals we do have the two distribution theta one but what we are going to use to determine the error at theta one is the synthetic distribution with known counterfactuals at theta zero so in a similar way as we are going to use Taylor expansions we are using the von misses expansion functional calculus to do a similar approach as a Taylor expansion so similarly as we use in Taylor expansions derivatives now we are using influence functions and we can predict in this way the performance of a causal inference model using the influence functions of its loss this mean square error loss on a similar synthetic data set for which we were able to compute counterfactuals so how do we really estimate these models performance by synthesizing counterfactuals so this is what we are doing in effect so in the first step we are computing plug-in estimates and then we do bias correction in this way we are estimating a causal models performance and we are able to estimate the inaccessible empirical measure so the pehe which we couldn't estimate before but because we didn't have counterfactuals we are now able to approximate it as a first-order Taylor expression using von misses calculus by using influence functions I skip for a lot of the details but please take a look at our ICML 2019 paper if you want to see how this is going now equipped with this we are able to do auto ML for causal inference and we are able to solve the problem of the fact that we don't do cross validation so what you see here is the study that we have performed and you can see that on the 77 benchmark data sets that we have used for the Atlantic competition no single model is best for all the data sets yet our automated machine learning framework the influence function if based influence function based automated machine learning model is able to select the best model among the variety of ones that we have used 72 percent of the time and that is better than using any single model itself the best model a single model that performed best across all the different data sets was is a model that is introduced by ourself in a previous ICML paper but that is only the best 17 percent of the time so now without ML we are able to select the best model for a new data set again if you are interested in looking more at our words on causal inference please take a look now let me just say that you would like to go beyond a setting where we have static covariates into a time series setting and for that we would like to issue time series forecasts the challenge here is that at different time horizons different models may be best and as we have temporal distributional shifts and risk factors may be changing different models may be best so what we are trying to determine in this time series setting where we have a variety of for instance recurrent neural networks and so on would be to determine which models are really best to be used given what we have seen so far and for that we are we have developed a new auto ML approach which we call a stepwise model selection for sequence prediction let me tell you briefly about this so again the challenge is that if I want to do time series prediction I may have a variety of recurrent neural networks and different neural networks and architectures may be best at different moments in time some may require GRU some LSTM some may require attention some may not some may have different types of memory so there is a vast amount of opportunities of models to select from and the question is how to select these different models and how to combine them together to issue effective predictions over time so one solution would be to treat the performance at each time step as its own black box optimization so treat the different time steps independently and we could do that and formalize a multi-objective Bayesian optimization where the model performance at each time step is its own optimization like I showed to you before the trouble with that is that if we do this I need to find one model for all the different objectives for all the different time horizons that's one challenge and also this is expensive because I need to compute the volume gain with respect to all the objectives at all the time frames so while multitask Bayesian optimization may seem a good approach here and has the advantage that we have a warm start because we can use at every new step as we are progressing in the time series setting information that we have obtained in the previous Bayesian optimization step so this may sound like a good approach to model this multi-objective optimization as a multitask Bayesian optimization where as we accumulate more data a new task is defined it has the disadvantage that it requires evaluating deep learning models on large data sets over time and it requires solving these separate Bayesian optimization approaches and it's complicated to do that at scale also it does not take full advantage of information from all the acquisition functions at the different Bayesian optimization methods so what we do in our work which we call stepwise model selection we are solving this by formalizing by formalizing this multiple black box optimization problem jointly and effectively so not as a multitask approach but as a joint in a joint approach by learning and exploiting the correlations among the different black box functions and we are doing that using a deep kernel learning so the idea is that how we can do this jointly we are able to create feature maps to measure similarities between the different data tuples and how these data tuples are changing over time as the sequence is evolving the steps involved are as follows first through a recurrent neural network we are creating an embedding matrix and this is the second step we are obtaining a permutation invariant embedding and the reason for that is we want to create an embedding that's independent on how individual rows are ordered in ht and then as the final step with a multi-layer perceptron we are issuing the final feature map that takes this longitudinal data and determine the features associated with it now how does this perform this perform very similarly and at times better than sophisticated RNNs that are crafted what at the time by experts the advantage here is that we do not do anything manually but rather we do this in an automated way and in this way we create and select different models over time and create a dynamic predictive model that capitalizes on a variety of automated machine learning machine learning models that are trained for the time series setting now we would like to put all of this together not only to do prediction at scale but also treatment treatments over time and also determining when should a patient be screened next so in a package that we call clairvoyance we put all of this together the software associated with it is also online and you can play with it and you can take a look and with this we are able to do longitudinal predictions as well as treatment effects as well as personalized monitoring in the clairvoyance package we can select among a variety of auto ML engines whether is the deep kernel learning that I described to you before or the structural kernel learning in autoprognosis yet this still represents an important frontier how in a setting like this we should do automated machine learning at scale um it's still an important research frontier that's not completely solved finally I'd like to say that we don't want to just issue predictions we want to turn this into actionable intelligence so for that we need um risk or understanding transparency but also um we need to know what we do know and what we do not know and there are many kinds of interpretability and there is not one that really does all of it in some setting we may want to do interpretability by determining what features are globally important for the entire population in other cases determine what features are locally important for this patient in other setting we may want to look at the way feature interact or how the model is linear or non-linear and what is the impact of non-linearity so that this is the Rata in terms of interpreting these automated machine learning models is that we would like to do it in a general way we would like to be able to interpret any black box model inclusively models crafted by auto ML and we would like to do this in a post hoc fashion we should not interfere with the model training which may introduce bias and compromise accuracy so just to give you an example I may have here Mary and I may with autoprognosis issue a prediction for her to die from COVID-19 and the question will be well why did autoprognosis issue this prediction we have developed a technology that we call in vase which is able to identify what are the unique features for Mary that have led to this prediction this is individualized prediction we are trying to determine here what are the unique features that have led for this black box model to this prediction and for Mary they may be different than the general population and different than another patient how can we learn which feature are most important for a black box model this is one form of interpretability trying to identify which feature led to this particular prediction by this black box model what's interesting about in vase is that we use a reinforcement learning approach to determine what features are most important and we use actually a simple reinforcement learning framework based on actor critic where the actor is determining is selecting through its action what feature may be important and the critic determines the reward which is the loss in performance by this model if the predictive model didn't have access to these features so what we are doing in in in vase we are building as I mentioned is actor critic framework where the predictor network receives the selected features and is learning over time what subset of features to to select that are critically important for the predictor which is in this case the black box model which may be a neural network or auto prognosis and we learn how this would have affect the performance of the predictive model I'm not going to go into the details in view of time I'm just going to say that this is just one of the many ways of doing interpretability and while in vase has some important advantages as opposed to lime or sharp and the reason is that it is individualized it also has an important advantage with personalized and individualized interpretation models such as learning to explain because once the further than this setting in vase is able to discover the number of relevant features for each instance so it may be that for Mary three features have led to this prediction but for Bob only to again that being said are we really done and the answer is no because we may not want to do only that in many settings in machine learning for healthcare it is not enough to just say it is these features that have led to this prediction we really want to have transparent risk equations that are describing the prediction of the model and can be interrogated so we need to go one step further and different users of this model whether they are clinicians whether they are medical researchers or they are policymakers may want different forms of interpretability so the answer is the question is can we have it all and the answer is yes in a recent no rips paper just last year we proposed a new way of demystifying any black box model with the help of symbolic metamodels which are able to take a black box model and represented as a transparent function describing the prediction of this black box model the advantage of this is that this metamodels needs only query access to the trained black box model so the black box model could remain let's say a black box we don't need to know what's inside it it can be patented and still we can identify what is really has this model learned in few of time i'm going to skip through a lot of the details on how to do that but let me just say that we are training the symbolic metamodels in an analogous way to the way in which we train neural networks so we are able to learn what are representative functions and we are selecting for that myergy functions which represent a large class of interpretable functions and we are learning that using radian descent in a very fast way in a similar way in which we learn neural networks in this way we are able to take something like autoprognosis and not only issue predictions but also interpretations and more than that we can throw away the autoprognosis black box and use just a white box model that was the interpretation of it to issue these predictions so now we have an interpretable equation associated with it we can now go one step further than models like in vase or lime or learning to explain or shop because we are now able to see what is exactly the relative importance of a feature given other features and how this relative importance will change and we can do that at the individualized level so this is more or less the last more technical slide what i'm trying to say with that is that we are moving beyond interpretability by allowing in this way with these metamodels to provide interpretations that may be different for clinicians who may be interested to interrogate the auto ml model to determine what treatment is best for this patient at hand but also why has been recommended while for a patient for instance we may be able to say well this is the reason this prediction is made and that may inform a may lead to lifestyle changes or informed consent um let me just say that if you are interested in this topic of interpretability which is complementary to it i've given uh some time ago in in march uh a touring talk so if you are interested in our work on interpretability which i think goes hand in hand with the issue of crafting machine learning models please take a look and please reach out to us finally i'd like to say that we don't want to only issue predictions we also want to issue um trustworthy prediction and while models such as uh Monte Carlo dropout or ensembles provide some form of um um transworthiness and uncertainty estimates the challenge associated with using that in the healthcare setting is that they are not post hoc so they may affect the performance of the model and also they do not provide coverage guarantees which is a time needed for adoption of these models in practice i'm not going to tell you much about this but our lab is doing a lot of work on developing trustworthy machine learning models and providing any machine learning model with frequent discovery models that can be post hoc if you are interested in this topic uh please take a look at the two papers mentioned here one for the static setting one for the time series setting with that being said i'd like to kind of um invite all the students following this the summer school to really join this um but i believe to be a revolution in machine learning in medicine using machine learning which can really develop tools that could be useful for clinicians for medical research which can include a variety of omics data in addition to the clinical data that i highlight here and finally may lead to new and better drugs i'd like to thank brilliant postdocs and phd students in my lab that i the work who's i highlighted here and also i'd like to say that um if you are interested to learn more about that uh please take a look at our website but if you are a student interested in this area we are trying to build together with my students a community of interested um young researchers interested in these topics especially now in the of covid where we cannot meet face to face at conferences we hope to be able to kind of reach out and build a community of interested young researchers and connect and develop tools together to to really further this research agenda so thank you and thank you here illa for this great overview of the current frontiers in machine learning and medicine are there any questions questions from inside the network the arm has a question please yeah thank you for the talk i have a practical question could you explain or explain again if i missed out part to what extent the program auto prognosis is flexible or customizable for example can we set a fixed future processing or validation measure and test different imputation and classification so yeah it is possible to have the benefits of this general pipeline while at the same time fit the specificities of our problem for instance with parameters in the software or something so so so thank you so much so one is a methodology question to some extent and one is a practical question so so both of those are the goal of autoprognosis so what we are hoping is to add a variety of models some of them existing but also some new so for instance if you come up with a new machine learning model for data imputation or for feature processing it should be relatively easy to add that within this autoprognosis ecosystem and try its value as opposed to the the other methods so we believe that this is both a solution for building more and more powerful tools but also solution for researchers like you who may come up with a new idea and identify when this idea pays off for what type of data sets but also in conjunction with other methods at the other stages in the pipeline okay thank you yeah thank you now I'll have a look at the Slido board here of questions the most popular question is the following what is your stance on imputation of missing patient data in risk prediction and survival analysis so thank you for this question so I gave keynote talk in a workshop at ICML 2020 which on missing data imputation so I have an entire we have done a lot of work on missing data imputation inclusively data that may not be missing at random because what's interesting in healthcare settings certain data may not be collected but that is informative by itself so if you are interested in that please take a look at the keynote talk I gave in the missing data workshop at ICML 2020 or email me and I'll send you a link thank you very much sorry given the time I think that's the easiest way to answer it but what's interesting is the fact that in machine learning for healthcare missing data may be informative I think that that's one of things I wanted to ask a question exactly about that but you took the answer away so good so from inside the network is another question by Giovanni so thanks for the talk first of all it was amazing so a question that I have is that many postdoc explanation methods like Lyme and shop are turned out to be quite susceptible to adversarial attacks is that a consideration you've had to make for in base or the meta learning model you just presented or are they somewhat more robust so this is a very good point so again unfortunately I didn't talk too much about interpretation and an explainability again please take a look at this during talk but but the idea there is that especially for the sake of the meta models because now we have an equation associated with that explains this black box model we are able to interrogate it and we are able to see whether it issues consistent predictions and predictions that may make sense to somebody that has knowledge in the field so again the issue here becomes how robust is this meta model and how consistent it is to either maybe a causal model or the a priori knowledge of the clinicians afterwards this may be one way to think about that that being said an interesting research agenda for all of us to go one step further so I believe this is one way further because it's easier to interpret I have now an equation but will this completely solve the problem probably we need to think further yeah thank you there's another popular question on slide though by marcos can you make a statistical inference in your ensemble methods can you select optimal variables in all models individually so whether I can select optimal models or optimal variables I guess variables so I guess there is an issue of variable selection so one thing that I didn't touch upon here and I thank you very much for this question is the value of information so in healthcare unlike in other settings gathering data for a patient is costly so doing additional tests for this patient to issue a prediction for him for diagnosis or trajectory is costly so an interrelated question is what information is useful to acquire for this patient and when and given what other information I may acquire which may be at a lower cost that is part of auto ML framework such as clairvoyance where we not only issue predictions and longitudinal predictions but also are starting to look at what information is valuable to inform these predictions over time so the answer is yes I think this is where auto ML plays yet another important role because it enables us to really identify value of information thank you for this answer as well is there another question from the network then I close the question and answer session and thank you very much Mihaela for giving this thank you talk at our summer school it was a great pleasure to have you here and we all learned a lot I think thank you thank you very much and again I want to reiterate that if you are interested in kind of joining my lab and my students in these areas and discuss with us please sign in information inspiration exchange on our website thank you very much Carson for the invitation you're very welcome we're happy that you could make it