 Welcome back to the second presentation of the morning. It's my great pleasure to introduce Felix Agarkov. Felix is one of the leading experts when it comes to translating machine learning into healthcare. He holds a PhD in machine learning. He is a successful entrepreneur, has co-founded two SMEs. He is the director and owner of Fematics, which is one of the companies represented in our network, which was also represented in the predecessor network MLPM, which ran from 2013 to 2016. So translating machine learning into healthcare, into medicine, really into the clinic. It's an extremely difficult and effortful process. And Felix knows extremely much about this. So he's really one of the experts that are working on this, that I'm turning this into a success that really make this happen. And this always impresses me when listening to Felix. So often we come from basic research. We think that machine learning solutions will have a deep clinical impact. Sometimes we underestimate how long this path is from the computer to the bedside. And Felix knows how long the path is and masters many steps along this path with a successful company Fematics. So I'm very much looking forward to his presentation now and to learn more to gain new insights which I do every time I listen to Felix. Thank you very much for being here and we are looking forward to your talk, Felix. Thank you very much, Karsten, for the amazing introduction. Machine learning and modern healthcare. When I think about a citation, which describes the process of how we innovate in healthcare, I always think about this very timely citation. He who innovate will have for his enemies, all those who are well off under the existing order of things, and only lukewarm supporters in those who might be better off under the new. The lukewarm temper rises partly from the fear of adversaries who have the loss on their side, and partly from the incredulity of mankind who will never admit the merit of anything new until they have seen it proved by they went. So it's a very timely citation. And I think this is really an explanation of the process which we have to go through when we try to innovate in healthcare. So let me maybe highlight some challenges and they'll start with the history. And please bear with me it will not be the start of this talk is not so much about algorithms but mostly about processes. So, as an example, let's look at telehealth. Remote monitoring of patients. The first mentioning of this appeared in 1879. About the use of telephone to reduce unnecessary doctor visits. Most of the space in the article was actually about the fear of telephone and how people are afraid to maybe give phone calls. In 1920 radius are used for providing clinical support on ships. In 1925, there was a drawing in one of the journals, magazines showing a remote communication between a doctor and the nurse and the patient. And basically that was matching some kind of a video radio is a video to diagnose the condition of course it was completely made up. So we made radiological images that stand by telephone from Pennsylvania to Philadelphia so we can see that maybe it was about 70 years from time on remote monitoring was mentioned to time when images started to be sent. In 15960, there was a telecommunication link between Nebraska Psychiatric Institute and Norfolk Medical Hospital. In 1961 USSR and USA used remote monitoring to test animals in space, and then 23 telehealth was actually used to provide health care in rural areas in Arizona. Later we can see how long actually took for telehealth to get established, but in 85-88 telehealth was providing care in Mexico due to earthquake. So it needed a push for something to start happening. 93 American Telemedicine Association was formed and then 2013, when there was an increased interest in telehealth. There was a major trial in the UK and actually some other major trials showing that telehealth was not cost effective compared with the usual care. In fact, in 2018 in December, I think there were some guidelines issued in the UK saying that some conditions telemonitoring was not more cost effective than usual care. There was no convincing evidence for the effectiveness of telehealth monitoring to reduce the distribution, in this case in chronic obstructive pulmonary disease. Then in 2020, in the new post-COVID world, there was a 78%, not percent, but 28 fold grows in telehealth in just two months from February to April 2020. And now this level stabilized at about 37 fold growth compared to the pre-pandemic levels. The growth was largely driven by patients and provider uptake and also by the growing and developing regulatory changes. So there are some recent estimates saying that 26% of patients in fact are interested in using telehealth. 57% of providers also look at it now more favorably than before COVID and now there are new reimbursement codes. So basically health systems and payers and insurance companies are more happy to reimburse telehealth. But it took quite a long time over probably 140 years for this technology to become established. Now, if you look at the challenges which were faced by telehealth, with this telehealth intervention, you can perhaps learn a little about what can be done to help us develop AI and machine learning solutions, AI interventions and shift. Those interventions bring them to the providers, to the frontline. And we should say that the health interventions, which are really just about shifting to remote consultations by telephone, for example, is relatively low tech compared with artificial intelligence and machine learning. Because care is still provided by clinicians and it's only delivered virtually. It's a long-pass way to adoption. And there are still some major providers concerns, even now, and some patients concerns. So providers, now in 2021, ask questions about security, about workflow integration, effectiveness, comparison with standard care reimbursement, and patients may have some of the same questions. Questions about effectiveness, accessibility, how easy it is to use telehealth, and so forth, and whether it's covered by insurance. So I would say that artificial intelligence and machine learning interventions will have many of the same problems, but many additional barriers, because it's a more advanced intervention in many ways. So let's look at the past adoption. It's a major challenge. When we are doing research and development, even if you manage to develop a good, well-performing, robust predictive model, it's far from clear how we can convert the predictions of the models, safe efficacious, cost-effective, timely, practical, scalable interventions. So we can make predictions, but it's going to be safe. It's going to be time-effective to actually act on that prediction, and what could be the best way of doing this. There are regulatory challenges. So the development, before we adopt something in healthcare, needs to be to certain standards. Also, our products, our algorithms and devices which are built on top of those algorithms and software, it needs to be certified. And this is basically to demonstrate safety, but even if the algorithms are robust and even if they're safe, it's important to show that those algorithms are also safe, not only in the lab, but in the real world. What's going to happen if data is not available, what's going to happen if some variables are missing, what's going to happen if there are outliers, so we say it should be robust to them. So we can demonstrate safety. It's still quite important to demonstrate efficacy and cost-effectiveness. So we will need to perform clinical evaluations quite often in prospective trials. And occurs that might happen that even if we run a trial and the trial is successful, we still need to be able to demonstrate added value over the standard care. We may have a useful algorithm, but what could be added value on top of what's already implemented? Is it really useful to actually for the health systems and providers or players to invest in new products? And there need to be new developments and new requirements practicing the marketplace about data governance, privacy, interoperability, good practice. So we may have gone through all the clinical trials and safety and everything, but we haven't really developed a product or an algorithm and software which respects local laws or national laws and doesn't really adhere to good practice and then it's of no use. And the final point is reimbursement. Even if you have done everything well, it may happen that no one really wants to use it, no one wants to pay for it, no one wants to use it. And the reason for this might be that, well, maybe we created very good products for clinicians, but that product might, for example, need some data from the patient and the interests and needs of patients were not considered. So patients don't use our app or don't use our system, don't supply our data, which our AI models might need to use. In this case, the product becomes of no use to clinicians either because they haven't really taken all the needs of all these stakeholders into account. It's just a snapshot of a past adoption and digital technology assessment after we were through clinical trials, for example, to get adopted in the NHS. There are additional questions about evidence and outcomes. This usually comes from clinical trials and clinical safety, all the data protection, security, youth build and accessibility. Have you thought what was going to happen if patients are unable to impute data friends or provide some readouts or clinicians are unable to do this? Or, you know, if they are building a patient-facing product, what happens for people with disabilities and so forth? In terms of rapid ability, we need to integrate into workflows and technical stability with the equations of what's going to happen under data shifts and our models and so forth. And only after we've been through this process, we can access marketplaces and with the NHS might endorse a model in the UK. It's important to understand about multiple stakeholders and some of them are providers with the people or institutions which provide healthcare services. Again, the needs of those providers and the needs of stakeholders will be quite different in different locations. And it's very important to actually try to understand those needs. And some of you might be asking yourselves, why are we speaking about this? Why are we not really speaking about machine learning at this stage? Providers and vendors and players, what is all this about? Why do we need this? Now, we need to understand the needs of multiple stakeholders so that we can solve the right kinds of questions, right? We can address the right kinds of problems and we can define objective functions of our models in a more sensible manner which will take the interest and the real needs of multiple stakeholders if you pay a little bit of attention to this. So providers, many of them want to spend less time and money to provide high quality healthcare. We know many providers are busy, especially now post-COVID with big backlogs. If you can develop something which can come to the right conclusions quickly and perhaps less expensively at lower cost, it's going to be good. Providers will probably like this. Now, providers may want to increase the market share by reducing false negatives. Of course, it depends on the reimbursement system. For example, if providers need to compete for patients like in the United States, false negatives are not providing care to people who need this care is going to be bad. You may potentially get used to this. Vendors, sorry, vendors, developers, they won't be able to develop as quickly as possible with minimal hurdles. Questions for data access, questions for regulations. If you can somehow develop models or algorithms or AI power solutions to help vendors to develop better products, vendors, developers will probably like this. They also want to increase and retain their market share. Patients want to stay healthy as long as possible and ideally with least effort and ideally at no cost or at very low cost. So we need to be able to engage patients. There are actually key stakeholders in many cases and payers, insurers, employers, sometimes it may be a national system, a state system. They want to avoid unnecessary costs. Quite often it reduces the false positives. Can we really not provide extensive care to those who don't need it? And again, false positives and false negatives may be important for multiple stakeholders here. And it's quite important in every system, in every condition, to try to understand what could be the real needs. Just a very simple example. If you're trying to develop, say, a self-management product, a product or an algorithm which helps patients to self-manage, it's quite important to be able to engage the right kinds of stakeholders for it. If, for example, providers are paid by volume, for example, if provider organization, a GP practice, for example, is paid by the number of patients they serve, they may potentially be interested in a product or in an algorithm which helps patients to self-manage. Patients can stay at home, they can self-manage, they don't go to GPs so often and GPs will still get paid and they can free time and provide services really to the people who need those services. But payers, on the other hand, will not really be interested in this kind of product because payers, insurers or health systems will be paying providers anyway based on the number of patients which are being served. And for every intervention, for everything we develop, for every kind of product condition, you know, any use case, we really need to try to understand who the stakeholders are, who would be the right people to engage and how we can define our objective. I will now give a few outlines of the challenges from the point of view of clinicians and key opinion leaders. It's based on a recent survey and interviews by McKinsey, more than 237 interviews and surveys of providers, investors, vendors, and some barriers which were identified from those interviews, specifically for AI and machine learning, were about evidence of safety effects and cost effectiveness. The key principle, first, do no harm, not doing harm, it's more important in this space than potentially innovate and also any innovation will need to add value over established processes in the real world. We must demonstrate quality of AI and machine learning solutions, robustness for new populations, completeness of underlying data sets, biases that need to be handled, data leakage needs to be handled. For example, sometimes it might happen that the code for a condition or the code for an event happening in a hospital appears a few days later, you know, after a patient started getting some levels of care, and then getting, for example, a medication to treat the condition will be identified as a predictor of a condition itself, because the condition was recorded several days later, so this kind of data leakage is common, and it's important to actually, when we develop our AI solutions to really understand the process of how data is recorded and what's happening to it because it will affect quality. Lack of multidisciplinary development. Now COVID, it's probably not going to be a trigger in contrast to telehealth, but rather a challenge now going to be so difficult to engage with clinicians, at least for many long term conditions because of the backlogs. These are the solutions. Solutions need to be driven by real needs, and not only by the accessibility of data, and we also need to integrate within workflows, and whatever we develop needs to be easy to use. And a third subtake. There is a limited, if there is a limited clarity on how decisions are made by models, then people will not necessarily trust it. So some words of the interview stakeholders and that comes from 237 surveys of McKinsey and some of our own surveys of clinicians and patients in relation to AI and machine learning. And these are basically some words. They're holding our lives. We are holding lives on our hands, we need proof that it works, and you have to convince people with the results. It's difficult for regulators to trust something that is difficult to assess. I don't know how feasible AI predictions would be saying severe patients, we're stuck with crazy definitions of outcomes. I'm not sure about the ability of AI to predict the risks. This is a very good question. And it relates, for example, to the challenge of how we define outputs. Sometimes outputs, which we are trying to predict maybe subjective and clinicians may have concerns about using machine learning models for this. If your AI solution works, it would be brilliant, but we tested some solution in the past and had problems with them. So past experience, negative experience of course is bad. I don't understand machine learning models. The models I've seen don't use medical knowledge. I don't know what these models are telling me and why. It's easy to be driven, but what we can do with the data, rather than with the clinical need. We could have helped to design something much more useful and likely to be used, but we were not involved upfront, and now it's too late. These are just some typical examples. And there are additional challenges which vendors, investors and prime client people identify. When it comes to startup executives, it's lack of interoperability and also data sharing, data issues really dominate many of these for startups and for investors for healthcare professionals. It's lack of skill, education, funding, basically. It's mentioned quite a lot as additional challenges. When it comes to executives, there are additional challenges. So 2,400 C-level executives, multiple sectors, not only healthcare, multiple KPMG surveys. Only 35% of people trust in their own AI and machine learning solutions and data analytics processes. 93% of healthcare executives agreed about the need for a code of ethics. 92% of healthcare executives questioned trustworthiness of data and models, and we're concerned about reputation and litigation. And there are also concerns about data quality and privacy violations, potential biases are also mentioned very often. Once we understand that, once we understand those kinds of needs and pain points, we may potentially think about which AI and machine learning models and approaches could offer opportunities for solving those challenges. But let's speak about ethics first. Let's come from policy makers, and now we can see a lot of policies around integrity of solutions. So basically, inference and learning should be proper. People care about appropriateness of how data sets are used, control for data quality, no data leakage, interpretability, so clarity about how predictions are made. Robustness, robustness under new conditions, new populations, new interventions, new outcomes, new clinical settings. For example, a new intervention may still be the same kind of a drug, or it may be the same kind of model, but it may be used in all slightly differently. Without, for example, support of the clinician in place, and then it would be a new intervention and we need to show that whatever validation they made, and whatever evidence they collected for safety and cost effectiveness. Our downstream developments on top of the algorithms and models are still robust and firmness. The lack of biases prejudice, ensuring that protected variables are not associated with predictor variables, for example, or, for instance, we cannot predict protected variables from the models performance. So some kind of good uniformity and high quality performance across multiple strata operations. We may already potentially think about various ways of addressing those needs. So let's think about opportunities for AI and machine learning and the research and development stage. So, what can we do, what kind of opportunities might exist for us when we develop new solutions in relation to evidence. Now remember, it's added value over established processes of solutions. One question we can try to ask ourselves, how can we configure our models to formally generalize established processes. For instance, there may be already established risk scores which clinicians use in practice. We can potentially try to use those established risk scores as features, priors, actors in the meta model in an ensemble of some kind. And then, if we are careful, we may potentially be able to demonstrate that whatever we develop will formally improve on what's currently being used. Quality robustness for new settings. Accounting through situations and training data is not the same as the test data. Possible shifts in the PICO, population intervention comparison and outcomes, comparison relates to the baseline. What are we going to compare our models with? Code development, nothing really replaces code development from, we still need to involve stakeholders to code develop, but perhaps we can use information from guidelines. We can also use clinical trials, primary research articles to constrain our hypothesis space. And maybe we can be clever about how we aggregate this information from prior knowledge by using machine learning. Practicality and data issues, access to data robustness, so we already can think about data efficiency using prior knowledge, pre-trained models, published results, and uptake. And we can consider explainable models if possible, or at least we can try to approximate complex models by simpler models. So let's look in a bit more detail in this last point about how to improve uptake, because we can probably, if we are careful enough, we can probably handle evidence and we can probably handle quality, but what can we, how can we ensure uptake by the end users? Well, let's look at explainable AI or XAI a little bit, and this relates to one of the statements made by a senior clinician whom we interviewed, who's saying, I don't understand machine learning models, I don't know what those models are telling me and why. Perhaps uptake may be improved by improving explainability, and many machine learning models are black boxes, not really explaining predictions and decisions in a way which is understood by experts. And there are multiple procedures, clarifying how such black box models may work, often based on production of a post hoc approximation of complex model, for example. And this explainable AI needs to be, you know, differentiated from constraining models to be interpretable by design, for example, by using those models which we think are a bit easier to understand those models, which use domain knowledge. Sometimes people speak about interpretable machine learning, explainable machine learning, XAI, and there are some differences between them, but sometimes those terms are used interchangeably. We will speak mostly about explainable AI. How can we open black boxes? They can try to explain what's happening inside, or we can try to design a transparent model. And when we are trying to explain what's happening inside, we can try to explain the model globally, which they take outposts of our model, which could be difficult to explain and fit a global model to mimic the behavior of the black box function F. There are local explanations, outcome explanations. So, can we construct a locally explainable model GFX, mimicking FFX, this black box model of some input. For example, why does the model classify patient X as a case of a control. There are kinds of questions, which may potentially address the concerns of the clinicians, why for this patient a model is telling me something. What can we learn about this? And we can also think about model inspection. So, trying to vary and see what's going to happen with our one of these predictions of our black box models. In some cases, this can elucidate and explain some extent what's happening inside. One family of methods, which are commonly used in explainable AI are saliency maps, relatively old, but still very, very useful visualization of image areas explaining predictions, for example, the conditional neural networks. This is to visualize the notion of the class captured by a trained model. And if S sub C of I is a score of class C computed by a model on image I, what we really care about is to try to construct an image which maximizes the score subjects and penalty with respect to the image. So we are basically for a given class, we may want to construct, for example, an image in this case, by optimizing the score with respect to the image and not with respect to the parameters. This can potentially give us visualization of what's happening. What could be a typical representation of cases or what could be a typical representation of progressors. We often use unnormalized scores and there are some technicalities here to get better visualizations and there are many extensions of the approaches. So, this figure is from Timonian 2013, where we get representative images of many common classes, bell peppers, key folks, huskies, by optimizing the scores with respect to images. We can go conditional and try to visualize the saliency map for a given image and class. For example, if we have radiology image and the patient is classified as for example, a patient with cancer, what could be the representation of that. So given some image I0 class C and the trained model assigning the scores, we can want, we may want to rank the pixels of our given image based on the influence on the scores. For linear scoring functions, if we describe a deep neural network, for example, but if we had a linear model, then our interpretation of the importance of pixels would be very simple. We would just take those weights with the highest magnitudes and if we are dealing with a nonlinear model then we can come up with the first order Taylor approximation at I0. You can say that, well, maybe we produce a linear approximation of the scores and we Taylor expand around the original image, and this will tell us which pixels need to change the least to have the most impact on the class predictions for a given image. Computations of those derivatives could be done by back preparation and there are many ways to improve this and get more intuitive visualizations. So, again, from Simonyan 2013, we can get basically the pixels, those saliency maps corresponding to given images, what classifies this as a blogger, what classifies this as a boat. And this is potentially quite useful saliency maps may help to identify the parts of the images predicting certain class. This example is from ZEC 2018, a rather famous example that people have found out that when trying to classify pneumonia, the parts of the image which were responsible for the classifications which actually were quite good in the training set, those parts of the images corresponded to some artificial tokens, which are artificial findings, tokens, which are ideologies for both of the patients, here on the shoulder, so no radiology features are very few of the radiology features stored highly in the saliency map explanations. So, basically, this is an example of how we may potentially identify biases and similar things can be used to identify various various correlations and images. Now, this looks good and we sometimes use it to try to understand what's happening with our predictions. Does it really answer the question of explainability? And there are recent arguments that no saliency maps may potentially be useful. May also, however, sometimes give a false sense of understanding. We know where a model is looking to make a prediction, but we don't really know what's going to happen with that region. What model is doing with that region of big stuff. And what's shown here is from Rudin 2019, an evidence for an animal being a husky, or an evidence for an animal being a flute, musical instrument, and we can see nearly identical saliency maps here. So, we may potentially get very similar saliency maps for the same, and get very similar explanations for quite different classes. And this is especially an issue for noisy predictions of machine learning models and medicine. So, we know where we are looking, but we don't really know what's happening in this region, which we are looking at. So there are arguments saying that now it's not really a true explanation. So, other families of models to highlight and dilute to the feature importance and maybe explain what's driving our predictions are with the models of rather lime method, local interpretable model agnostic explanations using simpler functions to explain a complex function in its locality. For example, we may define families of relatively simple functions, classes, trees, rules of various kinds, and we can improve various constraints on what we can expect from a good explanation. So, a good explanation to provide the quality of understanding of how input variables relate to the outcomes. And also, they may need to potentially consider user's constraints. For example, sparsity might sometimes be very important, sometimes it may not be enough. For very high dimensional problems, sparsity may not be sufficient conditions for explainability. They may need to impose additional constraints on the families of models. G, the family of explainable models that we use. Also explanations ideally need to be easy to understand, which is not necessarily through the original features acts. So what's happening in lime, and the many feature importance method is that we project our original features acts a simpler space of, for example, binary features x prime. And also we should accurately mimic complex predictions at least locally. Remember, we are trying to find an approximation, a linear or a decision tree approximation of a complex model. So we want local fidelity local consistency and accuracy of approximating a complex function by something simple. Ideally, we want the message which is model agnostic so f can be treated as a black box. So what's the idea of this family of methods, let's think about this in a bit more detail because it is used often, and it gives rise to a family of more complicated models which are quite popular nowadays. So the idea is to find locally accurate approximations black box function from a high dimensional space. We're assuming that the functions of complex function maps from high dimensional space, for example, the scalar in this case, X could be a vector of features and X prime, a set of more easily explainable more easily interpretable features could live in the binary space and P prime could be a lower dimensional space of binary features, and C prime could be some random perturbation of X prime. And so the techniques, we can try to explore how sensitive our functions going to be the one we try to preserve some explainable or some really easy to interpret features, we can also define a mapping back back from C prime to the original space. So that we can reconstruct the original inputs, or inputs in the original space, see how good our approximations are, and we can impose penalties on the, on this approximation survey, for example, we can penalize the number of non zero elements and linear regression or we can use lots of penalties or we can use the depth of a decision tree based on what family of simple functions we use. For example, we can say that age electronic health record, or some representation of an electronic health record, X prime could be a simpler representation of binary representation which relates to the presence of terms describing medical conditions. C prime could be a random perturbation of X prime, and D could be reconstructed electronic health record with some perturbations in medical terms. Yeah. Optimization this case is a simple approximation of complex function by using local perturbation so this is a lime objective function for a family for a function in the family of explainable functions, we are trying to minimize a loss between the ground through this complex model between our simple approximation, proximity measure between perturbed and the original samples with some penalty over here, and we can find various kinds of losses for example, we can say that when we're dealing with regression, we want a small square distance between our black box model, and this is our simple explanation of this black box model, and the proximity is basically defined by how close our original electronic health record was to the electronic health record, which we managed to obtain by disturbing some explainable features. The probability could be a number of non zero elements, for example the constraint number of elements which are now which are non zero and there are various tractable ways of approximating those constraints. So what we are doing in this specific feature important sample, we allow F to be a complex model predicting outcomes for example in electronic health record acts. We convert electronic health record into a simpler vector x prime saying which terms are present which terms are not present with sample D prime by turning off some medical words some key terms to see what's going to happen to our approximation. We are trying to predict D prime, you know as a simpler representation of the medical terms back into the original electronic health record. Some of the key terms will be meeting in this representation. We now compute F of D, we are trying to make a prediction what will the black box model predict if some of the explainable features were turned off or perturbed. We then penalize the error, the discrepancy between what the black box model does in this case, what simpler model would do in this case subject to some complexity constraints and the parameters of the model. So what we are getting is a locally linear approximations complex function by using a simpler function, and also by using simpler features. So we are getting a simpler function, a linear model for example, or decision to a simpler function, but we are also getting simpler features, those features tax prime, those features D prime. So, when it was applied to various they said it's from Ribeiro 2016. We can see what happens which words, for example, in the text classification task turned out to be predictive of various classes. And this is a class, the binary classification of essayism and Christianity. We are comparing basically trying to elicit the trying to see which features which keywords drive the predictions of black box models in this case SVM with RBF kernels. And we can see which words are responsible for predicting Christianity and which words responsible for predicting essayism. And the argument is that maybe this model to, in particular, it's not very useful when we look at this, we can see that posting host, rare, you know those terms which appear in the headers of emails, they appear to be driving predictions as a model. And this is despite the fact that model turned out to be very accurate and test data over 94% accuracy for a well balanced data set. So we could potentially without trying to see what's happening with the features, we could try to improve the accuracy and get, or other performance metrics get 99%. But turns out that the key terms of the key factors driving those predictions are not very useful, then maybe we should really consider other kinds of models. And model one is a little bit better but we can still see that anyone this could be drivers of Christianity class. So potentially we can try to improve this. Lime, in the instance of adjective feature attributation message, but explanation is a linear combination of some kind of binary features. Yeah, so basically this is the same explanation as we have seen C prime feature in the potentially lower dimensional binary features some kind of simplified input features is this medical term present or not is such and such feature present or not. It's not a variable of technological or not, because there are ways of how we can construct such representations. Now an approach which is used more and more frequently which is called Chef, the science feature importance, according to this expression over here. So for our sample acts of the black box model F, we assign important importance to feature I, based on basically how important the feature I is for varying for preserving our output as output of our black box model. And the difference of the output of the black box model. In the simple features grade C prime, where feature I included, and we are looking at the output of the black box model where this feature is excluded, and we waited by certain coefficients and we repeat this for all possible types and subsets of features, which in our subset of explainable, explainable features. This expression is not easily computable with the efficient approximation of this, and basically the intuition is very simple. And by least big changes as output of our black box model, then it should get higher weighting, and we can repeat this we could potentially repeat this we could feed multiple models. An exponential number of times, we could get an exact feature importance importance feature attribution and this way it's not intractable but there are ways to compute this. And it turns out that the features and feature importance values described in that way have several very important and attractive properties property of local activity. So if the output don't change, and the approximation is faithful to the model. So if you take our simple feature x prime projected back to the original space, and the prediction of the black box model is same as the approximation G. The features are missing, they have no impact. It's also what we can understand if it sure is not, it sure is not present maybe shouldn't really contribute how we make predictions. And there is a property of consistency for any two models, each of our is more important for that model, what makes a bigger impact on the predictions. So, if there is model f prime. And if there is model f, it turns out that feature i makes a bigger prediction in the black box model f prime, then it would get higher weighting model f prime. So all these are desirable properties, and it turns out that the only additive feature attribution method, it satisfies all these properties is exactly the shape attribution which was defined on the previous slide here. The shape is used very, very often, and to elucidate what's happening with models, this is also model agnostic. And become literally just literally just scratch the surface here, there are so many different ways, try to explain predictions of multiple models. So many works, just in a survey 2018 I think shape is not really mentioned here that are many more different approaches now based on what kind of black box models are used sometimes it's not really agnostic sometimes specifically about deep neural nets and images, like this and so forth. But we cannot ourselves the question how relevant. It's always improving clinical update. They, we can explain the features is it really useful to engage clinicians is based what clinicians want. It's important to try to understand the criticism behind feature importance issues. It's a nature machine intelligence 2019 by Rudien, who's saying that we should really try to stop explaining black box machine learning models for high stake decisions. And we should use explainable models interpretable models instead. An observation is that rather than creating models, which are inherently acceptable explainable I create a post hoc model to explain a complex black box model laugh. This may lead to unreliable and potentially misleading explanation. If our approximations not faithful to the original model. Say, we have a complex black box model, and we created an explanation, but the explanation is a poor fit to the original model. And whatever we are trying to explain, we cannot really trust, and then people who don't trust the explanation will not really trust our black books, and basically there are so exemplify, and we shouldn't really be using this according to some arguments. In contrast, it's argued models which are inherently interpretable would provide their own explanations faithful towards model actually compute. So we can trust an explanation of an explainable model as long as we can fit the data well. And quite often explanations don't really provide enough detail to understand what's happening inside the black box. Yes, we may know that some features are important, but we don't really know how those teachers are going to be combined and black box model. Also, when they are structured, it's argued that features that are meaningful about our task may perform models which use meaningful features may perform similarly to more complex models. The difference in performance may potentially be outweighed by the ability to interpret the results. So there are many advantages of using inherently interpretable models, but explaining what's happening inside the black box is still quite useful. So for example, we can sometimes see that something really needs our attention. Perhaps, when we look at our agency map in images, for instance, or when we look at the shape predictors, if those predictions are predictors make a lot of sense, a lot of intuitive sense, we don't really need to trust this, but to see that predictions or explanations don't really make much sense. Perhaps it's a flag for us to go and try to improve our models or try to see what's happening in our data and so forth. Now, what about medical explainable AI? We are looking at explainability. Just remember it's possible way of improving uptake and trust of clinicians, players, patients, and multiple other stakeholders. But what do stakeholders really mean by explainability? There are no universally accepted metrics and perhaps there are no, you know, maybe there are no metrics, maybe this is so domain specific and so application specific that it's going to be difficult to define what's meant by explainability. Yes, although there are no formal criteria, many stakeholders, including policy makers, look at explainability is an important factor. Accuracy and performance are widely viewed as insufficient for a clinical uptake at scale, so some way of explaining what we are doing is still useful. And there was a recent survey or rather interviews with 10 clinicians to try to understand specific equations in explainable medical AI, people who are trying to find answers to questions of which explanations are needed by end users and when, how those explanations could be addressed using machine learning, what could be good explainability metrics. It's a potentially interesting read. We get some intuitions about what was needed by clinicians, but these questions are not fully answered. What do people expect? Clinicians quite often need to justify their decisions to other stakeholders and they would like the models to provide similar ways of justifications. Clinicians would expect alignment of expectations with evidence based medical practice. Also transparency when to use and when not to use the models. Inclusion criteria, exclusion criteria, interventions, you know, what's going to happen with the patients, what's happening with the patients when we're using our models. Definitions of the predicted outcomes. Sometimes they may be dealing with the same condition but the outcome is defined subjectively in one hospital from a different hospital from a different geographic location. So those issues need to be taken into account. Models are allowed to make mistakes and some taking as long as it's clear why those mistakes are made. For example, if a model makes a mistake, but we can say that this has happened because we're dealing with a different definition of an outcome, or we're dealing with different populations, maybe it's okay. Models should be repeatedly successful for personalized patient level predictions in the real world. And it's also important to try to understand what drives predictions and why those predictions are made. So explainable AI outside medicine is trying to address some of the same questions, but maybe there are arguments that maybe we are not completely there, but clinicians certainly want those kinds of explanations. I guess the key take home, which we will consider when we are trying to explain our solutions in medicine is really this. If a feature is important in medicine, clinicians or stakeholders would expect it to be important in our AI models. If there is a very solid evidence of an important feature, for example, from clinical trials, or from evidence reviews, we would like to see this in our AI models. Now, if a feature turns out to be important in our AI machine learning models when they come up with our explanations, it shouldn't be completely useless in medicine. We don't know for sure that this is useless in medicine, right? Like in the example which we looked at, for example, tokens placed on the patient, clearly no radiological data is probably useless in medicine, but it turns out to be important in AI. It's an indication that maybe it's not really important feature. We should go and do something else with our data and do something with our models or with our data. So, more on medical AI. How to construct models exploiting medical evidence? And this is what we usually do in health. We are not really, it's quite important. We are not really trying to construct an explainable model here. We are not really constraining ourselves to simple model classes which are easy to explain, but we are building constraints which clinicians want into this potentially complex function, try to make this potentially complex function more explainable. So, one very simple idea is to regularize around the published findings. For example, we can use informative priors, we could use informative regularizers. Most of the standard regularization methods, analyze variations from zero, for example, rich regression, LASSA and so forth. But we can regularize around values which are known to be potentially useful, known from clinical trials, known from published research, and W0 here could be magnetized, reported on the literature. And in that case, we will, you know, we will evaluate, we will look away from the prior, which is validated in the existing evidence, only if the data really pushes us away from this. We can try to enrich feature spaces. And in genetics, quite often polygenic risk scores are used. For example, we can look at linear combinations of various SNPs, W0s could be with weights, which we can get from the literature. We can do the same with clinical risk scores and various other models which make this in the literature to enrich the feature space. If you use those clinical established clinical predictors, like comorbidity scores, driving poor outcomes, if you use those additional predictors in our ensemble, maybe we can make sure that the remaining part of the model explains the residual structure. We can sometimes use shape constraints, diseases associated with variation from the norm, which may be used to constrain the shape of a function. Let me explain this bit. This shows the probability of readmission of the black box model when we vary the heart rate. And we can see that maybe what's happening here kind of kind of kind of makes sense. You can see that the risks grow when the racing heart rate increases, but the risks also grow when the heart rate decreases. Quite often linear models, simple explainable models using heart rate feature, you will probably see a line. It would say that heart rate, a low value of the heart rate is good for you, but of course, if the heart rate is zero, it's not very good for patients. And similar non-monotonic trends, fluctuations from the normal, they happen in many vital signals. For example, for heart rate, predicate and tachycardia are the conditions. We can also get hypertension and hypertension, hypoglychemia, and hyperglychemia, where similar kinds of non-monotonic shapes could be expected. And then we can potentially use constraints, impose constraints based on the prior knowledge and the clinical understanding of the marginal risks in our models. We can call it shape constraint, and for example, penalize the patient for those. So let's go back to further opportunities for AI and machine learning. We've spoken about uptake and explainability. Now, one other question, which is of very high importance as the question of quality. How do we account for possible data differences? The training data is not the same as the test data. And it may include, it may come at multiple different levels. Well, a solution for machine learning could be in transfer learning, really. We could look at those families of methods. A team that explained the model using a source data set with some inputs, some features, some outputs. For example, it may be a randomized controlled trial of telehealth intervention, just for the sake of the argument. Now, in a randomized controlled trial, patients are closely monitored and reminded to submit patient-level data. Also, people included in the trial could be predominantly ill people who would rapidly progress to the worst stage of the condition without the intervention, without telehealth monitoring. It would make sense to actually focus a trial on very ill people so that we can demonstrate more easily what's going to happen in the intervention arm of the trial compared to the control arm of the trial without any intervention. But after we run a trial, we really want to run a model in the target data set of less severely ill people who will not be reminded to submit daily data by whoever is running clinical trials. People will be based worldwide and not just in the lab or not in an academic health center, which is running a clinical trial for us. And the standard assumption of many AI machine learning models occurs if the training data set is the same as the sample data set as the distribution level. Examples are identically distributed. In our case, this assumption doesn't really hold anymore. There are different pages, severity, socioeconomic, ethnicity, patients reminded or not reminded, disease status, many, many different things might change. And this can be addressed by transfer learning domain adaptations, those kinds of techniques. So our idea is to improve performance in the target domain by transferring knowledge from the related by different source domains. So various things can differ. For example, the inputs can differ. We can say that the population is the target task. It's not the same, but the population is the source task. So a very shift problem, our domain has shifted, right? Our inputs are different. Sometimes the label may shift and more prior probability. It may correspond to the different disease severity in contrast to different patient populations. For instance, patients may be less severely ill when we take our daily health intervention from a clinical trial to the real world. There may be a concept shift. For example, a different type of intervention in the real randomized control trial and the real world setting. For example, when patients are not reminded daily, it means they read out, follow an intervention. Maybe the intervention is not quite effective anymore. These labels may potentially, rather, the makings may shift. And everything may potentially shift also between the target and test distributions. And basically, we want to try to derive models in our source data set that will have a small error in the target data set, which we know is different from our source data set. There are many different ways of addressing this. And for example, we can reweight training instances so that our losses under the reweighted training instances are similar to what's happening in the target data set. We can look at shared feature representations, maybe pre-process our data somehow so that in the feature space, our target and source tasks would look similarly. We can try to share parameters. It's quite often, especially in deep neural network literature, people do use models trained on completely different tasks and maybe adopt those parameters a little bit, so there is a lot of parameter sharing, but perhaps not completely. And multi-task learning is a related area. Now, instead of optimizing for a single target task, we want a good performance over multiple tasks. Now, this is an important result. It's a bit technical, but it's quite important actually to try to understand the meaning of this. Our DT is the error in the target domain, the risk and the target domain, which we want to minimize, and RS is the risk and the source domain. And this distance here, it's really specific distance definition between the target and the source domain. And there are additional terms where we say that our risks are lower bounded by certain equations. What this really tells us is that to be able to achieve a good performance on the target task, we need to perform well in a source task. In a, for example, in a randomized control trial setting, but we also want to try to learn the representation, which is where both domains are similar to each other as possible. So to be able to repeat this again, to perform well in the real world, we want to perform well in the randomized control trial setting, for example. And we also want to ensure that whatever features, whatever models are using, whatever representations, whatever samples models are using, what's going to happen in the real world is going to be as close to what's happening in the trial as possible. So how can we do medical transfer learning, due to variation in protocols, patient populations, reimbursement strategies, designs and so forth. It's very common that there is a distribution shift. Our source data is not the same as the training data. And this is really fundamental in healthcare. Whatever we develop in a lab or even in a trial, it's not going to be the same as in the real world. There are clinicians and medical research, epidemiology and clinical trial research and so forth. There are some heuristics which actually make quite a lot of sense. And those heuristics may be justified from the expression which we could have seen, where we want a robust performance in both source and multiple unknown target populations. What people tend to do is to use features which are similar in the target task and the source task. So, for example, if you are creating a model and we know that certain features are going to be very different when we are going to use models, maybe you should down weight those features somehow. Maybe you shouldn't really pay very much attention to those features or in clinical research people often replace those features. Sometimes people define an indication so that models are only used for target subpopulations, which are as similar as possible to the source subpopulation. When people transform the features, use quantiles or discretizations of various kinds to decrease the discrepancy between covariates and the source task and the target task. For example, instead of taking the age variable, people can say, okay, let's use the binary indicator, age above 65. For example, taking, for example, the income level in euros, people say, are we in top 25% quantile or not? Or what could be the quantile of income? Of course, it's one way of basically ensuring that in the issue space there is more, there is a similarity between source and target data sets. Now, it's important to understand that quite often there are no patient level samples from the target data set at all. When we do transfer learning and machine learning, we've got some samples, quite a lot of samples sometimes from the source task, and we may have some samples from the target task. Sometimes those samples may not be labeled or we may only have a few labels from the target task, but the medicine, it may often happen that there are no zero samples from the target task. So what do we do in this case? So let me now show a few case studies, so I'll try to go through those case studies very quickly. And unfortunately, because some of this work was relatively sensitive, I would be limited on the technical details of what can be presented, but hopefully still kind of highlight the flavor of what we are doing and what may potentially be done. So let's go to a statement of one of the interviewed clinicians. It's easier to let ourselves be driven by what we can do with the data rather than by the most pressing clinical needs. We see many AI solutions addressing the same tasks, because those are the tasks for which the data are available. It's basically mentioned by one of the professionals interviewed by McKinsey. When we first read that comment, we thought, okay, well, it's probably true, but is this as vendors, as developers of AI models struggle so hard to get access to clinical data sets. Surely we shouldn't be held responsible for not being able to address real clinical needs if we cannot access real clinical data sets. But then we asked ourselves the question, how can we address clinical needs without accessing data sets? How can we make predictions if we have zero samples? How can we construct something which could be used to clinical stakeholders? Well, the answer to this is that if you don't really have data, then we have no choice but to rely on prior knowledge. At least this is one of the answers and this is the answer which we gave to ourselves. So the concept of what we did was to try to extract pre-trained predictive models, the given condition and outcome from biomedical literature. You know, it's almost like completing the loop. Quite often, people who are experts in AI machine learning, they try to go to the medical community and say, look, we're experts, we have algorithms, we know how to train them. We have some data that can help you solve medical tasks. What we are trying to do really is try to mine medical evidence so that we can construct potentially more sensible and more explainable AI models and models which clinicians may want to use. And whatever features, whatever we can identify from the published literature could be used as features, constraints, evidence-based priors for patient-level AI models. They potentially help to improve data efficiency and robustness, and literally we are combining millions of years of medical research by trying to gain some insights from clinical data. So the process is relatively simple. We start with millions of publications. We identify potentially thousands of publications related to conditions and outcomes. There are various ways of how we can rank those publications and how we can retrieve information. And really, in what we call MedAI, medical knowledge plus AI, we are trying to convert medical knowledge into inter-predictive software which we can quickly validate and data sets, those data sets are available, or which we can take away the coverage of if those data sets are not available. And we can build on a lot of clinical research and construct extensive libraries of pre-trained models and the risk scores of the basis of literature. When we look at this information discovery approach and at the clinical needs that you think of discussed before about evidence, can we now ensure that whatever we develop builds on priors and builds on whatever is used in clinical practice or whatever clinicians have already identified this important quality questions of data efficiency. For example, we don't have enough data from the case distribution code development. Nothing really replaces this, but maybe we can actually get an idea of what kinds of questions they can ask the clinicians by identifying which features appear to be important practicality and data issues. Many of these questions are really addressed or at least to some extent addressed by this information mining approach. So this is how roughly it works, we specify a condition and outcome, we search for multiple models. Often we can identify thousands of publications, we search public repositories for PyCo population intervention comparison outcome terms, we use semi-supervised classification to evaluate the results that are reliable and relevant, and then we run a crawler to access the information, PGS and links and what our information we can mine. Very, I'll not go through this, it's basically set sideways but semi-supervised learning. People will know about this but the completeness of the tutorial with this idea is that we have an unlabeled data set which may be quite big, we have a labeled data set which may be relatively small. The question is how can we improve the quality of predictive models trained on labeled data by also using unlabeled data and this is a common setting which we have to deal with. We know we have many unlabeled papers, we don't know if it's a good paper or a bad paper if it's relevant to the task which we care about construction of predictive model or not. We have our in-house data set which are labeled as a small data set. And then semi-supervised learning with the, we have a few labels, we have quite a few unlabeled examples. We try to draw a decision boundary, it may be a word embedding space, it may be something else depending on the task and we want to draw a decision boundary for example and if there is new paper we won't say is it a good paper, bad paper, how informative and so forth. There are various ways of solving this, they can construct generative models exploiting the clustering structure in unlabeled data, or we can basically use graph methods, propagate the labels, construct the graph between these points and propagate the labels in each graph and also construct the agreement methods where our classifiers generate labels for each other to the labels. And basically mobius low density, we tried many different things but low density methods where we basically say that well start drawing this line, this classification line over here, which may potentially result in a big boundary, decision boundary, but we somehow favor solutions in low density areas over here. So once we do the search, we clarify our publications, what happens next goes through this very, very quickly, is that we go through some text parsing, we identify tables quite often results are presented in tables. And we detect tables sometimes if we deal with HTML some back somehow they can find tables by parsing, or sometimes they actually need to use table detection with three papers as images and then we need to detect tables they need to use multiple character recognition on these for table detection variants of RCNNs and cascade of RCNNs to identify tables and mine, and essentially get some information, some human proof reading to get the text to get really the results from a table. Then we classify tables, are we dealing with regression models or is it the table of population statistics and so forth, so we can potentially come up with thousands of tables and table limitations, and it gets converted to an internal model representation. And those given internal representation, they can then auto generate predictive functions, and we can also auto generate back end API instances and also web based front ends. It's, you know, there is a human in the loop for quality control everywhere, and especially for development of exemplar files and documentation checking whether what we could find in publications is useful or not. And what we can get for example for chronic constructive pulmonary disease is a library of, you know, many different, you know, different models using different kinds of predictors based on what has been published in medical literature. And this is a piece of Python code or PHP or whatever. And this is integrable with front ends, and this can also run in batch, they said, so that we can really call it the predictions. I'm limited on what I can say about the quality of how well this works. We had six case studies across provider organizations meta, it's an ensemble of literature based models. There was a provider organizations they had with ML commercial software packages, very complex black boxes identifying the best models. And we also looked in internal validation within the provider organization what could be said that William's meta models. So we can see that this is an average I cannot give details rich as a case studies and they cannot really say what those case studies are. But we see that overall on average those models are more or less the same. This meta model which combines literature base scores and predictive models hits by you see 0.76 test data 0.77 for very complex model and with the same level of improvement as by using black box over metamodels explainable metamodels when we combine auto ML and metamodels. Now when it comes to external validations. It turns out that some of the auto ML packages which we use software, it couldn't really be taken from one provider to the other provider for various availability and various other issues so they weren't able to externally new data set to validate those auto ML packages but compared to simply taking literature based models, simple models and metamodels to get the same quality almost the same quality as they get an internal validation when we go and validate an external data set. So, I know I need to go very quickly I've got one other case study, which can take about three minutes four minutes, or we can go to the questions. Shall I continue, or shall we go to the questions now. I would go on for another three to four minutes, as you said. Okay, okay, thank you very much. So, I'll show another sign which we are doing is still work in progress. It's a pre clinical screening project. We are trying to combine organs on shift brains on shift is artificial intelligence to drive drug discovery in Alzheimer's disease. It's got many components organs on shift or brains on shift and AI and Alzheimer's is probably one of the most complex conditions. So what is the idea here we've got candidate medications. And we also have brains and she's organized in all those devices where we're trying to grow not we but our partners is the partnership of three companies and one university. And we're growing tiny little brains from stem cells, they extract stem cells they grow rather in partnership, our partners grows, grow tiny brains from those stem cells. And then we bombard those tiny brains with drugs known to work or known not to work at various levels of Alzheimer. At various stages of clinical trials for Alzheimer's disease. And we record what's happening basically in those organoids we get structures resembling human brains and by videos from videos of those structures. They're trying to make a prediction of whether or not a drug for which we know the ground truth from in vivo human clinical trials. If the drug is toxic, if the drug is not toxic, it helps us to fail early and clean out those drugs which are unlikely to work in humans by using organoid devices. So, there are some biological artifacts show how images the stationary image that's how it looks like, and it's a plot of the same image. And this is what's happening basically if you try another image that they plot it in 3D and we now say hold it exactly same image, say that image intensity is whatever, whatever we're getting they're kept. Now they're kept at certain level, and we can now start seeing curvy structures over here, and potentially more happening in the image. So multiple different ways of pre-processing those images. And sometimes there are funny artifacts in the microscope, you have, you know, now backgrounds of this kind. And there are ways of how we can improve this with defeating most functions that remove the background and various ways of how we can pre-process images and using boundary boxes to identify cells and also to try to see what could happen between the cells. And what's also happening in time. We are dealing with time series here is there are spatial features, there are temporal features, they can compute spatial and temporal gradients in reach our feature spaces. And generally we are trying to formalize the features here now for organoid devices, various ways to augment our data filters, blurring, you know, usual things we should do in machine vision. And the model looks like this. We've realized very quickly that it's actually very difficult for us to get a lot of stem cells and to grow a lot of brains, so we needed auxiliary tasks. So starting with those videos of what's happening in the brain, we defined auxiliary tasks identifying cell types and what's happening between the cells and cell locations of the spikes and you write data for this comes from semi automated packages, which cannot be used fully automatically but it can help to generate target labeled data. And by solving the auxiliary task of predicting the features, predicting the cell locations, for example, we improve the quality of position of the target. There are multiple models, which we can consider basically only brain and shape to predict the timers, or we can use the brain and shape of the cell to predict the timers, or we can say that cell features cell locations are also part of our output. And losses are defined, basically, there are many different ways to define and ways those losses, but basically we've got multi task changing. And again, as the data which we have for an auxiliary task is quite a lot bigger than the data which we have for the target tasks and there are multiple auxiliary tasks. This is LAD, target task classification loads of timers and response to maltimers disease, and multiple auxiliary objectives of cell segmentation and displacement, types of cells and also cell displacement. Sometimes they cannot really see an individual cell, we can see a cluster here. And model is really something like this. It's still ongoing work, but what we can see, and perhaps this precision recall curve is the most interesting one for us to look at. A 20% precision, and it's a well-balanced data, 20% recall, we're getting nearly 90% precision, which basically means that if we are happy to identify 20% of drugs, which are going to work well in humans, our accuracy of identifying those drugs successfully based on the data which you can see is nearly 90%. So we may miss on some potentially useful drugs, but we can accurately say that this drug can potentially be useful and we can try the next levels of clinical trials. I will keep all the future themes and I'll go to the last slide of the summary. So I would say that AI and machine learning is still probably in healthcare, it's not probably not an infancy, it's probably in childhood, it's a regulated space. There are many constraints, safety effects, cost-effectiveness, uptake and multiple stakeholders with multiple objectives, unique challenges, but it's still probably the most exciting time to be here to be in this space. Thank you. Thank you, Felix. These were very wise final words here. It's the most exciting time to be here, I fully fully agree on that. There's a lot of applause coming here, virtual applause. There's also a number of questions already, so I'll do my best to fairly share the time here. The first one was on Slido, in fact, and I'll just read out these two by the same person. Helena, in your opinion, are there other fields in which explainability is as important as in healthcare? Maybe a quick question. Yes, anytime when we make critical decisions, for example in law, it may be quite important. Some people will say finance, some people will say militaries, there are many applications where we're having various biases, we have a big detriment. Thank you. And the second half is are the models you use to pass the existing literature explainable themselves? No, which is good, which is good. Basically what the youth model for is to prioritize the expert's time. So it's not a critical application for us to identify a table in an image to say that, well, first an expert should check the results from model from paper A and not from paper B, we haven't found. We can't treat this critical application. Whenever we make critical predictions, we try to build explainability in the models by building on the prior knowledge. So, a slide which I show here, other things, a slide which I show here, we try to make this model explainable. Because we're dealing with the patient data. Thank you, Felix. And now we have questions from the network. I saw them in the order, Diane Wessner and then Lucas in this order. So Diane, please go first. Okay. So thank you very much for the very interesting presentation. I can see at least two differences in two different types of projects that we can develop using machine learning in healthcare. So on one hand, softwares, or applications. And on the other hand, real objects such as machines or devices. I'm wondering what you think are the main differences in the process of developing a virtual tool versus physical tool and in the process of moving from development to actual use by patients and clinicians. Great. Good question. Thank you, Diane. So software is a medical device that the answer to the software, at least in the EU is going to be regulated in the same manner as devices, as long as they're making clinical predictions or the software is going to have clinical use. So from the regulatory perspective, and from what you need to go through, as long as your software is going to have clinical use, it's going to be largely the same process that will be probably some specific hurdles to software. And in the UK, very quickly go to the slide about the process in the NHS, which is about data stability and data privacy, it's highly likely that those kinds of challenges are going to be slightly more specific. Data protection, interoperability are going to be specific additional challenges for software. On the other hand, re-inventor devices will have something about how biological samples are going to be handled. But both will need to be regulated and the process is the same. Thank you. Thank you, Diane and Felix. Now, Vesta. So thank you so much for the presentation. It was very good, very relevant for my research. My question is about these two feature importance methods you mentioned, LIME and SHAP. So in my lab, we use a lot of Shapley values to understand the contribution of each features on electronic health records that predict that helped the most for the prediction of the outcome. And I was not familiar with LIME. I was wondering if there is an advantage of LIME or SHAP over the other or if there's some kind of data that is better to use LIME or SHAP. My question is, I think SHAP is preferred nowadays because at least theoretically when we can solve this equation or when we can do this approach, when we can compute those feature attributions exactly, then there are good theoretical guarantees. And in LIME we don't necessarily have this. There may be consistency problems and also based on what kinds of models we use. Also there are proximity measures in LIME. On the other hand, LIME is quite fast to use and LIME does help to identify features. The question of what's better is difficult because it's very difficult to explain what we mean by good explainability or what's a robust explainability metric, because LIME is simple to use. It's not a good idea to use it, but SHAP is preferred nowadays. And then Lukas, please. Hey, Felix. Nice to see you. It's a pleasure as always. Just a quick question regarding the literature scraping that you mentioned. I was wondering how, I don't know if you mentioned it and I missed it, but how you assess the reproducibility of the papers that you include into the scraping. Okay, there are several ways to answer this, reproducibility of our methods, how we prioritize papers, the assessing normally because we've got this label, reproducibility of the findings, reproducibility of the results of those papers. This is a good question. In specific projects which we had, they went to real world clinical data that clinicians were on board for those case studies which we have discussed. So those six case studies outcomes were in real NHS data where we were able to run a model on a data set and see whether the model performing, you know, to a certain level is described in the paper really performs the same way in clinical practice. Thank you, Lukas. Now we have another question in the chat that I will read out. Thank you for your exciting talk. I have a question about one of the papers you mentioned when discussing saliency maps for explainability in the slide showing the results from sec et al. I understood that the saliency maps had to show that the machine learning model wasn't actually predicting to know me are based on a real signal, but rather because the radiologist had placed a token on the shoulder of the patient. Could you explain again how the token was predictive of know me did the radiologists only place the token on patients with to know me and patients without know me I did not have a token. That's a good question. I think this is what happened. This is not our work, and on the basis of what we're able to understand there was those kinds of places. It happened that, for example, radiologists or whoever we're dealing with those images, we're placing or one of the radiologists some of the radiologists may be an intensive care and emergency departments that were different process but it happened for whatever reason, which we haven't really tried to eliminate ourselves with that spirit relations are found in some other work the same kind to see that different kind of equipment or different process are used in different pathways of care in intensive care. So I think in in patients, the regular in patient care or outpatient care something else was happening and serious teachers are predicted from that case. What exactly was happening there I don't know and there is this paper which is useful today. Thank you Felix. And the last question of the morning session comes from Giovanni. It's very relevant to what I do, and I wanted to ask a question related to that. There is a lot of excitement in the machine learning community for attention models, and many people claim that they are actually a new type of interpretable models that we can try to use, although I am a little bit skeptical about whether they are truly interpretable. Would you share your opinion on that if you did something that close the work that you do if they are actually a possible pathway towards interpretable models or if they are obviously not as interpretable as many people claim. I think, Giovanni, thank you for the question. The reason why it is difficult to answer this question because it's quite difficult to say what we mean by interpretability is very domain specific. And people do use attention and methods and some of those are in this list and there are many more attention methods to explain what's going on. The criticism which some people would have would be pretty much the same as Rudian had in this major paper over here. So they would still be trying to explain, trying to impose some kinds of constraints in black box. We can focus our attention on certain parts of images, but we don't necessarily know what's going to happen. And unless we start building in prior knowledge in our models, there will be some stakeholders who will probably or highly likely to say that maybe the level of explainability which we are getting is not sufficient because what we are getting is not necessarily matching what's known from our clinical practice of medical evidence. But these are useful methods to know about this. Thank you. Thank you, Felix for this very insightful talk and discussion round that was excellent. A wonderful start into our summer school this morning with the talks by Magnus and Felix. Thank you again to both speakers. And now we have a lunch break until quarter past one, and then we continue with a talk by Chloe Agat-Arsenkot. Thank you very much. And thank you very much Felix. Thank you.