 This is my great pleasure to introduce Joachim Duparzo. Joachim is a founding member, even of the first of our two ITNs. So he played a key role in these two networks. I mean, it's almost unnecessary to introduce him. He's a thought leader and a prominent person in bound informatics. He's interested in functional genomic systems biology, mechanistic modeling of omics data and their exploration. He's the director of the FPS in Zivia. And one of the key figures in bound informatics in Europe. We are very proud that he was part of our two networks and that he's opening our two days in posseum here now with the course in modeling and machine learning applied to massive truck repurposing in rare diseases. Welcome, Joachim. Thank you very much, Gaston. So thank you very much to all of you for giving me this opportunity of being an opening this last session. So we were commenting yesterday night that it was a pity that this covalent cross in this ITN. Because, I mean, especially for you students, networking is very important. So we were commenting yesterday that you have to be good because you have to be good for your future. But it's not only that. People must know that you are good. So that means that you have to do networking. You have to make you visible in this field. So it was a pity that you couldn't make the networking that these ITNs are designed to. But I mean, try to squeeze this last meeting that we are having here to try to do the last network here. And remember that you have to make the network. So I'm going to talk to you a little bit about some of the things that we are doing in the, I mean, I'm going to make a mixture of things, right? So I'm going to talk about rare diseases. But I'm going to simplify some of the method that you use with covides, in which we have been able of demonstrating the efficiency of some of the method that we are using. So you know that we are part of the Spanish network for rare diseases. And we, our contribution, typically, is related to the analysis of undiagnosed cases. Typically, this patient for which you have an exome or a genome. And there is not any mutation, which is characteristic of the disease. And so they are undiagnosed. So they send these hopeless cases to us. So we do some research. And we have a 30% of case resolution, which is quite OK, combined to the literature. But now, I'm going to present you something which is more philosophical, has to do with the way in which research is done in rare diseases. So these are some facts on rare diseases. So by definition, it's considered rare disease if it affects to less than one person among 2,000. Many of them affect even less people. Maybe some of ultra rare diseases affect one person among 2 million, et cetera. So typically, there are diseases which are thought to be rare, by definition. But at the end, there are 7,000. Actually, there are some literature that say that there are 9,000 rare diseases. So collectively, they affect to 6% to 8% of the population. So it's like, I mean, collectively, taking collectively is like a prevalent disease, a normal, a regular, prevalent disease. So most of them have genetic basis. And there are very few treatments available for only 400 rare diseases have more or less efficient treatment. So what are the consequences of this? Mainly that the classical approach that we use for facing diseases is not very useful in rare diseases. Because typically, there are lots of people working on diabetes, working on lung cancer, working on whatever. But we have maybe at least 7,000 diseases. We don't have a lot of sets. And we don't have 7,000 sets of scientists or group of scientists working on any of them. So what happened is that at the end, the research is very scattered. Obviously, in terms of treatments, it's the same. Pharma companies do not invest in rare diseases because if you consider one by one, if you consider treatments for one specific disease, thank you. I mean, there is not a niche of market for them because there are very few patients. So at the end, we should change a little bit the way with we do research and the way in which we try to cure rare diseases. Maybe what I'm proposing is not the panacea, but we need a change in the paradigm. So we need, firstly, probably to focus on disease mechanisms, more than on specific diseases. Because many of these diseases, actually, they share some of the disease mechanisms. What is the advantage? The advantage is that in that way, all this knowledge that we gain on mechanisms will break the disease barrier. The problem is that we don't have much detail on mechanisms of rare diseases because they are mostly unknown. So well, we try to use mechanistic models combined with causal machine learning to try to gain some insight in rare diseases. From the point of view of the translational application, we are going to focus on drug repulsing. Why? Because in that way, we will be dealing with drugs for which the security profile, mechanism of action, et cetera, is already known. So the only thing we have to do is to prove that it's efficient in this disease. And a lot of steps in the regulatory part of the approval for a drug is already solved. So at the end, the problem is that the relationship between the targets of these drugs already in use and the disease are unknown. What solution can we use? Again, some causal machine learning in combination with mechanistic models. So I mentioned that in a previous talk, but it's good to refresh because probably you don't remember. What are mechanistic models? So we use mechanistic model to provide a quantitative representation of the functionalities of the cell. Basically, what we do is to use biochemical pathways, which we have the relationship between the functional relationship between proteins, how proteins interact, but not only physical interact, but functionally interact with each other. And what this protein does at the end of this pathway. So in some cases, they trigger cell death. In other cases, they trigger out of the way in which the cell decides what to do, the fate of the cells. So something which is interesting is that in this pathway, we can define functional endpoints. So at the end of all these pathways, there are a function which is triggering the cell. And the idea is to have some things and mathematical modeling that we can use. So that would be the framework and how genes interact with each other. And we can use data measurement that we have on the condition of the cell. And typically, we use gene expression, which is data, which is nowadays is quite accurate. And it's cheap to find there are lots of this data. And they are sort of readout of what the cell is doing in this moment. It's like a snapshot of what the cell is doing in a particular condition. So that would be a very simplified picture of how one of these models works. So this would be the pathway. The pathway would you have an interaction between some proteins on the left that receive some input, some signal or whatever. And they communicate to each other like in a circuit. And finally, they trigger a function. There are other type of pathways which are the metabolic pathways in which the functionality would be the generation of metabolites, but conceptually they are more or less the same. So the idea is that we put data, gene expression data, that we mentioned in different condition on this map. And we see what happened. There is a formula, a recursive formula, by means of which we can say, OK, what would happen if there is a signal here? And we have these states of activation. In that case, the bulb will light. So at the end, it's like an electrical circuit, what we are simulating here. And what would happen in the second condition? In the second condition, we have a short circuit here. So the bulb will not light. So this is essentially what we have, right? If you take real pathways and you take real data, you get things like this. So this is real data with TCGA, the cancer genome project. And we put the data on the model pathways. And we see that, for example, some functionalities that you can easily map to cancer hallmarks, for example, DNA replication, which is how the cancer grows up. So if you take a sample from many biopsies of cancer, that case is kidney cancer, I think. We choose here kidney cancer because there are lots of data of kidney cancer. So you measure what is the activity that you infer from the model based on the gene expression. And you do a plot of survival plot. You can see that patient with DNA replication highly activated, they have a bad prognosis. But significantly, bad prognosis. This is very significant. But you can identify other cancer hallmarks like antiepoptosis. So patient with the antiepoptosis activated, they die more. Patient with an inactivation or celadation, which means metastasis, have a bad prognosis as well. Patient with activated angiogenesis die more. So at the end, if we identify, in that case, cancer hallmarks, but in other cases, we identify some function which are related to our disease or to our phenotype, we can easily follow the activity state based on the gene levels, the gene activity simply. But there is something, I mean, this is good. But there is something that I like still more of these models. The models are very interesting because you can simulate the condition that doesn't exist yet. So you can take a condition, and you can simulate a new conditioning with you, knockout a gene, for example, and compare and say, OK, what is the difference between the previous condition and the condition with the knockout? See what happens. So you can sort of forecast what would happen in different situations. You can simulate, for example, the activity of drugs or whatever you can make it. Probably you try to simulate something that 20 knockouts and 15 over expression, probably you will distort the system too much, and maybe it's not, I'm not sure about the result. But for just one or two knockouts, it could be interesting. So at the end, what you do is to take the first condition and to do knockouts. How do you do a knockout here? It's very easy. Simply, you substitute the value of gene expression by 0, or by a very low amount. In that way, you simulate a protein inactivation is actually like a protein knockout, like removing the gene there. So you can force a situation in which you have knockout that doesn't affect to the function or knockout that really affects to the function. So you can, before doing the experiment, you can sort of see what would happen. And actually, this is what we did. I like very much this paper, and I always say that once I publish this paper, I can't really find myself very happy because this is what we call the revenge of the bioinformatician. One case in which we came up with the hypothesis, we tested in silico the knockouts, and then we went for an experimental group and said, please, can you test if this is real or not? So they tested that it was real. So we sort of predict the knockouts that will not kill but clearly damage the ability of the cell to replicate. So this is something that can work. And actually, we published it in cancer research. So we went to the cancer battlefield and we won with our prediction. We were very accepted in the beginning, but it was nice. So I mean, this is another very nice example in which what we did was to use a single cell to see what happened in a population of cells. And in that case, it's the glioblastoma. It's typically you give the treatment, which is the bebacizumab. Typically, you remove the cancer, so there is no cancer visible. And in nine, 10 months, the cancer goes again. There is a residiva and the cancer is resistant to the treatment. So you say, oh, the cancer has acquired this resistance. So here is that if we simulate the knockouts caused by bebacizumab in the cell population, what we see is that most of the cells were killed, but a few of them were not killed. And why they were not killed? Because this bebacizumab is mainly anti-bef, this protein. This protein, bef, is in the beginning of the bef pathway, which triggers a lot of processes related to cell proliferation. So what's happening in the resistant cells is that they don't express bef. They are not expressing bef. Actually, they are expressing this older protein, this PDAGFD. That also triggers the same pathway. So you are trying to inhibit a protein that is not there. So probably this is the silicone. And the silicone was there. It was not very successful. But once you kill the smart clone, then the silicone promotes, grows up. And then when it takes over the brain again, you try to kill them with an anti-bef, but they are not using bef for growing up. So these are very nice ways you can dissect what is going on at the cell level. So yeah, but what's the problem with this approach? The problem with this approach is that we rely on these circuits that are drawn in the pathways. So the problem is that only one-third of the genome is part of these pathways, right? So two-thirds of the genome, if they are relevant for the disease, we cannot include them in the model. It's a pity. Well, I must say that this very nice cartoon was made by Sankut Kubuk, who was our student from the previous ITN network. And I use it very, very much. So we have this problem. We have this problem with we need a lot for this modeling. You need a lot of information, biological information we don't have for most of the genome. So the problem is that the generation of this biological knowledge, I mean, drawing one of these arrows is a lot of time for it. Typically, this generation of biological knowledge is several laboratories working and trying to demonstrate that this gene actually is doing an activation of this other gene. And then you can finally draw a line, a causal line. So this gene is causing phosphorylation, for example, this protein or the other protein or whatever, right? So this is a problem. We cannot wait for 50 or 100 years for all these arrows are drawn. So one thing that we thought is, would it be possible to use machine learning to learn biology? To say, okay, let's let the system to learn biology. Yeah, well, the problem is that machine learning has been applied in many scenarios in which you have this very good balance between the variables that you have to learn and the samples. In biology, we still, I mean, in terms, we try to learn all the biology, meaning all the possible interaction, direct interaction between proteins. We still have a lot of variables and we still have few samples comparatively. And actually, it's not only a matter of course of dimensionality, is that the relationship between the genes are much more complex than the relationship between pixels in a picture. Pixel in a picture are related with the pixel around and makes sense with the pixel around. But genes have crazy connections sometimes. So it's not a problem at the same level. The second thing that we can do is to reduce the dimensionality of the problem. So we are not interested in ruining all the careers of all the biologists in one month and discover everything. But something that we can do is to say, okay, let's try to see if some proteins which are of interest or transmission can be related to the current knowledge that we have. So this is a problem which is affordable probably because I mean the dimensionality is much lower. So I'm gonna switch them to COVID. So we have some COVID funds to do some things and something that we did was, well, we were participating in this COVID disease mapping which they draw up a very nice and very detailed map of all the process of virus infection and all the consequences downstream on inflammation, on how the virus triggered the immune system, et cetera, et cetera. So in that case it was very easy because we have a very detailed map. So we didn't need to do anything with the map of the disease. So what we wanted to know is what are the connection between targets of other drugs approved for other diseases? What are the connection between these targets and the disease map of COVID? And not to all parts of the map but to a specific part of the mapping which we were interested on. So the advantage of having these maps and having these functions at the end of these maps is that you can focus on specific parts of the disease in inflammation and immune system or whatever, right? So we model this map and actually we have a version of the modeling of this hepathia model that includes specifically the COVID map. And what we did was to, well, we have Carlos here for details and you want to ask specific details on the methodology, but at the end the idea was, okay, let's try to explain what is the behavior of the disease map in that case of the COVID map as a function of the different drug targets of drugs that are already in use. We can manage to explain the behavior as a combination of one or several targets of drugs. So probably this drug will have an effect on the map, right? That was the idea. So we use this sharp reality of explanation to try to look for the specific relevance of specific variables in that case drug targets. And well, we draw some maps of activity of this drug target. So we found different situations. For example, this is the famous chlorokine, the famous chlorokine acts on the map, but acts absolutely on all the map as many other parts. So, I mean, it's like if you, I mean, probably burning a cell is very efficient, but it burns to you as well. So it's not, I mean, it's efficient for combatting the disease, but also to combat the patient. So, I mean, we were focusing in a specific in drugs which were more specific of certain processes. And I mean, we managed to produce a list of drugs. And by this time, we saw that in a publication that they did on a review on trials that were for testing treatments and provision of COVID, all the drugs that were in trial for COVID who has known target because there were drugs which were, I mean, there were treatment that were more specific like gas inhalation or whatever. So in that case, we don't know what is the target, but for this drug, drugs that a target was known, all the drugs that were there were predicted by the method. Okay, I mean, that could be good, but we wanted to have a stronger proof. So we use this database that we have in Andalucía, Andalucía is some, is this, I mean, you know, is the South of Spain is a large region, is the largest region in Spain and actually the third largest region in Europe. It has a population of 8.5 million. So it's, I mean, it's the same size of Switzerland or Austria, I mean, it's like a medium size country in Europe. So we have an advantage is that all the health system is digitalized and all the health system dumps the data into a large database and this data are put in a way that can be, I mean, queries, so there are structured data. We have also an structured data but we have lots of structured data. So we have here 13 million people. So probably if you're not the biggest, it's one of the biggest database with detailed clinical information that we have. So something that we did was to look for patients here for COVID patients for the first wave. I mean, we started that in the first wave. So we have in the range of 17,000 patients and some of these patients have received another treatment for all the reasons, the vitamin, whatever, I mean, all their treatments because they were having this treatment. They were infected and we compare what happened with this patient that were having this treatment with patient that they were not having this treatment. Taking into account all the co-variables, right? Actually, we managed to make a very nice circuit because I mean, you know that accessing clinical information is not easy in general, it's not easy because I mean, it's protected. I mean, obviously it has to be protected. Nobody wants to have their own life explosive. I mean, I understand that but at the same time it's a problem for doing a lot of studies, right? So what we managed to do is, hey, what is the problem? The problem is extracting the data from the health system. Okay, what if we don't extract the data from the health system? So we managed to put some computing facilities within the health system and we can then analyze the data within the health system and say, oh, that's okay. We are happy with this. So we set up this circuit in which essentially we propose whatever a study, we pass three data committees, we then write this sheet of evolution of impacting data protection and then since there is no impact in data protection and have the approval of the committee so they provide the data, we can do the analysis. And the only thing that we make public are the results, right? So that was very nice. I'm not going to talk about that but that has been a complete change in the way we can do research now in Andalucía and we are trying to open that to everybody, right? So finally what we saw is that they were 21 treatments that were highly effective so they protect clearly the patients and actually there is one who is very bad if it's counterproductive, right? And actually for most of them since we have also data on analytics so we can follow the, for example, the lymphocyte counts and we see how for this patient also the lymphocyte counts was compatible with an improvement of the health, right? So interesting thing is that we have an enrichment of the, I mean, among this data we have a lot of prediction, not all of them, actually for example, we didn't predict the first one was not predicted but the second one was predicted. So I mean, that's for me, this is the definitive proof that actually the prediction were relatively good because most, there is an enrichment here, statistically, that of predictions that we made using the model. So if we then know that this model is good, we are in a situation in which we, I mean, this is very nice because we have made all the road from the, of scientific discovery to the scientific method proposed by Galileo Galilei in which you have to formulate the hypothesis, do experiments and check if the experiments fit to the observation fit to the hypothesis. We can do that without doing an experiment. Why? Because there are lots of data available. So we can do everything without doing a new experiment. I mean, it doesn't mean that experiments are useless because this data were obtained by previous experiment. But what I mean is that we have now so many data produced by experiment that in many cases you have the data right there, which is very important. So just for finishing, we apply this, we are applying this concept to the, to rare diseases. And with the idea of trying to say, okay, instead of, of focusing on diseases one at a time, we are going to focus in this as a particularity of the whole cellular mechanism. And say, okay, these diseases are characterized by mutation in these three genes. These three genes are in this part of the pathway. So we have a representation, a small representation not complete, but a small representation of the disease map of this disease. Maybe we have missing parts, but at least we have a part of the disease. And then we have, it was a little bit more than 150 rare diseases, which has mutation within the null pathway. So we managed to make this, I mean, trading all the diseases like a part of the disease of the cell behaviors. Okay, this disease is here, this disease is here, this disease is here. I think we have here, I will show you another slide later. So I'm figured right now at the time, or, almost. Okay, okay, no, I'm almost finishing. So the idea is then do something similar to this, and let me show you what I mentioned before, using gene expression data and trying to learn if any specific disease map can be explained by the combination of target of all the drugs. And the idea is to make this systematically. So we have the whole map, and we map disease by disease, the genes in part of the map, and we build up this small partial maps of the disease, and we try to see what drugs could be acting there. It's not perfect, but it's something that can be done systematically, and you can solve in one shot, you can propose a lot of treatment for a lot of diseases. So the idea is we have the genes, the general, I mean, the current knowledge. We have the specific disease map, the models, and we do the machine learning for any specific disease, and we look for the most relevant targets that are affecting them, and then we go for the validation. Interestingly, you have a look at the, oh, sorry. Well, as you do this, I mean, this class, and you see that there are different subclusters, so at the end, with the diseases at the end, as we suspected many rare diseases, at the end there are sharing mechanisms, so probably drugs can be used for more than one rare disease. So we have, I mean, a couple of validations, so this was published a couple of years ago, was two training that we predict for phanconia anemia were validated experimentally, and now there are systematic validation of all the training that we propose for retinitis pigmentosan. We are working, since we are working with people in the Spanish Network for Rare Diseases, we are in collaboration with several groups that are doing the specific validation. So it will take time, but at the end, what we provide them is with some drugs, instead of trying to see what would be the drug, so there is a list of potential candidates that they can use to start with. Well, this is a bit of publicity, so we have some software that you can use, you want to use the models. And this is the people and our supporters, and let me just show you this last slide, Baetha workshop that, well, this is the list of people, some of, so for example, Antonio is there, so he's going to participate and some of you are attending, and this is another place where you can do networking, which is important, so thank you very much, and if you have a question, I will be happy of taking them, thank you. Thank you Joakim, I have questions from the audience. Giovanni. Thank you for these very interesting talks, I have a couple of questions on the whole talk. One of them is on these explanation methods like SHAP, I'm a bit familiar with, the problem is that often these postdoc methods are a little bit unreliable or vulnerable to adversarial attacks or some other issues. How to say it, have you found, other than just making predictions and then validating the drugs, have you tried to find alternatives to that, like using multiple of these interpretation methods or something that could give you an idea before the experiments if, I'd say what you are hypothesizing as a reason to be considered valid? Yeah, well, I mean, apart from Carlos can give you a later, more detailed explanation from the point of view of the focus or why we focus first on SHAP is, I mean, typically we go very fast. So we need to solve problems quickly. That was, well, I mean, a simple way of trying to see what is the contribution of any of the variables that was more difficult to obtain from the model. I mean, I don't think that, adversarial attacks here are relevant because it's not the case here. But it was, I mean, simply, since we are doing a prediction, based on predictions, et cetera, so probably we are not going to be very, you know, picky with the methodology chosen. It was only the necessity of trying to figure out what of these variables was having a bigger or stronger effect on the pathways. Thank you. And a curiosity if I can quickly, before somebody asks a question, you showed that you used, for example, keg pathways. And a few times I've looked into them and the thing is they are a bit of a hodge podge of genes and metabolites and maybe sub pathways. How do you actually convert something so complicated to a relatively simple model like the ones you were showing? Yeah, well, I didn't mention that. So dealing with pathways is a nightmare because actually we are having problems of using keg pathways because you know that keg now has become not private, but I mean, you have to pay for rights or something. This is a bit problematic. So the point with keg is that they are these metabolites, but the metabolites can be easily removed. But typically they are, I don't remember the name, but activity pathways, meaning that they have essentially proteins acting on other proteins. We would have preferred to use, for example, reactome because we have a lot of relationships with the EBI and so they are pressing us to use reactome. The problem with reactome is that they have not only metabolized, they have a lot, many parts of the map are, for example, how different proteins make a complex. So all these arrows cannot be modeled because what we have is a snapshot of the gene expression. So the idea that we have is if we have all the proteins so the complex is there. We need only one node with different proteins. So it's very difficult for us to convert all these arrows that are not functional activities in the map, but are other representation of the biological knowledge to convert because something like 50% of the arrows in a reactome are all this stuff. And this stuff can, I mean, it's not useful from the point of view that we want to use the map that is to put gene expression data and to see what happens, right? But it's a name that you have to do a lot of, it's not as immediate as putting the map in the models. You have to do a lot of manual curation. Thank you. Okay, thank you very much, Joakim, for being here and for the great presentation. And I'll keep it short in the interest of time but just two quick questions. The first is going back to those mechanistic models that you showed at the beginning. Is there any work on longitudinal aspects of these? Like how the connectivity evolves over time during development, for example, of organisms? And the second is this huge database that you showed of healthcare data in Andalusia. What's the prospect of actually accessing that database? Thank you. Okay, I'm gonna answer first the second one. We have some instruction for using the DSD data. So it's something very similar to what I draw there. So firstly, you have to ask permission to the SIS committee for most of the, I mean, if your study is reasonable, you will get the permission, for sure. And the second most problematic step is to pass this evaluation of impact in data protection. So typically you just fill a series of questions. So is the data going out? If it is going out, how you guarantee that the data is not spread out, you are not going to try to re-identify patients, et cetera, et cetera. So what happens is if you get out the data from the health system, you check one of the most horrifying checks and then you don't get the approval for it. So what we did was to set up that in a way which now is not perfect, but it works. So you have to ask, essentially you have to ask us to do the job. So what we are trying to do now is to ability data system by means of which once you get the approval of the HTS committee, you can access, you can manage the data without having access to the data. Something like using a virtual monitor, whatever you are, you can do things, but you cannot copy the file outside. If this is only a technical problem, we are trying to see how to solve it. As soon as it is solved, probably it will be more open to you because we want to become leaders in exploitation of clinical data. And the first question was, I don't remember, sorry, was there something in the connectivity or what? What are the longitudinal evolution of these models? How the connectivity evolves during development, for example? What connectivity, if you... The connectivity that's represented in the mechanistic models. The connectivity that's represented in the mechanistic models that you showed at the beginning, I was wondering if there's any work of how those evolve over time during development, for example? No, as far as I know. So I remember that we did a very simple study about using enrichment, genotology. So we saw how the function evolved over a long time in a system like it was a gizm, I think. It's interesting to see how functions move across time. But no, as far as I know, there are probably there some study, but I don't know. Thank you very much, Joachim. Thank you very much for asking the questions, Otto. Thank you for opening our symposium. So a round of applause. Thank you. Yeah.