 Thank you to everyone and for your patience and for being still here after a very, very long day of deep learning and related approaches. So what I'm going to present is a case study. I mean, we have been working on a specific problem and after several, several approaches, we also applied the deep learning approach and that's what I'm going to present today. So just let me, well, thanks my team or the people involved in this project. I appreciate and mainly Chabna Matai, which is also following the course today. But she has already applied deep learning in this context. So that's that being said, I just let me see. Okay. The outline of my presentation is, well, it's mainly four points short introduction to show you the research question, the methodology, and then some results and conclusions and perspective is quite systematic and you will see that I'm not going to present several words, many words as some people have, but specific, specific project where we applied quite systematic method and just to chair that and hoping that will tell you how do we work in some cases. Okay, the research question is related to face therapy. The face therapy relates or corresponds to an alternative way of dealing with the bacteria with potassium bacteria or bacteria in general, and is using their natural enemies. In face therapy was more or less the standard before antibiotic therapies in the beginning of the 20th century. And because of the antibiotics coming in first line. The phases were more or less forgotten forgotten and the concept is easy. There are viruses that feed on bacteria that infect bacteria to reproduce and they are the natural enemies they have co evolved with bacteria for millions of years. And if you can identify a percentage that is able to infect a specific bacteria, then you can use them to reduce the population of the of the bacteria of the bacteria in particular. And that's a good thing is that these phases are highly specific to those those bacteria then they usually don't have a lot of difficulties or post the problems for the, for the people receiving the therapy. So that's a good thing. But finding the right. The right phase could be relatively hard because as bacteria evolve phases are also evolving so it's not very easy to say this is the phase for that bacteria because the word is bacteria because the bacteria because they are changing. So we need to have some kind of selection phase selection process, which can be sometimes relatively hard relatively demanding in laboratory sources and all these parts. So, this is the context where we are interested. The idea is that there are some some computational approaches, which are not necessarily based on machine learning. And sometimes you have solutions that goes through phase banks databases knowledge basis that allows you to do some kind of selection, but in the evidently in the last years, there is a lot of predictive approaches, which tries to say okay this could be the most active phage against a given bacteria, bacteria and most of the approaches are based on similarity and more and more approaches or based on models predictive models. So I'm not going to present the state of the art of several harmony approaches, but I'm going to mention a little bit about our own predictive approaches, which are intended to work only on genomes. That's something that it's a decision we took at the beginning of the project with our partners, but it's not necessarily necessarily the only possibility but in our case is that the condition we decided to to accept. And we developed first, more or less conventional in quotes machine learning approach based on feature engineering feature extraction from these genomes and then processing with the classifiers. And once we obtain results with that approach and we saw some of the limitations and the difficulties. We also decided to go through the deep learning approach, not only because it's becoming quite normal to use deep learning in many, many contexts. But because also because you in principle deep learning allows us to avoid this feature engineering part because we are not. We are never sure that we are looking for the right features and let's see if the deep learning can provide that part of the of the advantages of using that. Okay, so that that take us to our research question is how could we predict those interactions between stages and bacteria, based only on their genomic sequences and using deep learning because that the one thing we want to. So, for that, we need data. And that's the methodology is a relatively simple pipeline simple conceptually, we need to collect data to prepare the data based on some exploratory data analysis, we can then make decisions and then apply the models. And finally, where we are, we analyze the results for the data collection we use the two sources. First ours is what we call the public data sets extracted from a public databases such as FHDB and in bank. I think there is any missing there. And you can see that we were trying to predict at the strain level, so we from that one we obtain almost 100 bacterial strains, and almost a double for the private data set the private data set is another data set that we obtain with our partners. In this point in the previous project, our partners, we were partnering with the University of Lausanne and the, and the installation in Bern. So, from them, we obtain a much more structured data set with the more strains, more bacterial strains but as you can see a much more reduced set of pages and with a more or less equivalent number of interactions. So I will perhaps mention that a little bit later. After that, we need to prepare this data, but to make it here. Well, some operations are removing phages with genomes or phages of bacteria with genomes which were not of the right size because these were just in the data set. So we have to have duplicated values where sometimes you have a different, different identifiers of the same genome, etc. So cleaning these data sets, we reduce the amount of interactions at around 10% because I concentrate on interactions because that's the one, the one thing we want to predict is that is being infected or potentially infected by that phage. So it's the interaction is the center of our interest and you can see that we have 7000, almost 1000 interactions at our disposition for this project. So we can also look a little bit rapidly to the data. Well, we did more than that. But one thing that this first comes to this, to the mind is that the public data set, we created the more or less balanced data set knowing that in the public data set. And I'm going to mention that later. There is no negative or almost no negative interactions reported while in the private data sets, these negative interactions were validated experimentally in the, in the laboratory. So that's one first thing the second thing is the genomes, and we can see that the lengths of the genomes are relatively variable, much more variable in them in the private data set, while in the public data set we have a lot of genomes which are close to 7 million bases in the phage side you can see that the solutions are relatively similar. But well, the number of phases is also very, very different. So that give us a first approach to the size of the genomes we are going to deal with. So we are having here genomes of in one case, several millions of basis and the other case. Sometimes of thousands of of basis okay so to continue for deep learning in the kind of fierce model we tried we need to have fixed sequences length. So we were obliged to have some to make some decisions, then we decided to limit the sizes of the pages to 200,000 basis and for bacteria to 7 million basis and just note that we have two kinds of organisms with a very different lengths. So that's also somehow a challenge that I'm going to mention also in one or two slides. And just to come to finish here, if we have, we have shorter sequences we padded that with no with zeros and for the longer sequences we were obliged to truncate them knowing that the sizes we selected were intended to less lost as few as possible information. So finally, we coded that and that's the database where we use data set we use for training the approach. And as usual we use the separation in train and validation sets and the set and in this case we use 65% 35% and we're going to stop a lot of fear. A lot of time here, but then we need to come with an architecture. As I mentioned, one of the challenges is the fact that we have two organisms. And we have whole genomes from these organisms and we have different genome lengths and very different genome lengths, which is somehow a challenge because if we just put together the bacterium and the phage in the given interaction, then the information coming from the bacterium will be much bigger than what that of the bacterium phage and in some cases, the factor is not 35 but even much, much bigger. So because of that we came with an approach where we use two different neural networks for separately pre-processing if you want the both organism genomes and at the end we combine that for doing the final interaction prediction. If we look at that more in a more schematic way, we have here the architecture. You can see that first we use two different set of layers of convolutional layers for the bacterial genome and for the phage genome separately. We call that multi-context modeling because each organism is learned separately. But given that we are predicting interactions and not independent activities or independent characteristics of both organisms, then we are finally mixing both of them in a full connected layer and that can remind a little bit what we just saw in the last presentation with the partial latent spaces for combining different kinds of information in that case is something similar. And just notice that the bacterial the bacterial convolutional part is deeper than that of the phage and that way when we arrive to the final prediction layer, layers because there are two of them, the amount of information is relatively the same. So there is no phenomenon of completely losing the phage genome information. That was one of the the most important decisions at a given moment because we were having problems trying to put all together or even learning with the same architecture in both sides. And that amount of information is quite important for the final decision. Okay. So we apply that and after applying that we obtained some results. I'm not going to to take too long on that. And you can on the test set, the 35%, as I mentioned, we obtained for the full set of of bacterial strains and phages that we had in our in our data set. We obtained performances of, as you can see, from a 74% precision, 72% precision to more or less 86, 85 accuracy in general. And we're around 80% for F1 score. So these are the results. And we after that, we also were interested more how is going in the different species. And that's so that that moment we saw that is not the same for each. The model is only one. We were quite ambitious to say, okay, we are going to have a very universal model for every bacterium, every phage, and then, well, at least those that we can have in our databases. And then you can see that while, for example, for Staphylococosarius, for stuff you have very, very good performances, knowing that we only have 68 data points there. And you can go down in the recall to 27% in the, in the cases of Staphylocosarius. And so, you can go down in the recall to 27% in the, in the cases of Staphylocosarius. And well, it's quite changing. And these are only four species that we selected here. So this universal model is perhaps less universal than we wanted. But that allows us also to see that it is in general relatively good. But there are some things to address. And that's the results that we have. We can go even deeper in the analysis of the results. But let me stay here and then go to some analysis of these results. And mainly to, basically, some conclusions and perspectives in that way. I tried to be not too, too late at the end. So we consider that our models are relatively good. We can go even deeper in the analysis of the results. But let me stay here and then go to some analysis of these results. And mainly to some conclusions and perspectives. In that way, I tried to be not too good. The performance, it's okay. We wanted more. And we still want more, higher performance, higher classification performance. But that, well, it's an opportunity for improving all these these results. The other thing that is that we because of the context of the project, the clinical context, we are really interested in predicting at the strain level. So what we had at the beginning as only a condition, it became clearly something that is a hard condition. And we, when we go to compare our approaches to other, with other approaches, we can, we could see that most of the reported works, which presents a 95%, 97%, when we, when you look more carefully, they work mainly at the genus level or even sometimes a higher taxonomy levels. So the host prediction problem here is mainly very rough compared to what we were looking for. And only some of them go to the species level. And we were really looking for at the strain, for the strain level. So in that way we can really relativize a little bit about the performances we are obtaining. So we need to explore a little bit more how this approach will compare if we use the similar data sets as the others, the other works. So something still to do. And well, it's normal that it's difficult to compare with other works. And that's something still to do. We are aware of our limitations and clearly the data distribution is not the best one, but this is the one we obtained with going to do this to these sources. And while we try to keep a balanced distribution with the public data artificially balanced, the reality from the private data set, I mean the experimental one, the controlled one is that it is very unbalanced. There is much more negative interactions because we say that phages are quite specific. So they are not infecting whatever strain or whatever species of bacteria. So it's normal that it's quite unbalanced. And having an artificially balanced data set perhaps is something that could be modified. The other thing is that mainly from the public, we have a lot of cases from some or a lot of information about some few species which are of wide interest and several others were underrepresented. So the universality we were looking for taking all the most possible, the most the biggest amount of bacteria when under the price of having some of them that are not very well represented. So that could explain the uneven predictive power and we need to deal with that for stabilizing the models, but the concept still remains relatively satisfactory. And the other thing that I already mentioned is that the negative interactions are missing in general from the public unless we start doing a lot of experimental work only to obtain these negative interactions, then the absence of this interaction is really reported. The negative results are not part of the publication. So we were obliged to estimate to use some some guesstimation about the negative interactions knowing that phages are highly specific. We could or less do that, but it's not 100% certain that it's the case. And so the negative cases in the public database is perhaps not are not perhaps a very precise are very representative of the reality. So knowing that we are exploring what to do next. First, as usual, we try to we will try to improve and expand the reach of the current approach. As I mentioned, the data quality should be improved, but we will, we want also to explore the behavior of our approach in different taxonomy levels from genus to strain to species to strains, and even in some context, we are dealing with the variants that evolve during relatively short time, short in quotes. So it's even the same strain principle but some variants and then that's the granularity granularity is a much finer. But we also have several ideas and we want to explore to go to build on top of this approach to explore novel approaches, and that's related with the state of the art also on phage therapy, where they they need this well known that phages alone are not necessarily very effective against the against the some bacteria, but usually you need to use cocktails to avoid resistant development on these bacteria to avoid a relatively mild effect on the bacteria. So that's something that it's already known, but there are also investigations on the synergy between phages and antibiotics and phage and despair and there is also an interest on engineering phages to improve their capacity. I'm going to speak a little bit about a little bit more about that. And one also one line we would like to explore is the use of explainable artificial intelligence methods. So as to extract from the models we already built, we are going to build extract or propose some mechanistic insights, which are the parts of the genomes in the phages that are responsible for highest efficacy or for a better effect or for also for failing in affecting a bacterium. So this is something that we would like to explore, given that we work also in explainable artificial intelligence. And finally, the second part of the title of the presentation was tower phage genomic editing. And that's one idea that's relatively exploratory. And I would like to just present that idea. The idea is to engineer these bacteriophages so as to increase their antibacterial action, the host range or the infection efficacy, etc. The common methods are recombination based and there are also some methods which are called genome rebooting. All these are very experimental stuff. I'm not a biologist, so I try to understand how that works from a biological and experimental point of view. But the idea that it's more appealing for me is to engineer the genomes. Given that we are already working with genomes, how could we modify those genomes so as to improve the activity? And I put here in red, our idea is to look for potential genomic interventions within silicon methods. And from that point of view, we have a new question is how could we modify the genomic sequence of those bacteriophages using deep learning models with the goal of improving their therapeutic value? From Simon engineer, I have the vision of, okay, we can do some modifications and we would like to improve something, optimization, so that calls for some kind of feedback loop where we have genomes, we have the model I just presented, and the model is predicting the interaction of these genomes and should I be able to take information from these predictions so as to being able to modify somewhere the genome of the bacteriophage so as to improve its action. I know that it's hard to say we're going to modify whatever because we can have genomes that are not viable as organisms. So we know I'm aware that there are some constraints, biological constraints, but because of that we are trying to do very target modifications. But the idea is that one, we have specific bacteria, perhaps a reduced amount of bacteria of interest. We have bacteriophages from our, our phage bank or whatever kind of a SARS, and then we use the predictor, the predictor we already developed and its predictions to drive generator, this generator can take several forms. We have explored, we have explored auto encoders, generative networks that we have been exploring LSTMs and different architectures, and I will, I will be very happy to present that work in a future version of this course or in another event. And with that, with this idea, I finished my presentation. Thank you very much for your attention. Thanks to my team because they are doing the hard work of dealing with all this data and well, thank you very much.