 I welcome you all to this SAB Virtual Computational Biology Seminar Series. Today we have the pleasure to host Carlo Sandres Peña-Reyes, who is full professor in computer engineering at the University of Applied Science, Western Switzerland, School of Engineering and Business, Vaux, Informatics and Communication Technologies Department. His research team named Computational Intelligence for Computational Biology is also affiliated to the SAB, the Swiss Institute of Bioinformatics, since 2016. So Carlo studied computer science, electronic engineering, and biology in Colombia and in Switzerland. And in 2002 he completed his PhD on co-evolutionary fuzzy modeling at the Swiss Federal Institute of Technology here in Lausanne. And then from 2002 to 2004, he was lecturer at the Computer Science Institute of the University of Lausanne and postdoctoral fellow at EPFL in the Logic Systems Laboratory. And from 2004 to 2007, Carlo worked as a research investigator and scientific technical leader at Novartis in Basel in the Computational Systems Biology Department. Before moving back to academia and becoming full professor at the University of Applied Science of Western Switzerland. The research conducted by Carlo's lab at the School of Engineering and Business focuses on the development of computational intelligence methodology, mainly derived from machine learning and artificial intelligence, and on their application to the real world problems, such as the one encounter in life sciences and biomedical engineering and involving data analysis and predictive modeling. So today Carlo will present some of the computational issues found in biomarkers and diagnostic signature discovery. And looking forward to your talk, I want to thank you again for accepting my invitation. And the floor is yours. Thanks to you. Thanks to everyone for coming, physically or virtually. Well, I was just thinking now that I put the title Computational Issues, which is quite large. And perhaps it should be better to say some computational issues, or to be more precise, perhaps, two computational issues. And one of them is taking more of the time of the presentation. So finally, the quantification of that issues is changing with time. Well, I was thinking on presenting my group rapidly, but I already presented some quite well what we do. So yes, what we do is computational intelligence, which is related with nature-inspired computational methods, and applied to complex real-world problems. And in particular, to computational biology that give the name to my group, Computational Intelligence for Computational Biology. And from our point of view is the general study of biological systems of biology. But by means of data analysis methods, mathematical modeling, computational simulation, that's the computational part. From that point of view, you will see that we are mixing, we are combining all these methods from both worlds in several projects. And a good way to understand what we do is perhaps to have a rapid look on some of our current projects going from fossil logic for market discovery with a company, or developing a test for monitoring viral diseases in fishes, also with the two companies, one from Poland and one from France, or a project where we are trying to predict phase-vacterial interactions based on their genomes, trying to fight against antibiotic resistance, or exploring some way of developing new methods in computational intelligence, like a rule extraction from these networks, these neural networks. We are going to start a project investigating the health of vineyard soils, so as to be sure, assess how the pollutants' different sources could affect that, this kind of soil. And we are expecting as many people here in Switzerland and also from Rich for a project where we intend to work on decision support in intensive care units. So that gives you an idea of the broad span of our projects, but at the same time, I'm not going to speak about any of these, perhaps a little bit about one of them. I decided I prefer to present something that is more related with the projects that are already finished or one of the projects for which it is still running, and is related with biomarker and diagnostic signature discovery. In our context, we are interested in extracting useful features for helping to make some good diagnostic decisions. Because of that, you will see that there are companies developing diagnostic tests, which are partnered with us, or even a hospital, which is also interested in developing that kind of... So from this point of view, the vision we have about that is, let's say we are in a blood-based test. That could be another kind of test. And we are interested, or the people developing or asking for the solution are interested in quantifying some biomarkers, using them with some kind of analysis to make a diagnostic decision. In two of our projects, the goal were more close to biology. I mean, the goal was mainly to understand something, to understand, to identify entities or mechanisms that could be responsible for a given phenomenon, why this kind of cancer is developed in that way, or which are the genes which are more affected than in which way, because of the presence of absence of a kind of cancer. But in another, in most of the projects, the goal was more diagnostic. The goal was not to select exactly the biomarkers related with the actual phenomenon, but some kind of biomarkers that could be good enough to predict the presence or absence of a given condition, and trying to put all that in a test. And we have other constraints like cost, mainly cost and well-predictive performance, all of that. So with this point of view, what we developed in all these projects is an approach where from the data we have at the beginning, or I didn't touch in too many, OK, that's OK, the idea is to have some kind of process where we select genes or genes in this case, whatever kind of biomarker, and then produce a decision. And in that path, we are going to eliminate some of the biomarkers that we don't really need. I'm going to use two examples of our projects, all through the presentation. Then one of the projects was related to understanding the role of smoking and lung cancer, not contesting it, but understanding it. And the more specific question was detecting non-trivial gene interactions associated with lung cancer. That was more or less their question, the biologist. We are not biologists, but we work with the teams of biologists, so we try to give them tools to find answers. And in this case, the University Hospital of Geneva was interested in having a new test for classifying different subtypes of leukemia. And well, for that, we use a data set which is already well-known in this context. And it was a complex multi-classification problem, and we were intended to have really highly sensitive results. With these two projects, I'm going to accompany my presentation so as to exemplify how to use that. OK, if I summarize somehow the process, we have data, and nowadays we have a lot of data. We have a stage of a biomarker selection selecting some of these data that data could be predictive enough to allow us to produce a diagnostic signature. If you are interested only on the biomarker selection part and not on the exact diagnostic signature, we are speaking only about this block. But if you are interested in the full pipeline, then you need also to deal with that. OK, the original data is usually a lot of variables. It's more and more common to have different kind of data involved in the decision. The two examples I'm presenting here are based on single kind of data. That's another problem. The biomarker selection, the idea is to find a very informative subset of the data that allows us to explain or to make decisions. And the diagnostic signature is a model, computational model or mathematical model which captures the relationship between the selected biomarkers and the phenomenon of interest, the emissives or the presence or some condition. If I speak all that computationally, we have two problems, feature selection and predictive modeling. And then I'm going to address feature selection first and at the end, one aspect of the predictive modeling part. So first part, feature selection. For this feature selection, there are three important questions which are the first we have to make before. It's going to start in, OK, no problem. It's not a good order that I've reviewed. One is if we are going to search for biomarkers for features in general, where and how to start the search. One approach is to forward the forward search. You don't have any biomarker. You find one. Then you have, you find the second and third and you start to add biomarkers. Up to the moment, you are no longer improving. That's very simple to say, but it's not as easy to do or at least to obtain always good results. But that's the other way would be, OK, I have all the variables. In principle, I'm able to do a very good work with them and I start to eliminate features that are not informative enough. Perhaps I am going to increase at a given moment the predictive power and then at a given moment I'm going to reduce that power and so I can stop my feature selection. Or I like a lot the combinatorial search where you try different combinations with some kind of strategy, but you have, you try the different kinds of combinations and then you select the one or the subset that is the best for you. So another aspect is how to organize that search. Once you decide to have some kind of search, how to organize that? Excessive search would be the ideal but evidently the number of combinations of possible biomarkers is enormous. So you could use some of the search algorithms that are already known for doing, having good behavior or performance. But these algorithms might range from simple research to advanced machine learning algorithms and there is a lot of algorithms in. But another question that is very important is how do you evaluate these subsets? Because you are looking for subsets. You are not looking for or not yet looking for the prediction. You are looking for the subset. So you have to evaluate the subset and it's easy to mix both goals and you will need the queen metrics to determine how relevant are these features or these combinational features. At this moment, I could say, okay, there are these methods for feature selection. But these methods, again, a quantification problem is there is a lot. There is a lot of these algorithms, these methods. So the first thing would be to understand a little bit how these methods work instead of knowing a list of 10 or the 10 most used or whatever. And I would like to present here the work of one of my PhD students. In the frame of her thesis, she started doing a review of methods. And there is also several reviews already there. So at the end, we decided to do a meta review or at least to review the reviews and try to extract sign of knowledge and what we produced was a taxonomy of feature selection methods. So I'm not going to pass all that in detail, but there are some access which are common to almost all the methods, some possible decisions that you have to make like selection management, how do you deal with election, what type of evaluation you do, not what kind of metrics you use, but how do you organize or use these metrics to do the evaluation? There is a how do you use your data, supervise and unsupervised, the dimensionality of the class, is the model linear or not, and there is some kind of prior knowledge or not. And well, there are some classes here I'm going to present a little bit later. And when you have to evaluate the method for feature selection, it will be interesting to always know how this method deal with each one of these access. If I start with that one which is perhaps the most accepted classification, we will see that most of the feature selection methods are either filters or wrappers. A filter selects features based on intrinsic properties of the data. I mean you have the data and you have a search algorithm that are using this data, this is proposing a subset and based on the information content that kind of metrics use, it's able to select some of the features. You see that there is the important thing is that you are not using a specific classifier. Perhaps you are using class information but you are not saying I'm using decision-free or probability metrics or whatever kind of classifier that I'm using here is intrinsic properties of the data. And that's what we call filters in general. Some are very simple filters, some are very complex filters but most of them are filters. And wrappers, a wrapper, the idea is that they use a specific classifier as a black box. I decide to use a super vector machine and then my wrapper is going to propose a feature subset and look at how good this feature subset is able to produce or to be used by the classifier and with that you can obtain accuracy is the most common, but you can use also other measurements or their metrics. And then at the end, in this loop, you will obtain selected features. You see this out to the same. In both cases, it's normal that this iterative process but it's not always an iterative process. In this case, it's very clear that it's an iterative process and both of them are used. And there are some others which are called embedded which are close to wrappers. And more recently, there is a kind of hybrid where you are mixing filters of wrappers or several types of wrappers or different approaches to mix this. So up to now it's only a taxonomy and this is a small short summary of what you can find when you look for feature selection methods. But there is a big issue there. It's my big issue today is the selection bias. As I mentioned, there is a lot of feature selection methods. There are hundreds of methods already existing and the number is increasing. Among these methods, there are some which are oriented towards whatever kind of feature selection program you have and others that say, okay, for this kind of data, I have this kind of feature selection method which is very well-adapted and is developed for that kind of data for that specific problem. So all of them exist also. But there is a problem is that each method will absolutely have a bias. Why? Because every method are going to make a decision. Have to decide which features to keep or which features to eliminate. So for that decision, you need a criteria, a criterion or a rule. And this criteria are applied and if your criteria are not good, so your selection is not good and you rely on the goodness of this criteria. That inevitably introduces a bias. There is also another bias which is induced by the metrics that we use and the best way to explain that is that's to say that is just to notice that if you use two methods, different methods that use the same metrics, they tend to have more or less the same list of subsets, the list of biomarkers. So, you should support that with more theoretical analysis of that, but speaking in practical terms, both your method and your metrics introduce some kind of bias. If we look at the classification I already presented, there are some of the access that led to different kind of methods. For example, you can have filters or mainly filters that re-variate and multivariate and depending on that, you will have a different kind of bias because the univariate rely mainly on the intrinsic features of the intrinsic properties of the features one by one. By the multivariate, try to capture relevant interactions and these interactions are the most important part and even features that could seem not very interesting from that point of view could be interesting from the relationship in the interaction point of view. The same for the training approach is clear that if you have an unsupervised approach where you don't use the knowledge of the class of the output of your process, it is clear that you are not going to capture the same kind of properties of the variable of the features, but unsupervised could also be seen as something that it's oriented towards the intrinsic properties of the process independent of what you have as labels. Perhaps it could be less sensitive to the labeling of the annotation part. Unsupervised taking that into account and that can produce a different list of... And the search strategy also, that is different that roughly we can speak about too deterministic, you have a method, you have a data set and then you have a list of biomarkers and that's always the same because the method is deterministic, that's very good, you run that once, that's okay. But there are also a lot of methods that are, we call that randomized but in general either non-deterministic and the subsets could be very different or hopefully not very different but could be different for from different from run to run. Then you need to run several times to be sure that there are no accidents in your selection. So with all of that, with all of this diversity, how to deal with a little bit with that? The first thing we did was not the first but one of the things we are trying to do is to compare different methods and in the Master Thesis by Gary Marigliano, we proposed him to explore the difference between some of the methods. Evidently there is a lot of methods and well for this work, five, one, two, three, four, five filters were selected and one, two, three methods are some kind of wrappers or embedded methods for doing ranking or feature selection. Notice that some of them are non-deterministic and for that in that case, we perform an average ranking because as they are non-deterministic you have to run them several times. Okay, if you look at that, this is a matrix of intersections of the lists. We asked each method to produce a list of 1,000 genes from 54K, it's the best option or not that you can discuss about but that was how it was done, 1,000 and then we have the intersection here. Evidently each method with itself, the intersection is on 1,000 and the results in our opinion, these are surprisingly low. I mean having two methods where you have only 53 out of 1,000 genes which are the same when you perform that, that's relatively low, that's in general low. Evidently there are some of them which are high. Well, two of the methods use the same metrics and perhaps the way it is used is very similar because they are exactly the same for all the tests we did. So, functionally it's the same algorithm. We tested, he tried that in two data sets with different numbers and every time it was the same results. Curiously enough, we have twice SVM, once SVM as an embedded method and once SVM with a wrapper and they are not really highly similar as you can see here. This method is supposed to be the same but the way it is applied is very different. This minimum, I don't remember exactly but I have my help here. Minimum redundancy maximum relevance filter seems to be an outlier. It is not similar to any of the methods it's producing completely different lists. And that gives you an idea of the diversity of the methods. So, how to deal with this selection bias? I propose here two possibilities. One is the most common, it's conceiving novel and or better methods but I call that yet another feature selection method. But that's the idea. There is a lot of people developing that. Perhaps a novel selection criteria and strategies that are oriented that are reducing the bias or using multiple criteria for the feature selection instead of one metrics, try to combine two or three or five or whatever or have wrapper methods which are robust. Based on many or in long or both runs. Then you do a lot of computational work, a lot of search. Nowadays it's quite easy to do that. You have a lot of resources and based on that you will have robust statistics that will allow you to better classify or better assess the quality of some of the biomarkers. We can also imagine to have a metrics to quantify the bias based for example on the several runs of the methods. This is one of the facts or the ideas that is exploring Zara in her thesis. And what we are going to present is to use several methods at the same time with the idea to uncover or even better to compensate some of these bias. And the simplest way is to combine like list of features. Give every method each method is providing a list of features and then you have to combine them. In the master thesis of Gary we explored some strategies to mix this list, the union of intersections. You have two lists, you have an intersection, you have I don't know, 20 or 100. And you do all the possible combinations and then you combine them with a union of intersections and a union of all these. And for example, here are two examples, they do two data sets. In one case out of 54,000 we obtain 672. Another possibility is the union of all features of all the lists. You put all the lists together and some of them given that are repeat you obtain for example around 5000. You see here it's already a good selection or that's quite of a simplistic approach. Out of all that you take the top 100 for example features and the way you compute this top is based on counting. Or well here is all the features or you, and that's another possibility, I'm not going to explain it today, but it's something that we would like to explore a little bit more is to use weighted lists. If you have the principle is simple, if you have two lists which are very similar, you should use both of them but each with a smaller weight than if they were alone. Let's imagine you have these two methods which are identical. If you have twice the same list, you are going to give to them half the weight to another which is completely different. And then you have to measure this similarity of the list and then find a weight for that. I say that we want to explore that a little bit more because the first implementation it was not really satisfactory. But that's what a master thesis is really made to time so we have to explore a little bit more. This is the F1 score, it's a performance measure and you see that in general, all the methods obtained, well, in that case that one is not as good but in general you obtain the same. So the different methods are allowing you to keep more or less more or less the same performance that you already obtained. All features are like that you have 93 here and you are improving as compared with that. So you can say that this method is able to find this relatively small number of like that here with a similar performance. In general, this one is, that means that I'm going too long. So I want to present two cases by two cases and we are going to use the first one given that these are as good as another one but it's mainly because it was the first we had in mind when we work on those projects. In this project, we used three, five, seven different methods and we performed this approach of union of intersections and with this method we were able to go down from 16,000 to 1,000 genes which were then used for producing the model, final model and then that the feature selection. In the other problem, which is a little bit more complex because we have 18 classes instead of only one. We do the same method but only with four because it was taking several days on servers for each one because of the number of classes, the number of tests that were necessary and we were using Galgo which is a wrapper, very robust wrapper but very heavy also computationally asking for one tier of of RAM to run so it was quite heavy but we did the same and using this method we were able to obtain lists of 500, 300, 137 and that's two examples of what we did. All these methods have been already ready and some of them are already implemented in error and Galgo is no longer maintained unfortunately and most of them are easily available on the different languages, air, Python, whatever and for the master thesis we have a number of scripts that we plan to make render available in some time. The other part and I'm going to go a little bit faster is with predictive modeling and the issue identified was model selection. When you do feature selection you usually run many modeling instances with different feature sets and a lot of other and usually you are obliged to do cross-validation to assess generalization robustness and then you run also many modeling instances with different training sets. For each one of these you obtain a model what we call a model is an instance of a predictor. You have a lot of predictors. So how do you decide which one of them you are going to use as your final predictive model or instance? So one thing we have been developing in the different projects is what we call a model selection workflow and once you have a lot of models a lot of classifiers or predictors again. Then we apply a series of filters or model selection steps one based only on simple thresholds of performance one based on the frequency of the variables inside one based on the dominance with respect to several criteria and then each time we try to reduce the number of models so as to have at the end one or a number of selected systems that could be ideally done automatically but most often done manually by the expert and my example with the lung cancer with a fearsome kind of criteria coming from the biologist involved in the project and at the end some filtering based only on the frequency we obtained 320 and 10 genes that were kept and even with some other criterion of frequency we were able to come to 20. Just to comment that biologists decided to stay with it 110 pool instead of that one because it was much richer and they were more interested in the mechanism and all that and not in the actual test so they were not going towards the smallest number then that was the final set and for the leukemia subtyping we applied that kind of pipeline and at a given moment what we used was a threshold that allowed us to have the smallest number of genes and models while keeping the minimum performance required here you see that if you want a better performance you are keeping very few genes, very few models and that was not good enough for the performance so and for this project we were not able to produce a single classifier because in these examples given that we have 18 classes let's say all of these are predictions of a classifier some classifiers were well in this class but at the same time were producing false positives in other classes so the idea was to use multiple models let's say 20, here is more than 50 and then say that from this 50 if you have more than 20 proposing a class you are taking that class and that way you are able to project a lot of false positives as you can see the model selection involves also how to combine them and I'm not speaking here about several strategies we also developed for mixing different classifiers in a very positive manner but that's one of the projects as we are in a university of applied sciences some of our projects are protected by IP and all kinds of agreements so some of the results which are very interesting we are not able to communicate with them okay so I'm not too late but I cover two major aspects feature selection and model selection and as I mentioned one of them was taking most of the time the feature selection but there are also other issues that were not discussed today so for example dealing with the collinearity and correlation but it's not as simple as just having correlation matrices and dealing with causality because causality is present and in a data set where you are measuring a lot of things you are not using that information and perhaps you are not selecting the good features because of that okay as I mentioned that's a team job that was my team last year we haven't done a new picture there are one, two, three people who are no longer with us but there are two other who arrived later dynamic team and I also have to thank all my partners in the projects with the academic partners and industry and you have some questions