 Liane did her PhD with Kristal van Sen in Liège and she'll tell us about the outcome of her work, the floor is yours. Thank you. Good morning and thank you Karsten for the presentation. I'm very happy to be here. This title may seem a bit specific but with simpler word it means that I'm going to talk about learning from data. And I'll take the specific example of systems in biomedicine. When you are doing data science there are some generic steps that you will apply in most cases. And here I will take a use case, a question to show the importance of these steps and to show how we can deal with them. So the question that I'm going to take here is how could we understand better the genetic architecture of complex diseases? Complex diseases are caused by the interaction between environmental genetic and lifestyle factors. And it's important to understand them because while understanding the problem is often the first step towards solving it. So for example in the medical context it can be the first step to translating biological findings into real life improvements such as drug recommendation or tools for diagnosis. And in any data science project the heart of the project, the foundation is the data. What data do we need? What data do we have? And how should we represent the data? And this last question is actually very important because the way you represent the data will have a direct impact on the model that you use and then on the result that you get. So it's not only important at the output level it's also important at the input level. In life science and actually in many domains such as computer science and sociology we have more and more proofs that often elements are not independent they are interconnected. And a good way to see that is to see the elements as a system with interconnecting features. And this is why networks come in because networks is a good visualization of connected features. Networks are a data structure with nodes and edges modeling the relations between the nodes. There are different flavors of networks, different families of networks. For example, networks can be binary. In that case the nodes and the edges will either be present or absent but it can also be weighted networks where the edge weights can represent the strengths of the association or the confidence we have in the association. So graphs are not only useful to see a global visualization of the data. It also shows the different substructures so you can zoom in and see important details about the substructures. In the meantime we have increasingly complex data and this gives rise to heterogeneous networks. In other words we have different points of view to look at the data. We also have a different methods to analyze the data and this will give differences in networks. This is what I call network heterogeneity and this complicates the replication and the interpretation of findings. And the last concept that I need to introduce to reply to the main question which was how could we understand the genetic architecture of complex diseases better is the concept of epistasis. Epistasis occur when the effect of a combination of genes is not due to their independent effects. So in epistasis networks, the nodes will be genetic units of analysis for example genes and there will be an edge between two nodes if there is an effect of the combination of the nodes on the phenotype. Epistasis detection is a quite complex problem, especially because it involves a lot of statistical tests. So there are many methods developed for that but they all give rise to different epistasis networks. And today there is no real grip on what aspects of the methodology is producing the differences and so similarity in the network or in the networks or dissimilarity in the networks. Okay so now we have all the concepts and this figure is summarizing the main steps we are proposing to solve the problem. So first we will see how detecting and understanding the sources of heterogeneity will help to see the differences we would like to reduce in the networks. Then we will specifically see the reduction of the heterogeneity between the networks and finally we will have a look at how it can enhance the interpretability. So in this first project we are interested in graph comparison, we would like to create groups of graphs. In one group the graphs are similar but there are several specificities in our context. First we would like to derive an algorithm that is as I said specific for graphs that mean that the distance that we are going to use for example should be derived for graph. And then we would like to do clustering, so unsupervised classification because most of the time very often and especially in the biomedical context. We don't know the group label for example in this typing, sometimes the group labels are not known. And actually here we don't even know the number of groups we would like to derive so the algorithm will need to derive it itself. And finally we would like to incorporate some notion of significance in the algorithm because we would like to make sure that if we say that two groups are different then they are statistically significantly different. So we developed the pipeline presented here that we called Netanoga. Starting from a list of networks we compute the pairwise distance between the pairs of networks. There are many existing metrics for that. It will be dependent on the context for example in epistasis. We don't only want to compare the structure of the networks but also the name of the nodes is important genes. So we will need a distance that requires no not correspondence. We apply an unsupervised clustering algorithm to get some groups and to really see where we should put the truncation point in the dendrogram. So how we could obtain the final groups we apply a recursive algorithm. So we go from the top to the bottom of the dendrogram and we apply a customized ANOVA test to see whether the within group variation is smaller than the between group variation. And we do that for the two first group. If the groups are statistically different then we do it in each of the two subgroups and so on. So we are kind of building a tree where there are two sub conditions. Either the groups are not statistically different or the groups are too small. And to evaluate the model, the algorithm, we did some simulations. We created an original network. We disturbed it to create group networks and then we disturbed them again to create individual networks. And the goal is from the individual networks to trace back to which group they belong. To do that we evaluate the performance with the jacar index. So it ranges from 0 to 1 and 1 is the best value. This is not shown here but we evaluated the type 1 error and it's under control. This means that when we don't have any group then we don't detect any group which is good. Here is the figure for the power. So when there are groups do we detect the groups? We can see that some factors influence the results and some not. For example, the number of networks and the density of the networks does not seem to have a big influence on the detection of the group. But for example, the number of groups or the multiple testing correction algorithm that we apply will have an effect. And regarding the multiple testing correction, it's the thing that we need a not too stringent multiple testing correction because the algorithm in itself is already a bit conservative. Now we can detect groups of similar graphs so we can understand the sources of heterogeneity. And now we would like to reduce this heterogeneity. So in this project we'll talk about the importance of choosing the variables wisely and cleaning the data. Yeah, the importance of cleaning the data. So here we started from two observations. First, we have a growing biological knowledge and there is a need to study the combination of this biological knowledge with epistasis to see if we can get better results. And the second thing is that genes are often the natural units of analysis in biology because they are easily interpretable, more easily interpretable and they can be linked for example to pathways. Biological pathways, but in epistasis the unit of analysis is very often the SNPs. So a smaller unit of analysis. So we would like to study the impact of going from SNPs to genes at the bottom of the slide. This is a simplified version of the workflow. We test the SNP, the epistasis at the SNP level and we define some function to map SNPs to genes. For example, it can simply based on physical distance, but we also tried mappings based on QTL information or chromatin information. Then we convert SNP tests into gene tests using the adaptive truncated product methodology to aggregate test results. So basically it aggregates the p-value that are below some pre-specified threshold. Then we used biological knowledge to reduce the search space. So we focus on gene interaction that are known and importantly it's not gene interaction that are known for the disease and their investigation. It's gene interaction that are known in general for any biological process. And finally we combine the analysis with the pathway analysis to see the broader context in which epistasis may occur. This is an extract of the results. We applied different variations of the pipeline I just presented to an IBD dataset which contains approximately 70,000 individuals. Half of them are cases. They either have transities or cerative polices and half of them are controls. What we see here is that when we apply different mappings, for example here is standard physical mapping or QTL and chromatin which is more related to functionality. And also for QTL and chromatin we filter for biological knowledge. We keep the gene interaction that are known to be interacting. Then we see that we get very different results. For example on the left with the standard pipeline, even though we test a lot of SNPs and a lot of genes. So the multiple testing correction is quite strong but still this is a pipeline which gives rise to the bigger network. And for the two networks on the right, yes we have less interactions but these interactions are more interpretable and they are more robust and also they lead to relevant pathways. So we get to have different points of view when trying all these different backlines. Then another way to reduce the heterogeneity between networks is to aggregate the networks. For example if we use the first project I presented with group graphs into groups of similar networks and then we can apply an aggregation in each of these groups to get only one representative graph per group. It means that we are creating an aggregated graph which is kind of a summary of the group. So it's also based on simulations. We created a network that we call the true network and some partial networks that only partially represent the true network. And the goal is from the partial networks, from the edges of the partial networks to predict the value of the edges in the true network. So the rule of one, the edge is absent or it's present. We tried three models to see if it's possible, we tried K-means, Latin class analysis and similarity network fusion. And we evaluate the performance with the F1 score because we have data in balance. We have a lot more edges that are absent than edges that are present. We tried multiple settings. For example in graph C we increased the number of nodes in the partial networks and in graph T we increased the number of partial networks. And what we saw is that the Latin class analysis methods is outperforming the other in all scenarios. So this is an example of a method that we could use if we want to summarize multiple networks into only one to get a representative network that is easier to compare to other representative networks. Results are comparing many networks from many groups. Okay, so now that we have all the methods, let's solve the main question we asked at the beginning. Let's gather them to see what we can do with them. We have seen that when we apply different pipeline for epistatic detection, then we obtain very different results. So for example here when we applied 10 different epistatic detection tools to the IBD data set, the same as in the second project. And they already differ in the input they require. For example, some of them will require imputed data or some of them will only deal with binary phenotypes. And they also differ in their output. Some of them will output statistics or P values or some of them will just output the ranking of the, the snippers. So for each of these tools we applied four different variants. Either we corrected for population structure or not, and either we included some biological knowledge or not. So here is an extract of the results at the SNP level so no network information is used here. What we've done is that we selected the top thousand pairs for each workflow for each analysis. And then we computed the number of pairs in common in each different pairs of methods. And what we see is that epiblaster line of regression and boost are quite are very similar to each other. And this is this means that at the SNP level, the comparison is indicative of the tool that we use. It's differentiating the tool that we use and actually is differentiating between the modeling framework because epiblaster line of regression and boost are all based on statistics. On the right is a comparison at the SNP level as well, but based on the rank and we see approximately the same thing. So I will not go into the details. And now we go from SNP to genes. So for that we do the same thing as in the second project. So we map the SNP to the gene using a map function. Here we used the QTL one and we either post filter for biological knowledge or not. We applied the algorithm of the first project of the one where we were building the groups of similar networks. And here we see that we have a big purple group. And in this group, we find all the methods where we used biology so all the biologically driven methods. So that means that when we include biology in the methods, then we reduce the heterogeneity and it's kind of entreated because we target we are reducing the search space here. And on the contrary, when we don't use any biology in the pipeline, we see that it gives rise to many different clusters. So for example, all the analysis based on the neural network weights are in a different cluster and the one with HPM as well. And finally, we applied the network aggregation method that I showed in the third project for all the purple network. So all the networks in the purple group. And what we have seen here is that it gives us to this network where I couldn't write the genomes because there are too many. But many genes are related to the HLA region and the HLA region is known to be very important in IBD. So this overlap with a non biology of IBD. The message of this project is multifold. First, it gives insight into the differences of the methods that results in differences in epistasis networks. Also, it shows the importance of applying multiple epistasis methods on the same data set because we obtained different results. So we obtained different points of view of the disease and also it shows the importance of including biological knowledge. So to summarize, I showed three different methods and I gave them to show their impact on a specific use case. The first thing that we can say is about the versatility of the methods because here I used them for networks. Which was my main PhD topic, but actually most of the methods that we developed are applicable to a lot of types of data. It also shows the important steps that we always apply in any data science project. So choosing the data and the variable wisely here it was focusing on genes instead of SNPs to get more interpretability. Also representing the data, the representation of the data is important. Here we decided to go with networks because we were looking at interactions. Then it's important to look at the sources of variations between the observation. In this case, we did that by grouping similar observations together and by aggregating then this observation, these similar observations. And finally it shows the importance of cleaning the data. Here it was related to the way we filtered the ages in the networks while including biological information. So I would like to thank my collaborator and I also would like to thank you for your attention and let me know if you have any question. Thank you for the questions from the audience. Maybe I start by asking at the end when you compare these different epistasis detection methods to each other you introduce another level of multiple testing, not only in the space of SNP peers or combinations but also in the realm of methods. Have you thought about this and how this affects your results? Since we compare the networks obtained with the different tools or methods, we didn't correct for multiple testing here. We are not doing any prediction and we didn't go a level further the epistasis networks. We started there so we, I mean to my opinion, there was no need to correct for multiple testing. So, but have you looked at questions like whether the consensus between these different methods, for example, has particular properties. So are the interactions, so you show coincidentally showed that the results differ in general that you can detect differences. You also looked at the question whether the common parts, the intersection between the different methods has any special. When we don't use any biological information, sometimes they don't overlap at all. So there is no intersection and that was also one of the reasons we wanted to do this comparison because sometimes there was no intersection, nothing in the intersection at all. But when we do integrate biology then there is indeed something in the intersection and in that case it could be the common biological processes that is probably the most relevant for the disease. And then things that are not in detected in most of the epistasis networks but only in one or the other. It's probably either less important. I would say it's probably not the correct biological word but it's not, at least it's not, it doesn't seem as essential as the gene pairs for example that we find in most of the methods. Thank you. Another question for the end. I would have one more. When you saw these different methods to not necessarily compare exactly or search exactly the same space. So, making up an example someone search pairs of snips other search higher order combinations of snips. So with three or more snips, and then when the methods are a parameter sensitive in the sense that you may choose like a significant threshold for each of these methods but this may mean something different. A quadratically growing search space or a cubically growing search space. And so it's I think it's very hard to find the best way how to compare these different approaches to each other such that the parameters mean exactly the same work. And this and the parameters have an impact on the results that you obtain. So have you considered this point how to maximize the comparability between the different methods. This is a very valid concern. And so what we've done is that we didn't set the exact same threshold for each method first because sometimes it was not possible because as you said sometimes it's just a ranking and sometimes it's a p value so we cannot deal with significance exactly the same way. And what we've done is that for one tool, we took the recommended way of doing significance assessments in the paper where the tool was published. So it's true that it's different ways of looking at significance for each method but since they are so different from each other then there was no way to apply the same threshold. And even if we were applying the same threshold as you were saying then it could compare different things because while the methods were differently so they are not looking at the same search space. Could you look at something like a ranking of the edges that either you try to find one common way of representing the results and you rank all the edges in the network as ranked by the different methods. I mean you're looking at subgraphs sometimes or like like combinations, but maybe, maybe the way to do it is to bring it down to the simplest form, the simplest element or a simplest unit edges, and then to get a ranking of edges for each method and to compare these rankings to each other. Of course you lose some parts of the interaction then if you boil it down to two edges so somewhere you will probably lose some information. I understand correctly. It's a bit similar to what we've done on the right here. So we've ranked the sneak pairs. Yes, yes, and then we compared the ranking, but we needed to again set a threshold because some methods output very few sneak pairs some method output a lot of sneak pairs. So we we went for the minimum. Okay, okay, that's how you how you made it comparable good. Thanks. Okay, so with this we thank the DM for her presentations and her answers.