 So it's my great pleasure to introduce the last speaker of the day, Anais Bordeaux from the Ex-Marseille University in the CNRS and the Marseille Medical Genetics Unit and the Barcelona Supercomputing Center. Anais did her PhD in bioinformatics in Marseille and finished that in 2007 and then was a postdoc with Alfonso Valencia at the Spanish National Center for Cancer Research in Madrid and in 2010 she joined Marseille Mathematics Institute and in there in Marseille she then moved 2017 to the Marseille Medical Genetics Unit MMG to create a group for networks and systems biology for diseases and that's also the major focus in her research. Since 2019 she's also an associate researcher in the life sciences at the Barcelona Supercomputing Center. So her main research interest, as I already alluded to, is to develop computational approaches to study human diseases with a particular focus on network based methods. We are very happy to have you here Anais, this is a very exciting and relevant topic and to learn more. We'll be also very happy to learn more about your research now in the next hour. The floor is yours. Welcome. Thank you very much. So I'm very happy to be here and I've seen that you have had a fantastic line of speakers so it's really exciting to be to be in this context and to discuss with you. So I'm going to talk mainly about the work we are doing in the team in the Marseille Medical Genetics Unit and it's overall dedicated to multiomics data integration for genetic diseases. So the context of this work is that we have a lot of biological data today and I will focus today mainly on omics data, so the different types of omics data we have access to and this data they are really diverse meaning we can have transcriptomics, proteomics, intractomics and so on and so forth. And in addition all this data set they are very large, very large mainly when we speak about the number of biological variables much more than the number of samples which is quite often a problem. And all this data are also very complex that for instance they are not independent from each other. So overall we really do need and this is currently a bottleneck we really do need computational methods for the analysis and extracting knowledge from all these biological data independently but also jointly together and why because we really want to integrate the information from the different types of data and why do we want to do that? Because the integration of different omics data is really essential for the first reason is that each omics data they have their own bias and limitation and noise and integrating different data sets is expected to reduce this noise. And in addition all the different data sets they capture different aspects or scale of cellular functioning. So they will complement each other to provide a more comprehensive view of the cellular mechanisms and all their pathological deregulations in which we are interested in here. So we really do need computational methods for the analysis of omics data but we also need these methods for the integration of different types of data. And this is exactly what I want to talk about and this is the main goals of the projects we are trying to develop in my team in Marseille and we have really two aspects. So starting from biomedical omics data we want to do new algorithms for the integration of different types of data that can be used in different contexts but we want also to apply these algorithms to study genetic diseases. And we are interested in genetic diseases in general but more particularly in rare diseases in rare genetic diseases and I want to stay here that these rare genetic diseases are really bringing some scientific challenge that I was studying. So first rare diseases are rare but we have many different rare diseases. So together for instance in France you have more than 6000 rare diseases and together it's 3 million patients so it's more patients than cancer. Then many patients remain undiagnosed like we have some diagnosis for less than half percent of the patients so it's called diagnosis wandering or diagnosis deadlock. Also the phenotype are highly heterogeneous so we have sometimes the feeling that rare diseases are often monogenic because by mutation in one gene and so as a position to complex diseases such as cancer or Alzheimer's they can look as simple disease but it's not the case. They are very very complex the phenotypes in different patients can be highly heterogeneous even if you have mutation in the same genes and in many cases we do not have a clear picture of the pathophysiological mechanisms. What are the cellular deregulations that are happening and of course in most cases no treatment exists. So orational is that the integration of different omics data sets and this type of large scale omics studies they are also very important and very relevant for rare genetic diseases and that if but these diseases would need specific methods because then we have very few patients and some specific challenges. So we really want to to develop methods but having in mind that we want to apply these methods to study rare genetic diseases. Okay so today I'm going to talk about three different stories that represent three different frameworks for omics data integration that we have been working on. This framework are the mining of multi-layer networks so we will focus here on interaction networks then the active module identification which is an approach to integrate quantitative data such as transcriptomics gene expression onto a network and try to find subnetworks of interest and finally we will leave the word of networks to talk about joint dimensionality reduction which are other approaches very interesting to integrate different types of omics data and for each of these framework I will try to show you some of the algorithms we are developing and also how we try to apply them to better understand rare genetic diseases. So the first framework is the mining of multi-layer networks and I will start by thanking the people who did this work and this algorithm who are Alberto Valdirolivas and Antonio Batista to PhD students and Leo Pielopes and Osana Nesichik to postdoctoral fellows and this work was in collaboration with the Mathematic Institute in Marseille so with a strong involvement of people more from the theoretical side. Okay so we're interested in mining large-scale networks. What is a large-scale network it's this kind of picture you have a huge set of interactions between nodes of interest and historically these large-scale intractome networks were mainly protein-protein interaction networks in which the nodes correspond to the proteins and the edges to their physical relationships so it's true that these networks were quite criticized at the beginning because they contain some false positives but the techniques evolved a lot it's true also that they do not have any spatial temporal context meaning that we know that they can be physical binding in between two nodes but we don't know if they're expressed at the same time in the same place but still all these networks if you consider the amount of interaction they contain a lot of information and knowledge about protein cellular function and even for proteins that are poorly studied so you can have interaction profiles thanks to large-scale intractome mapping techniques so my claim is that they are worth studying because they contain a huge knowledge about cellular function and what is interesting is that we do not have only these protein-protein interaction networks nowadays because we have access to many different interaction sources so the protein-protein interaction network can be complemented by a molecular complex network in which the interactions are drawn from immunoprecipitation experiments for instance but you can also fetch data in pathway databases such as reactome or keg and construct big networks of pathway relationships like this one and importantly we also have now a lot of other types of omics from which we can build network or inferred networks for instance here it's a network of correlation of expression constructed from RNA-seq expression data in many cell lines and tissues and you just compute the correlation of expression and and this can be done with different and more advanced statistical techniques but it can be also be done with many different types of omics so we have all these different interaction sources which overall contain a huge quantity of information about gene and protein cellular function however this network they are they are large they are complex and noisy and the big question is how do we extract this information and it's what we are trying to do with algorithms based on graph theory and we are trying to do that by considering the different interaction network sources jointly to try to extract knowledge and we do that by using what is called a multiplex framework okay I will define what is a multiplex network now so a multiplex network is a network composed of different layers and here I'll represent the different layers each layer contains the interaction from one category for instance molecular complexes and pathways but all the layers they do contain exactly the same nodes so I do have exactly the same nodes which are gene or proteins that are considered here the same so what is interesting by combining this setting a multiplex framework is that you will keep track of the individual network topologies and features and I will try to show that so once we have this multiplex framework we want to do classical network analysis approach for instance the first one is identify communities communities are groups of proteins or nodes that share a lot of interactions that are expected to to be involved in the same biological processes or cellular function but how do we do that from multiple sources of interaction so maybe the more intuitive way to say okay I will try to combine the different networks to have only one network and then apply my very classical community identification algorithm so for instance we can compute the intersection take only the interactions that are present in in the different networks so this would be very strong interactions because they have been found by different interaction sources however using the networks I presented before I think there are less than six interactions that are common to the four big networks so this does not work another approach is to try to do identify communities independently in the different networks and then try to do consensus and this does not always work and I think you had this notion also in the talk of Kluia Gadasenko with late integration methods so because it's quite often hard to find consensus and the other possibility is to try to do the union or the sum of the different networks and the problem if you do that is that you remember we had very big and noisy network and you will hide all your relevant interactions from pathway or protein interaction inside this very big and noisy network so this merging of the different networks also does not really work so that's why we wanted to work using the multiplex framework so keeping track of the individual network layers and we developed methods for clustering or community identification that can work directly from the multiplex network so it optimizes the modularity but it's a multiplex modularity and it allows identifying clusters from a multiplex network without merging and without doing any consensus so what we have shown so we we use this method in different contexts and in particular in the dream challenge which was dedicated to the identification of communities enriched in disease genes what we found is that we do obtain different results than when we merge the networks together in particular we do not when we merge the networks together it's more or less like if we would use the big co-expression network on what we also have seen is that using more interaction sources in many cases provides better communities meaning richer communities that have more annotation and more interpretability on cellular processes that occur in the cell however there is a point of caution is that the the different interaction sources must must be complementary meaning they should reflect the same biological process and this does not always occur so for instance if you combine a protein-protein interaction and a pathway network it's the same process meaning they do reflect some physical contact and flow of information inside the cell but if you do want to combine a protein- protein interaction and a network built from epistatic interaction which often occur in between different pathways they do not reflect the same process exactly and the combination will not work in this case okay so this was the first algorithm I like a lot dedicated to community identification and I will not talk about a second algorithm I really enjoy which is called a random world with restart algorithm so it's an algorithm for network exploration which works as this so you start with what you call the seed it's really the node you will be interested in it could be your gene mutated in a disease your favorite protein for instance you have your seed and then you simulate a random particle that is working into the interactome so you start from the seed I can go there or there so let's say I'm going there I'm at this point I can move here here or here so you are walking like that in the interactome and this particular with restart random work also has at each step a non-zero probability to go back to the seed so you see that if you do simulate this process a lot you are at your seed and you will work and you can restart and you will work and you can restart you have some nodes because they are close to the seed and highly connected they will be visited a lot and other nodes which are far away and maybe less connected will be much less visited so in a sense it's a way it's a very smart way to define a score of distance from all the nodes in the network to the seed and so here it's just a toy model you see that it's different from just computing the number of jumps so this node and this node they have they are at one jump to the seed but this one has a better score because it's more connected so it has more probability to be visited by the random worker so the random work with restart define a score proximity distance score of all the network nodes with respect to the seed here it's just a toy model do not forget that your seed is hidden in this big network so you will have score for all the nodes here with respect to your seed and what is also interesting is that you can have more than one seed so you can have two seeds here for instance so you will have a score with respect to the two seeds this could be very useful to do what we call gilba association so to try to extract the subnetworks of interest around your seeds of interest so here the random worker is anchored in in three seeds which are dna repair nodes dna repair gene proteins and so the blue nodes are the closest subnetwork extracted from the random walk from the big network but around these seeds and we see that almost all the nodes here are also involved in dna repair so it's a really gilba association strategy but we have all the nodes that at least at the time in the annotation databases they did not have the annotation dna repair and it's a way to try to predict functions also for for these nodes okay so this was the state of the algorithm the random work with restart is widely used in by informatics to do gilba association on networks and what we did is to extend this algorithm so it can navigate multiplex multi-layer networks so now the random worker can navigate one network but he can also jump to another network and to another network and it can jump because remember in our definition of multiplex we have the same nodes in the different layers so here our seed will be hidden in in these big networks but we have more than one network the same way we can define a score from all the network nodes with respect to the seeds and this is very useful so it's implemented in in a in a package and it's very useful because the the output random work with restart scores they can be used in a wide variety of application they can be used directly for node prioritization or ranking also to extract subnetworks around seeds of interest but they can also be used for other by informatics input as input of other by informatics algorithms to do clustering or network ambiting for instance and i may ask question here and on the previous slide when you say you can you can jump to across layers like how expensive is it to jump to a different layer so this is a parameter you can either say that you you want to explore a bit more network and so give more probability to go to another node of the same network or give the same probability to go to a node of the same network or to jump to another layer it's something you can you can choose usually we do something homogeneous like we do the same the same values to go everywhere although in some cases for real application we try to lower the the weight of the co-expression network which we know is more noisy and so we put a lower probability to jump to this network so it's a hyper parameter that you you understand my knowledge so so lucas lucas also has a question a very quick follow-up question but once you jump in between two layers when you're exploring a second layer let's say is the algorithm allowed to jump again yeah yeah at each time point it can either restart and the restart can also be in any layer at the same weight or it can be more in one layer or explore or move from one layer to another okay thank you so okay so my point here was that these cores are very useful because they are really versatile and you can use it in a wide variety of contexts so I will show you later different examples and how we use that for embedding for instance but now I want to show you an example on how we use that to try to to have some insights on on one rare disease one set of rare diseases which are miscellaneous trophies and we were interested so miscellaneous trophies are monogenic diseases caused by mutations in different genes that lead to weaknesses in muscles and among all the miscellaneous trophies we were particularly interested in two sets of miscellaneous trophies that have different that affect different types of muscles and it's really interesting because we have some miscellaneous trophies that affect only muscles of the distal part of the body so it's distal onset myopathy such as biology myopathy and other myopathies in which the affected muscle are in the central part of the body such as limb gable muscular district and following of affecting data from a gene panel which is used for diagnosis we've been able to identify 11 genes that when mutated only lead to distal myopathies and 19 that when mutated only lead to proximal myopathies and we don't really understand that because it's all skeletal muscle so we don't really have a clue on why different sets of muscles should be affected so what we did is we tried to do this gilba association strategy on networks using these genes and these genes as seeds so we have three big networks three big biological networks pathway protein protein interactions no molecular complexes protein protein interactions and pathways and we use a seed either either the 11 distal only genes or either the 19 proximal only genes and we extracted the top scoring subnetrics around the seeds and what we found is that it's not the same subnetwork so it's different genes and it's not the same subnetwork that is extracted and interestingly this first subnetwork with distal only genes it's it's sort of enriched in proteins localized in the sarcomere which is the kind of cytoplasm of muscular cells and this subnetwork is enriched in protein localized in the sarcolema meaning it's the membrane of the muscular cells and also involved in ogle acquisition so it means that it seems that these mutated genes that leads to different onsets are involved in different maybe functions of the muscle of cells and now we want to use that to to use this knowledge to try to dig further in other types of myopathies such as dyspherinopathies in which this time the same mutated gene lead to the two different phenotypic onsets in different patients with the same mutated gene half of the patients will have a distal myopathy and half of the patients will have a proximal myopathy which is striking and we want to use this this knowledge and exon data we have to try to find modifier genes in these disease okay so this is an example of how we apply the random work strategy to do some GILBA association in rare diseases okay so all the networks have been presenting so far our multiplex network meaning they were sharing the same nodes but we want to go beyond that we want to integrate much more networks and much more heterogeneous networks so we want them to have different edges as before but also different nodes not only genes or proteins we want to integrate networks with drugs networks with diseases for instance we also want to have networks that are weighted and directed and we want to integrate everything in a common framework and we can do that with bipartite interactions so I have my different networks containing different nodes and they are connected by bipartite interactions meaning interactions in between two sets of nodes so for instance I can connect a gene network with a disease network because I know which gene is mutated in which disease so it's another network connecting the two networks I can also connect the compounds to the genes because the compounds we can know which drug target which protein for instance so in a way we we know how this type of network which are universal multilayer network and we want to be able to explore them which was not possible with the tools we had so what we did first is to extend this this random work with restart algorithm I'm talking about so they can navigate this type of universal multilayer networks and this was not easy it's a bit tricky because we have a huge combination of networks and matrices and all the parameters to navigate between different layers of a given multiplex network between different multiplex networks and so on and so forth but we were able to do that and Anthony managed to code that and it's available in the package in a python package called multix rank that can explore this very generic multi-layer network so we are very happy with that we want to apply it to different contexts but we had to do some tests first like when you develop a new tool in bioinformatics you have to do some evaluation and the first one is that you see here and it's related to your questions that we have a huge number of possible parameters and we want to know what are their impact so the first test we did was to try to explore the parameter space and for that we use a combination of two multiplex network and a monoplex is a simple network so we have the gene multiplex network which correspond to the networks I'm talking about from the beginning we have protein protein interaction pathway interactions and molecular complex interactions we have a drug multiplex network with different layers of interactions in between drugs so molecular compounds and these different interactions are for instance the fact that they can be involved they can create the same secondary effects and these kind of relationships and we have a disease network so here it's a monoplex it's the diseases are connected by their phenotypic similarities and all these three networks are connected together with bipartite interactions the genes are connected to the disease the disease are connected to the drug the disease are connected to the genes I said that and the gene are connected to the drugs we use one node random node as a seed and we tested over 100 parameters sets to try to see what are the output scores and how they behave and what we observed is that so this is a principal component analysis of the output scores in these 100 sets of using these 100 set of parameters what we observe is that indeed changing the parameters will affect the the output scores but we have some zones of stability where all the outputs are more or less the same even if we change the parameter sets okay so this was the the observation of that and now we wanted to test to evaluate the approach in in its final goal of guild bar association in bioinformatics mainly so for that we devised two different strategies an unsupervised strategy and a supervised one so in the unsupervised strategy it's a live one out cross validation what we do is we pretend we don't know the association in between one gene and one disease so we left out the gene and we pretend we don't know the the association and we use the disease as a seed in the random work with restart algorithm and we check what is the rank of the left out gene using the disease as a seed so for instance if we use only the protein interaction we have about 20% of the left out genes that are ranked in the top 100 of the better scores of the random work with restart if we use the gene multiplex networks or compose of the three layers I was mentioning we increase about to about 40% and the best result is obtained when we use the gene multiplex plus the disease monoplex network together and this is a bit striking that when we use the all the networks together the gene plus the disease plus the drug we have indeed decreased a bit the performance so it seemed that the drug networks do not bring a lot of sigma to predict the associations in between gene and diseases and in fact we retrieve the same observation using a supervised classification for the evaluation so here it's a kind of historical validation what we did is we trained a random forest classifier that was trained on the network containing gene disease association from this genet this genet is a gene disease association database we use the version two for the training version two is 2014 and for the testing of the classifier we use this genet version seven it's 2020 so in fact we are trying to see if the random forest train on a network built from old gene disease association is able to predict the gene disease association that that were published after and we see that the F1 score of this approach is again better if we use only the gene and the disease multiplex network if we add the drug multiplex network we lower the score so we retrieve the result that the drug multiplex network is not bringing a lot of signal in the in this context okay but in any case if we use only these two networks we are quite happy to see that the classifier is able to predict new gene disease associations okay just mention also about network embedding which is a really trendy topic in network science so the the idea of the embedding is really in the spirit of the dimensionality reduction we hear a lot about the idea is to project your network data into a low dimensional space and then you can use this low dimensional space for many different applications like cluster identification or directly machine learning classifiers so it's quite convenient to have this low dimensional representation it's not easy to represent graph data in a low dimensional space so they have different approaches for that and we try to to propose a new solution to do that but with a multiplex network as input so this is a multi-verse it's a network embedding for multiplex network so you start with a multiplex network and you can also do multiplex heterogeneous so it means you have a multiplex network connected to one other type of network not a full universal network but connection with one heterogeneous network and we use the random work with restart scores to have a similarity matrix that we can use then to do the embedding and we tested the embedding and this approach on different classical tests for this type of tools like clustering, link prediction, load labeling and so on and now we want to implement that with the universal random work with resort strategy and we also have we would also like to compare what is really the added value of the embedding as compared to directly using the scores computed in the network so we have different projects ongoing on that and with that I finished what I wanted to talk about on the mining of multi-layer networks and I will jump to my second part which in this we still have networks but we try to integrate quantitative data on the nodes of these networks to try to fetch to fish the interesting subnetworks from the big and large scale networks so this is the work of Alva Novoa who is now a postdoc in Toulouse so here is the objectives and it's the objective of the field of bioinformatics which is dedicated to active module identification. What you want to do is you have a biological network and you want to integrate this biological network with RNA-seq transcriptomics data for instance patient versus control differential expression analysis to try to find subnetworks of interest also known as active modules so you really seek such subnetworks where you have a lot of differentially expressed genes in the subnetworks and it's interesting because then you have access to to the function that is deregulated to the biological process that is deregulated and this is not an easy task because finding checking all the combination in the big network is impossible so there have been many different algorithms to try to do that over the years a famous one include a pineapple set which is based on a greedy search or a geactive module which is based on simulated annealing and we when we review these different methods what we found is that few of these methods consider the density of interactions meaning they try to find that the subnetwork the output subnetwork is connected but they do not try to find subnetwork that will be like communities sharing more interactions in between them than with the rest of the network in addition all the methods were usually using only protein protein interaction networks and they have they were not tested on RNA-seq data because the paper are 15 years old so we try to meaning we propose a new solution and if this new solution use a multi-objective genetic algorithm which are very fancy algorithms that I wanted also to to show you so I will guide you through a genetic algorithm protocol so in this type of algorithm you start with an initial random population of solutions in your case a solution is a subnetwork so we fed sure we gather random subnetworks from the big network then you will rank this population of subnetwork according to your scoring system so in our case I told you we have a multi-objective so in fact we will have two objectives two functions to optimize I will show you next slide once you have ranked the population you choose the parents and you create the children from the parents and you do that by combining two parents so I have parent one and parent two I have a constraint that they need to share one node and I will shuffle the different parents to create two children by crossover and then I will also introduce some mutation meaning adding nodes or removing nodes so this will create a new population of children so you will have a population of twice the initial number and you will rank again this new population of parents and children using the same two objective function I will detail and then you select your new population by elitism you choose your best solution and you do that again you do select the parents create the children rank again select and and so on and so forth until you reach your stopping criteria which could be a number of generations and if you do that you should small pieces by small pieces try to find meaning explore the full network and try to find the best solutions so everything here is based on this ranking of the population so in your case it's it's a multi-objective genetic algorithm and we do have two objectives to optimize the first one is the average node score meaning it's the amount of differential expression on the nodes of the subnetwork so we compute that with this average node score and our second objective is to try to optimize the density of interaction we want to have active modules that contain a lot of interactions so we do that by optimizing the normalized density which is a simple density of interaction that computed over different networks okay so I compute these two scores we compute these two scores and then we have to rank the population according to these two scores and as we have two scores we we need to have this way of ranking the population and it's using the Pareto dominance the idea that if you plot here the density of interaction so my second objective and here the differential expression my first objective each dot here represents a solution one subnetwork and the Pareto front are organized like in the first Pareto front you have no other solution that could be better on the two objectives so for instance if I take this solution here I cannot find any other solution that would be better both on differential expression and density of interaction so this would be the first Pareto front and doing so you can define the second Pareto front third Pareto front and so on and this is the way we use and we rank our solutions so we use that to small small pieces by small pieces increase the quality of our solutions and select the parents and select the new population and at the end of the day the algorithm will output the output of the algorithm will be the final first Pareto front and it will be the set of subnetworks that are in this first Pareto front for instance this one which is a good trade-off in between differential expression and density of interaction but I'm also interested in this solution here which is not so good in differential expression but very good in interaction and this solution here which is very good in differential expression but not so good in density so the the algorithm would give me this set of modules as a result so we implemented that in the tool called Mogamon and then we had as I said before meaning when you do a new bioinformatics algorithm you have to try to compare with existing approaches so we set up a benchmark to try to compare our Mogamon approach to state-of-the-art methods which are cosine which is also based on a genetic algorithm but only with one objective binomial set and geactive module so to test the method we simulated so we use a benchmark from batterite al 2017 and we have one network and we select a subnetwork of 20 genes which will be called foreground genes and we simulate or we use artificial expression data to highlight to give differential expression to these foreground genes and then so we had different scenario I will not detail that but then we try to see how good the different methods are to highlight these 20 foreground genes this benchmark is quite limited for different reasons the first one is that it does not simulate the fact that it is it should be the foreground genes should be in a dense region it just simulate the fact that there should be connected it does not consider multiplex network and it consider only one community to be retrieved only one set of foreground genes but we had to do that because the other methods we are not able to use multiplex network find more than one community and this kind of specificity that we implemented so the results look like that so these are two objectives and and each dot represents a solution from the different methods what we found is that binomial set find a lot of different modules some of which have a very good value for density and and average no score but if we focus on the subnetwork containing at least 15 nodes everything disappears so in fact binoculars let's find a lot of very small solutions so in many cases it's only two nodes so two nodes one edge is a maximum density so they are looking good but it's hard to interpret from a biological point of view and on the contrary for objective module and cosine they found very big network so in some cases it's 800 1000 nodes also very difficult to interpret and our approach was able to find sets of modules which were more reasonable meaning they contain 15 to 20 nodes and you can easily go back and discuss with biologists to interpret that so of course here it's two or two objectives so it's a bit biased to word or methods but we also computed the f1 score to see if the methods were able to retrieve the foreground genes and Mogamun was giving a good trade off because it was not too small and not too big so we applied this approach in collaboration with with our colleagues in this specific disease which is called fascioscabular humeral dystrophy and which is a very complex disease because it's caused by hypermethylation of a specific genomic region so it's not one mutated gene and the consequences of this hypermethylation are already fully described so our colleagues they have they conducted the RNA-seq study in muscle fibers derived from stem cells themselves derived from patients so we have this RNA-seq data in patients and control so we have the differential expression and we combine it with molecular networks and and run Mogamun and we obtain I don't remember 10 to 20 different active modules such as this one and so it was easy then to go back and discuss with them and they were able to pick up the modules the function they recognize or they were interested in and in particular here they identify many different genes involved in the regulation of the calcium uptake by cells and so what they did is they do some live imaging on these muscle fibers to monitor the calcium flexes in contracting cells and they validated a done regulation of this process so we are now continuing this the application of Mogamun in different contexts the idea here is that we extract or we help finding biological processes that are perturbed in RNA-seq data in a way okay so this was my second story on active module identification and now we will leave just for a moment the the word of networks to go to joint dimensionality rejection because I think this approach are really complementary to the one I presented before and this is the work done by Laura Cantini my my fellow CNRS fellow in in Paris Ecole Normale Supérieure so here in the context it's a bit different we have multi-omics data but multi-omics data represented as matrices and that are obtained on the same samples so for instance I have I can have samples with transcriptome copy number microRNA methylation so we have these different omics matrices and joint dimensionality reduction approaches are a set of approach we here again it's not neither late neither early integration is that at the same time in the middle they try to find the joint signal by applying dimensionality reduction and we focused here on approach based on matrix factorization where the idea is to factorize each omic matrix in one omic specific matrix and a common factor matrix so you do that you will have this common factor matrix which is common to the different omics and then you can use this matrix to do sample clustering and you can also use this approach to find some pathway processes and and so on so it's really interesting because here the signal is really driven by the different omics together however there are many mathematical formulation to do this matrix factorization so what we did is is a large benchmark on on joint matrix factorization approaches so joint dimensionality reduction approaches based on matrix factorization we used nine different methods and we implemented three different benchmarks on simulated data on cancer data and on single cell data and we evaluated the performances of the methods to find good clusters to associate factors with survival metadata biological annotations everything is implemented in a Jupyter notebook so it means that if you develop a new method you can directly plug it there and and apply the three different benchmark on your methods and compare it with other methods but also if you have a multi-omics data set you can plug it there and apply all the different methods to your own multi-omics data sets and the results of this study was that if you do want to do clustering you're sure you want to do clustering analysis on your multi-omics data set then the methods such as iCluster which are really dedicated to clustering perform very well if you don't really know what you want to do maybe you want to do a bit of clustering but maybe also some survival analysis and finding enrichment in biological processes then you should use MCIA which was very versatile and and quite good in many different contexts including single cell analysis the word of conclusion for that is that these methods they are very good but they are really data intensive so they require a lot of samples to to work correctly so how do we apply these methods in rare diseases and right now we cannot so this why it's a blank slide in fact it's the future project of the team and we really would like to try to have some approaches where we can apply multi-omics integration with matrix factorization on rare diseases and we will do that with transfer learning strategies so learning the factors on big compendium and then trying to project them on small samples so this is relevant for a disease but it's also relevant when you want to do personalized or stratified medicine and you have few patients sharing similar profiles okay so in a nutshell i hope i convinced you that data integration is really important to better understand biological systems but it's not one omic will save everything we really need these different points of view there are many different frameworks for data integration i i showed three different frameworks but it's really dependent on what are your question what you want to do and you should find the framework that fits your data and not make the the data fit your framework like if you do not have network or if it's hard to infer network maybe it's better to go with a matrix-based framework and and this kind of strategy and my final point is that more is not always better so i we show that with the drug network and you can also see that sometimes with different omics you can do a lot of you have to do a lot of initial exploration and statistical exploration of your data set trying to find that correlation to be sure that it's worth integrating everything so sometimes it's not the case and it's really depending on your question so with that i thank again all the team and i will take your questions thank you for this very exciting talk thanks a lot and i see here a virtual applause coming in from the zoom participants thank you and Jan has a question please go first and Jan hi thank you very much for the presentation i have a very basic question related to multiverse so multiplex networks are composed of several layers and i didn't quite understand if multiverse is generating and embedding for each of the layer or only one for the entire multiplex so it's generating one embedding for the entire multiplex and it's why it's interesting and it's not an early or late integration but really at the beginning is that the random work are computed from the multiplex network and so we have values for for the nodes from the multiplex network so you can have a joint embedding okay i see and would would that work with each layer being fully connected with it so currently no because it's it's implemented on the first random work with restart we had but it's something we are trying we are in progress of doing and Antoni Battista is trying to implement that using the universal random work with restart so this universal random work with restart can navigate any combination of weighted directed layers and when we will have put that with a multiverse optimization the second step then it will be able to work on any any multiplex network okay thank you very much thank you Dian thank you Anais Giovanni is next hi thank you for the very fascinating talk i have a quick question related to the networks that we use for these analysis specifically i wanted to ask how does the quality of the networks start with influence the results but i mean is often many molecular networks or data that you can find is quite noisy it's right from experimental results so you may have false positives or some quality issues my question is combining networks like these would help to overcome the noise that comes from these experimental challenges to find the molecular networks or would the algorithms offer if one network with more noise is included so this is a very important question indeed because we do not want to sum up the noise we want to like subtract the noise somehow so we we we do not have any like gold standard network but what we have shown and in particular in the community identification algorithm is that the fact of combining finally we try to enhance the signal that is more present in more than one network so it was filtering out the stuff that are only in one network and in particular the stuff that are only in this very big and noisy network so for the community identification in particular when we are at this step of multiplex modularity optimization so it's a way of normalizing the modularity per network size so it lowers the weight of the very big and noisy network but i agree that it assumed that the very big network is the most noisy then maybe a point more on protein-protein interactions because it's true that so the techniques for large scale to hybrid screens it was put on a large scale 20 years ago and initially it was really full of false positives but these data sets they are better and better and we do not have the same problems that were observed 20 years ago and the data are quite we can have good confidence on the data and has been shown by comparing so not by us but by Mark Vidal and his team but when they took as a gold standard interaction published in the literature by different teams on a small scale they show that they're the new large scale to hybrid screen that they are quite good and so i think here combining these different sources of interaction yeah it's really trying to extract the signal that is more everywhere and so to fetch more good interactions but of course we meaning at the end of the day when we have a true result for something very important biologically such as when we do that with miscellar dystrophies and before testing if the gene is really involved in this miscellar dystrophy of course we check back the literature to to try to see where this interaction is coming from but as compared to 20 years ago when i started my phd when i do this check back of the literature i have more and more often good interactions thank you do you think is there any way you could learn during the process the importance of the different networks i could learn what sorry i didn't hear the importance of the different networks so to upweave for example yeah less noisy ones or yeah so it's something we did so um so we did that for the community identification but also here we can see that when we do the live one out cross validation so it's not presented here but here i compare the protein protein interaction with the multiplex but in fact we did compare protein protein interaction with co-expression with pathway and the for predicting gene disease association the better the best network was protein protein interaction and in second pathway and then co-expression so we have that in in the first random work with restart paper comparing the the different networks and what is interesting is that the co-expression network was very bad to predict gene disease association but still when combined with the protein protein interaction network it was increasing the prediction of the protein protein interaction network so it means that it still contain a slight bit of signal there i see thank you that's very interesting thank you are there further questions and i have a higher level question namely in in what you presented and there's a bit linked to what Giovanni just asked um in in these in many of these approaches there are a lot of parameters to be said and you also refer to the the very popular task of node embedding so which is often done with deep learning these days so which role for deep learning do you see in this in this field um can you can you maybe distinguish here between scenarios where you think it might be useful scenarios where you think it might not be useful um so so how important is deep learning in this context um yeah it's quite a high level question indeed so uh indeed uh for for yeah we have the graph convolutional network that our kind of network embedding uh just another strategy here we do not have so big data set so it's not what we used so far even if combining different interaction sources we start to have some yeah computational difficulties um but the the main project we have that are related to deep learning is that when now we want to integrate uh the different omics data with images and i think at this one so one very fancy work is the work from caroline nuller in in the us where they have these auto encoders and different omics and images on single cell and they try to project everything in the same dimensional space and it's something we would like to try uh with uh data we have midiomics data so it's bulk but midiomics data and images um for this muscle ourselves and contracting with calcium fluxes so it's what we will try to do but i would like to be sure first if the auto encoders are better than matrix factorization to project or to reduce the dimension of omics data i'm not completely sure there there are some yeah i mean maybe maybe because it's nonlinear and this kind of thing but we will have to test that and in fact we have a network problem right so so now that was more from the omics yeah it's more multi-omics i see yeah it's more on on this question here and and we would like to try to compare matrix factorization and auto encoders for the same task and make a lot of sense yes yes i see the potential there with the graphs and the networks it's always the question do you stay enough data to do to do yeah for the for the graph right now i i didn't see and in particular because the random work with restart is really efficient so it's it's really cheap we can compute it so i don't know and it's something we want to compare the network embedding to the direct random work with restart because okay the embedding is costly and and so what exactly type of additional signal or filtering does it bring on the table and it's what we have to compare now no makes a lot of sense yes so i also find it much harder to see the the deep learning application in the graph domain than in the dimensionality reduction domain yeah i think yeah thank you are there further questions let me just check the chat but we also at the end of this one hour thank you very much and i ease for this excellent presentation also very very thought-provoking about the future of of biological network analysis and dimensionality reduction i enjoyed this very much uh so we again um send you a round of applause here and uh we're also grateful that you're taking now another half an hour to meet our doctoral students of the network this will be in a breakout room but okay the other p.i.'s and myself we say goodbye here thank you very much that was great bye bye thank you very much thank you