 So let's get started. Welcome back everyone from the break. It's a great pleasure to introduce our second speaker of today, Professor Christel Van Stijn from the University of Liege. Professor Van Stijn holds a PhD in Mathematics from the Hent University and also a PhD in Medical Sciences from the Hasseld University and Maastricht University. Since 2008, she is a professor at the University of Liege in Belgium and she's also an honorary guest research professor at the Hent University. And as all of you know, she's a valuable MLFPM partner, but Professor Van Stijn is also the coordinator of another ITM network called the Translational Systemics. And Professor Van Stijn is a leading expert in developing and applying methods to detect first of all, gene interactions but also gene-environment interaction. And she's also an expert in unifying biological and statistical evidence in genetic epidemiology. And today, her focus of the talk will be system analytic strategy in the framework of precision medicine. And with that, I would like to hand over to Professor Van Stijn. Thank you, Catherine, for the introduction. And indeed, today I'm going to take you on a tour regarding systems analytic strategies in precision medicine. And before I give some challenges and opportunities of computational efforts for precision medicine, I will briefly go over some concepts related to precision medicine and systems analytics. So I will show and motivate that integration and interactions go hand in hand. And I will also motivate a role of network analytics in this context. And in a second part, I will give real-life examples from the lab supporting this companionship between interactions and integration. And I will primarily focus on post-GWAS data following short courses that were given last year and the year before. And also give examples coming from microbiome and transcriptome contexts. So if we want to characterize precision medicine, what is it? Well, we know it's a medical model using characterizations of individuals' phenotypes and genotypes for a multitude of purposes. And it's not only about or the purpose is not only disease management in the sense of post-disease diagnosis follow-up, but also about personalizing risk assessments, maybe prior to disease onset, and delivering person-optimized diagnosis. And whether precision medicine definition then really entails it, that it's important to characterize an individual well. And to do so, you need data, lots of them, and ideally multifaceted. So offering multiple views on the individual. And this is challenging because data belong to an entire ecosystem of clinicians and researchers that surrounded the individual. And there are challenges and there are challenges on intellectual property, data, computational infrastructures, etc. And eHealth, therein, offers several additional challenges and opportunities for integrated precision medicine. For instance, digital therapeutics and that artificial intelligence that is based on guidelines, best practices, experiences of professionals, such that all the data that you collect can be translated into something useful or can increase insights or ameliorate personalized interventions. So what can integration bring us in the context of precision medicine? Well, when data are available on lots of individuals, it can bring more precise predictions of health outcomes for an individual. And data collected over multiple individuals also allows investigating similarities and dissimilarities across individuals so that you can identify disease subgroups or groups of individuals that, for instance, would benefit more from a particular treatment regimen than others. An increasing multifaceted data are collected over time. And it's more comprehensive than just demographic, clinical, genomic, and transcriptome data alone, also entailing microbiome samples from different body sites as shown in this study. But typically here is also that these study designs are often challenging in terms of integrative analysis. And so, for instance, in this study here shown on the slide, the participants visits are a mix of planned visits but also spontaneous visits. And the interest here was in the context of diabetes. And you see that in these spontaneous visits, you have a mix of non pre-diabetics and pre-diabetic individuals. So this is already challenging for the analysis. But also, of course, we're dealing with very heterogeneous data where each data type basically comes with its own added margins and sources of variations. Now longitudinal data collection is particularly interesting when you want to optimize precision medicine and then the sample size is one. So only one individual. And here it can help to detect disease early or to forecast progression as the individual serves as its own internal control. And early groundbreaking experiments in the cells include the efforts of Mike Snyder as in the Chen et al paper that I show on the slide. And so where multiple measurements were taken over time, over a period of 14 months, and using an integrated approach and real-time monitoring, Mike Snyder could actually observe the onset of his own type 2 diabetes. But he could then take actions accordingly. Now, of course, heterogeneity is still present like in the previous study. But what they did was actually to rely on Fourier spectral analysis so that he could somehow normalize the various omics data sets prior to trying to look for common patterns. Now I've mentioned before that phenotypes may be important, traits may be important. But what is an endotype? Because an endotype is something that you often hear in the context of precision medicine as well. But clinicians, broadly speaking, often classify patients into phenotypes, which comes from the Greek phenol, that means to show. And people, that means type. And that's why phenotype can be defined as the observable properties of an organism that are produced by the interaction of the genotype and the environment. And the term endotype was basically introduced in 2008 by Anderson in the context of, as Mike, I'm not mistaken. And it's a contraction of endophenotype. It's a combination of the Greek word endon, meaning within, and typos that we saw before. So it's defined as a subtype of disease defined functionally or pathologically by a particular molecular mechanism. So biological mechanism, that's the key here. And it can go beyond mechanisms that are based on genetics. So the goal in identifying endotypes is to really create homogeneous groups of patients that can be very interesting for all sorts of clinical studies like genetic studies or maybe drug trials. And to give an example, asthma endotypes may be broadly characterized into two groups, type 2 high or type 2 low asthma. And as you can see here from this slide, it's really still an umbrella classification where you can underneath have different manifestations of the disease and where it's very, very difficult to try to get to these underlying biological mechanisms that are relevant perhaps for precision medicine. Yet it's very important because particularly this group is suffering a lot. These patients have a very poor response to steroids and together with bronchodilators are key to the treatment of severe asthma. And they are also not optimal candidates for some of the newer medications that are around. Another example is shown here on the slide. So again, you have this umbrella behavior and your asthma is presented as a syndrome and a syndrome is a set of symptoms that are linked to specific pathogenesis, your asthma. And you can see here that it's linked to different phenotypes and that the phenotype comes perhaps with different endotypes. And actually you can also have an endotype that can be linked to different phenotypes. So it's a very complicated thing. So the identification of endotypes is really challenging. And I think we can make some advances here if we are investigating the dynamics of these endotypes. If we acknowledge that we are dealing with different systems that can be partially overlapping and perhaps that we also spent increasing amount of time on finding the relationships between disease endotypes and drug endotypes. Now just as a note aside, I was looking in the literature a few days ago. How many reviews do exist? And I was looking for integration occurring in the title and omics in the title abstract. And you see that you see this increase after 2012-2014. So more data become available, people are trying to develop methods to analyze the data and then there is an abundance of methods and you need really to have some overviews there. So it's all logical what we are seeing here. But of course not all data are equally informative and some are redundant. And of course the more data you collect, the more likely it will be that some items will be interrelated. So the challenge is to adequately describe the system of interest. And one of the basic principles of a system is that everything can be connected with everything else and that when describing the system you need to define some kind of a boundary within which you would like to be holistic in a sense. And you need to understand the behaviors of this system. But in a nutshell interactions are intrinsic properties of systems. And it's therefore also no surprise that systems blend themselves perfectly to graph representations and network analytics. So what can interactions bring us to precision medicine? Well what we have learned from over decades of work on genetic interactions is that it's extremely difficult to align analytic modeling with biological relevance or impact. And that can in part be explained by the observation that statistical interactions are based on populations usually involve averaging out of effects as you can see at the right. But in contrast genetic and biological interactions occur at an individual level and are therefore natural instruments to our understanding intra and inter-individual heterogeneity. Now if you look at the population level I will dwell upon this a little bit further in the next slides and what I always call an interludium. What is the role of interactions in these polygenic risk scores that you already saw during one of the previous summer schools in this network? Or what is the role of the environment? So this is here a slide on the polygenic risk score and actually this is a term that finds its origin in animal breeding. But what the plot on the left shows is basically a decrease in GWAS hit effect sizes and you can actually see that the odds ratio if you look completely at the right 2019 you see that the odds ratio is really jumping underneath this cutoff so to speak of 1.1. So basically we are approaching more and more towards this Fisher's infinitism model. And this has some challenges of course because how on earth are we going to assess the functional relevance of such an effect? It's just a small effect to the traits of interest. However it can also be beneficial because we will have a lot of these hits now. Each with a very small effect they all contribute to the polygenic risk score. The formula is on the top of this slide and jointly for instance we can assess whether there is an over-representation of pathway one or pathway two to get some further understanding about what is happening with respect to this disease. Now some people have been studying the incorporation of interactions into this scheme into this polygenic risk score and what you can see here is that if you are including interactions as well between these genetic markers between these snips that you typically deal with in a GWAS context you can actually improve the performance as you can see with the area under the rock curve. So if you look at the green line I hope you can see my cursor but if you look at the polygenic risk scores you see that the majority have an AU rock under 60 percent but as soon as you are also including interactions you see that this performance is actually going up a lot of times over 90 percent and you can very clearly see that or get some better intuition with this plot here. So the score that included interactions was actually based on mobile-based multifactored dimensionality reduction. This is a tool that we developed in house but without further ado I will just say that at the end you will have an organization of your of your two locus effects in the sense of whether a two locus combination is increasing or decreasing or not really important you know for the disease risk and if you convert these high low or redundant risks into minus one zero or one you can actually come up with a formula that looks very similar to the polygenic risk score and basically if you only have main effects even if you don't include interaction effects but you use this alternative recoding you can already increase the performance and obviously the amount of interaction effects is increasing you also see that the performance in general will be increasing so this is actually good news but of course we have more than gene-gene interactions we also have gene environment interactions and sometimes we also have interference of course or confounding factors to these polygenic risk scores if you would like to know a little bit more about that I'm referring you to this reference of Blank and Barry which is really an eye opener and give some food for thought so as an example of gene environment interactions and how this also may be very important to understand disease risk I give you the example with microbiome data so what we see here is different types of interactions that you can have between microbe and genetic information so maybe host genetics may directly impact you know the phenotype and that may perhaps you know also have impact on the microbiome but it could also be that host genetics is interacting actually with the microbiome and in that way regulates gene expression which then has an impact on the phenotype so there are several possibilities here and it's a challenge really to get a grip on systems like these and to then translate it to something useful for an individual in the context of precision medicine again a note aside I was looking for interaction reviews in a combination with multiomics and I couldn't find a lot so I just restricted to a number of publications where multiomic appeared at the title and interaction in the title and abstract and this is the situation then and it's a little bit you know mind-blowing only eight of these publications in 2020 but it actually shows that what's in a name of interaction maybe most authors are referring to it as association for instance between omics data sets are they talking about omics data interacting or features coming from different omics type interacting or more association now we'll come back to that later on because that's very important but what are the systems analytics what are the characteristics well basically on a nutshell I would say that integration and interactions go hand in hand and this is here an example of the complexity and interconnectivity of omics data sources in a multiomics framework this appears to be a well studied set of multiomics interactions but one can expect a more complex and unknown interactions while integrating multiomics data sets so I was looking a little bit into the literature and reviews on integration focusing on multiomics integration and really you find that these reviews have their own focal points quite often so the first one was basically overlaying a few methods with targeted application context and in this sense I gave an overview or you may have reviewed that actually focuses on how data are actually used others classify methods into early intermediate or late integration or maybe the nature of the method I mean is it a supervised method is it unsupervised seem a semi-supervised algorithm that is being adopted or you know bottom up and top down integration where the definition of bottom up and top down is not always that clearly explained and actually I have to say for a lot of the reviews that I went through there is still there are still a lot of questions that arise and where you would say yeah but you know this approach where you would you actually classify it so it's it's really not so so easy to box it in yeah this is the example you know on the the review that categorized on the basis of data use and you can see if you compare B and A for instance so with B one says okay you have these multiomics data sets and you first extract some data and then you do an integrated analysis then with the fusion methods it's you have all your data and you immediately are doing your integration analysis but okay for that last approach you probably have to be pragmatic or you do want to filter out right what is not redundant and in that way end up you know to select some features of interest so it is not so so easy and the same actually for the other reviews and what's the difference I mean correlation is also can also be seen as a similarity here on top of the of the slide you know there are Bayesian methods that are also multivariate so why do you make that distinction it helps if you are more targeted for instance what is a problem that you want to solve and that is actually more in line with what you what you would naturally do right you have a particular problem that you would like to address you collect data and then you find you know the method that most suits you to solve that problem and if it is not there you develop a method on your own or it can be actually data oriented I mean that you would like to focus on integration of non-omics with omics yeah it's also in this context that clinical transformation can be can be discussed or you can actually look at other organisms like microbes they are also interacting with the environment you have host microbe interactions and you have metabolites mediated networks and here you already see that I'm using this network paradigm what I particularly like here is this review for single cells where they basically talk or start from the concept of an anchor yeah so either you can have itself driven you know so the cell can be an anchor which is called like vertical integration you can have the feature taken as an anchor which would imply horizontal integration I was shown here that you have no anchor at all diagonal and where you are actually challenged with you know integrating this orange g matrix with the the green matrix with the the the blueish matrix so it leaves no doubt there is a a tower of bay babel or babel I'm not quite sure how you have to pronounce it in English that we have to face and it really depends on the community that that is working on that particular aspect on so for instance multi-view you often see that in machine learning papers or papers coming from the machine learning world multi-sources more related to you know the data types that are underlying etc etc so we decided to to work on a kind of not a classification but kind of an overview borrowing the strengths of a few of these review papers and that is work in progress but for this for the sequel of this presentation just focus on the upper right part here and we are dealing with networks which are actually multi-layered so for instance think about the nodes being genes and they come perhaps from different omics platforms and with this multi-layered visualization we can also look at interactions within each layer or between the layers okay just to be sure networks consist of nodes and edges but nodes can really refer to any biological feature such as microbial toxin gene expression as I will show later but also environmental exposure etc and edges are connections between nodes can be empirically or statistically derived or often just associations and they can be unweighted or weighted to reflect association strength undirected or directed to reflect cause and an effect and then these are a few examples and the ones that we will primarily be dealing with are these multiplex networks which are shown in B now this is a very nice review where actually you know networks can go beyond pairwise interactions by means of hypergraphs if you would like to learn a little bit more about that I invite you to have a look at that paper that was published in in 2020 and then we come of course to the analysis of these networks and there there is of course this increasing role of machine learning I just listed a few reviews here where they indicate and explain the increasing role of machine learning biological network analysis and more targeting deep learning and particular graph neural networks in 2021 by Muzio et al and these graph neural networks are very interesting because they actually act directly on the data that are represented as a graph on the graphical structure so on these interactions as well but there are some challenges especially if you would like if you're dealing with multi-layered networks the interaction heterogeneity is still an an issue sometimes the design of the the GNNs themselves and we should also not forget about the interpretation also which nodes and edges contribute to the results so these are fields of very active research in the in the community I would say now if you want to analyze network of course first you need to have it right you need to construct it so you need to select your nodes and you need to select your or construct your your your edges I will show what I mean by edges in the different context in part two now one way to to construct your networks if you have measurements like for instance gene expression data is to use to take one node as a target and then to see which other nodes are connected with them so which other nodes are basically predicting that target node and that could be done for instance with a with a lasso technique this is actually not new right lasso was used before in this sense by Mineshausen and Bulman in 2006 but we you know evaluated it a little bit in in in on synthetic data and also added a permutation approach to really select you know the best predictors in this via these lasso models on another competitor in that in that sense is actually hierarchical lasso and this is a very interesting tool as well because it does incorporate pairwise interactions to explain cases where two or more genes in my example are expressed together and capture non-additive contributions to the response when you see it's it's not so much behind actually the the lab net the tool that we developed but you know then you have a downplay of the the computation time this is typically what you see as well you may have a very cool method you know to integrate data or to detect interactions or to analyze systems data but yeah it's not feasible because it's computationally to to demand so there is always this pragmatic balance of unfortunately we still have to put okay so examples and last time so last summer school you have seen what you was are how you're analyzed and the post GWAS with GWAS outcomes were integrated with pathway information and network analysis of GWAS outcomes was presented and also some discussions about polygenic risk score so I'm going to take it from that point when moving to interactions with that GWAS data as a first example before moving to microbiome and transcriptome data so how to define the edges how to construct them well terminology is quite challenging in this in this context right so this is a definition of Wikipedia and I'm sure you've already read it by now it's not really informative I would say it seems to be a mixture of concepts you have a lot of questions that you can ask yourself well is interaction for instance not related to causality should it be related to causality a kind of action what do you really need okay so let's go actually to the genetics literature then and one of the definitions actually is coming from Bates who introduced the term compositional epistase in the early 1900s where it actually describes the situation which the effect of a genetic factor at one locus is masked by a variant at another locus so in this case in the example on the top you have actually that if you have at least one capital G allele at locus G you see that it really doesn't matter what you have you know at B the color is always gray now so here we say that G is episthetic to B it is not implying that B is episthetic to G so there is non-symmetry going on non-symmetry sorry going on soon as we go to a mathematical model we are more working with penetrance tables like you see at the bottom of this slide and penetrance really refers to the probability of developing the disease or the trait one zero given a genotype combination and an interesting model is actually the heterogeneity model as shown where an individual becomes affected through having a predisposing genotype at either locus now so this actually corresponds to a situation in which the biological pathways that are involved in the disease are influenced by the two loci but in an separate in an independent way but it can also be explained as some kind of a masking and therefore you could say that you can interpret it as epistasis as well um this is just you know these are just a few terms you know of for epistasis and each time referring to something different so if you follow these definitions you would actually have another interpretation of your edge and one should be aware of that so mechanistic is really you know coupling to biology fissures is more closely related to like what you would do in a linear regression model deviations from additivity in such a model statistical epistasis is what we nowadays always use for um you know whenever it has been derived from a model but actually the original definition comes from quantitative genetics where one really wants to see what the contribution is of epistasis of interaction actually to the variance of the trait and this is perhaps one of the most queer definitions ever essential epistasis is when you cannot remove you know this this epistasis by changing the scale of your of your outcome etc etc yeah so let's go down to the problem right because this is what's what we uh this is the natural process right what is the problem that we want to solve and the problem that we want to solve is the picture that we we want to grasp that picture that we see on the left of the of the slide so where you have some interactions possible between dna sequence variants that give rise to a particular phenotype the star in the in the slide but it's not a direct relationship it goes through a hierarchy of molecular um compounds or or molecules which may be physically interacting with each other and these physical interactions are shown with these dashed lines and it you you should remember that these type of interactions this interplay is really individual maybe individual specific so in that context if you understand what's going on there it may help a lot you know in these contexts of precision medicine prevention diagnosis or disease management that i mentioned before um but um the problem is of course that when you so want to solve the problem you you want to to model it somehow and you typically collect a lot of data and you work at a population level and there is this um non correspondence between one or the other as i mentioned before so we have been traveling along and winding road to kind of get an understanding about okay how do we approach the problem even i mean when the problem is reduced to only using the genetic information only using the SNPs actually from the g1s it's already a very difficult problem and um how can we increase our belief in these discovered statistical interactions from for instance jiva's data and to that respect you can always borrow a lot of information by looking a little bit outside your your your your own world in which you live and operate and the closest to to the world that i have been describing a few minutes ago is actually the world of genome white environmental interaction studies i think i have a common part which is the genetics part but there are some some differences there not in the least that there are much more errors or potential errors associated to data that are collected describing environmental aspects yeah but you find the same issues with the respect of terminology when it is when is the interaction removable when it is not removable what is the impact of scaling are we scaling the trace do we find interactions are they are they meaningful are they mechanistic you know this is really the same but what you can also learn from that is that even though it was very well known already in quantitative genetics you know sometimes you know people go through the maths again and make it really widely accessible to people this is exactly what happened here in this present in this paper of ashar 2016 where they actually you know kind of they simulated they simulated the situation where there was only a pure gene by environment interaction presence so no main gene effect no main environmental effect and what you can see on the plot is actually the decomposition of the variance explained and so the outcome variance is the target and how much of this variance is explained by the main effects and by the interaction so remember there's only an interaction no main effects but you see that you know the contribution of gene or or the environment can be really substantial yeah and that is because that parameter this beta g e so capturing the effect of the gene environment interaction is really occurring in the formulas of the the the variance explained by the the the singular effects yeah and this has been this has been taken as an argument okay why should we actually look at interactions because you know it's the you know the main effects really you know that that are that are the most important but it really depends on the problem that you would like to solve right here the focus is on prediction here the focus is on on you know risk assessment you know you would like to to have maybe some impact on policy changes here risk prevention and these type of things whereas in the GWAS actually the primary focus is often to just understand we're not just to understand molecular mechanisms you know and there it is there is still a lot of room to deal with interactions okay so and of course what I'm telling here you can extrapolate to the larger context as well when you're bringing in multi-omics into the picture as well but the other thing is that we need to make decisions about the unit of analysis you can take SNPs you can take sets of SNPs sets of SNPs for instance that are linked to the gene so older old like in the first the first line here on the on the slide or you can actually you know get some more graphical structure in that set by looking at the the non-independence of those genes so which we call LD if you remember from from last time linkage disequilibrium or you can even bring a third a third data type into the picture maybe gene expression data and you can look at you know the EQTL so you can look at the SNPs that are impacting the gene that gene's expression which may come actually from the quite different locations and maybe they're modifiers and in that way represent a gene by a SNPs that you depict as a graph and then you can start doing some cool things with it because you can define the question for a graph you can define a kernel on that graph you can actually do a kernel PCA to cluster and to have similarities between your individuals on the basis of that of that gene with that particular graphical structure and that of course may facilitate the analysis because you reduce dimensionality reduction maybe it's also enhances the interpretation and is also believed to increase replication so in this case your note would be you know a gene but really comprising a lot of information underneath it and this really paves the way for some some extras as well because sometimes you do not have other data than your SNP data and you can still create very interesting internet interaction networks between a gene you know upon which you can define your graph kernel by looking at for instance synergy using some algorithms to select the the most important notes for instance maximum relevance and minimum redundancy algorithms and then you know with the kernel PCs put them into the models for whatever you would like to do to end this example I would just like to highlight another important challenge in my opinion and that is of course reproducibility where methods and results reproducibility are you know they should really be be standard but where we also spend quite a lot of time if not like you know 60 or 70 percent of our time on inferential reproducibility where that really refers to okay understanding why a conclusion cannot be reproduced yeah and we focus on conclusions because it's that you know that is potentially get having an impact on on society or an impact for for medical purposes right so it's that conclusion that should be consistent and if we don't have it why is that is it because you know our new data set was too different you know from the one that we initially use or is it because of the methods that is being used what is going on and going through these processes really enables you to to better capture you know the data and what you can do with it and so one of the things is if you have these to stay in that context if he stays as networks you whatever method you if you take different methods regression and and a neural network a random forest I can guarantee you you know the results when represented as a network will look quite different yeah so what do you do next you know well you would like to have one conclusion or a few conclusions and in this context is that we were looking into some novel network aggregation methods now so this is the work of of Diane in the group who estimated who generated sorry some partial networks that reflect partial views of of so-called true underlying network mimicking actually that real life example and then you know trying to see okay how can we get an aggregated conclusion so one network with predicted links and she did so by having the solutions organized in columns of that matrix to the left and the edges as as rows and basically you have a matrix of ones and zeros ones when an edge was fine found in that particular solution with that method with that protocol etc and then you can do a cluster analysis or different types of clusters analysis on these edges and you also expect that the biggest group will be the group where there are no edges and this is then a way to come up with one solution where you can then you know do follow-up analysis and see whether some edges are more linking to particular pathways or not but of course we are using here we are generating one aggregate network and this may not be that clever yeah because some methodologies may give rise to so different results that it may not be that wise you know to lump it all together and so here we run into that concept of network similarity and how to how to define it in a proper way and there is this very nice paper of Tandardini at all this is a review which we actually took as a starting point a few years ago to think about these problems on where they really divide these methodologies into no node correspondence methods for instance like a delta com method or the unknown node correspondence methods like for instance a spectral methods we are in the situation of a no node correspondence methods right so we have the nodes will always be the same but the edges will be will be different depending on the view depending on the method that we use etc or multiplex network and this is then challenging because if you look into multi-layered network analysis and similarities there there seems to be a tower of label as well you know in in similarity measures and depending on the community you know how something is called and I like this paper here because here they make at least an attempt to look at some relationships between you know these measures that are circulating and coming from different works and they also try to give some guidelines about how to choose appropriate measures given a specific data set and they do that with the so-called property matrix so let us go to a second example enough about interactions let's move to a microbiome data and well microbiomes are increasingly a part of health related studies because for instance host microbiome interactions have already shown to be highlighting very interesting targets for new diagnostics and therapeutics and the data that we had was data from the lucky cohort study so it's an ongoing cohort embedded within the larger lucky cohort study where participants were enrolled in the south Limburg area via professionals but also via the internet and currently there are about 140 newborns and their parents enrolled and then there is a microbiome profiling done by next generation sequencing of 16s v3 v4 hypervariable gene regions variants were identified with an old pipeline and then centered lock ratio transformation were done on the data to to normalize the data now these terms may may sound a little bit strange to you especially if you have not been working with microbiome data before so let us go a little bit into more detail what the nature of the data are and so 16s sequencing why is that being done or why does it refer to well we're actually looking at the 16s rRNA gene yeah and this is actually a genetic marker that is very useful to identify or classify microbes yeah and why is that because it consists of highly conserved and hypervariable regions with which are denoted by v1 v9 so that explains already one part on the on the previous slides and then what you get out after such a sequencing is sequencing reads so these are strings of DNA sequence and then these can be analyzed with a bioinformatics pipeline what is also important in these contexts is a notion of OTU that you see at the bottom of the figure and this stands for operational taxonomic unit so basically that's an operational definition that is used to classify microbes based on sequence similarity on that on that marker gene so when there are some problems with microbiome data which makes them so interesting from a model developer's point of view but actually there are three potential biases there and the first bias is that microbiome data are compositional in nature so for every individual you will have a factor of positive real numbers yeah and all negative real numbers these are the counts yeah the abundances yeah so every element refers to an operational taxonomic unit and how abundant is that for that individual and the sum is constrained because the sum is determined by you know your sequencing depth yeah the number of reads so it means that basically you're dealing with an equivalence class yeah and the the factor that I just described is one representative of that equivalence class another unique representative is actually you know living on a unit simplex so we're dealing then with proportions and if you sum them they sum up to one yeah so there is a difference between these absolute abundances or relative abundances we should actually work with relative abundances if we want to compare individuals with each other this is called the compositionality problem or the compositionality bias and you will hear that over and over again and this is one of the most challenging aspects you know also to integrate machine learning into um into the microbiome world so for instance the standard log ratio transformation is one way to deal with compositionality but there are other transformations possible as well and not all of these transformations are giving equal uh you know comparable performance if you integrate them with machine learning tools like rental forest and there is really not so much research done when you integrate it with um neural network paradigms the other one is sparsity so you have a lot of microbes that are not occurring which could be true but it could also be because you don't have enough sequencing depth and of course again uh confounders, pure associations where an edge between two uh tuxa may actually be um false yeah and this of course is also something you would like to avoid because um yeah if you want to do analysis on the network and you can't trust the edge that's not so good this then leads to the the fact that you have to most of the time adapt standard approaches in the context of this compositional data like microbiome data i've already mentioned the centered log uh ratio transformation which basically this log of your account divided by a reference which is a geometric mean of the vector um so you define a new distance and this distance and and this way of dealing with compositional with the compositionality of the data is called a coda analysis a compositional data analysis and there are some key features actually of such an analysis that is that it needs to be scale invariant so it really does not matter which um element you you consider from the equivalence class it should give the same results permutation uh invariance it doesn't matter how you organize how you order these uh toxins or these microbes and sub compositional coherence very important because you want to do some feature selection and you don't want that um you know suddenly your uh your conclusions on those same microbes are going to to change okay so now let us look at real data uh analysis results so um what we have here on the on the plot to the left is basically within subject distance in microbiota structure between two subsequent time points yeah so we had all these babies it's microbiome data on the babies we use the correct distance the agents uh distance and then uh we look at these boxcloth and you see that there is a shift you know if you go from month six to month nine how could we potentially interpret this so this was done by our colleagues in in Maastricht by the way well because the diet changes a lot around that time you know between six and nine months and this is also reflected in this um difference in a microbiome constitution now what you typically do with these data is then to try to cluster the individuals on the basis of their microbiome compositions and a very popular technique there is diraclet multinomial mixture cluster and this is what you see at the panel to the uh right yeah these different ellipses yeah these are the different uh groups and then you can start doing some transition analysis to see okay do people stay in the same group in the first group or in the second group yeah and you again see that actually if you transit from month six to month nine that most of the kids are changing their group which is actually called in this community entero types and um with the exception of some kids in in this group too so this is really these are really interesting time points to consider so we consider these two time points and we were looking and okay how can we construct the edges and there is really a lot of material there advanced methods modeling conditional dependence i give some examples below in the machado reference review magma was not yet included but this is also i think a very promising tool to construct the edges but most people are still relying on some notion of correlation yeah it's easy to understand um but from all the methods that were listed in the review and that are correlation based are basically only two um that also handle the compositionality bias yeah and cc lasso was basically constructed for instance also to speed up the process because sparse is used very computationally intensive but that argument is now no longer valid because these days we also have the fastest bar which is basically sparse is see but with a fast implementation so the results that i show will be based on a fast spar but to just you know indicate to you how carefully you have to be and that you just can't use the classical measures this Pearson measure for instance have a look at this plot so you see all these edges there are positive and there are negative correlations the negative correlations are in red which are logical because if the abundance of one microbe increases because of this some constraint the others must go down right so you will have a lot of these negative correlations red ones and you see that for this micro for this taxon tree which is highly abundant in the data you see you have a lot of these negative uh associations and you could start making interpretations you know from these but it may not be such a good idea um because if you look at the middle panel these are reshuffled data so you kept actually the marginal abundances but you have um no associations between the between the microbes anymore so you see that a lot of these reds edges at the left are also appearing in the middle so basically these are spurious associations yeah and you see that a lot of those are gone in the sparse cc so just by following the principles of this coda analysis yeah and on top of it you see that you know some of the negative correlations may reveal and one of those negative correlations is shown at the right between tree and 148 it was completely blurred you know it didn't show up in the Pearson analysis so one association measure is no the other you have to think very carefully you know how to define your edge how to construct the edge because if you're doing your analysis on the network and drawing conclusions from that it's of utmost importance and this is then how a global network may look like at month six or at month nine reduced because otherwise you wouldn't see anything anymore so we made some reduction by having a threshold on the on the on the correlations and typically in this field one takes 0.4 or 0.5 yeah now so far I have been talking about global networks I've been talking about networks that have been constructed by pooling lots of individuals together and the weights were population based estimates now individual specific interaction networks are also networks with nodes and edges but specific to an individual and if you look in the literature for these individual specific networks then they are typically dealing with multiple measurements for the same individual the individual serving as its own control and in these scenarios quite often coming from neurosciences you will see you will actually have not so much problem in constructing a network it's quite similar as if you were dealing with lots of individuals and pooling across those individuals now networks can also be made individual specific in the presence of a fixed template yeah like a protein-protein interaction template for instance you take that fixed for all the individuals but you're superimposed individual specific nodes well these you know sends are also individual specific but it's not the ones that we are interested in these are by the way networks that are used post gboss analysis as you also saw in the previous summer school so what are we not interested in well we are interested in networks where the edges the edge weights are individual specific and how do we get to that well it's colleagues in in harvard who suggested one possibility and their formula which shown on the slide is built on repetitively leaving out one individual of the total sample and then capturing the influence this procedure has on the edge weights and then you reconstruct the individual specific edge weight such that in the event of infinitely large samples on average the population results are retrieved and one of my students Federico Milograna is working out different recipes machineries if you like to go from influence or perturbation to individual specific association strength and also assessing joint significance of network edges now I don't have to say that you know how potentially important this can be to have these individual specific networks think back of the the mini systems or the systems that I showed quite in the beginning of this of this lecture so all kinds of interactions that are happening at an individual level edges that are happening you know that are appearing at an individual level so trying to construct this is challenging but it can also be very rewarding yeah because you can actually draw interpretations directly from the network that is coupled to that individual rather than deriving it from you know a population-based aggregated network and then deducing it you know to the individual level and one way of making sense of these networks is by looking at for instance modules and here unfortunately we see that you know the validity or validating sorry validating the quality of a community from a real network is not receiving that much of attention and that is probably because it is also not so clear what good means and good what is good in one context maybe bad in another context as well now this is a food for thought for for another lecture so before we go into some opportunities let's make sure that we're all on the same page so basically we are now dealing with multiple measurements for an individual but related to an edge in a network rather than a note in the network but the data organization is pretty similar so it means that a lot of the methods that you would be using ordinarily on the notes can now be used on edges as well and you can actually you know look at you can use one of these methods which is basically perhaps a moderated t-test to compare the mean of edges between two time points one edge at a time and to look at the top 50 of these differential edges and then use that as a filtering approach to then revisit you know your network at a global network at the time point of interest this is what's happening at the left yeah and at the right that was what was shown before so in this way with with different selections of edges different filterations in a way you can get a different view you can get different understandings out of it okay so what are now additional opportunities well distinguishing between stable and unstable individuals across conditions based on individual specific networks can clearly further contribute to precision medicine optimizations and to make such distinctions in systemics analytics way we use individual specific neighborhoods so what we do in particular was we performed a network representation learning to map an individual's binarized multiplex network to a low-dimensional coordinate space using an encoder decoder neural network so what you see there on top of the of the plot are these is this multiplex network with green nodes and blue nodes so the blue nodes is is actually an individual specific network related to time point one and the green network is a network for that same individual right individual specific but for time point two so what we basically do is the inputs and outputs are binary factors that correspond to each node in the multiplex network for a single individual and the at the encoder side we use the information about immediate distance distance one to create the input factor and at the decoder side what needs to be predicted as a binary representation of more distance neighbors likely to be reached by a random walker and then the local structure that is learned of this multiplex uh network lives in this low-dimensional space in which you can then do further analysis such as like looking at the angle between these uh two nodes so basically your data is organized as such so that now instead of the edge weights the node weights you have this angle or the cosine of a cosine similarity or a distance between the nodes so for every pair of nodes you will have such an sorry for every node yeah you will have such an an angle when you compare it to the situation at the other time point and you can start clustering individuals on the basis of that and if you do that you find uh on our data two clusters now cluster one and cluster two and if you then look at the most and the least variable microbes in terms of their local neighborhood dynamics between the two time points you can get the plot to the right so you see that there are actually only three film involved here if you look at the most extreme situations and then there is even one setting so for clusters I don't see it because I see the the pictures okay so for cluster one and the least dynamic microbes in terms of their local neighborhood they really have a very pure answers that's that blue line that you can see here going all the way over the classifications like order class and film now the second opportunity lies in the fact that we still want to cluster individuals with respect to these individual specific networks but you know where we don't have repeated measures we don't have time for his data and here actually we also developed an algorithm that once you have chosen your similarity matrix between the networks and on your hierarchical clustering you can actually borrow some information from ecology to implement a distance-based ANOVA so instead of actually comparing nodes to an average you are looking at all pairwise combinations in such a framework this allows you to actually put a p-value to the branching off and basically gives you a strategy to determine the most optimal number of clusters for where you take the cutoff in this dendrogram so when you apply it to these data you have actually two clusters one at time point six and one at time point nine and sorry 10 groups at the month six and eight groups at month nine they can also be shown to be quite different from you know the standard approach and this Dirichlet multinomial mixture model but then you can start looking at also transitions to see in this case that basically all transitions occur in either this upper left part or the bottom right part of of the plot you can also look at you know how good that clustering is and whether you have some misclassifications as is being illustrated here in this instance where you have a negative silhouette value but roughly speaking it does very well what we are interested in actually is what are the main drivers of these clusters so for instance in month nine where we had eight subgroups you can actually look at these edges separately and you feed them into a predictor approach here it was just a simple conditional tree that we used because it gives such a nice visualization and you see that a few edges are really selected here with their corresponding thresholds to kind of differentiate between the eight subgroups that we are detected and these are now carried forward for a further analysis but it is clear that you know the impact of selection can be quite dramatic in both ways you know in a positive way or in a negative way and this is shown for instance here in this case of mode of delivery where we first used the prioritization on the individual specific edges with an algorithm called relief so it does use actually it's quite sensitive to interaction information and then we plugged it into the random forest prediction model down sampling to ensure similar number of observations in mode of delivery classes and by doing so then we got you know higher and higher areas under the curve so discrimination performance of course here what we are also looking at or what one should be looking at and this is a network on machine learning is network representation learning so we have been looking into the literature and this will obviously be our our next steps so this is one method mx gnn that was introduced in by liang and co-orders it was recently published and with rather good performance and if you can see here the number of graphs can be quite bigger which we also have in our sample right because for every individual we have a graph so what can these methods bring us what can these bring us to classify to group perhaps these these graphs and can it backwards learn us more about you know the impact of certain choices made all along from the pure data towards the end so in summary here and the edges either we can have and we can start making interpretations by looking at the edges at every individual on an individual basis and try to link it to clinical data and here yeah by doing so we could get acceptable discrimination performance but it's really you know a matter of how you selected the edges you know where your thresholds and of course you may have good discrimination performance but it doesn't mean it also you have well a good calibration if you work with the network as a whole or the larger a larger group of edges going beyond modules we can't find significance so maybe that is because yeah and you have so many choices actually for the kernel and they all look at very particular aspects of these networks and it is a challenge to find you know those aspects that can be most informative right for the problem that you would like to solve and here in this context of course network representation learning may be very useful as well but the common denominator is basically sparsification and I will come back to that later on in very short parts on the transcriptome data this is also unpublished um but we see that um you know gene expression data are routinely used to identify genes that on average exhibit different expression levels between for instance a case group and a control group but um this does not mean that the same genes are also perturbed in a single case individual because what we often see is that there is fairly little overlap between case subjects on the basis of their personalized perturbation profiles um I have had indeed a few additional slides and I'm I think that the slides will be shared as well and they give you some more background actually why or why not we shouldn't use same principles with microbiome data for transcriptome data so I like this publication a lot when at all 2019 who actually try to use or discuss you know some of this compositionality issues in the context of transcriptome data but the challenge that we are really here dealing with is how does this molecular heterogeneity then between the case subjects translate into precision medicine practices and the way mentioned does it is they have this fixed template right so the edges are not individual specific but you have you know some scores that are individual specific and if you restrict the tension to a pathway and this is indicated by these dashed lines at the bottom of the slide you can really see that you know maybe at a pathway level and with this individual scores and using the fixed edges you can still see some heterogeneity between individuals but we would like to actually make use of the individual specific edges and so how are we doing that in practice so we look for node similarities or modules in the aggregate network that are obtained from all the individuals it's here called aggregate network and one of these modules is highlighted in yellow this this part here and it's a zone it's a zone that consists of nodes and connections and edges between them and for each such zone in this aggregate network we now would like to see whether the yellow zones in the individual specific network so going down the figure maximizes or minimizes individual to individual heterogeneity so how do we do that by first selecting a similarity metric again right between two individual specific networks but respect the two the networks in the in the yellow zones and that leads them to an individual similarity network as grown in as shown in in gray yeah so the gray network the nodes there are individuals yeah and what we then would like to see is okay is this a very connected network or do we see some subgroups there yeah so the measure that we are using there is the feedler value yeah so we use this algebraic connectivity in this individual similarity network to identify groups of individuals that you show here in that gray network three groups light gray darker gray and very dark gray so if we apply this to our real life data this is what you get so you get these hot zones in red and in the hot zones you see that there are two groups distinguishable it's a little lighter red and a little darker red and then the the blue to the cold zones are depicted in the in the in the bottom part of the slides and you really don't see a clear separation between the individuals so a cold zone is a zone that does not indicate evidence of heterogeneity between individuals so this matches high connectivity in the individual to individual similarity network for that zone and the hot zone well that has potential consequences right because they're these functional interpretations based on the all sample aggregate network may need to be refined or at least diversified to some subgroups of individuals so this is getting more and more towards personalized medicine and these different zones that you get the different hot zones because there are multiple hot zones that you can detect in that way prioritized by modular structure in the aggregate network they really offer different views as is depicted on this on this slide and last but not least of course you can also look at you can also look beyond the spectral picture of a of a graph and in that sense I would like to mention this graph filtration a notion also when you have a graph with an edge weight function that goes from the edge set to our filtration is really a series of monotone increasing subgraphs that defines a graph decomposition and you can actually take you know as a weight function in naive way the edge weights themselves yeah so whenever you take a threshold you can look at okay how many edges are still remaining in the network and in this way have this monotone increasing subgraphs a series as depicted on the on the slide and what you can then do is actually with a graph descriptor function which is focusing on particular attributes of the graph depict that graph in a different way so this was very nicely described and explained in a publication of Lasi Obre from Carson's lab so we actually applied this as well on on our data so the two hot zones that we that we obtained or that I showed before and if you then average these filtration curves but here we took as an on the y-axis not only no degree but also the field or value you can clearly see the separation between these between these two groups and then it is a challenge of course to actually put a key value of this and this is work in progress where we're actually looking at some normalizations of this filtration curve and extending acutely distance from point data to distribution data amongst others but you can then also look at all these curves that you obtain for every individual and again apply the distance based ANOVA that I showed before to really see how our individuals are clustering into a homogeneous group so and with this I would like to end where some take home messages so that there's still some time for questions so integration and interaction need to go hand in hand precision medicine benefits from longitudinal follow-up but new avenues from machine learning should not be left unwalked novel developments are I think still welcome in the context of a multiplex network analysis or multi-layered network analysis individual specific networks are promising in the context of precision medicine and individual heterogeneity assessment and hopefully can further complicate a complementary standard analysis and then an additional challenge is actually how to determine causality or bring causality into the picture or directionality in these individual specific networks so thank you for your attention and I put up a slide with our European funding agents thank you thanks a lot because it's still very exciting and very informative I learned a lot and are there any questions from the students check slider there's no question slider okay so while they're preparing their question I'm going to ask one and I was wondering so with the microbiome data like how large are your data sets because as far as I understood the microbiome like is so variable across different people different diseases and then you have drugs and nutrition and I always wonder like how like how large are your data sets you work with and how large has a data set to be in order to to apply the methods you actually presented yeah should actually be a question for those who are handling the data on a day-to-day basis but actually you know in principle any microbe that exists in the world are I mean we are not working with microbes in the environment so basically we're working with the gut microbiome or microbes in the in the stool etc but I mean if you look at the diversity in general you know in the school these are all elements in the factor that you have for every individual yeah potentially but not everything is observed because of the sparsity right and because of the sequencing depth etc but it can become quite large and you may not have the power actually to make conclusions at you know you know the microbe levels so that's why we have these operational taxonomic units to make it more tangible yeah good are there any questions yeah there's a hand raised Giovanni please thank you for your talk it's really fascinating I have a quick question I apologize if it's something that I missed but I wanted to ask out of curiosity you mentioned quite a bit multi-layer networks multiplex models with networks but do you also include heterogeneous networks because I found that it is quite a challenge to include certain type of biological networks like pathways yeah yeah I haven't shown it here right because we are actually not doing it right now but yeah it's it's where we need to move to right if we are integrating different data sources yeah especially coming from the environment that's what I'm what I would be looking at first you know so environmental exposures and putting them into you know into that context and how they may impact you know certain elements so but that's why we were so we come actually from the from the gwas field right and integrating multiple omics on the same individual and I see you know microbe as as you know a very interesting playground on its own but also you know interacting with you know these host omics and I think there is where you get to these heterogeneous networks and if you make some breakthroughs there it will probably also be useful for a more generic context right on the heterogeneous networks yeah does that answer your question yeah yeah thank you great are there any other questions or any questions so basically the nutshell is if you do network analysis you need to select your notes properly you need to spend quite some time on you know what do you really want to capture with an edge and once you know that there are still a multitude of options to actually construct it and the construction methods may really well they have to be driven by the nature of the data as I've shown with the microbiome data right so which are compositional in nature and then you know once you have done that you can actually start analyzing your data and that is where then these individual specific networks may offer some you know alternative perspective good so on that note if there are no further questions I don't see any hands raised um thanks again okay so for this fantastic talk you also get a virtual round of applause and with that we're gonna close the morning session and we will reconvene at 1.30 for the afternoon session thanks a lot everyone and have a good lunch