 Good afternoon everyone. My name is Christel van Steen from the University of Liege and Keuleuf in Belgium. Our next speaker is Laura Furlan and she's an associate professor at Universitat Pompeu-Fabra, UPF, and is also a miguel servet researcher at the Hospital Gellmar Medical Research Institute in Mim. She's heading the Integrative Biomedical Informatics group of the research program on biomedical informatics in Barcelona, Spain, and she's really combining her background in biology and bioinformatics to develop bioinformatics approaches to identify and understand molecular mechanisms underpinning human diseases, possibly in relation to drug response. She's particularly active in systems medicine and systems toxicology, text mining, and knowledge management. She has a long-standing record in EU project involvement, covering the former FP7 framework program of the EU, but also the current age 2021, including LXE's accelerate. So Laura, the floor is yours. Thank you Christel for the nice introduction and good afternoon to everybody. So we'll start my presentation of today that is titled Enabling Comorbidity Analysis from real-world clinical data. So as you maybe already know is that we are facing an increasing aging of the population worldwide, and this has consequences from the health perspective. So we are also observing an increasing prevalence of chronic diseases and this also has an impact on the increasing prevalence of disease comorbidities and this can in a way impact how patient management has to be achieved and also has an impact on the treatment of the diseases that are involved in comorbidities. But let's start having a clear idea of what we talk about when we talk about comorbidities. So there are different definitions in the bibliography and this depends on how we approach the problem of comorbidity. So traditionally this was approached from a perspective in which we can see the one principal disease and the other concomitant diseases as secondary or less important and this first definition is more in the spirit of this perspective. So traditionally in the earlier studies of disease comorbidity the focus was on the index disease and then the other diseases were considered as concomitant or secondary to this principal disease. And then there was another perspective that is more focused on the patient, the individual patient and all the diseases that occur in each individual patient without putting a special importance to one disease over the other and this has a more holistic perspective and in some way is the approach that we use for our analysis. Notwithstanding this for simplicity I will sometimes refer to the term comorbidity although we are studying more than two diseases so different diseases that may co-occur in a given patient. Another important aspect we need to consider when studying comorbidities or multimorbidities are the chronological aspects. So in which time window we are looking at the whole patient's history to define the multimorbidity or comorbidity because as you can imagine and this is illustrated in this figure you can observe the two disorders or at the same time in the in the patient so they coexist, the individual patient may have several diagnosis that coexist in time or they can be they can be separated by a certain time frame and in this scenario you might also have situations in which one disease is the consequence of the other. So there are different combinations but the important thing here is that you need to have into account these chronological aspects when you design a study and you define what you mean with comorbidity or multimorbidity and not also this is the fact that in order to analyze or identify comorbidity or multimorbidity patterns in a population the time is really a key factor and this is what one of the main messages that I want to convey to you today. So why is it important to study this problem of disease comorbidities nowadays? So I already presented that we are facing an 18 of the population and this has increasing the prevalence of comorbid diseases and this of course has great impacts in the healthcare so there is a need to address this problem in a proper way in order to to reduce cost and this is from a healthcare perspective but also to improve the quality of care that is given to these patients and it's also important to to note that the comorbidity is not an exception but most of the patients currently have more than one disease when they encounter the clinical the healthcare system. Another important aspect also to consider is that when we study comorbidities we can gain a better understanding of the etiology of the diseases that are in place we can also identify patients subpopulations that have similar characteristics and even define new diseases types and this is also very important when we come to to the problem of disease classification and also has consequences for trying to find out more effective preventive approaches and even designing better and safer treatments. So I think the reasons why addressing this important problem in health are quite clear. So in our group we from some years ago we are working towards helping addressing this problem by developing different approaches in also bioinformatic tools that can be used for from one side to identify and and assess the degree of comorbidity that is available in patient level data and for that we exploit real-world data and also different types of omics data sets. So we are in a moment in which we can access to different information from the patients that describe their molecular features like genomics, protonics and so on and this can have the access from different initiatives and on the other side it's important to consider data that is captured from the healthcare systems and this is what I mean from real-world data. So data that is obtained from the encounters of the patients with the clinical system, the hospital and that is recorded in the electronic health records and other types of records that are collected during the clinical practice. So it has been demonstrated also in other applications. All the information we can extract from this real-world data and how can be applied for research of course but also for translational applications. So with this spirit in our group we have developed a series of tools to from one side identify comorbidity and multivormidity patterns from real-world data and these are listed here and I will provide later a more complete list with our URLs and also some tools that help gaining insight on the molecular basis of the coexistence of diseases. But today I want to focus on our approach for analyzing temporal patterns on disease trajectories. So a little bit more about using real-world data to mention some aspects that need to be taken into account. So from one side this data is very rich and can provide a lot of information about the real patients but when working with this data you have to be aware of some barriers or aspects that require special attention. One is that you need to know that this data has not been collected for research proposals so it's different from a cohort database that you might collect with a lot of attention to collect very well specific information about the variables you are studying. So this is a completely different scenario so you are working with data that has been collected for other purposes for of course patient management but also in sometimes billing. So the idea is different and from this knowing this you have to deal with data incompleteness sometimes you will find errors or bias in the way data is registered that you need to know in order to control in an appropriate manner. Other things that you need to be to take care of is how the data is represented. Sometimes the information is encoded in different using standards but depending on the data you work with they will use one standard or other and also different standards are used for different types of information diagnosis medications laboratory measurements and so on. So you need to be aware of all these aspects to process your data set in an appropriate manner and also is the issue of how the database you are working with is structured and also this is important when you plan to combine data coming from different databases or different hospitals. So let's then move on once I have talked a little bit about working with real-world data how we approach the problem of finding these these trajectories. So first let's see how we can represent the information that we can obtain from an electronic health record of patients from a hospital for example in which we have encoded the different diseases so the different diagnosis that a certain patient can have at different time points and this is something you can find in an electronic health record or a patient registry. So the idea is to use this information to represent a single patient and we are interested in how these diagnosis are assigned in time. So we can define these disease trajectories as disease history vectors in which each diagnosis is is ordered according to the time of diagnosis. At the end we have a time series so order sequences of diagnosis and this disease vector represents an individual patient. So we can if we have a database of patients we can represent each patient with this approach so each patient will be represented with this disease history vector and then what we would like to do is to find for instance patients that are similar among the whole database or even group of patients that share similar disease trajectories in our case. So for that what we do is to perform pairwise comparisons between our disease vectors that represent individual patients and we with the goal of finding what we call the common disease trajectories. These are the trajectories that are in which the diseases are shared among a minimum number of patients. Let's say we set the limit to 10 patients or 100 patients. So in this way we can identify these common disease trajectories that represent sequences of ordered diseases that are shared by a number of patients. So if we perform this procedure for all our database we will end up with a collection of trajectories and the next step then is can we identify common patterns between in our population and for that we will represent a clustering algorithm that is based on the dynamic type warping approach that allows us to identify patterns of commonalities between our data our disease trajectories and assign them in different clusters. Note that our trajectories may vary in duration and in the diagnosis and the order of the diagnosis and the clustering algorithm is able to assign trajectories of different duration different length in the same cluster and considering the similarities in the in these individual trajectories. So let's talk a little bit about dynamic time warping. So dynamic time warping is a powerful dynamic programming algorithm that is used for measuring similarities between signals or time series that can have different lengths or speeds and it has been successfully applied to different domains from speech recognition and gene expression and many other domains and we propose that this can be used to identify disease groups and especially for identifying common disease trajectories. So dynamic type warping is a global alignment method and what it does is to calculate an optimal path between the two sequences by minimizing the total distance. So the way it does it is to try to compute an optimal path, the minimal path between the two distances and in this way it achieves a more intuitive way of aligning the two sequences. So in a way it stretches or compresses sections of the sequence in order to find this optimal alignment and this is done despite these sequences might be somehow out of phase in the time axis and that's why this is a really suitable approach to apply to these disease trajectories that I already shown to you. So let's look at an example with the disease trajectories to see a little bit more how the algorithm works. So we have from one side two patients represented by the disease vectors and so what we want to obtain is first this common disease trajectory that is obtained by comparing in a per-wise manner all the diseases that are described in it trajectory and we do that also by using a local distance matrix that displays the similarity between the two sequences, the two trajectories for each of the elements, so for each of the diseases and the distance metric that we can use here for the diseases can be obtained using different approaches that I will explain a little bit later. So we compute this local distance matrix and the dynamic time working actually works by trying to find an optimal path that is shown here with this red line that minimizes the distance between the two trajectories. So here in the local distance matrix the lower values indicates the more similarity elements of the trajectory are so what the what we try to do is to find out this path that goes from the that minimizes the cost of comparing the two sequences. So what we do is actually to obtain from the local distance metric a total accumulated distance matrix and the value of the last element of the column actually is the distance between the two sequences and is the similarity metric that we will use later on in the clustering and we subsequently, sorry, we subsequently calculate this optimal path between these two sequences and we aligned them to using this working path. If you are interested I can give you some pointer this is a nice example and very illustrative to to show you how actually the local distance and accumulated distance metric is calculated if you are interested in exploring a little bit more about dynamic time warping. So once we have obtained our common disease trajectory sorry too fast we end up with trajectories of different lengths so trajectories of length two three four five six and maybe more so these are sequences of diagnosis that are shared among a certain number of patients so this represents our database and what we want to do now is to answer the question if there are similar patterns of trajectories that we can identify and for that we apply this unsupervising clustering approach based on dynamic type warping. So let's go to explain how it works so we have our collection of common diseases common disease trajectories and we will process iteratively to obtain our clusters. So the first trajectory of the collection is assigned to to first cluster and imagine that we repeat the process that we will go through several times and we are in a step of the process in which we already have three clusters and as you can see each cluster has a set of common trajectories that can vary length but that they share some common properties like diagnosis and combinations of certain diagnosis over time. So we want to cluster the next trajectory so what we do is we compute our using dynamic time warping our metric of similarity between my common trajectory and each trajectory of a cluster and we repeat this procedure for all the clusters we already have. We then take the average distance between the trajectory and all the trajectories of each of the clusters and we identify the minimum distance so we identify the clusters that might be more similar with my trajectory and then we use a threshold that has to be heuristically defined in order to identify to decide if we assigned the trajectory to this cluster so if the the average distance is lower than the defined predefined threshold or on the other side if it's higher we will end up by creating a new cluster with this trajectory. So this process is repeated several times and until we are able to allocate all the trajectories to the clusters. So as I mentioned before the threshold has to be selected and this is an iterative process in order to find a configuration in which the clustering is not very fragmented but also we don't end with clusters that are very big and contain a lot of trajectories and therefore do not provide a proper separation of our data set. So we analyze all this data we are applying the dynamic type warping trajectories so I explain how we compare the trajectories but at some point we need to define how we will compare diseases individually so what metrics of similarities we use for the diseases. So in this regard we can relate diseases from different perspectives so because diseases themselves can be defined or expressed in different ways so we can in the clinical setting the diseases are described and identified by science and symptoms and then ultimately they receive a code a diagnosis code so we can use this information and the diagnosis code have some meaning and are organized according some terminologies so we can use this perspective the clinical perspective. We can also describe diseases from the genetic perspective by using the information that we know about their genetic entertaining and we can also describe diseases from the science and symptoms by which they are manifested so the phenotypic manifestations of the diseases so in our work we use these three different perspectives of course there are other perspectives that we can or other data sets that we could use to describe diseases. So for the perspective that use the diagnosis so the definitions of the diseases we derived a similarity metric between diseases that is based on the terminologies so the hierarchical organization of the concepts in these terminologies and for this we use the unified medical language system that contains a metatosaurus of concepts that cover all the domains of interest in biomedicine but in particular they have some soup domains that pertain to diseases and they have a really large collection of disease terminologies included and they organize each concept that would represent a single disease in a hierarchy of ISA relationships in a manner that we can have a particular disease and this is related to a more general disease description and it also to similar disease description so we can exploit this organization of the of the terminology to address in a more semantically aware way the similarity of the diseases so there is that if you we are comparing for instance two diseases that belong to the cardiovascular system we will assign them although they are different diseases we will assign them a higher similarity value than if we compare cardiovascular disease with neurological disease and this is a representation of these concepts in the hierarchical structure provided by the UMLS and we apply metrics that have been implemented to exploit the topology of the ontology the terminology in order to assess the similarity between the concepts using information contact metrics and and the and explores the the depth of the of the ontology so the other metric that is based in the genetic information is based on obtaining a list or sets of genes that are associated to the diseases that we are studying and for that we use the degenerate platform that is a platform developed by our group that integrates information on on genes and variants associated to human diseases and in this way we can obtain for each disease of the trajectory a list of genes that have been studied in association to to the disease and what we do then is to apply a jacquard index to assess the similarity of the diseases based on these gene sets. Brought for a question that is asked by several members yes so it's about for finding common disease trajectories do you incorporate similarities between unequal diseases and how do you handle the time difference between the diseases? Well yes so so I'm explaining now how we compute the similarities so yes we we use these three different metrics that I have to explain now the third one so and and then you will decide if two diseases are similar or not and this is defined by this metric and the time distance is so the time factor is considered when we aligned the entire trajectories so the the distance between diseases is used to compute to to make up the local similarity matrix that compares to disease trajectory and then the the time dimension is used when we compute the the similarity between common trajectories and this is used for the clustering. Okay thank you so I hope this clarifies the questions if not we can then yes comment more on this so I have explained that so maybe to clarify we have so we have all all this process to compute the to obtain the common disease trajectories and then once we we have the common disease trajectories we have let me see if we can go here so okay so I will continue from here we have this once we have this we want to cluster them but a very important point in fact is this how we define the similarity between diseases so we we implemented three ways to to compute the similarity between diseases so we have this based on diagnosis and the meaning of the diagnosis we have the one based on the genes and now we also use one based on the phenotypic descriptions of the diseases so so for that we use the human phenotype ontology as this is a resource that provides descriptions of phenotypes for diseases so each disease can be represented by a vector of of phenotypes and at the end what we assess is the similarity between the two diseases by comparing the the phenotypes and we also exploit the the in fact the ontology that organize these phenotypes according to ease a relationships when we compute the similarities between the diseases and for that we use the best much average approach for combining these two pieces of information so to sum up at this point we by by what I explained before we we we obtained the common disease trajectories and we use dynamic time warping to for the clustering and we use the three different metrics to obtain the similarity between each individual diseases and this then are are approaching the problem from different perspective from the the meaning of the disease the genetic information and the phenotypic information so what I want to show you now is an example of the application of the methodology so how it looks like and what what we can actually achieve so we have an example of for indicating area of cancer in particular prostate cancer and we did a study to I try to identify which are the comorbidities of prostate cancer and for that we use a health registry from a regime of Spain the first step is to identify those patients these are our main patients that are diagnosed with prostate cancer and this is the actual three level digit code according to ICD-9 terminology that is the one used in the registry and we end up with 21 000 patients diagnosed with prostate cancer and then we applied our approach to identify those common shared disease trajectories so this is a summary of the trajectories that we identified from this registry so we identify trajectories of length two so that have two diseases three four five up to six and these are the number of trajectories that pertain to each category and this sum up to a total of a little bit more 2000 trajectories and these are the set that we then used for the clustering algorithm using the three metrics that I have already described for the dynamic time mapping and this is a little bit of summary of the results that we obtained so for the semantic or clinical we obtained 86 clusters for the genetic 41 and 82 for the phenotypy and these are the mean number of patients allocated per cluster and the mean number of trajectories allocated per cluster in each configuration and let's explore some of the results that we obtained so here I show some of the more populated clusters in terms of patients so the first one is the one that we call the metastasis cluster that has is formed made up of by 6 500 patients this cluster contains trajectories mainly of length two two sorry as you can see starts with a prostate cancer and then progress to metastasis two different organs and systems and we can see here that this trajectory is the one that is most populated in in the cluster with a larger number of patients another interesting cluster was the one that we named the COPD the chronic obstructive pulmonary disease cluster that contains a diagnosis the first diagnosis of prostate cancer followed by different diseases that are part of the COPD definition and again these two diseases were the one that the trajectories that were more populated in this cluster is important to know that COPD is a disease in which smoking is a respiratory disease in which smoking is a risk factor and for prostate cancer it happens also that smoking is a risk factor and also this COPD is a predictor of mortality for prostate cancer patients the next cluster is one that has diseases related to neurodegeneration and this also was quite interesting due to the large number of patients that were allocated to to these trajectories and cluster and finally there was another cluster in which we find several different types of cancer that then lead to prostate cancer and the one that started with bladder cancer was the one that the trajectory that was more populated so this configuration or these clusters were extracted from the cluster configuration in which we use the clinical the semantic information of the diagnosis so it's a way to you know you can see that we can organize the different cluster the different trajectories in different clusters and identify different patients subpopulations that follow a different temporal order of diseases another example that I wanted to show before we finish is different clusters that have similar diseases that were obtained using the different measurements so the clinical the genetic and the phenotypic if you see for instance you compare these two clusters they contain similar diseases and these were obtained using the different metrics but there are some diseases that are not shared and this has several explanations it has to do with the genetic description of the diseases or how these diseases codes are defined and organized in in the hierarchy but the fact that some associations are found using the different configurations are also supportive of the of the associations that are found and also it's interesting to note that when you use the phenotypic similarity metric the cluster is larger in the sense that it contains different trajectories but most of them are also associated somehow with kidney problems so to summarize the conclusions of of what they presented today or the main conclusions is that we have developed these systems medicine methodology for identifying disease patterns from patient disease trajectories this approach study associations or similarities between diseases from three different perspective and it's also important that we introduce this the factor of the dimension of time into consideration for the clustering algorithm I also want to stress that this approach for clustering using dynamic time warping that as you could see is an unsupervised learning approach it can be used not only for for the case that I presented for disease comorbidity and disease trajectories but to classify an heterogeneous patient population in different groups by by using different types of data that we could have associated to to this patient population and in this way I encourage you to to try it to use it by your own applications so all the work that I have presented has been published two years ago and this publication is mainly this work is mainly done by a postdoc in my group Alexia Giannone and also I have presented some results that are published it's a manuscript under review and the code is available in this github repository and finally we have other tools as I mentioned before that can be used for the study of comorbidities that are listed here and with that I I finish the presentation and I'm happy to take more questions thank you very much are there questions from the mlfpia we do have a few questions via Slido so does the warping method assume that the two trajectories being compared spend the same time frame or can gaps be introduced at the starts and ends and the trajectories do not need to have the same length and this is really important point so you can cluster or compare trajectories of different duration and in fact the algorithm is able the dynamic type algorithm dynamic type warping algorithm in fact is able to deal with these differences in length and differences in the elements in in our case the elements would be the diagnosis but can be different elements of that make up the signals and it does this by stretching or compressing the the signal so you you actually do not introduce gaps as in other algorithms but actually really stretch or compress the signal in order to align them but yes it is actually very good for these kind of signals that differ in length yeah and does the algorithm allow for multivariate time series so multivariate responses well this is a good question and we haven't tested but I think there are examples in other areas that have been applied for yes yes but we haven't tested but this is really a very interesting question yeah indeed I think there is also a question from Thomas Gumpsch yes hello thank you um thank you for the talk I have a question about the representation of the disease trajectories um you mentioned is a time series so um as if I assume that a patient is in a hospital and there are several disease codes for that patient during the hospital stay and then there's a gap and then there's another hospital stay then comparing this disease trajectory with another patient the last disease code that this patient got from the first hospital stay would it be then forward filled for the entire uh gap because then this disease code would have a higher weight compared to the other disease codes the patient got during that hospital stay you get I don't I don't think I get the question so we so each so we don't assign weights to the diagnosis so it's uh the only thing we we actually do is uh if a diagnosis is repeated over time in our method we make an assumption we only take the first time other diagnosis is made so if this diagnosis is then assigned another time then and this happens we only take the first time the diagnosis was made but I think this doesn't answer your question so yes so I think I've understood that from my understanding if you're compared to uh time series with dynamic time warping then um the time difference also matters so there's a time difference between the last disease code of the first hospital stay and the first disease code of the next hospital stay that is also is also part of the dynamic time warping difference to a different uh patient trajectory yeah yeah so the order at which the disease codes during the first hospital stay are assigned matters in that sense right but maybe maybe this is too too much detailed I'm sorry no no this is important it's important and this is something we we also discussed internally so the algorithm at the end tries to match the the diagnosis irrespective of the time gap between them and sometimes this time gap is something that is interesting uh yeah it it tries to map them irrespective of the time of the time duration uh yeah but I yes all right okay that answer is an interesting point thank you right I have here another question um so this warping algorithm struggles with performance in big data could you explain how you managed to handle it just well we didn't have a lot of problems with the big data so uh we yes the this example is not really large database but we in the in our publication we have a population of 500 uh individuals 500 patients and we we managed to compute the the trajectories in a normal work workstation in a reasonable time so the the step that is more demanding computationally is the first step of comparing per wise the individual disease trajectories but this is yeah with actual computers not a problem yeah we have also a question from Zuki Lee Zuki uh yes yep hi Laura hi thank you for the in-bar insights go talk so my question is do you consider integrating the three similarity metrics and if not if not how do you choose in practice all of these three metrics well this is a really nice interesting question and something we are really struggling how to combine them because this would be really the interesting thing to combine in a single uh clustering configuration the three or even more but actually we haven't come out with a nice solution but this would be the best because now we have this yes three different cluster configurations and no one is uh is better than the other and each one has their limitations so it would be good to combine actually to overcome the biases and limitations of each of the metric yeah okay yeah thank you has it been investigated to what extent the ordering in a trajectory has an impact on the final results so the actual question was does the outcome of the working clustering depend on the ordering in which the comparisons are carried out we haven't thought about this no no no no no no karson do you like to say something thank you crystal laura thank you for a very interesting and stimulating talk um i have one question about this clustering on different modalities or different views of the data so clusterings often end up in local optima clusterings are very parameter sensitive so while i fully trust in your results i think if if the field at large does these kinds of analysis there's a danger that people tune the parameters tune the the the the the initial configurations of the clustering until there is some overlap between the different modalities um here so do you see like a systematic way or a proper way how to avoid this kind of fitting the clusters i would yeah thank you for the question i i i didn't show some interesting thing that that has to do with your question because of time but we actually implemented clustering evaluation metric in order to assess the homogeneity of each of the cluster in terms of the trajectories that are actually uh end up in the cluster uh and we compare them uh according to the how similar are the trajectories among them so the trajectories that are in the cluster and we have a similarity metric in this sense and another similarity metric that is more focused on each trajectory in order to find out if in an individual trajectory the diseases that are there have something to do or not uh but uh yes we implemented a metric to actually evaluate the clusters and uh in fact we select the threshold for the clustering algorithm taking into account these uh metrics for each of the configurations but as you know clustering and so provides clustering so something quite tricky to fine tune the parameters but we we implemented this metric in order to in a manner more objectively select the threshold of the clustering and also evaluate the homogeneity of the clusters that we obtained thank you okay thank you very much um thank you again laura for this excellent talk and to all who participated in you know asking questions it was a nice discussion i think thank you very