 Hello everybody. I'd like to welcome you to the virtual CIP Computational Biology seminar series. As you know, the idea is to bring together computational biologists and to broadcast those seminars in order to share knowledge and expertise from the different groups at CIP as well as invited speakers. So this talk will be broadcasted as you know and today we have the pleasure to have Marc Ibeson. So I'm going to tell a few words about him. He studied genetics at Nottingham University in the UK and he obtained a PhD in genetics from the Imperial College London in 1995. Then he moved to Lausanne for a post-doc in Bernard Torrance Lab where he worked already on the field of diabetes. Then in 2000 he joined a biotech company Sereno called now Marc Sereno as bioinformatician and he worked drug discovery projects and developing methodologies for sequence analysis and research knowledge management. And in 2010 he joined the CIP in the vital IT group where he works now and he's involved in large-scale genomic data analysis in system biology for EMEDIA. EMEDIA is an innovative medicine initiative, IMI project on type 2 diabetes and he also works on other projects related to metabolic and cardiovascular disease. So Marc, the floor is yours. Thank you very much Diana. Thank you for inviting me. I didn't get much choice but so it's a pleasure to be here. I'm going to be talking to you about three things. Systems biology, networks and diabetes and really I picked this broad title because when I was asked for the title of the seminar I really had no idea how I was going to present what I wanted to present but I had three things in my head, systems biology, networks and diabetes. But of course all of these three areas are huge and so in this talk I'm only going to be really talking about very very small parts of the intersection between these three areas. So what will I talk about in more detail? I will first talk about a little bit about systems biology, for example what it is and what it means to us in vital IT. I'll talk a bit about networks and some network concepts and also how we can use biological networks and finally for the main part of the talk I will talk about how we are using network based analysis in a systems biology project on diabetes. So when I was looking for slides to introduce systems biology I found this one from the Institute for Systems Biology which I really like because it's called the research trinity and we have three different areas we have biology at the top and technology of course and computation and really hopefully we will start from a biological question when we do systems biology and this will be, will we use new technologies to try and answer these biological questions and then of course these new technologies require, because they produce a lot of data they require a lot of computation and then hopefully when we do all the computations and analysis we will come up with new hypotheses after some time and thought and this will then feed back into the biology and each cycle of this process will actually generate new technologies and new software from computation, new hypothesis and new insights from biology so some of the systems biology products we are now running or involved with at vital IT are the following I haven't written them all down here but we have immediate which is a project on diabetes which I will focus on in my talk we have an age-brained cyst bio which is a project on Alzheimer's disease we have CISVASC, a project on vascular disease we have a couple of systems X projects still running for the Swiss initiative in systems biology these are host path X and lipid X we also have a project such as Infugus which is a drug reposition project in cardiovascular research and also Synergia BXD which is a project looking at the systems genetics of metabolism so what happens often at the start of some of these projects is that the project manager or the project coordinator is not a bioinformatician or is not an analyst and as this caption says, he is saying in front of all the other people in his team let's try to solve this problem by using the big data none of us have the slightest idea what to do with I wouldn't say it's that bad when we look at our biological projects but sometimes when I'm in the first meeting of a project I get this impression that the people running the project don't really understand what's involved in analyzing the data and they really don't have that many questions so it's really up to us to look at this data to try and analyze it as best we can to give them and feed this back to them in a way that they can actually interpret and drive the project forward I'm going to talk a little bit more about networks give you some concepts on networks, very simple concepts and also talk about how we can use biological networks I want to come back just before getting into the networks to this idea of visualization and in fact visualization is not a recent thing this is an example of the power of visualization from 1854 London and in 1854 in London there was a cholera epidemic and one of the physicians at the time who lived in London he was treating a lot of patients and he took the initiative to draw a map of London or at least all the areas which were affected by cholera and he put little points where all the affected cases were that he was seeing on this and other doctors were seeing on this map and this is not the full map, it's just a part of the map and what he found was that in fact all of the cholera cases were radiating out of one area which was this water pump on Broad Street and this really shows you that even at that time just the fact of taking a step back and looking at a big picture can really help you to see where the causes of something are of course he contacted local authorities and they put a cap on the pump and they stopped the cholera outbreak and they stopped people using the pump and in fact if you go to London today there is in fact a monument to John Snow who was the physician at the time and there's a pub across the road from this which is called John Snow after him so that's a little anecdotal story but it really shows you the visualization I think it's a story to me means a lot because when we're talking about visualization we want to really take a step back and look at the big picture so networks are useful for visualizing complex data sets networks are everywhere now and this is an example that I quite like I like music and this is the connections between artists in the last FM music database and each of the nodes on this network are different artists the different colors represent what type of music they do for example red is rock green is pop and hip hop is blue and I think just everyone will be able to see that just by looking at the way that the nodes or the artists are clustered together by the connections and who they're interacting with you can see that for example rock artists and pop artists are more closely together than for example rock and hip hop artists so this type of network visualization can give you a global overview of what might be occurring and give you insights into how similar different groups of nodes are so why do we want to use networks in biology? well first of all analysis of individual data sets does not maximize its value and often we find that if we're doing a big analysis the number of samples is often limited for statistical analyses often we use arbitrary cutoffs for example the p less than or equal to 0.05 this is something which particularly I don't really like because if you're trying to compare and contrast results if you change the cutoff then you change the comparisons and the results of the contrasts on alternative it's not really an alternative it's something that I think we should do in parallel to this is doing data integration in network type of analysis so this is a data driven approach this means that the network is generated from the data statistical cutoffs are not needed prior to integration a third point is that the data are integrated at the beginning and the resulting network contains all info needed for subsequent analysis but it's an important point that we can always go back to the individual data sets so I'm not saying that we should only do network analysis I think that network analysis can give us clues as to what's going on globally and then we can go back to the individual data sets to really find out what's going on in more detail so if you take a set of random dots and you join them up randomly then you'll get something like this which is shown on the left the random network this is not the type of thing that you would see in a naturally occurring network such as a biological network or a social network what you see in a social network for example is a situation more on the right hand side where you have certain nodes which are more connected than the others and this type of network is called a scale free network if you make networks from worldwide web internet links or social networks or biological networks you will get a network like this there are two important features of scale free networks the first is that the nodes tend to group together to form clusters or modules so some network layout algorithms such as size escape display this type of module clusters quite nicely sometimes and in biological networks such modules may represent groups of functionally related genes for example or proteins a second point about the scale free networks is that a small proportion of nodes are highly connected and these can be referred to as hub nodes and such nodes we think are more important because they influence more the network and more important in the network structure so as an example of what a module could be this is a protein interaction network where each of the nodes on the graph here is a protein and the links between the proteins are physical interactions and you can see that there are particular areas where in this particular layout there are many nodes clustered together and this could represent for example a module and we could infer from this module in this type of network could represent a complex of several proteins interacting together when I think of hub nodes I quite like the example of taking for example the different airports this is in North America so obviously the networks the airports with most incoming and incoming flights are likely to be the airports which are going to be the most important so there I like to think of the analogy of hub genes or hub proteins as like hub airports so now coming into the major part of the talk which is going to be on diabetes and how we're using network based analysis for assistance biology project so I'm first going to go into an introduction into diabetes and then I'm going to go through two types of analysis that we've or two examples of analysis which we've been doing for this particular project first of all for the introduction type 2 diabetes is a global health problem in 2013 382 million people were diagnosed with or had diabetes a lot of these are only predicted because they're undiagnosed 5 million deaths were reported from diabetes and this had a global cost of about $548 billion so this is a huge global health problem what happens in diabetes is that the normal process of glucose homoestasis goes wrong so what happens in normal glucose homoestasis is that when you eat food this goes through the stomach and the intestine and you get molecules or hormones such as GLP1 that are secreted from intestinal cells and these will act on the pancreas and cause together with glucose the beta cells in the pancreas to produce more insulin which will then reduce the blood glucose overall the effect of insulin is basically that it causes insulin sensitive tissues such as muscles to take up glucose from the blood what happens in type 2 diabetes is that the beta cells of the pancreas become damaged by metabolic stress and this means that they cannot produce as much insulin and this leads also to high blood glucose a second effect is that the tissues which are responding to insulin are less responsive and this combined effect also has a disastrous consequences in that it keeps this cycle of chronic high glucose going type 1 diabetes on the other hand is when the beta cells in the pancreas are destroyed so they are either less insulin or no insulin at all produced and this obviously leads to high blood glucose so type 2 diabetes is the most common form of diabetes it's 90-95% of cases the type 1 diabetes is rare and results from an autoimmune reaction that destroys pancreatic beta cells in both cases whether it's type 1 or type 2 all diabetes can lead to cardiovascular disease, kidney failure, blindness and nerve damage among other things so it's a very serious disease and especially recently we've seen many increases a lot of increase in type 2 diabetes due to people being more overweight and obese unhealthy diet and physical inactivity so these are really lifestyle consequences or reasons for the increase in diabetes and this can be seen in this well this is a schematic showing what can happen if we have too much fat so on the left hand side we have if we have too much adipose tissue or fat then this will increase fatty acids in the blood these will then be metabolized in the liver and the liver will produce or release glucose and triglycerides into the circulation which will have detrimental effects and produce a major metabolic load on muscles which are insulin sensitive tissues which are taking up the glucose and the fat from the blood and also will have detrimental effect to the pancreatic beta cells and there will be beta cell damage which will occur and this causes a cycle where the tissues that are normally responsive to insulin will become resistant and the pancreatic beta cell will become damaged and this will reduce the insulin secretion as well and really we don't really know very much about what the mechanisms are for the beta cell damage and so this is a focus of many studies now and the focus of the study that I'm going to talk about now so finally the current treatments for diabetes are symptomatic so for example we can add we can give patients who have moderately damaged beta cells increased glp1 which is will increase the secretion of insulin from the pancreatic beta cells for more drastic cases of type 2 diabetes diabetes where we really have much more advanced chronic diabetes we have to give insulin and this will obviously lower the blood glucose level but we're really not we're only treating the disease we're not curing the disease at all so the holy grail for type 2 diabetes is obviously a cure to go from a symptomatic treatment to more of a cure and the ways that obviously we can do this are to try to prevent the disease either by lifestyle changes and so on and so forth but also by trying to help the beta cell the pancreas to be able to cope with new metabolic demands or even to regenerate new beta cell function of the pancreas with new medicines excuse me so now I'm going to talk about this project in media which is a public-private partnership for type 2 diabetes research this project is European wide it's made up of a consortium of 14 academic institutions 8 pharmaceutical companies 1 biotech and the budget is 26 million over 5 years we're now into the fifth year of the project and really the focus of the project is on understanding the pancreatic beta cell dysfunction in type 2 diabetes and to try and identify the pathways which are underlying this dysfunction in type 2 diabetes and also to try to find biomarkers which will enable us to detect as a proxy beta cell functional or the beta cell function in patients that haven't yet become diabetic so the media generates data from both mouse and human samples we are at the center of the project this is just showing the different work packages in the project and we are at the center of this and we are involved in the integration or the data capture first of all and the integration and analysis of the different data coming out and I really just want to focus on the top right hand corner where we have in vivo models for diabetes which are in the mouse and the bottom left hand corner where we have a human repository pancreatic tissue so we are really dealing with data mainly from these two work packages and I'm going to go further into the mouse study a little bit later so one of the first things we did in the project was we developed a database and a web portal to enable the different partners from all over Europe to put their data in the database and put in annotations for example clinical annotations functional annotations and so on and so forth so we collected a huge amount of data and I'm going to concentrate on the data that we collected from the mice which is basically phenotype data during diabetes some imaging data and lipidomics and RNA-seq data now I need to tell you a little bit more about the mouse project so the idea from this project is that if you feed a mouse a high-fat diet for three to four weeks it will become diabetic or in the most cases it will become diabetic but the thing is that not all mice respond in the same way so if you take different mice strains they will respond differently to the high-fat diet so if you take for example three mouse strains here you feed them the same high-fat diet some of them will remain a normal weight but they will get diabetes some of them will have a weight gain and they will get diabetes but then others will remain completely normal so the idea of the project what are the molecular differences between these different mouse strains which will determine or predict whether these mice will become diabetic or not so the experiment was that six mouse strains were fed a high-fat diet or a regular child diet and then we studied the evolution of diabetes over time in these six different mouse strains and for each of these different mouse strains different measurements were taken at different time points so we had measurements such as physiological measurements such as weight how tolerant the animals were to glucose so if we give glucose how does this keep high in the blood stream or does it actually reduce this is a measure of how diabetic the animal is insulin tolerance which is measuring how insulin resistant the mouse is so we give insulin to the mice and we see how much the blood glucose actually drops over time we also have islet morphology data where sections of pancreas from these mice have been taken and we have quantification data on the number and size of beta cells and alpha cells in fact in the pancreas and finally we have RNA-seq and lipidomics data for the pancreas and also for lipidomics data for the plasma so there are a total of 48 experimental conditions so there were a lot of comparisons we could possibly make and really the first thing we wanted to do was really compare what we have, what we see physiologically with what we see molecularly so here I've shown this slide that I've divided the two types of data that we're collecting into those that can give some sort of physiological phenotype this is the things like the weight the islet, the glucose tolerance, the insulin tolerance and those which are more of a molecular phenotype which of course the RNA-seq and the lipidomics and really we want to as a first step we wanted to see whether we could in fact see any correlations between the molecular phenotype globally with the physiological phenotype so this brings me on to the first analysis example I'm going to give which is a sample centric view of this data and this was using a procedure called network fusion a network fusion basically can be used to combine high throughput data sets so for example if we have our two data sets from RNA-seq for example and lipidomics with the samples along the as rows and the measurements as columns we can transform this data into similarity matrices so comparing samples with samples and these can then be viewed in the same way as a network where each sample is a node and each of the edges between the different nodes in fact a similarity score and what's quite cool about this method is that you can in an iterative process you can learn it learns a new network based on the two networks from the initial data sets and you get at the end a few network which combines the original two networks from the two different data sets and this is a nice way that you can combine molecular data which is completely different and try to find global patterns in this data so we performed this analysis for the lipidomics and RNA-seq analysis from these mice and this slide shows on the left hand side the similarity matrix for RNA-seq and you can see there are about six different squares of similarity on this similarity matrix and these actually correspond to the different mouse strains if you look at the lipidomics data we get a more complex pattern but I think the thing which is interesting if we fuse these two matrices or networks together we get a fused similarity matrix which actually where you can see two groups coming up and you can see this a little bit with the plasma lipidomics but it's not very clear but when you fuse the networks together you can see this more clearly on the combined network and if we take this fused network and we basically look at this in terms of a network we can see that in fact there are two sub-clusters in this network and here I've colored the nodes by mouse strain or which strains of mouth they come from and you can see that there are three strains which are represented on the left hand side and three, well four strains which are represented on the right hand side there's only two of these yellow nodes which represent mouse strains from the left hand side which are also in the right hand side and what is interesting is in fact these mice that are in the left hand cluster are actually more susceptible to high-fat induced diabetes so this already shows us that by combining the data together we can actually find global correlations to finotypic information so the take home message is that combining or aggregating molecular data can show patterns that are not clear when looking at separate data sets I'm now going to just go through a more gene-centric view which is what is a more standard way of doing this type of analysis and as I said before we have many different mouse conditions we have 48 different conditions but really what we want to find are signatures in the data and here is an example from from lipid from lipidomics work and here all of the rows are the lipids and the different mice strains are the are as columns and the yellow represents lipid concentrations which are relatively high compared to to the blue squares so what is clear from this diagram is that we can see patterns emerging where there are groups of lipids which seem to have similar profiles across all the different mice what we'd really like to do would be to extract these clusters and obviously as many of you know a lot of clustering methods exist and we chose to use one method for this which is weighted correlation network analysis in DNA so here if we have for example gene expression data with expression level on the y-axis and samples on the x-axis we can actually find several groups of genes potentially in our data set where they have a similar profile across the samples and we need a method that we will be able to extract genes which have a similar profile and from this group of genes we can actually create a network a network module this is a subnetwork which is generated from the correlation data and I'll go into the next slide a little bit more detail on that but the important thing here is that within this subnetwork or module there are central genes which may be influential genes that are more connected within this network so in a little bit more detail weighted correlation network analysis starts by creating a weighted correlation matrix from the original data be it RNA-seq or be it lipidomics or something else you then convert the method then converts this matrix to distances and the distance which is used is called topological overlap which is actually a measure of how similar two genes are based on the number of shared neighbors that they have and this is actually a continuous measure which is just the degree of shared neighborhood that they have between each other and from this topological overlap distance matrix we can then do some hierarchical clustering and identify modules the important and the interesting point about this method is that you can actually correlate modules to particular clinical or functional traits and this is interesting because normally or sometimes when we do this type of analysis we are trying to correlate genes for example to different functional measures to try and find the genes which might be influencing the particular measure or the particular functional trait here the idea is to correlate an ensemble of different genes which have similar profiles to a particular trait and for those of you interested the metric which is used for this correlation it's actually the module eigen gene which is the first principle component of the module which is used as a vector and compared by normal correlation such as Spearman 2 a vector of clinical or functional trait values so we detected modules from WGCNA from both RNA-seq and lipidomics analysis and then we correlated these to the physiological traits in the mouse using Spearman's correlation and we then selected modules for further investigation so to cut a long story short we identified a gene module which was correlated to both insulin secretion and glucose tolerance and here in this heat map you can see along the side there are the module names along the bottom there are the different traits which are a bit cryptic in this diagram but what I want to highlight here is that there is one module which is highlighted in red which is correlated to both insulin secretion and glucose tolerance by the fact that those the blue squares in the middle are very blue which indicates that these are highly negatively correlated to this particular module so we identified this module and then we actually looked into this module to see what genes were there and this is a representation of this module I can't show you the gene names but in this module the size of the node or each node represents a gene and the size of each of the nodes represents how connected it is or how hub-like a particular gene is the different colors represent the degree of correlation to the particular trait which is glucose tolerance in this case of course as I mentioned before we are not only studying mouse in the media but we are also studying humans and here I just wanted to show that in fact we have lots of measures such as lipidomics, gene expression functional data and also networks now which we have in common between the mouse and the human data of course in mice we have different mouse strains and in humans we have tissue banks of people who have either died and we got some of their pancreatic tissue or they had an operation again in the mouse we have the high fat diet model of diabetes and then in humans we have the real thing we have type 2 diabetes in normal individuals and of course we have the physiology which can be measured very easily in the mouse but can't really be measured very well in the human so we are relying on clinical information which is sometimes pretty sparse but generally what we want to be able to do is to try to find the similarities and differences between human and mouse to try and understand for example which mouse models might be good for a particular clinical trial for a particular drug and to do that we have to understand the mechanisms that are underlying the similarities and differences between mouse and human and I'm going to show you on the next couple of slides one thing we tried to do which was really quite exploratory to put the mouse and the human data together and for this we actually used the idea of hive plots and this is a hive plot showing combined mouse and human data and in this particular hive plot there are three axes so the vertical axis represents genes which are changing in the human in type 2 diabetes versus normal individuals the axis pointing down to the left hand side are genes in the mouse to the mouse orthologs of these human genes which are correlated positively sorry, correlated negatively to insulin secretion on the right hand axis pointing down we have genes that are positively correlated positively correlated in viral insulin secretion in the mouse the links between the particular genes are links which we extracted from the network from the RWGCNA network so these are evidence for co-expression so these means that if there's a link having a similar expression profile or similar in the overall network and one thing which we can see from this this images and in fact if we look at all of the mice together most of the genes which we find regulated in type 2 diabetes in the human are positively correlated to insulin secretion in the mouse experiment now this may or may not be important or interesting for us but I just wanted to show you this illustration because what you can do with this type of visualization is you can compare different networks together and in the next slide I show you networks which have been generated from all of the six different mice strains in all the different 4 time points on a high fat diet and here only the genes are shown in the networks which are significantly differently regulated on a high fat diet if we look for example in the highlighted strain which the strains are on the bottom and the time points are along the side we look at the highlighted mouse strain at DBA 2J on the right hand side this in fact is a mouse strain which becomes very diabetic and obese and we can see that many genes which are also many genes seem to be activated at late time points and we look at another strain valve CJA this is another strain which becomes a very diabetic but here the pattern seems very very different we see many genes being activated at early time points and less at the late time points and if we compare this to another strain which is actually pretty resistant to diabetes we can see that there are very few genes which are chaining so really this type of information is still giving you try to give you a global overview we can't easily go back now into the and look in detail at these particular high plots to see exactly which are the genes which are changing but it gives you an idea where you should focus your effort for the next type of analysis the next slide I want to show you as well is another global visualization of the mouse experimental data and it's where we basically put all of the data together and we tried to map out and annotate it to see what types of things were being highlighted in the mouse experiment as changing on a high fat diet in these different mouse strains and we got a picture like this where we really all of the the underlying network is colored based on the different types of relationships between the different nodes and I'm not going to go into the details of exactly what's in there but there's all of the data which we could put in we put in there and we annotated the the network just basically by looking at what were the types of nodes which were in the particular areas of the network and when we showed this to the biologists in the project they got pretty excited because in fact this recapitulates many of the things which we'd expect to be going wrong in the pancreas of this in diabetes so this can give you an idea as to which are the areas which are already just popping up from the data and we can then put our efforts into them and this has been a quite a good starting point for us in the media to at least discuss with the biologists and try to find areas which would be of interest for them to continue study. So in summary systems biology the research trinity of biology technology and computation is well upon us so we really need to get used to dealing with increasingly large and complex data sets I think I tried to show you the visualization of complex data is important to identify patterns and networks can help here not just the standard ball and chain type of representation of a network but also new network visualizations such as such as high plots can be useful as well. I tried to show you as well that both sample centric and gene centric network methods can be useful for finding patterns in complex data and finally we have applied such methods to mouse and human data from a large European project on type B diabetes. So finally last couple of slides where is this all leading us? Well what we've realized over the last few years is that metabolic diseases seem to be getting closer together and here I've just put some of the diseases where we are we have some projects in vital IT and these are all linked to what we could call metabolic syndrome and I think this is the type of view which we're going to see more and more in the future where instead of looking at individual diseases we're really going to be looking more at the overlap between different diseases and what is underlying the fact that some diseases are linked together. And my last slide is really just a representation of a disease network and this is from a publication from 2008 so quite old now but this is based on Omin Omin genes which have been mapped from different diseases so things like enzymes which are known to be mutated in the same diseases will create a link between different diseases and then they put on top of this network information on actually what is the coherence of particular diseases so they use clinical records for this and what it shows you is this type of clustering where you see for example diabetes being clustered together with obesity and hypertension myocardial infarction and stuff like this so what we might see in the future it would be good to start with a network like this and then to try to go down further and try to drill down and see whether we can identify the common mechanisms and use this as a start point for finding new drugs which can maybe develop for several diseases at once and prevent comorbidities from occurring so I presented my view of some of the data that we've analyzed and this is really only a very small part of what is being done in the immediate project and I just wanted to make the point that we are a very multidisciplinary team and it's not just me so we have people who are involved in data analysis, project management algorithm development we have Roba who has been essential for all of the web based development and data visualization which I didn't show today but it's been a huge amount of work we have a lot of people such as Fred Leonore who are working on the statistics and data and knowledge management also Dima is now working on restructuring all this data into RDF format so we can reuse it later on in other projects and we have obviously the IT infrastructure so I put Roberto but it's the whole team behind here who is making sure that we can still work at vital IT and finally last and not least the lipid and protein annotations because in fact when we're generating this data we need to generate, in order to generate hypothesis to feedback to the biologists we need annotations and people like Ann and Alan and Lucila have been very important in trying to bridge the gap between what we're finding from the computational analysis and what we're actually presenting back to the biologists in this project so I just have a couple of acknowledgement slides for the Media Work Package 2a and the Media Work Package 2b which was the human study so with that, thank you very much