 Welcome all to this SIP Virtual Computational Biology Seminar. Today we have the pleasure to host Daniel Marvard from the SIP Computational Biology Group of the Computational Biology Department of the University of Lausanne. Daniel got his PhD in Computer Science and Computational Biology in 2009 at DPFL. And then from 2009 to 2013, he continues research with a post-doctoral fellowship from the Swiss National Science Foundation at the Manoli Scales Group from the Broad Institute of MIT and Harvard in the US. And since 2013, he's back to Europe and is now a senior researcher at the Department of Computational Biology here in Lausanne. So the overarching aim of his work is to develop novel methodologies and software tools to integrate large-scale molecular genetic and clinical data to unravel genomic networks, discover disease pathways and biomarkers and build predictive models of disease processes. In the recently published work, Daniel created the most comprehensive resources for cell type and tissue-specific gene regulatory circuits to date. And he found a genetic variance associated with complex disorders, disrupt pathways in disease-specific tissues. And over the past eight years, Daniel further led private sources, open data competition in systems biology and personalized medicine as part of the dream challenges. And today, Daniel will take us through some of his work and explain to us the disease module identification dream challenge. So Daniel, thanks again for accepting this invitation and the floor is yours. Thanks for the invitation. Okay, so I will briefly, for those who don't know, introduce the dream challenges. I will go through this first part very quickly. Also, our previous work that Diana mentioned in the introduction on tissue-specific networks, I will go over this very quickly as I presented it here in Lausanne recently. But it forms kind of the background and motivation then that led to this challenge that we launched this summer. I will give a brief introduction to pathway and module analysis and present the challenge and results. So, to the dream challenges. So reproducibility in biomedical research is an issue. Several studies have shown that lots of published work is not reproducible. Problems are that data is often not shared and that researchers are forced to evaluate their own methods in their publication leading to a situation that basically everybody presents the new best method and what I call the self-fulfill in prophecy and often in computational biology, high-value data is generated and then paired with some method and a high-profile publication results and then this method starts to become widely used but maybe there are other very valuable methods in more technical journals that maybe don't get this visibility to be a better or a forgiving problem. So what if I would tell you that we can actually have a way to openly share data, pre-publication and objectively transparency assess methods and build collaborative communities in the process and so these open community challenges are a way to achieve these goals and so it shows the basic structure of these dream challenges and also other similar open data challenges. In dream we used a lot of clinical data recently in the last year so you usually have a ground truth in this example you may have some clinical data from cases and controls. So now this data instead of locking it up and making a publication it's crowdsourced, it's made freely available to the community and anybody interested can now apply their methods and submit predictions which are evaluated rigorously because the participants don't have access to the ground truth. So that's the basic structure of these challenges and the dream initiative has been founded by Gustavo and Andrea back in 2006. Gustavo was the main driver over these years. In 2013 there was a big momentum with Sage Bionetworks joining so now it's together with Sage Bionetworks and Stephen Fran is the leader there, he now joined Apple but he was the president and founder of Sage Bionetworks and they developed Synapse which is a collaborative platform that is now used to run these challenges and also brought a lot of great data for these challenges. So there's other fields KDD, CASP and so on so other crowdsourced challenges dream is by far not the only such initiative but we are very focused on computational biology and biomedical research. So my role as I've been involved from the very beginning in Dreamforce as a participant and then I've been involved in conceiving and leading these challenges and I really like these challenges for the reason I mentioned before but what excites me most is that actually we can learn new, by integrating the predictions from the community we can often derive a better, more accurate prediction that could not have been done by any individual member and that's the principle of the wisdom of crowds which is another new idea already Francis Galton in 1907 explored this idea, he was on a market and there was a competition where people could guess the weight of an ox and he then found that the median prediction of the participants was actually very close to the true weight and so we applied the same principle in these challenges and this shows results from this gene network inference challenge I will not go into the details because this is published but this shows the performance of all the individual teams and then the integrated community prediction achieved the best performance and this is a theme we often see in these challenges and hope maybe also in this new challenge and we will definitely explore these community predictions so that was the introduction to Dream now briefly present our competition specific networks there the motivation was that now we know many genomic low side that are associated with complex diseases but making use of this data is difficult without understanding the networks and pathways that kind of sense and propagate these perturbations so we need to unravel these molecular networks and to really have a chance to develop a novel and better targeted treatments and of course these networks are actually tissue specific so these represent different cell types and tissues and there is a problem that existing pathway databases likely lack tissue specific information so in this project we wanted to basically show that by deriving tissue specific networks from data we can identify perturbed modules by genetic variance so again I will just skip through this briefly we use data from the phantom 5 consortium that's cache seek data but basically it gives you activity levels or expression levels for enhancer and promoter regions so regulatory elements so you can get similar data from similar maps from epigenomic data and you could use the same approach using epigenomic data so we use this phantom data and the cool thing is that this is available for 400 cell types and tissues so it's all human data from this phantom 5 consortium so we use these enhancer and promoter regions from phantom 5 and then link transcription factors to these regions using regulatory motif analysis transcription factor binding motifs in these regions and then link enhancers to target genes based on their proximity and joint activity in a given tissue so here you have a link because maybe this enhancer is close to that gene and they're both active but if the gene is not active obviously there is no link so we use quite a personal model it's quite simple approach to construct these networks basically so as a baseline that maybe could be improved further on in the future so we validate the networks by looking at the enrichment motifs in these regulatory elements using chip seek data to validate transcription factor regulatory element links eqtls to validate links from enhancers to target genes and RNA seek data to basically see if these networks also have a function or predictive of gene expression so the regulatory edges are also predictive of gene expression in independent RNA seek data that was not used to construct these networks this just gives an overview of these networks when we did a clustering of these networks here you have these 400 networks and you see that they really basically recapture the whole human anatomy and nicely grouped by tissue type and function so basically related lineages share regulatory components because these networks cluster together so we use this cut up here to create 32 clusters of networks and here we zoom in into two of these clusters clustering grouping together lymphocytes and here myloid leukocytes and you see that these clusters really very nicely group related cell types and we also use these 32 clusters to create 32 high level networks so for each cluster we also create the network by taking the union of these networks so we also have a network for lymphocytes one for myloid leukocytes and so on for each of these clusters so then we tested whether diseases associated genes are more densely interconnected in these networks than expected and to do that I compiled a large collection of GWAS data sets so here we had about 40 GWAS data sets I think these numbers are not up to date anymore and we then used our Pascal tool which was also published this year and this method was developed by David Landbarter and together we developed a nice software tool where you can easily run this analysis so if you have to compute gene level P values from the GWAS data this tool can easily do it so just briefly you have here the SNP values so the genetic variants and their association to a given trait and now we want to summarize this signal across a gene region and the reason why this is difficult is because SNPs are not independent so they're correlated to two linkages equilibrium and this has to be taken into account in this type of analysis so once we have gene level P values we can map them onto our networks so this shows one tissue specific network and we can do this for each network map these values here and then ask whether disease associated genes are more densely interconnected than expected and here we see a module of genes that is enriched for disease associated genes for example and so we develop the pipeline where you can compute kind of a global enrichment value that shows whether the disease associated genes are more densely interconnected than expected and I will just show one example and that's for schizophrenia Miranda's analysis across the 32 high level networks that I mentioned before that for brain tissue specific networks we found the strongest enrichment basically the only ones that passed the significance threshold which of course makes sense for psychiatric disorder but then the really surprising finding was that we could even zoom in and test each of the individual networks so remember that these are these high level clusters and each of them contains a bunch of networks and if we now test each of the individual networks in here for enrichment we found that even at this very fine grain resolution we get very disease relevant tissues at the top so for example we have here three structures that make up the basal ganglia and the basal ganglia modulate motor, cognitive and emotional behavior and actually show pathological anomalies in patients and are the primary target of some antipsychotic drugs that are currently used to treat schizophrenia so this shows that we can really with this that basically the GWAS variant perturbed tissue specific regulatory modules and as I would expect these modules are very specific to disease relevant tissues so we made these networks available on this website and again to summarize what's unique basically about this collection is that it covers so many tissues and also that the networks are basically very fine grain that we really link promoters, enhancers and gene as a form so not just transcription factors, target genes as often done in gene network analysis I included the cover here that's actually not our illustration that's a different paper that was in this issue but it's very fitting and I really like it because it was actually Julio so that's Rodriguez who will be here next week you should see his talk it was a paper where he was involved where they made this illustration basically showing Platos Cave Platos Cave is an allegory where he imagined us and being prisoners in a cave which is observing the world through shadows projected on the wall of the cave and of course for us computational biologists this is really the same because we're trying to make influences about the true nature of things just from basically their shadows in our data so that was the brief basically some highlights of this study on tissue specific networks but I will now focus the rest of the talk on this new work, this disease model identification challenge and start with a brief introduction so first what's a modular pathway well we can loosely define it just as a group of functionally related genes or proteins and of course most functions that we study in molecular biology are actually involved with multiple genes or multiple proteins some traits are defined by a single gene but in most cases we deal with multiple genes and so these can be defined as modules or pathways and so we often have data and we want to know what are the genes involved in my process or disease of interest and we can call that pathway analysis we really focus on identifying groups of genes and again this is not just for human traits but more generally molecular biology very fundamental problem and so one approach is what I call pathway analysis here is to start with pathways from curated databases and then test these pathways for enrichment in your data so you have your data for example differentially expressed genes in your study and you take pathway databases like gene ontology and now you test these pathways for enrichment methods such as gene set enrichment analysis and that's of course a useful analysis but the limitation is that you really rely on existing pathways and we know that these pathway databases are definitely incomplete and especially as I mentioned before they lack tissue and context specific information often and they're very heavily biased towards well studied genes or well studied model systems so an alternative approach is network analysis where you build a network from your data then identify network modules as a de novo way to predict pathways so you build a network and then you identify the modules in this network and so this is the step that we call module identification and an example of this approach is why the gene expression network analysis it's not my favorite method but it's widely used and then you can analyze these modules for example you can then bring in pathway databases at this point and see if they enrich for certain functions and annotate them but of course the advantage using this approach is that we do not rely on known or fixed pathways but we can actually discover novel pathways or expand existing pathways and also network data is now increasingly becoming available so it just becomes more relevant as I've shown the networks that we had in the study that I presented before but also novel experimental technologies now allow to really map also transcription factor gene interactions with ChIP-seq protein interactions, high throughput and so on so there's more and more network data and this type of approach is becoming very relevant and so I want to now focus on the module identification step I mean we've dealt with network inference in previous challenges but I was interested now in kind of comparing module identification methods and module identification or also called community detection more in the network science is a very classic problem and it's a huge field, hundreds of methods in physics, economics, social sciences and so on this is very relevant here we have an example, it's the Game of Thrones characters network and the modules shown in color basically correspond to the novel houses of these different characters but you can of course do this with Twitter, networks and so on so it's a very relevant problem but here we were more specifically interested in applications to the biological data and the problem is the performance basically, what are the best methods to identify biologically relevant modules that's really not well understood and of course this is obviously would be a great dream challenge I thought and for several years I kind of thought about this but the key problem is of course how to assess these modules because there's really no gold standard experiment where you could experimentally test or validate a module so this is a difficult question and previous studies mainly they mainly use artificial benchmark networks so you can use these stochastic block models where you have edge probability matrix to generate the network basically just the probability of nodes that are in the same module like these here is higher than for nodes that are in different modules you can generate benchmark networks like this you can generate them with different structures like here they have different sizes for these communities so you can try to make them more realistic but of course in the end the biological network still they will never really follow the stochastic block model so another approach is to use a method that that's a method that would be labels on the nodes of the network and in addition to the edges so you use the metadata to validate the communities so this is a classic example is this karate club where these were members of this karate club each node and the links are social interactions and then when you run community identification you get these two communities and that actually corresponds to two fractions that emerged when there was a disagreement in this club and it split up and you recover this with the community detection methods but of course usually we don't know the two community structures and so you can use other metadata like in the game of thrones network you could use the noble houses of the characters in genomic networks we usually use geo annotations or pathway annotations so you can see if a module enriches for a known pathway gene ontology category or something like that you could say this module is validated or supported we have some evidence that it might be a functional module now of course this has the limitations that I mentioned before pathway databases they're incomplete and so on and so our idea was to use GWAS data which is a novel approach to really to validate really with the goal to validate module identification methods but it has very nice properties so the GWAS really provides a genome-wide unbiased functional annotation of your genes if you want because it's really an experiment that is genome-wide and it doesn't focus just on well-studied proteins or genes and so you get for every region you get really a score that's the association to the trait of the GWAS and it's available for all genes and now also we have more and more GWAS data becoming available so now there's really hundreds of traits we collected 256 GWAS data sets so this covers a lot of basically pathways or functional units and many of which may not have been annotated previously and because they may be from brain tissues or so that they're not often studied and so the idea is we have our modules and now we test them for enrichment in the GWAS data you have here a module that's enriched because you see here it has many trait-associated genes and this can be done with the Pascal tool as I mentioned you can use Pascal to compute the gene p-value so to summarize the SNP association scores at the level of genes but you can also use it to summarize them at the level of gene sets or pathways so now we have input the gene set the GWAS and also 1,000 genome reference just for the LD structure down here and that's used to correct these scores correctly and now we compute the gene score for every gene and for neighboring genes we can treat them as a single unit we can kind of merge them which is important because they're not independent because of the language equilibrium and many tools ignore this and then you get the inflation of the p-values so you compute the gene scores or the meta gene scores and then you compute a module enrichment p-value and now we can compute a score at the level of module that basically shows us the association of that module to the trait and so Sarvenas did the most actually also on the results that I will show now she did the biggest part of this work and she did also the initial exploratory analysis to see if that actually works and so this was one of the first results and one of the first things she explored but she just took pathways from known databases and tested them for enrichment and then she counted how many pathways you get for each GWAS trait so that's shown here in blue and you see that for some GWAS traits you actually get lots of relevant pathways in the pathway databases and these are actually probably too small to read but these are all immune related traits so it seems for that the pathway databases have good coverage but then she also did basically the same approach but with gene sets derived from networks so she applied some standard module identification methods to a network collection that I will show you in the following slides so she had a bunch of network derived modules and tested those for enrichment and again checked which traits give significant hits and you see that for many traits basically these network modules we can get significant modules where the pathway databases did not give any hits and especially you see here many psychiatric disorders like schizophrenia and so on which have zero hits in these pathway databases so this was very encouraging and we thought okay we have to go for this challenge because this really has huge potential to discover novel pathways especially if we crowdsource this because this was just kind of off the shelf methods a kind of quick and dirty analysis but now we wanted to comprehensively try to get the best pathways from the community so now we'll describe this module identification challenge and the results because it just closed two weeks ago so this is very fresh we really just announced the best performers like last two weeks ago so I will briefly describe the challenge setup and then these first results and the best performing methods okay so the network collection we have six networks in this challenge and two are protein interaction networks including the string network so we have a Swiss quality network in there right we should have a for the next edition we will get one from here and then a signaling network provided by Julio who will be here as I mentioned and the co-expression network and kind of a cancer co-dependency network and the homology based network and so these are correlation networks these two here the co-expression and cancer co-dependency network and the others kind of have a confidence score on the edges and that's based on you know the evidence that supports these edges so these are more like curated databases so they're all weighted the networks undirected except for the signaling network which is directed and they're quite diverse as you can see they vary in size also in structured properties that's not shown here but they have different degree distributions and so on so I think this is nice as a benchmark because we want to have a diverse collection of networks they're all unpublished or custom versions for example from string we removed all the literature mind interactions so basically that they're not publicly available in the form as they were included in the challenge so this allowed us to anonymize the network so to remove the gene labels and just number the nodes and the participants got the networks only in this anonymized form so they could really not use any additional data they just had to look at the network topology the network structure and derive modules only by looking only based on topology information so we decided to have two sub-challenges the first is really classic module identification so you have the network and you identify modules groups of genes for each network individually and sub-challenge two we had the six networks aligned so basically that the same node if it was the same gene and one network and the other they are assigned the same anonymized labels so the same numbers so they're aligned so that participants could actually use the networks together to derive potentially more accurate modules you will see how that worked out and so these are the two sub-challenges then the modules had to be non-overlapping and the number of modules could vary so they could decide how many modules they want to submit and the size of the modules that's not fixed except that we put a limit to 100 genes because really if you have modules of several 100 genes it's just not biologically very useful anymore even 100 genes is pretty big so they had this constraint and not all genes had to be included we thought if a gene doesn't fit in any module we don't want to force the participants to assign that gene to a module so these are two interesting points because that's not typical for classical module identification methods so basically off-the-shelf methods had to be modified a bit to accommodate these points and then the score was simply the number of modules giving submission of a team which is looked at how many modules show enrichment in at least one GWAS dataset of course we are corrected for multiple testing based on the number of modules that are in submission so if you submit a few modules and you have fewer chances to get a significant hit but if you submit more modules there's a higher multiple testing burden so it's kind of a trade-off that anybody would also face when applying these methods in practice so you see that if multiple GWAS showed enrichment in the same module it still only counted once and that's because these GWAS are often related for instance the same module might show enrichment typically in several of these immune related GWAS and so we counted only once so the structure of the challenge is basically first there's a training phase which we call the leader board phase and then there's a final submission so for the leader board phase the teams could basically make submissions and then see how well they did and they were limited to 20 submissions because it's computational intensive as I will show and for this leader board phase we used 76 GWAS datasets and then after that leader board phase they had to make a final submission and for that we used 104 GWAS datasets which are different of course from these and also I did an analysis to see that they are not correlated with these GWAS datasets basically that related traits are not either all in this set or all in this set that they're really independent sets so we really have a good hold out datasets so how this was done logistically so as I mentioned from Sage Bionetworks we have the Synapse platform where that is now used to run these challenges where teams can register and access the data there's a wiki discussion forum and then they can make submissions and see their score on a kind of a real-time leader board and the scoring was done here on the BiteLite T cluster and so Robin was very helpful in helping us to set this up and we set up an automatic system that kind of we, I should say, surveyed us really did all of the bulk of this work and the system then allowed to grab the submissions from the Synapse platform score them on BiteLite T and update the scores in the leader board and as I said it's really intensive so we had one submission means that GWAS pathway analysis on six networks times 76 GWAS datasets so we had 400 plus jobs just for one submission and during the leader board phase we actually ran over 400,000 jobs each job was actually quite short only like five minutes so it was feasible but still very challenging and just a brief note because people sometimes ask why would people actually participate in these challenges so an important part of these challenges are the incentives and we did a survey what's the most important incentive and the most important are publication opportunities as you might have guessed and the things that the participants actually get to be consortium co-authors I mean here we said that they have to outperform some off-the-shelf baseline methods to basically at least ensure that you don't conscious submit the random prediction on the paper but basically all the participants that do a serious effort they get to be consortium co-authors and the best performers may be featured more prominently on the author list depending on how important their method becomes in the paper and we have a partner journal always for this which just for this challenge is sell of course that doesn't mean that the paper has to undergo standard peer review process but at least they show interest and this of course helps to drive participation also they can write companion papers and submit I mean they can submit anywhere if they want to but we have a partnership with F1000 Research which is a great new model for publication so access to unpublished data is another important motivation and of course also they can be invited then to give a talk and have a travel grant the conference is very soon in Phoenix so if you want to visit the Grand Canyon you can still register so the participation we have 400 registered participants and then usually some drop out or don't get to make submissions in the end we have 42 teams with final submissions that means 42 teams that also submitted detailed methods descriptions I called and that's all already most already available pre-publication especially the best performers they are public the teams can choose if they want to make it public already or not at this point a sub-chance to 32 teams and the forum very active participation and this is really great that you get a very active discussion it's also a bit exhausting sometimes but it's really valuable because we get a lot of valuable input really early on in this project and some very smart people who participate in these challenges and start to see potential issues with the scoring method and so on indeed we did discover a bug like after the first week and had to change something with the background settings for enrichment computation and so on so it's kind of pressure to have basically over 100 people working with your your data and your scripts and potentially finding bugs but it's really valuable input that we got so these are the results for sub-challenge one you see that we have two teams that tie with a score of 60 so that's overall six networks and we did kind of a bootstrap analysis to find, to see what's the if these differences are significant and how often basically a team if you kind of vary the GWAS data set a bit if that ranking changes so we sub-sampled 76 GWAS data sets out of the 104 that's the number of GWAS data sets that was in the leaderboard set so that's where this number and did that a thousand times and we did this base factor which basically just boils down to saying how many times a given team was better or equal than the best performer and that this base factor of three which we defined as a tie if it's smaller than three that means that a team outperforms the best performer one out of four times if you do the sub-sampling and you see here in purple it has actually when we looked at other FDR cut-offs so this shows the results at 5% we actually found here it looks quite close but if you kind of look at different cut-offs and also on the leaderboard she was set we see actually clear best performer was this team task which had the most robust performance basically the ranking of the other teams was not very stable but they were just first always and also on the leaderboard she was set and also in sub-chance to their good performance then we had a bunch of teams that also did well and quite competitively I would say with this team this team Aleph and six others depending on the cut-off so we said this tied for a second place now if you look across the networks it doesn't look so pretty anymore so here we have the overall score and then this is the scores on each of the networks and basically in purple are the top performers on each network and you see that there's a lot of variability if you look a bit closer actually the top performer did pretty well on the protein interaction network it's best top here and top here so signaling and protein interaction also here it's actually ranked at the third rank so it seems the top performer is actually doing quite well on the protein networks but other methods actually do better on the core expression network so as on the networks we see that the string actually led to the most disease modules followed by the core expression network so this looks very promising for the community predictions which are a little more robust so I don't yet have results for community predictions because it's quite tricky to do that so for the community prediction for these modular identification methods I do have a slide on that in the end as our performance also varies across the leaderboard so the training set and the whole outside except the best performer again was quite robust but other teams vary quite a bit suggesting that they may be overfitted on the training data so just very briefly the best performer so it's this Jake, Lenore and colleagues and their underlying idea of the method is that basically genes connected by path through low degree genes have high similarity so protein networks are small world a small world network means that the path between any two nodes is typically short because there are hubs that connect many nodes and of course if you connect it through a high degree node or a knob, hub doesn't mean much it's like if Diana and I both follow Justin Bieber on Twitter, that doesn't say much but if we both follow you on this that says much more about our social network right so this is similar in this so this is the Justin Bieber network node here so the metric that uses diffusion state differences based on a random walk and basically a spectral metric that's based on the expected number of times that a random walk connects to genes and it's a metric that they actually developed this team, I mean that's not the case for all teams, often they use also published methods of the shell methods and so on but that's actually a team who is very active in the field and used their method and improved it and they developed another method to compute this diffusion state difference more efficiently and then they clustered this similarity matrix through the standard clustering algorithm they also used another interesting strategy and we don't know yet the contribution of this they also looked for dense bipartite graphs so that's basically when you have two sets of nodes and many links between these two sets and they found a few pathways in this way and then merged that with them so most of their modules are from this approach but then they added some dense bipartite graphs we don't know yet how important that was for their overall performance but it's an interesting idea that basically you need different methods, again of course I think that community predictions will be interesting that you may need different methods to capture different types of modules okay so subchance 2 will be more quickly so this shows the results for the teams in subchance 2 and in subchance 2 so that was the multi network predictions and we used that baseline, we used the single network predictions because that led to also some discussion on the forum because of course it's a tough it's a tough thing to beat so we said as a baseline we take the predictions of single networks from this team task which was the team that performed best on the leaderboard set so we didn't look at the final data but it was also the best on the final data basically we chose the best method on the leaderboard and took the single network prediction and you see that some of them tie with the multi network prediction so basically that's just an example of this baseline here but this was we realized this already during the leaderboard phase that basically it's extremely hard to outperform the single network predictions according to this metric with these multi network predictions of course initially we would have thought that you would get much higher scores if you add all this information from different networks but it seems it's very difficult to actually make use of these complementary networks in the multi network prediction so we declared that there was no winner in this challenge which is the honorable mention for the highest score goes to this team from the Ryken center in Japan they basically merged the two protein networks they didn't use all the other networks and that's also telling that the method that actually didn't only use two of the six networks did best and it's the two protein interaction that works quite similarly they merged these networks and then they used the standard clustering algorithm which is a state of the art algorithm and and it is just a luvan algorithm I will not explain this here but yeah they got a good score like this but again they didn't really substantially outperform predictions from single networks so I finished with the outlook and we will have to so as I said I mean this is really just the results from the last week so we didn't have time to do everything we want to do obviously we want to give a better overview of all the different approaches that were used and how well they did and generate community predictions we're currently thinking how to do that but the basic approach is we can build a gene-gene similarity matrix where an entry says how many times a pair of genes are actually put in the same modules by teams and you can of course weight teams in different ways or you could do it unweighted and then run a clustering algorithm on this similarity matrix that's the basic approach we plan to do and then of course we have to look at actually these modules and gain some biological insight and hopefully maybe follow up on some of these predicted disease pathways even with experimental collaborators okay so I've given you a brief introduction to what these dream challenges are I've shown you some of highlights of our tissue specific networks and then presented you this new disease module identification challenge which is really a new type of dream challenge I would say that there were no module identification challenges before and it's also new in the sense that the scoring is usually just based in these challenges is usually based on a holdout data set and it's quite easy because you can just compute a correlation or an area under the curve to see how well the predictions were by just comparing them directly to the holdout data but here the evaluation was actually based on this GWAS pathway analysis which is actually a very which is a very intricate analysis to do under computational intensive analysis brought many challenges but I hope will lead to hopefully interesting results now when we look at these modules that were submitted by these teams so I would just like to finish by thanking again first of all Sarvenas who really worked as I said she really did the majority of this work actually with the scoring and the exploratory analysis to even see if this challenge was feasible of course Sven, our advisor who was actually very active in helping defining at least the scoring metrics and so on and Sultan was very helpful in GWAS related questions Davit who developed the Pascal tool again without this without the Pascal tool we could not have done this because other tools for GWAS pathway analysis would be not fast efficient enough to do this at this scale it would have been completely impossible Gustavo and Steven from the dream challenges and Julio who started to is the challenge director actually and involved now in the analysis of the results of course Vital IT and Sage Bionet works for the infrastructure computing infrastructure thanks