 So, my name is Aya Khalil, I'm a recovering physicist who decided to go into this field around the time that the human genome had just been completed and, you know, what sort of motivated me was the work that I was doing in physics and our fellow physicists were doing which was, you know, taking this framework that if we can go out and actually observe what's happening in the world and quantitate it and then apply with it a computational or mathematical model, we can make some very accurate predictions about our universe and the origins of the universe. I'd pause it today that given that framework, physicists may know more about the origins of the universe than we know about the black box which I call my human body. So, that's why we started all of this and the premise that we're here is that big data in healthcare has arrived. So, we can actually go in and measure what's happening in the system on a granular micro lever, your genomics, gene expression proteomics but also, you know, what's happening on the phenotypic level and physiological level and it's not so much that the data has arrived but I think what has arrived is our ability to actually collect this data from the micro to the macro and what we need to do then is not only make some effort into collecting this data in the right way in the right patient cohorts and populations but then put some effort into interpreting this. So, obviously this is sort of what I'd call old school biology. Let's take our pile of data and let me pick my favorite gene or my favorite corner as my way of interpreting what the data is and what the hypothesis is but what we want to do here is take this to the next level. So, the first thing we're going to do is take advantage of soaring compute power as brought upon by Moore's law and then the second thing is advances in analytics and what we want to do with our analytics is to really go beyond the paradigm of just looking at correlations. If I see A go up and B go down does that mean that A is related to B in some way? So, you may go in and do a study and find that fat owners are associated with that pets does that mean I will lose weight if I make my dog exercise? We'd all get a dog tomorrow if that was true and do this. All right. So, in the language of mathematics when I say that A and B are associated with each other I can actually do an analysis such as this analysis to compute how likely this model is relative to that model is but if all we're doing is correlations then there's no way for me to discern which model is more likely. Does A drive B or B drive A? I can't do that. Okay. But now we can actually apply the mathematics of causation and probabilistic causation and I have a third variable in the system or a third intervention and that could be a drug that could be your genotype that might be changing variations in your expression and your phenotype and I create these two models I ask well given that intervention does A drive B the C drive A that then leads to changes in B or is it the other way around? And the nice thing is is that these two probabilities are actually very different so if I had a data set now a nice rich genetic phonetic and phenotypic data set I can actually discern which model is more likely and start to sort out what I call probabilistic cause and effect relationships and then make predictions as to what the right intervention might be. Okay. So a lot of this has been encapsulated in mathematics developed by folks like Judea Pearl so as Magali said the math is well established so we're not reinventing the math or finding ways to apply it so that we can start to build models that determine given for a patient population I may know what the variations in the systems are and those variations can be genetics they can be environmental what then drives changes on the molecular level and these can be different types of molecular levels your RNA your protein or metabolite that then can drive changes in your clinical phenotypic measures which some of these molecular changes might be causal probabilistically causal for these clinical phenotypes and some might be reactive and so what we want to do with this modeling is sort out what we call the causes of the disease versus the symptoms of the disease which the symptoms may be a good measure for diagnosing the patients but when we get to probabilistic causality we can start to predict interventions and what the right intervention is for that patient. So a lot of the work I do and have done at GNS has been about taking now the analytics and the computational power and putting it into a computational platform that can automatically learn these models directly from data and it's not just about learning models that encompass any one of these data layers but a collection of these data layers right measuring what's happening on micro all the way to the macro and being able to harness the power of Bayesian probabilistic math along with cloud computing to build models that can predict outcomes for patients and these outcomes can include clinical outcomes but even things like economic outcomes as well. So there's three components to the platform that are important. The first thing is you want to have the ability and we've built in the capability to enumerate and score Bayesian interaction forms at scale and these Bayesian interaction forms deal with different kinds of data, discrete data, continuous data, log data and be able to compute these very quickly so we can go in and compute trillions and trillions of interactions for a given data set. The next thing is you want to be able to take these building blocks of this probabilistic causal model that we're trying to build and aggregate them in networks and so this requires a global optimization algorithm that can search through the vast space of network topologies that could best describe your data but at the end of this we're not going to come up with a single most likely fit network because that would be overfitting so this is a really important component to learning these models. We're going to learn ensembles of models and those ensembles of models each represent a likely fit to the data but not any one of those models is the most likely fit so we're going to learn that ensemble and then we're going to use the collective ensemble of models to run simulations. Now ask what if questions, what happens if I set a node in the model and ask how does something downstream change and what we get are simulations, distributions of answers, probabilities of how likely that intervention might be affecting something else downstream of that and the goal with this isn't just to build these models and have a really smart mathematician look at it and ponder how good they are, how useful they are, we actually want to turn these into computer models that the world can use so whether you're a bioinformatician wanting to exploit these models to do deeper statistical analysis or even a clinician wanting and a researcher wanting to use these models to come up with hypotheses that you can then test in the lab is to get them in a form that are useful so just as a quick example in work that we did with Biogen early on we took 77 RA patients and we're able to measure on those patients so it's a small data set this is work that was done a while back but we were able to measure their genetics, expression, disease, severity scores and then we ran that data through the platform and asked okay how do the various phenotypes impact transcripts, how do the transcripts impact each other so we scored all these models locally then we aggregated these models into ensembles of networks came up with a distribution of networks populations of networks and models that underlie this data that we then took and ran simulations against so we took these network models and asked okay let me set the genotypes, the transcripts to reflect a specific patient and now let me run through all possible interventions so all possible ways I can upregulate or downregulate these nodes and ask how does that impact the endpoint swollen joints of the patients, tender joints of the patients and lo and behold one of the top targets that came up or top interventions for patients who weren't responding to anti-TNF was CD86 so CD86 actually happens to be one of the pathways that's targeted by the drug Orencia for patients who actually don't respond to anti-TNF so it was a really nice proof of concept to show that in a completely unbiased way we can take in this data set, learn these models and actually make predictions about interventions that have already been clinically validated so can we take this approach now and use it to come up with and predict interventions for things where we don't know what the right interventions are so this is what we're doing in MS right now through the work that we're doing with Orion and the Orion consortia so we decided for the phase one of this project to start with data that's already there so we went into the Brigham and Moon folks and asked okay give us all the data that you can on your gene expression, imaging, et cetera from your climb study so we're taking that data, we're going to run it through the computational platform to learn these models. The outcome will be these models that we're going to aim to distribute to the community so biologists and clinicians so they actually can go in and ask their own what if questions, set the nodes in the models they may want to have a hypothesis about a patient who in that cohort that has severe disease and we can go and set those interventions and ask okay what would be the right pathway or the right marker to target in those patients so it's to make these models accessible to the community so a little bit about the climb study that we're working with so as Magali said we started out with over a thousand patients who had all had their GWAS is done and then we asked okay in that patient cohort how many had their gene expression done so roughly 363 and then we looked at the overlap between patients who had SNPs and gene expression changes that would all the data set to a little over 200 and then we had clinical measures on these patients things like their EDSS which we converted into our rate or functionality in their left hand, right hand which we also converted into rates and a little down to 169 and then we asked okay what's the overlap with MRIs so to get what I call a coherent data set that actually includes your genetics now molecular changes because what we want to build are causal models that sort of cause and affect how the molecular changes impact clinical variables and MRI variables and then the MRI we ended up with a data set of about 108 patients right so that's our starting point for the phase one models we applied sort of standard pipeline and QC analysis on the SNPs that I can go into detail later after the talk as well as gene expression to get it normalized and get it reduced down to probes that are actually varying in this data set the other thing that we had to do and what we're hoping to improve with our phase two study design was deal with the fact that you know when the study was first done patients would come in they would get their MRIs measured and then at some point later they would get their gene expression changes measured so we had all these MRI variables that weren't necessarily coincidental with the time point at which the gene expression was measured so we had to what we did for this data set was perform some interpolation over the eight MRI measures to get a slope and an intercept that was then used in the model okay so in building these models what we're going to do is build networks then that look at relationships of how the SNPs can drive any one of the clinical MRI or transcript changes how the transcripts can regulate themselves and all of these components can also feed into a model a hazard function model that predicts time to events time to relapse events right and as I said earlier the first thing that we do with this kind of data is we enumerate the different ways locally that these things can interact with each other score these via a Bayesian framework and then combine these into global networks where the structure of the global network tries to assess whether things like the drivers in the system whether it's your genotype what treatment you're on can then lead to changes and networks regulating the transcriptional changes that then impact the clinical end points or phenotypes in this case being things like rate of change of EDSS or imaging and we're not going to learn just one network model but we're going to try to infer and we do infer populations of networks so at the end you end up with an ensemble of models where each network in the model is a directed asic graph and then we save all of the parameterizations underlying these models so that we can run simulations and go in and now set these nodes to reflect any specific condition that condition can be a specific patient with a specific genotype or specific treatment or subpopulations of patients and then propagate those simulations and run interventions we can up-regulate or down-regulate any one of these nodes and ask how do those interventions impact the end points and as I said earlier the answers always come out as distributions so we will get a prediction but also an uncertainty around that prediction how likely that prediction is other things we can do with the model this thing is like falling off is go in and try to also infer the regulatory interactions between the genes themselves so that we can pull out things like transcriptional modules, feedback loops, directed hubs this is all done once the models are built through the in silica simulations so this is an example output from another project that we did in Huntington's disease where we built these models from transcriptional data and variation in CAG repeat length and then we ran simulations to ask how does that transcriptional regulatory network change when we're talking of patients who have our people in this case who have low CAG repeat length which means their chances of getting Huntington's disease in their lifetime is zero versus people and now in this case patients who have high CAG repeat length who have a high probability of getting Huntington's disease in their lifetime so this is a snapshot of what we inferred after we built the ensemble of models ran the simulations of the regulatory network's underlying patients with low CAG repeat length and then as we vary the CAG repeat lengths from low to medium to high we start to pick up things like hub nodes that weren't there before so new interactions, new genes that are actually having new functionality in the system start to emerge and hopefully in the case of our Huntington's work this will lead to some hypotheses about what to target in these patients as we go from low to high so that's some of what we're aiming to do with the models here so we're at the point now where we got the data we're able to take the data and get it down to a data frame that while the patient numbers are high at least is complete in terms of our ability to connect the genetics to the molecular to the clinical and the imaging variables we've just built these models so they're hot off the presses we haven't had a lot of time to analyze them and see what's in them but I'll show some a preview of some of the results that are there so just a little bit I went in and asked the team to go in and just show me without doing the simulations yet remember a population ensembles of models of directed acyclic graphs let's take them and just do some sub network analysis and ask me what's connecting to what so do we see any transcripts for example linked to any one of the end points in the model things like lesion volume, BPF and what we find is there's a lot of SNP drivers at least that some frequency above five percent that links to these things we haven't yet seen any transcripts going from the SNP to the endpoint that could be as a result of what I talked about earlier which is that when we collected this data it wasn't taken with the concept of building these models and getting a coherent data set but we are finding some interesting biology around other nodes this is a node that actually showed up in a GWAS paper and we're seeing some interesting things that that node regulates and regulates it on the SNP levels and these are the things that we're finding that regulated actually seem to make sense okay we've taken these models and actually gone through and ran comprehensive simulations and asked for every node in the model what do I when I run an intervention when I upregulate I does it actually need to change this in A and vice versa and what are the most significant interactions okay and we pull up actually a number of nodes that when you perturb them they seem to regulate and change a lot of things downstream and then conversely a lot of the nodes that when you perturb what's upstream of them they're regulated by a lot of nodes and this is a map then this is a map of the simulation results and we're seeing a lot of interesting hub nodes emerge that I want us to go back and analyze in some detail and actually ask again now that I have this regulatory network of the system how does it look in the different patients that I've modeled and patients with more severe versions of the disease versus less severe versions of the disease and when I look at those different networks do I start to see nodes and hubs emerge that are different between those two subpopulations that we can look at in more detail from a biological perspective right and then finally I mentioned that we also have within the model things and drivers of time to relapse so again some interesting snips are emerging as important drivers that we're going to go back to the regulatory model and ask okay what do those snips regulate within that environment so that's some of the work that we're doing now to wrap up the phase one modeling and pull up some interesting biological results from that and some learnings as well about experimental design and what we need to do to get at more robust models to actually sort out causality so we can start to predict interventions for these patients with MS so that brings us to phase two and what we're doing there so I think Magali talked a little bit about this but our goal here now is to do a multi-year study that not only does a complete job in terms of measuring the micro genetics molecular but also the phenotypic measurements and the phenotypic measurements to match including some of the work that Jamie and other folks are doing and our goal there is to build a longitudinal model of the system so not only look at what the networks are the underlying networks are that connect up the genetics and molecular the phenotypic at a single time point let's say baseline in this cohort but over time so that we can start to see how the disease progresses and how these networks evolve over time and hopefully get at interventions that will help patients who right now might not be being helped by the standard treatments that are out there so one of the folks that's part of the consortium is Curtis Shreiner he's an Olympian a father and an MS patient and one of our goals is to do better for him so standard of care might do a reasonable job of managing his current disease state but over time he is going to progress and there isn't going to be anything out there for him and so hopefully with the richer deeper phenotypic longitudinal models that we're building we can start to come up with hypotheses that will actually impact him in real time and with that I'm going to stop I know I talked really fast because I wanted Stephen to have some time to talk so thank you