 Okay, good morning. Thanks for showing up early. We are a technology laboratory and we work with proteins. We try to measure proteins. And it is an interesting question right now, and I'm actually pleased to be here because I also forced me to think about some things which we normally don't. Think about how do genomes and cell fate relate to each other and what role do proteins have to play in this equation. And I think they have a large role to play and I'll try to give first a little bit of general consideration and then I will point to a few issues where we think proteins can contribute. I think we live in a very interesting phase. This was a very interesting meeting to me to see some of these very intricate, extremely complicated mechanisms like trafficking that are being studied for decades. And then there's the other world where people generate a lot of data and I think we need to somehow bring these two worlds together. This is kind of the topic of my talk. So we are a technical laboratory and I don't want to talk about techniques. I just want to give kind of a status report to calibrate the expectations and the capabilities. So if we are, we have focused in the field of protein measurements for a long time to exhaustively measure as many proteins in a sample as possible. And as I can say this has now been achieved we can probably measure virtually any protein in a sample. There's a few glitches here and there but that works well. If we want to relate genotype and phenotype this is not sufficient to measure one proteome we need to work with cohorts because then only then can we start to see how system reacts to specific changes for instance genomic changes. So there has been work ongoing to get to a stage where genomics has been for a long time and that is to have a large number of samples, conceivably hundreds of samples and have many analytes, preferably all of them, in this case proteins and to measure them reliably, quantitatively and then to generate the data matrix that has no or very few missing values. So this has been a huge challenge. I don't want to talk about why this has been a huge challenge but recently in the last few years we and other groups have developed massively parallel mass spectrometric techniques that are kind of equivalent to next generation sequencing which allows to sequence many peptides at a time and basically cover, reproducibly cover, a set of the protein, a subset we cannot cover the whole proteome at this time in many samples but to do a certain subset, several thousands of proteins with very high consistency and quite fast. So this is the kind of experimental basis that I will take off from. So we can say that each sample, let's say this may be a cell extract this may be a tissue extract, a biopsy sample is converted into a single digital file by mass spectrometric technique which I'm not going to further discuss. This is fast, it goes from if we could take for instance in the clinic a biopsy in the morning and have the results in the evening. We can do about 20 such samples a day with one machine this is quite fast also in view of genomics but we cannot do the whole proteome we can do about 5000 proteins in a sample. But with very good CV, with very few missing values if you have a cohort and it is the sample, the measurements are quite precise and so we can think of it conceptually like about 50,000 western plots per sample. So this is kind of what we're doing. This is also applicable to modifications and protein interactions. So this is where I would like to start and it's just kind of the technical base without explaining how this works and I'm happy to do this if someone is interested. So now if you read the literature and these large scale genomic efforts we read that we can do thousands of genotypes can be measured in a cohort sometimes these are international consortia and for instance cancer versus control is one of the most dominant areas where this is applied and of course we have also a lot of quantitative phenotypes from imaging, from the clinics and from diagnostic tests. And so this is the big question I think this is one of the questions of this session it tends to address is how do we make a link how do we make projections from the genotypic variants that are existing in a population or in a cohort towards a phenotypic space and this in the clinical sense of course is healthy or disease but it could be any, it's a general problem of course could be any phenotype. So we would like to predict phenotypes from their molecular origin which mostly is based on genomic data. So if you want to make predictions this is of course one of the hallmarks of science and there is many fields engineering, physics which has developed a very high level skills how to make predictions based on models. So I use just this clock as an example as a system which is precisely understood and where we can make fairly precise predictions. So we know if we know the state of the system at a particular time, this clock and if we know some of the parameters then we will be able to say for virtually any time in the future where the clock will be and because that's basically the purpose of a clock. We can also predict quite precisely how the system reacts if we change something in it. For instance if we make the pendulum a bit longer we can predict what effect this has on the readout up here. So this works well for systems with a moderate number of parts and the basic model how they interact and so this is of course very rather straightforward system and I would just say there is equations with first principles we can plug this in we can plug data and we can play in a computer with parameters and then we will get a fairly precise outcome assuming that this system operates under idealized conditions for instance it disregards friction of air and so on. So in biological systems we also try to of course get to predictions to generate predictive models but we have seen over the last few days and this is actually discussed quite extensively on for instance Tuesday evening in the discussion that this is very difficult to achieve and so there were terms or statements were made that this is usually equated to systems, biology and the terms were made that this has been disappointing and I would agree with that but there are some success stories where really quite well predictive models have been established in biology, oscillator, bacterial motor is one of the toy and really well studied problems cell cycle regulators but I would say this is, I would also agree that this has been disappointing from the point of view of generality that we can make general predictions so good predictability has been achieved in these relatively limited samples but they are relatively simple there's a generalized models they would be have a hard time to explain perturbation somewhere else in the cell how for instance bacterial motor acts the chemotactic motor if the cell receives two independent signals it is hard to predict from these models and they're presently not scalable to really complex systems like trafficking or other situations that were discussed in this conference so while there has been successes the way how this approach a very highly mechanistically driven approach based on the understanding of the wiring and the components of the system can be scaled to larger systems is a huge challenge and here I just want to make the point how big the challenge is if we want to go from something very confined like a bacterial motor to making statements about a whole cell so here a while back we worked together with the group of Jörg Baylor to basically do a molecular inventory on a cell which has also been featured quite prominently in this symposium and this is an S-pompy cell so we tried to use what was at the time the best methods we had available to precisely quantify basically count the RNA molecules the transcripts and the proteins in cells that were grown pompy cells were grown at different conditions basically a starve condition an exponential growing condition and this was to us certainly to me these were astounding numbers so we could see that a gene produces between 30 and about a million copies of proteins per cell so it covers an enormous dynamic range and of course the question would be how is this dynamic range maintained and how is it regulated the median protein number is about 4,000 not quite 4,000 and the median mRNA copy per cell was about two and a half this to me was extremely surprising numbers because this basically means this operates entirely in a stochastic domain so we must have cells if you have a population some have none some have maybe three some have five and if we then make predictions from let's say transcript measurements if we put weight on an increase from two to four transcripts we have to ask of course what that means because we operate essentially in a stochastic domain but here we don't here in the protein level we don't because there's not going to be any present that have zero protein and others have 10,000 they will always have some variation around of course these 4,000 but the mean there's very few cells that have none so it already indicates that we operate probably quite from a conceptual in a different domain if we work with transcripts in these proteins the total amount of RNA molecules in these cells was about 40,000 which is basically not many and means that from an energetic point of view it's very cheap to make these proteins whereas there's almost 100 million protein molecules in each one of these cells which means to maintain those to control those is very expensive for the cell another issue which is often times not really considered is that the protein concentration in these cells is more than 300 milligrams per milliliter it is about a third of what crystallographers achieve when they squeeze the proteins into a crystal to do X-ray diffraction it's enormously high concentration actually it's a miracle or it's astounding that the proteins do not crash out because any biochemist knows if you want to extract these proteins from the cell and you do in vitro experiments if you go beyond let's say 10 milligram per milliliter the proteins tend to largely precipitate out so how the cell maintains this concentration and can carry out its functions is actually an astounding feat which I think is not so often considered and so most of the vitro biochemistry reconstitution experiment and so on are done at about two orders of magnitude concentrations below what actually happens in the cell so that just to indicate that if we talk about cells we talk about a very complex system and these classical mechanistic models have a very hard time to reach any kind of comprehensive prediction of this model here so now then given this situation and coming back to the question kind of posed or I understood to be posed for this morning was how are we do we can ask how are we doing to predict phenotypes from genetic variation which can now be very precisely measured and the answer is we're not doing well at all and this is not just of course our group this is the general we have great difficulty to make this link so we can ask a few questions which are simple questions which we should be able to answer but we can't and I think if everyone is here who would venture to say that they generally have an answer to these questions and that would be very good to hear so we can for instance not say we cannot predict accurately what the effect of any inherited or somatic mutation is on the phenotype we can take a particular background genome because it should have a model where we introduce a mutation anywhere let's say in a coding sequence we should predict what effect does that have and this has not really been achieved we do not know how two or more mutations combine do they cancel each other out do they synergize are they neutral we do not know how the same mutation affects different individuals which is a huge issue in medicine particularly in this emerging field of personalized medicine and we also don't know how copy number variations in an individual are processed so these are seemingly simple questions which we should be able to answer and I believe that before we can answer any of these questions we have some kind of a path to answering those it is presumptive and maybe too early to really go into the clinic and try to make statements about genotypic variability and its clinical outcome with the exception of course of some Mendelian traits where it is very clearly understood what the molecular basis is and how this translates into phenotype and most clinical phenotypes are not so simple so to summarize this part I think we are operating in an interesting time in life sciences and we operate and try to summarize this in this graph here so we have an axis this indicates the data that is available and I indicate the y-axis the amount of first principles or theory that's available for the field we have certain fields like engineering health technology like the biotech or biometh tech people who make a device for instance the monitor heart rate or blood pressure which are in a very comfortable position because it's essentially engineering there's a lot of theory there's a lot of first principles thermodynamics, electrodynamics and so on which are used very widely and work extremely well so for them it is relatively straightforward with a limited amount of data to get to a predictive model in biology we don't have this luxury we have very few first principles I come to this just in a second but we have now increasingly a data and of course we also heard completely different types of data that exist and are generated from cell biologists from imaging that we're labeling and you can follow a specific molecule exactly where it goes how it's amplitude varies and so on so this is also of course enormously dense data so we operate now in life sciences medicine in a space where we have a lot of data available but how this data generally in this genotype to phenotype space relate to each other is to me a big question and I'll come back to this I think correlation, simply correlating data is not going to work so we have to find a way how to translate this data in predictive models and this is the topic that I'm going to address in the following of my talk we incidentally have whole classes of scientists or people who have very highly influential and important roles in society doctors, lawyers, CEOs, politicians that have neither they neither have a theory or first principle how the system actually works nor do they have data that they can actually do tests and can do any experiments I mean a doctor cannot have a clone the patient and do experiments or treatment of some type on some and not on the others so they have basically to accumulate an empirical base which they apply so at least in the life sciences we're now in a domain where we can use empirically acquired data from strategically positioned data sets that help us to make predictive hopefully accurate predictions so what are the first principles that we use in biology we do not have models like the physicists do where you can vary parameters and it simulates and makes an accurate prediction we have some principles that we can apply for instance we have Mandel's law Mendelian inheritance we have the principle of Avery DNA as a transforming principle this of course is now taught to every undergraduate student we have the one gene, one protein, one function notion from Biedel and Tatum we have central dogma we have Linus's Pauling I think this is an extremely important insight the idea of a molecular disease that a particular mutation in a particular gene leads to a change in the protein in a change in the structure of that protein and that manifests itself as a complex phenotype cycle cell anemia we have the notion that proteins only function if they have a three-dimensional structure and of course we have the most recent principle which I think is a fundamental principle that we need to consider in this phenotype, the genotype or relationship that proteins are basically of a modular biology that molecules do not act by themselves but they act in modules or complexes however one would want to call that so we try to come up with a concept that would integrate many of these principles and is experimentally addressable so we call this the prototype model and this is our notion that we are pursuing experimentally that if we were able to define this measure this term or this entity the prototype we would have an extremely informative entity that is fundamental to the translation of genotypic variability and phenotype so how do we define this prototype we define it as the composition of the proteins in a cell basically the inventory and the way they are organized in modules so this addresses many of the principles that are monuments in life science research and especially it addresses the issue of Lee Hartwell and colleagues of a modular biology and it addresses the issue of that variation, genetic variation that is happening some energy affects the structure and the function of these modules this is Linus Pauling's principle so we would postulate that if we are able to measure this prototype we would be able to make a useful link between genotypic variation and phenotype specifically we would predict and I will then expand on some of these points a little further that this prototype so the composition as well as the organization of the proteins is the result of complex processes multiple layers that are poorly understood we know that there is transcriptional models there's models that predict how RNA interference or microRNAs affect gene expression there's models that define translational control but and of course we have protein kinases that affect protein phosphorylation and I think for all of these levels there is information but we would everyone would be hard pressed to integrate this into a computer in a model into a comprehensive predictive system and we think that the cell actually knows how to interpret all these or to integrate the control events at each one of these levels and basically generates one entity this is the prototype which is the result of control elements at various levels we would assume that the prototype indicates the response of a cell to react to external perturbation of genetics I'll come back to this and that the cell knows how to integrate or how to react and if it can measure the reaction then we would learn some biology so we would further assume this is basically the principle of beadland tatum and also from Linus Pauling that the prototype determines the biochemical state and therefore to be very close to defining phenotypes further we do not ask what does the protein or gene product gene or protein do this is of course largely known in our kinase phosphorylates certain residues ubiquitin like is ubiquitinate certain residues and proteins digest certain proteins we know that we can measure that in vitro but I think the question we try to address is how does the prototype or the system respond to alterations so it's not just that we want to say a certain element has a certain function but we would like to see how does the system react if this function is changed and then we would present this as a system this prototype which has different levels of resolution eventually there will be a high level resolution all the way down to crystal structures or atomic level but at the moment I think we have to we have to assume that certain areas of biology are known in great detail and can be represented in very dense data and others are not and I think the whole discussion about trafficking is one field where this enormous amount of prior information has been accumulated and we would like to integrate this data into a larger representation at the level of the proteins so this is kind of what we try to achieve and the considerations that basically indicate that we believe that it will be very hard to make inference or predictions from how genomic variability affects for that matter also environmental insults affect the cell and if we simply do genomic measurements so now in the following I would like to expand on a few of these principles with actual data the first question I would like to address how does the simple genomic perturbation affect the prototype so now we go into an experimental design where we induce on another wise in variant genotype in a specific protein specific mutations and these mutations are derived from medicine from basically they have been associated with specific disease phenotypes particularly cancer and we ask what effect does this do these mutations have if they are in the same gene but of different type how do they affect the modularity the composition of the respective protein module and how what effect do these changes in this module have on the cells on the cells protein landscape so the experiment is we express mutated forms of a protein and determine the effects of interactions and functions and this we use a protein kinase DERG-2 and we use a protein kinase for that because it is easy to measure the reaction of or the response of these mutations on this protein because the function of this enzyme is to phosphorylate proteins we can simply measure whether it phosphorylates different proteins or non or additional ones if it is mutated and we measure then the effects of these effects on this protein so how do we end up with DERG-2 so we have a computational postdoc in the group we developed a system which we call Domino effect where she tries to combine the genomic mutations from cancer genomic data this is a massive amount of data more than 10,000 complete genomes from cancer tissue and normal adjacent tissue and she distilled this down into protein into mutations which very likely have an effect on protein function and so she does that by basically statistical arguments looking at the likelihood that the molecule is mutated at the particular site and then she uses prediction tools like do this mutation likely change the conformation of a protein or interaction of a protein that is affected and she came up with what she calls hotspot mutations about hotspot mutations in 156 genes which we have a fairly high likelihood that these mutations if they were introduced in an otherwise invariant background would induce a change in either protein folding protein interaction what information is used to distinguish what was specific about that there was some specific study so the mutations have been all found in a mutation which has been found in genomic data that has to be mutated in patients that have a certain type of cancer so now we know that there are many of these mutations are not known to be significant so there are tens of thousands of them that are categorized in those which have likely an effect on the basis of first of all frequency of occurrence and secondly that they occur in the folded protein in residues which either indicate that the protein might be disturbed or that an interaction of a protein with something else might be changed how do you get this information? I give some observation on this protein coming to the experimental data this was simply based on predictions on structure predictions and then where you would predict the structure of proteins and then paint the mutation on this protein and if it is in a region that is predicted to be interactive it is assumed to have an effect which are of course not very precise and I'll come to then some experimental data will you distinguish variants polymorphism and causal by the frequency so now this is the one of these 160 or 159 proteins that came out we then followed up this is a protein kinase called Durk II it is an interesting enzyme, it's a protein kinase it acts as a module and this is the kinase itself one of the protein it binds to is a ubiquitin ligase so it seems to be at an intersection of protein phosphorylation and ubiquitination and then there's two other proteins here so the core is a tetrameric protein that we want to study so it has a number of disease associated mutations it is a tetramer and some of the subunits as individual proteins have been crystallized the structure is known so now we select it there's an 81 in this genomic data set there's 81 mutations that have been mapped to this protein some of course have no disease association some do and then Maria filters this down to a small number of modifications as we test now there is a truncation where the c-terminal tail is cut off an event that happens relatively frequently in cancer we have two point mutations at the side where the truncation happens we have a mutation which is in the activation loop of the kinase and therefore affecting activity we have a mutation in the catalytic side thought to render it invariant and there's a mutation here in a region which we don't know what it's doing we have two mutations which are labeled here and affect which are all not they've been occurring frequently in patients with certain disease we were introduced this is the work of a postdoc and it generated cell lines that express the mutated form of this respective protein and then we measure the interaction around the core so we see that each mutation even though they may just change a particular residue somewhere in the protein which we don't really know what its function is each one of these proteins has a different each one of these mutations has a different effect on the module so this would be the way we read this graph is that this is the mutation this happens to be the one which renders the kinase inactive and these colors are the three interactors of the tetramonic core module we see if we inactivate the kinase by mutation there is substantial fall of or reduction in the binding of some of the subunits there's some mutations which have very little effect but each one does have an effect on the interaction and so they all curb the module in some way some more some less and as expected the biggest change that is occurring is the truncation mutant where the c-terminal part is missing this is fairly plausible so what we know so far the DERC two mutants which have been derived from clinical information show significant but varying impact on the assembly of this kinase core complex how do you measure interactions? so we use two methods one is affinity purification where we tag the proteins and pull them out and do interactions and the other is called bioID where we express a modified protein and it basically labels chemically the surrounding proteins and the results of the two are not quite identical but they largely converge and is this in the context of the wild type proteins so this is expression on top of a wild type or a knock-in or... so this is expression on top of the wild type protein I'll then come to a knock-in in a second so the message so far would be that oftentimes we would say okay a mutation effect is gene the gene is going to be eliminated or somehow modified and we would then like an inference from this mutated gene to phenotype and what I tried to show here is that each mutation even at this level of the organization of this protein with its core module has different effects does it act as a dominant negative on the endogenous one? no it's just expressed on top it doesn't affect the endogenous activity because it will take the substrate? no it doesn't because the endogenous is basically transparent to these techniques it could be that there is titrations or that if you express a tagged protein it mobs up into the interactors it changes the equilibrium yes? what it's doing is allowing nature to do the mutagenesis and then there was a phenotype that came out that would have been equivalent that I would have actually started from scratch than my mutagenesis and my protein and then get my readouts it just allowed evolution to do the experiment so these are mutations that are filtered so we don't say that these mutations cause cancer but they are statistically associated with certain tumors and what I'm trying to do here is to say these mutations that have occurred through evolution or selection in the cancer even though they affect the same gene they have an intricate and idiotypic effect in the module and if you believe what Linus Pauling's theory was that of a molecular disease the mutation affects the structure and the function of a molecule then it means that each one of these has an idiotypic seems to me that's the same experiment I would have done in the laboratory when I force let's say I'm doing a screen of mutants let's say I could just drive mutagenesis and select for it yes of course but we try to we try to make arguments that we try to find ways to use the genomic variability which is associated with disease like cancer and to help making the link from these, explaining these mutations under relation to the phenotype of course you could take any gene and you could mutagenize it and you could see what happens but that's not quite the question we ask here because there will be no phenotype in your experiment why? that's not true what will you make? let's suppose I'm working on a platrine so I have a reader which is going to be a certain phenotype it's not disease but it's a failure to internalize I accelerate it because I'm doing the lab I get the mapping through the protein and then I cluster the properties yes of course this has been done there is a big difference is that you are in nature and therefore when you are in the lab you work in an isogenic background it depends and here that's the first thing you do and second you can exactly start to work on your mutation when you have isolated the phenotype that means that you don't let the systems evolving further to generally in cancer you don't have one mutation you have many mutations that probably are the first is the one that might have caused the cancer the second one is the one that's allowed to survive the first mutation so you have a very long evolution of a complex pattern I understood that but I still conceptually I don't understand the difference to me the difference is in one case I allowed millions of years to do this whereas in the other one I just accelerated it so maybe because I'm doing accelerated I have less time to do the more spread thing or maybe I'm doing less subtle so I agree if you work in with yeast or with flies of course this has been done a lot even with mice there's been huge consortia that basically do mutagenesis then see what phenotypes arise and I would just try to make the point here that if you then say the gene was mutated that this is not sufficient granularity to make eventually mechanistic link mutations even in the same gene in a particular genetic background have very different effects on the modularity as I will show now also on the function but I think it has a much more fundamental effect is that when you do that in the lab you are looking for a strong phenotype so I have done experiment for example where I'm looking at a particular pathway let's say a perturbation in a gene and then we actually see compensations in the system and sometimes you don't even actually we don't actually see a readout on the phenotype the compensation is such that you don't see a readout but since I know the module as you are pointing out I'm looking let's say at the level of expression of some other proteins and then you see there was compensation there was no phenotypic readout but this is generally a rare case in general when you do experiments in the lab you go for the strong phenotype and what is remarkable is that when you look in nature you never have the same mutations okay, we will continue later sorry, that's okay no I mean it's an obvious difference between doing it in people you cannot do the lab experiment in people but I think the principles that we are trying to elaborate here also apply to mutations that are generated by random muta chat but the point is it's not unexpected because if you are looking at a very specific function I mean this muta can be online in the lab yes, so I was just trying to say that various mutations which have been selected and been through statistical arguments associated with the phenotype that they affect the protein differently at the level of organization and now we ask what does it do at the level of its function which is to phosphorylate other proteins and so we can find on this protein a number of phosphorylation sites we measure them and we quantify them and we can see that again each mutation has these few phosphorylation sites on the protein that are measurable have a a different pattern so not only does the mutations have been selected and expressed affect the wiring basically the modularity they also affect the phosphorylation state of the protein and then we also carried out is now a knock-in experiment where we carried out a study to see how do these mutations with presumably perturbed modules affect the overall function of this basically landscape the number of the phosphorylation landscape or pattern of this protein so the experiment was to take cells to knock out with CRISPR-Cas the intrinsic kinase to knock in then the mutant forms of these kinases then they're expressed and then we isolate proteins then we purify phosphopeptides and we analyze these phosphopeptides in a mass spectrometer so we generate about 1200 or so phosphorylation sites in all these mutants and by simply clustering these phosphorylation patterns look quite similar so that means that these subtle mutations in this protein do not radically change the overall protein phosphorylation landscape they affect it because in the same cell they have hundreds of other kinases active at the same time but when we start to look more closely which phosphorylation sites are affected we'll focus on this panel over here we see again the various mutants there's no more wild type kinase and we see that there's a set of proteins about 30 to 40 phosphoproteins and phosphorylation sites which are changing in response to the various mutations and these phosphorylation patterns change again in idiotypically dependent on the type of mutations we see the complete knockout this is the most strong phenol footprint here this is the second to last we have the deletion mutant, the C-terminus is deleted this is similar, less strong as the knockout and then we have the kinase dead mutant which is the third one from here which is again similar to the knockout but not identical and then we have the other mutations which either affect different residues which have a footprint on the phosphoprotein which is detectable which affects specific proteins but not as strongly as the absence of the kinase so we can of course then look what do these proteins do and this would then provide a link to the activity of this protein or its modified form the effect of a mutation of this protein on specific phosphoproteins and if we assume that phosphoproteins are phosphorylation is responsible for modulating the activity we can say that specific mutations in this single protein dark 2 mutations which have been coming through basically through evolution have effect various areas of the cellular physiology for instance some map to methyl transphases is an epigenetic complex we have this protein here a scaffold protein activated with GTPases so this is probably people who know a lot about this protein nuclear pore proteins which we also heard a lot yesterday and cell cycle regulating proteins so what we conclude is from this is that if we take a number of mutations that have been selected to be related to disease and if we introduce these mutations in this protein in a cell in otherwise isogenic background they affect both the organization of this module and its function and they do it in a highly modulated way and the function of this protein complex which is a kinase affects different parts of the cells physiology so this points to a lot of complexity of how these mutations mechanistically affect physiological processes so this is basically what we conclude from this overall conclusion the complexity of the cellular response to a simple genomic perturbation one mutation is beyond the reach of mechanistic models because we have no good way to predict a priori which parts will be touched by for instance a kinase or a ubiquitin ligase or a protease that's mutated okay so now see I'm getting of course very late we I won't get through so we'll see then we would like to yeah that's fine so I won't get through the third part of course which is actually also well anyway so now we I would like to extend this to a situation now we have basically had one background one mutation ask what happens and now we would like to go to more natural situation where we say we have a number of genetic variants in various in a population and we'd like to ask to what extent can we use this natural variation to make linkages eventually mechanistic linkages for predicting a phenotype so this is usually discussed controversially and because there's a lot of people like those here who very famous article by now who expunges the idea that we don't need to have any hypothesis anymore we don't really need to understand mechanisms correlation is enough so the idea is if you pile up enough data measure enough genomes to genome g-wall studies with enough cohort size we will not need to make to understand the underlying mechanisms we can simply make correlations and make statements so this is widely used also in clinical circles I would like to show that this is probably well almost certainly an all under estimation of the problem and that correlation will not be enough so how do we show that? it's used as markers or biomarkers it's not meant to be a mechanistic model it's just if you have 100,000 people with this, this, this, this, this and this and there are this disease there are this stage of the disease you can say there are good chances that if you have someone with this, this, this, this and it's risk, it's not even markers, it's risks and I think we know from this now very large and some of these large g-wall studies are now hundreds of thousands of individuals have been genotyped and what of course comes out is that there is the larger the cohort the more genes show a small signal in these Manhattan plots and so they produce there's additional genes or mutations associated with a complex disease but there are very, very small contributions how these contributions can be used even clinically to make a risk assessment is actually very difficult so I think this is a philosophical point I think one needs to eventually if one wants to do a risk assessment or treatment decision know something about the mechanism, yes so we now like to explore how can we use systematically collected data sets in populations and mechanistic insights and prior information to learn something about the system so this we do, we would like this is the outline and this is the system we use this is, and we use this system because we think before we can make any headway in a system which has controlled known genomic variability will have a very hard time to go to outbred human population so this is an interesting collection of mice mouse strains which has been generated by an international consortium we certainly have not contributed to that with the beneficiaries of it and there were two mice C57 black mice mouse and DBA mouse were crossed and then there were F1 generation and F2 generation was generated and out of that there were strains outbred which are each one of these strains is genetically identical, they're inbred and they all have the property the genomes of these strains have the property that alleles from either the one parent or the other have been distributed in these strains and there's about 180 strains of that genetic resource because we know the genetic variability it is limited in the sense from the alleles that are present but the distribution of the alleles is of course different from strain to strain so we have used these strains and in the experiment this is an early phase with now a larger data set which I don't want to discuss but we selected 192 proteins which are relevant for metabolism and selected them for quantification across this cohort so they covered some metabolic pathways we took 40 of these strains which were grown either on normal food or on food that makes them fat so this is an external perturbation, we have a genetic axis which is a genotype which is known we have an environmental or diet axis and we did this for 40 strains we measure in duplicate under two conditions high fat or low fat and about close to 200 metabolic proteins and then we want to see how this data set can be related to learn something about the genetic effect or the environmental effect on the behavior of this pathway so this is just showing that the data looks good this is the data table so this is what I said at the beginning we have now the ability to measure precisely quantitatively number of proteins across cohorts, these were rather for today's terms some of low number but it's focused to make the point it's efficient and so we we can now link using trade locus mapping we can now make a link between the presence of a particular allele at the locus and the abundance of a protein so this is referred to as protein QTL so we assume that one allele causes the protein to be more highly expressed than the other allele from the other parent and since we have sufficient number of these measurements we can relate the presence of a particular allele to the abundance of a protein and it's referred to as QTL mapping so what we see here is that we identify 44 from this 197 or 92 proteins, 44 QTL so these are low size over which the allele affects the abundance of a protein some are insist that means they affect each other I mean the product from this locus affects the abundance of a protein that's coded for by different locus so this is not super interesting or super remarkable but when we also measure EQTLs is the transcripts which are also measured in these mice we see that we have a rather similar number of QTLs links between the allele and the protein but they have a different behavior and the different behavior is that proteins QTLs act more likely in trans than the transcript QTLs that means the transcripts the transcript regulation is more is less diverse in the cell than the protein regulation by genetic means I don't get into the effects of the environment and now we try to we try to learn something about using this data plus brighter knowledge about these pathways to learn something that may be interesting biochemically or actually clinically so one of the QTLs maps to an enzyme that is at the end the last enzyme in the degradation pathway in amino acid like lysine or isoleucine like lysine, leucine or isoleucine so these are degraded in stepwise manner exactly as the beadle and tatum principle suggests and each step here produces a metabolite as an intermediate product so now we have a QTL so we have a genomic locus that affects the abundance of this respective protein this enzyme and this is either high or it is low so now we can correlate basically the enzyme presence here which we take as a surrogate for the activity and we can relate this to the metabolites up here so we basically do something which is like a water hose where we say we close the water hose we have less of this enzyme and we ask do the metabolites appear pile up this would be assumed and if there is lots of the protein a lot of activity down there we would assume that then the metabolites up there decrease in abundance because they are processed so this just shows that we can do this so from the enzyme level is inversely correlated with this metabolite which are also measured by mass spectrometry the principle of this water hose constricted or open we also see that two metabolites up here correlate very nicely so if one is high the other is high so that means the enzyme down here constricts the whole pathway so we have now made a link between a genetic locus and the allele that controls the enzyme level to be the high or low and the presence and the abundance of metabolites basically because we explained this by the enzyme activity that is present here now interestingly enough we can find then literature that says that this intermediate product here amino adipate is a small molecule that has been generated in the degradation of this enzyme has been found in a large cohort GEO study in the Framingham heart study as a biomarker for diabetes risk so this is of course now an interesting case because it allows us to make the statement that through measurements systematic measurements in genetically perturbed animals we are able to find a link between a genomic variant in a particular gene and an enzyme abundance and this enzyme abundance affects the path activity of a metabolic pathway the degradation of branched gene amino acids and if this activity is low of the pathway at the bottom the intermediate pile up and they are being found to be a risk factor for a complex disease yes so do you know if the change is due to the DBA background or the black 6 background because as far as I remember DBA is most susceptible to diabetes density yes yes so there is a whole range of actually disease phenotypic measurements in these mice is amazingly complicated so there is from these mice these BXD mice there is more about 300 phenotypes have been measured including some disease phenotypes and many of these phenotypes are quantitative so you can say you can assign a numeric value to them and in every case I've looked at is that the parents the DBA or the black 6 are somewhere in the middle so you can basically just make a plot of the numerical phenotypes from strain 1 to 180 and the parents are always somewhere in the middle and they create offspring through the reorganization of the alleles that are far outside the range of the parents so this of course outside Mendelian inheritance and this is for all of these quantitative phenotypes in the case yes no I'm okay so I want to summarize this part I think the correlation of prototype and genomic measurements in genetic reference indicates very complex relationships between genetic constitution and eventual expressed information we can in very specific and simple cases where there is a lot known about the mechanism we can use this prior information and relate it to the big data set generated it's not a super big data set that I showed but it's a rather substantial data set and we can then reach a somewhat mechanistic understanding and I think we can find ways, this is a big challenge for the future, to systematically integrated large scale data and mechanistic data like being generated by many of the biologists here who work for years on a very complex biological system to then use these general principles as background and to determine how they are modulated, how this background is modulated in a specific case in a specific genetic background or under specific conditions and that can certainly be elaborated by large data sets so my conclusion clearly is correlation is not enough it is a useful tool but if we think we can use simply correlation of large data sets to get mechanistic biologically meaningful insights I think this will not work so I wanted to, I was planning to do this, I wanted to show that this, how the cell processes gene dosage effects and I don't have time to do this I would just like to say to summarize what it does maybe I can summarize this in one picture so this is basically we collected a panel of cell lines which from the sequence are essentially identical or very similar to the cell lines they are very frequently used in laboratories 100,000 publications but they are genomically instable and so sequence wise they are similar but the genomic landscape is very different so here we map copy number variation of these cell lines that have been collected from various laboratories the people do experiments are simply, we see that although they have the same name these cells and the used in laboratories to do experiments they are substantially different from the, not from the sequence but from the copy number variation namely the number, the ploidy of genes in specific fields and I just want to draw your attention to this picture here, this is two of these chromosomes where we see hot is always high number of ploidy, green is low number of ploidy and we see that there is very large blocks of regions, chromosome regions which are amplified in these cells or not amplified so it's very, it's kind of a green and red block when we go to the transcripts this gets already somewhat diffused if we go to the proteins it gets very diffused so the effects of this increased ploidy or decreased ploidy is interpreted by the cell in extremely complicated ways and what I do not have time to show is that the organization of these proteins that are coming out of these increased ploidy regions into modules is a big buffer it's one of the most dominant factors how the cell modulates the abundance of proteins that are induced by a higher number of copies of a particular gene is the organization of these proteins in a complex so if a protein is known to go into a complex and the other subunits are not also augmented that protein is buffered down and is basically degraded so that's why this one mechanism is not the only one that these copy number variations are interpreted by the cell or lead to very, very refined and actually strongly buffered landscape at the level of the proteins therefore at the physiological relevant proteins so with that I would like to finish and try to show that or this is the topic of this morning that we can measure now with amazingly effective tools lot of genomic variability from very large cohorts thousands or tens of thousands of individuals we have lots of phenotypic information so the way we bridge this I believe needs to involve proteins not just the abundance of proteins but also their modular modularity and I think if we can make more headway into basically defining by measurements this quantitative prototype will have a much better situation to link genotypic variability to phenotypes so this is my the collaborators whose work I showed this last part I skipped largely is the work of Jan Scheglu together with Wolf Hart who is a colleague at ETH the DERC 2 project is work from Martin Maynard the postdoc and this BXD project is the work of Evan Williams and Yipo Wu two postdocs and we work with the group of Johan Owerks at EPFL who created and maintains this BXD mice thank you for your attention may have time for a quick question also yeah so I'd like to go back to the list of principles you showed in the beginning and about the predictability of complex systems for instance oscillations and cell cycle are typically the best predicted and I think they are predicted because what has been modeled is the regulatory layer so engineers distinguish between regularly in any complicated system distinguish between a regulatory layer or a control system or auto-regulatory system and the basic core process in the case of cell cycle we all know spindle and so on and so forth and in each module either in a complex man-made machine or in biological machines it's possible to distinguish from a basic plant manufacturing plant and the control system and engineers distinguish between those two components it's relatively possible, let's say, if not easy to understand the module, the control systems if we can, if we work them out this goes back also to what I tried to show this morning when I spoke you have identified the modules and the regulatory systems as I said is not hugely complex it's possible to break down the complexity of the overall system, the cell of the organism into modules and the regulatory system and the coordination among them so this will be, I think, a way in which we could maybe try to predict a complex system by breaking down the complexity of the modules and then into the regulatory layers this is what the engineers do basically yes, I agree with that and I think this is certainly the goal, the problem is that we are now reasonably good, not perfect, but reasonably good in determining the modules that actually do the work, but the control system we don't really know enough and so I think it will be transcription models which is one level of control we have microRNA control, we have translation control we have phosphorylation control I think the analysis needs to start from a function and understand what is the control machinery the control lab on that function and this is not done, it's very neglected in cell biology I think this is true however it's also very complicated because it's not a single level of control that controls the system, it is many that contribute to the control of the system, so that's why I think that we should work towards figuring out these control mechanisms of course but in the meantime for I think foreseeable future we are limited to or better off if we do measurements and basically take the point of view the cell knows what control systems to use how these control systems are used to control a particular process and if we can make a readout that reflects all levels of integrated control then we would be able to make a better prediction so this is a surrogate for having a theory is to do let the cell do the work and do measurements which are close to determining the field I agree on the rules but not on the method one more question to ask if you think about this prototype is quite stable beside the non-genetic mutations like transcriptional noise or epigenetic events so do you think this prototype is quite stable or dynamic configuration so it is quite stable I mean we are not able to make measurements of course at a single cell level so we always measure aggregates over a certain number of cells which can be good or bad, we can discuss that but maybe not here but in under specific conditions the prototype is actually quite stable but it is also strongly reactive so we can show with this mutation even a mutation somewhere in the protein a single amino acid exchange has an effect on the prototype which is actually noticeable this is quite remarkable, it's a very sensitive readout but it is inherently quite stable because through mechanisms like for instance the buffering the variability at the level of the modules that really matters for the function so this is what I had to skip over we'll have to stop now, thanks for your break