 All right. Well, thank you everybody for coming again. It is my pleasure to welcome back Barry Honig as this year's 2016 Steambock lecture. Yesterday I neglected to say anything about Harry Steambock, who, you know, whose work in sort of over time has generated lots of money that allows us to invite all of these speakers to campus. So in the 1920s, Harry Steambock invented a procedure for irradiating milk and converting a compounded milk into vitamin D and felt that this was worth some money and Quaker Oates wanted to license the invention. And so I think Steambock, he went to the administration and he said, I think this is worth some money. You know, we should patent this. We should do something. And the time the administration went, we're a university. We don't care about these things. You're supposed to laugh now, because, right, oh yeah, what a difference a century makes. But in any case, the result of that was the sort of the first technology transfer office at a university campus in ours is the Wisconsin Alumni Research Foundation that's run as a nonprofit. So the proceeds from, you know, from that and other inventions from the department kind of help, you know, help us to run nice programs and invite prominent people to campus. So we have today is the second of the Steambock lecture. So if you would please help me in welcoming back Barry Honing from Columbia University and the Howard Hughes Medical Institute to give the second in this series. We'd love to welcome Barry. Thank you, Julie. So yesterday, I gave a talk which was, I'd say, although it had a strong cell biological component, it was really traditional structural biology applied to particular biological problems. Today I'm going to talk in a very different tone. My goal as I sort of say in this title is to find a way to integrate structural biology with systems biology. And the reason I want to do that other than I find it interesting is systems biology, especially people looking at networks of interactions, is basically a field of people who know very little about proteins, who know very little about molecules. In effect, proteins and DNA are points on a diagram. And the hope is that if we have an understanding of the molecules involved, we can find a way to sort of expand both fields and also expand what we as structural biologists know and what we can do with our structure. So an example of what I'll be primarily talking about today is to take this, one of these interactomes which is supposed to every point is a protein and their lines connecting the points to somehow put a structural face on that sort of interaction. And more specifically to use three-dimensional structure to infer interactions that you couldn't infer any other way. So that's the goal of my talk and what I'm going to do is basically we sort of published a paper a few years ago describing how to do this. I'm going to go over what's in that paper and then show what's happened since then and then point to the future in terms of specific applications, none of which are complete but I want to give you a sense at least of what I think is possible. So to use structure in ways, so one of the goals of course is to use structure on a scale that it isn't normally used as we really want to be able to use structure on what we call a genome-wide scale. So there are a number of things we have to do. One is use homology models. These are models of proteins derived from known crystal structures. Everyone knows there aren't enough crystal structures to cover every protein sequence. The second thing I'll be showing you and talking about is structural alignments. We're going to use structural similarity between proteins as a way of expanding the connections between proteins that can be referred, excuse me. And the third is to use statistical inference. Using the context of protein structure and I want to show you how we do that, but to do all of these things we have to relax our standards of how we use structure. So if you're interested in an enzyme mechanism, you care about every angstrom and every kilocalorie. If you're using structure on a genomic scale, you can't worry about that detail. You want to extract information any way you can. So yesterday had angstroms and kilocalories, today won't. So the first point to make is if we look at the number of the structural coverage say of the human genome, there are about 5,000 proteins in the protein data bank for which there's a structure for at least one protein domain. Not the whole protein, but at least one domain. If you look at data bases of homology models or what one can construct with existing software, you get about 12,500. Together, there is some structural information for about 18,000 human proteins. If we look at protein-protein interactions from about 20,000 proteins, there are 206 million possible interactions, there are about 1,000 complexes in the PDB and there are about 169,000 interactions in existing databases that may be taken from literature, may be taken from high-throughput methods such as yeast to hybrid. So we don't really know how many protein-protein interactions there are in human cells. And indeed, the word interaction can be deceiving. What does it really mean? I'm certainly thinking of primarily as two proteins that form a complex. But then there may be proteins that are in a multi-protein complex that don't interact with each other directly, but at least are structurally related. And then there can be a third case where two proteins are on a pathway, one affects the other, so they're in a way interacting, but they're not physically interacting. And just even so when people say there are so and so many interactions, it's not always clear what's meant. The truth is that we don't really know how many interactions of any kind there are. Okay, so I've said briefly in the last slide that we can use homology models to extend the reach of structural information, and I'll add that homology models can be very good, they can be very bad. And at one level I'd like them to be good, but at another level I don't necessarily care because the hope is that I can extract information even from a bad model, and I'll get back to that. Maybe a bad model has some clues that can be used in a gainful way. The second way we use structure is with finding relationships between different proteins with structural alignments, and I want to show you how proteins are very often classified and actually argue against classification. So there is a SCOP database of protein structures, and in this blue circle I represent members of a protein family. These would be proteins that have a clear sequence relationship, a clear functional relationship. Then in the orange ellipse here are proteins in a super family, and these are proteins that are more distantly related, but nevertheless have a functional and perhaps detectable sequence relationship. And then there's the third classification that people use, it was a protein fold, and a protein fold basically is a way of describing what a protein looks like. And this is where for years I've sort of had to be in my bonnet, I don't like the term protein fold. I don't mind saying a protein has a fold, but to classify proteins based on what they look like, number one there's no objective way of doing it, so you're really dependent on somebody's opinion. But it can obscure information, and in fact I'll show you in a second there are connections between proteins that are classified as having very different folds, and we want to use that information because it's that information that allows us to use structure to find distant relationships that can't be identified with sequence. Let me give you an example. There's three proteins that are classified as having different folds, certainly looking at them, you might not see any connection, but each one of them has a structural fragment which looks similar, and if you superimpose them, you see that these three fragments superimpose well, and this superposition, geometric alignment is basically central to everything I'll be talking about. There are many algorithms that superimpose secondary structure elements. We have our own, but there are many good ones, and it's basically something you might do by eye, just superimpose secondary structure elements, find common features. What's interesting in this case that even though these proteins are different, once you superimpose these substructures, you find that they all bind to cation in the same location. Does this mean that they evolved from some ancestral cation binding fragment? We don't know if it's convergent or divergent evolution, but this would be an example of extracting information from structural alignment that you would never get from sequence. We use the term structural blast to mean when you run blast, you run sequence alignment, and you look for small motifs that align. Here we're looking for small structural motifs that align, and trying to use that again to infer function. Structural blast will be an integral part of what I'll be talking about. As I said yesterday, I'll say it again, please feel free to interrupt at any time if there are things I say that you'd like me to clarify or that annoy you. Question? Yes. Are they? I don't know. It's a very good question. I think they're... Are they? I think they're dival. I assume they are, but I'm not certain. What I'll be doing in the context of this seminar is the following. Let's say I have a protein, and I want to know what it binds to, and that could be DNA, it could be a small molecule, it could be another protein. I have in the database other proteins that bind to something else, say this pink sphere. In this case, I might align this green protein to the purple one, and the geometric transformation that turns the green into the purple also moves the sphere onto the surface of the purple. If I do that with other structural neighbors, in this case, I'm saying that I find that all these small spheres bind in the same location, and that might suggest that this is a hot spot for binding small molecules, for example. This is the notion that we're using, and these other spheres could be small molecules, they could be other proteins. But the basic idea is if we transform, we geometrically move this protein onto this protein, carry with it its ligands, we have a hypothesis as to where our query protein might bind other proteins. We have to prove that that hypothesis works, but that's the basic notion. The first way we've used this is to take a protein and to predict where on a surface it will bind other proteins. These methods in the literature will take this protein and, for example, look for a hydrophobic patch on the surface, or a region of sequence conservation. So the protein is analyzed for its own physical properties. There's a hydrophobic patch here, say, well, maybe that's an interaction site. What we're doing is different. We look for a structural neighbor, in this case, the red protein, superimpose the green on the red, and just like I showed you in the last slide, infer that perhaps the green protein binds other proteins in this location. This might give us an interface, but of course we only do it once. It's meaningless. The algorithm that we derived or designed is the following. We take a protein of unknown function and we look for its structural neighbors. The structural neighbor is a protein that looks like myquery protein. We superimpose Q on neighbor one, neighbor two, neighbor three. Now if Q, when I do the superposition, there are two amino acids in the neighbor protein that bind a ligand P1 or another protein P1, and we count those. So these two appear twice. Then we go to another neighbor and we count the ones, the residues that are aligned in the interface for neighbor two, neighbor three. We add them up and then we end up with a hotspot shown here in red, which are regions on my query protein that align to the interface of other proteins. So I hope that's clear. That's the way it works. It works very well. I don't like doing this sort of thing, but I can tell you, I have my own figure showing how well we do, but I was delighted to see somebody else did a comparison of the many, many methods to predict interfacial residues, and ours comes out the best except for theirs. But now we have a new method that's a little better. But anyway, this is supposed to tell you that this method works very well. Better than methods that are based on, say, hydrophobicity alone. We've since added that feature to what we do. We simply, I'm not going through all the slide, but we look at the propensity of every amino acid to be in a protein-protein interface, integrate that with the structural information, and we, again I'm just showing you, we do better than we did before. So my point here is that using this sort of structural alignment, we can infer where one protein might bind to another protein. And it's supposed to convince you that this is a reasonable way to approach the problem. But it doesn't yet tell you how we deal with protein-protein interactions. And the way we do that is just an extension of what I've been telling you. So better than I have to tell you how we do it because that's crucial. Here is a structure in the protein data bank of a complex, in the protein data bank. And now what we do is we want to know, say, if, so these are red and blue. We want to know if the cyan protein binds to the purple protein. So we superimpose the cyan on the red, the purple on the blue, get a model for the interaction of the purple with the cyan, and infer, I haven't told you how we do it that the cyan binds to the purple protein. Yes? In the sequence class, how do you account for sort of gap help and the fact that you might have sub structures that rely not sort of on structure, and you might have deviations? I'll get to that in a minute. Literally I just want to show you, this is what we're going to do. I haven't tell you how we do it yet. This is the goal. To use complexes as templates for other possible, known complexes as templates for other complexes. So I'll see exactly how we do that. So here is, this is the crucial slide in our method, and I'll take you through it. I have two proteins, query A and query B. And I want to know if they form a complex. So the first thing I need if I'm using structural information is a model for the whole protein or parts of that protein. And this model, model A and model B can come, if they're crystal structures from the protein data bank, it can come from homology modeling databases or whatever. I have a model. The next step is to look for the structural neighbors of each of these proteins by structural alignment. So we take protein A and we align it to every protein in the protein data bank to see what it looks like. And we find on the average 1,500 neighbors per structure. And here I'll say right off that we have very loose criteria for a neighbor. So we require that a minimum of three secondary structure elements superimpose. So because we're just looking for fragments or the whole protein or whatever. So if there are 1,500 neighbors per each, there's two million possibilities for a complex between this guy's neighbors and this guy's neighbors. 1,500 square is about 2 million. And in general we find around 300 complexes per pair of proteins. So we now superimpose the yellow on the browns day, the green on this pink. And we come up with a model for the complex. So we have a model based on this superposition. And now we have to score it. And the number of complexes we get is enormous. We are using 18,000 proteins. We find 1.5 billion possible pairs of interactions from this procedure. So we have to score literally billions of models. So the first point to make is we find that using this procedure of looking for structural neighbors we have an enormous number of models, much larger. And this is how we get to a genomic scale. The question is how do you score so many models? So that's what I'm going to tell you now. So the first thing we do is we have our PDB structure. And we know from the structure what residues are in the interface. Now the first thing we do is align our two models to this structure. And the first criterion we have is simply how good is the alignment. If six secondary structure elements overlap, it's better than three. So we have a score for that. The next thing we ask is, is the alignment in the interface? This sort of addresses what if the alignment was just on over here and over here, then there would be no interface in the complex and we'd throw it and we'd get a bad score. And the third thing we do is ask, are the residues that we predict to be in the interface, are they likely to be interfacial? That is statistically, are they likely to be interfacial residues? And we get a score for that. And we combine them in a, I'll show you in a minute using some Bayesian statistics to get a score for this complex. The reason we can deal with billions of models is we never calculate a pairwise energy. We only calculate the properties of the one on the left and the one on the right. So it becomes a linear rather than a quadratic problem. That's how we can evaluate so many models. So how do we do it? And I'm not a statistician. I hope nobody in the audience is an expert. But the way you, I'm learning, I'm learning, you do this with Bayesian inferences the following. You have a positive test set. These are proteins for which we know form a complex, say. And a negative set, which are proteins that we believe don't form a complex. Now if there's a property x, whatever that property is, some of the proteins in the positive set have a certain value for x, some of the proteins in the negative set have a value for x. And if there's a higher probability, if you have x, if you have a higher probability of being in the positive set than the negative set, that means there's some information in that parameter. So this is the probability of, given x, given in the HC-SEX set having a property x. So it's basically an enrichment in a case where you know the answer is positive. So this is, and then from this ratio, you get what's called a likelihood ratio, which is, again, just the ratio of the probabilities of being in the positive set relative to the negative set given x, a value of x. The nice thing about Bayesian statistics is that if you have independent sources of evidence, you get a likelihood ratio of their combination by just multiplying the individual likelihood ratios. So for example, if you get a structural score, and this score as you'll see in a second might be there also co-expressed, then you multiply the likelihood ratios to build up a total likelihood ratio. I hope this is clear, but this is sort of crucial to what we do. The next thing we need to do, I haven't convinced you yet that any of this works. I'm just telling you how we set up the infrastructure. The next thing we do is create test sets. So our positive, there are databases of protein-protein interactions. We look at all of them, and we only accept interactions that appear with two literature references, and we call that our positive test set. And our navigative test set is basically, in this case, the other 200 million interactions, some of them which will be true, but most of them will be false, because it's just everything possible. So that's how we define our positive and negative test set, and here are the sources of evidence we use. The first is the one I just told you, structural modeling. And just hopefully to make it clear, before we did anything else, we looked at these three scores, and we calculated a likelihood ratio from structure alone. Based on structural alignment, based on these interface properties, we calculate that independently. Then we'll go, I'll get to the orange ones in a minute, phylogenetic profile are, do these two proteins exist in multiple organisms? Are they always there together? Co-expression is pretty straightforward. Are they co-expressed? And again, we have to use our positive and negative sets to get a likelihood ratio for co-expression. Go terms, go is genontology. It's an annotation of protein function, and if they have a similar go term that increases the probability that they'll interact. And finally, orthology is, do their sequence neighbors and other organisms interact? So each of those is tested independently, and we get a likelihood ratio. Each of these makes a contribution. The two remaining orange ones are in part structure-based. One is, see, I've been talking about protein-protein interactions between structured domains, but what about interactions between a structured domain and an unstructured protein? So the way we deal with that is two methods. There's a database of such interactions, and we simply ask, does our protein have a structured domain that looks like one in this database and a sequence similar to a known motif? So we're not being very clever here. We're using existing databases to infer an interaction. Similarly, if there is a protein sequence bound to a domain and we know its structure, we can ask, do our proteins conform to this structure? I don't want to go into this in detail. But this basically is the logic of what we do. I'll skip this one. So this summarizes the total approach. I showed you this left part of the slide. This is the structural modeling score. We then take all these other sources of evidence, integrate it with structure in this Bayesian framework, and come up with a score, which we call a preppy score. Preppy is supposed to be predicting protein-protein interactions. Just for a moment of diversion, I thought it was a cute name, because I know what a preppy is. But most people not born in the United States don't know what a preppy is. And most people under 30 don't know what a preppy is. But for those of you that do, it was supposed to be funny. Anyway, but it means predicting protein-protein interaction. So this is what we do, and when we're finished, we have a database which you can access. You type in the name of a protein. It will give you its interaction partners. It will give you scores, tells you where the evidence comes from for that partner. And ultimately, it's meant as a hypothesis testing or generating procedure. This is sort of the nature of systems biology. It's statistical in nature. But we've succeeded in using structure on a really unprecedented scale, and that's something we're obviously very pleased about. One thing that is perhaps obvious to statisticians, but I sort of found it amusing, if we use structural evidence alone, we make predictions for about 130,000 pairs of interactions. If we use non-structural evidence alone, 115,000, when we combine them, we make predictions for over a million interactions. And that and our predictions, what we call high confidence predictions, are based on a likelihood ratio greater than 600. The number will mean nothing to you except it corresponds to a false positive rate of about 10 to the minus 3. So why is this true? Let's say we have a weak interaction from, weak prediction from structure, likelihood ratio less than 600, but also the proteins are co-expressed and have a similar function, then that will amplify the signal from structure and will give us many more predictions than we get from structure alone. Some other features of what we find, if we use the protein data bank alone, we make 47,000 predictions. If we use it just from structure with homology modeling, we've got up to 127,000. If we use other sources of evidence, again, the PDB alone, 250,000, with homology models, 1.3 million. So we're getting an enormous amplification from homology models. Now, the data and the evidence from PDB structures is much better than the evidence from homology models, but the evidence from homology models is still useful. Another way of looking at this, if we look at close structural homologues, say the same family or super family, this shows you that we make 47,000 predictions and we get very few predictions from distant structural relatives. But if we now combine this with other sources of information, we get enormous, most of our predictions are from proteins that are distally related. And this is, again, a crucial reason why this stuff, as you'll see, works well. It's the combination of structural information with nonstructural information. So everybody has their stories of getting published. We submitted this to Nature and people wanted experiments, as seems reasonable, except it's hard to do an experiment on 1.3 million interactions. But Cliff Zhang, who was doing the work, had friends in labs around the country. And he basically asked them to give him proteins they were interested in. He would make predictions if they would test them. And he had 19 proteins from different labs, actually most at Columbia, but one Tony Hunter at Salk. Of the 1915, the predictions he made were correct. So even though we had statistical studies on million proteins, this 19 is what got us into nature. And statistically, it's meaningless. But so that was one test that we did. And this is just one example I talked yesterday about kethyrins. We're interested in interactions between proto-kethyrins and other proteins, tyrosine kinases. We made a prediction where the interface was found. There was a key trip to fend there. Maniata's lab mutated it, and we abrogated the interaction. So we are able, when we have a structural model, to make testable predictions in some detail. There are many other databases of protein-protein interactions. And the most notable ones are taken from high-throughput experiments, such as yeast to hybrid measurements from the Vidal lab. Bioplex is a database taken from tandem infinity mass spec measurements. And we obviously look to see to what extent our database overlaps other databases. And this is a general feature of high-throughput databases, is nothing overlaps. It's somewhat disconcerting, but it either means that mostly we're making lots of mistakes, or there are lots of interactions that just get detected because different proteins are being studied and different methods are being used. But these are the two sort of standard gold standard high-throughput databases. So I'm going to give you a sense now of how our predictions compare to those. So we have to assemble data sets that we test. One of the ones I've already mentioned, data sets with two literature references for the positive set, the negative set, is everything else. But the Vidal group at Harvard has assembled very, very high quality data sets that we had nothing to do with with positive known protein-protein interactions and proteins that are known not to interact. So if we plot true positive rate versus full positive rate ignore the great curve, this is our preppy predictions. This is the yeast to hybrid data, and this is the mass spec data. And you can see that we're doing very well. So there's evidence that preppy is comparable to high-throughput experimental data. I should say high-throughput experimental data is lousy. But this is, again, the state of the field. The nice thing about preppy is you just have to go to our website and click and get a prediction if you're interested in a particular class of proteins. So I'll actually get to that in a second. But I'm sort of taking you through different applications of preppy. But yes, they do try to reconstruct protein-protein complexes. And we try to do that as well. And you'll see in a moment really how well we do. So here we're just looking at this carefully curated database. This is the likelihood ratio cutoff. And 600 is our cutoff for sort of high probability. And you can see that we reproduce about 60% of the interactions in this database. And this is a database of interactions which don't exist. And you see that at a cutoff of 600, we don't get any of those. So this is sort of adding to the reliability of our cutoff measure. Bioplex and the yeast to hybrid database just reproduce a much smaller number. Simply because they're smaller. So we are picking up a fairly large number of known interactions without using the information in those interactions. This is going to lead to some place we're going. There's a database of disease-related genes called ClinVar. There are other such databases. And there are disease mutations. So some of those have known mutations. But we looked in that database for proteins that are associated with the same disease. And we asked, do we predict that some of them interact with each other? And what this is supposed to tell you is that we find of these 3,700, we find that 500 are predicted to interact with each other. And 300 don't appear in another database. But if we take now the same number of proteins randomly chosen, but so they don't appear in the same disease, we predict only 41 interact with each other. So this is giving us some indication that maybe preppy can be used to infer something about disease-related proteins. So this is actually a very striking result, the difference between disease-related proteins and predicting that they do interact. And again, the other existing databases find a very small number, make a very small number of these predictions. So somehow, I mean, it seems reasonable that if two proteins are associated with the same disease, they'll have a higher probability of interacting than if they're not. And we're picking that up. This is a graph. And I'm sorry I'm sort of overwhelming you with graphs. I'll try to stop soon. The fraction of mutations of SNPs predicted in an interface, so in a series of sort of disease-related proteins, we look at their SNPs. We see here about 15% are predicted to be in an interface. If we take SNPs that are benign, much smaller number are predicted to be in an interface. So this is going to also point to something can we use our predictions to ask if a particular mutation disrupts an effect on a protein-protein interaction. So we have all these results that are very encouraging. And finally, in terms of your question about protein-protein complexes, what we did is we take this exactly what the bioplex people did. You have a protein-protein complex. In this case, it has four proteins. And we ask, what is the likelihood ratio of an interaction between each of them? And can we recapitulate the complex at a certain likelihood ratio? So this one, these two are found to interact with LR1, LR2, LR4, somehow 3 disappeared. But these three connections are enough to recapitulate the complex. And we ask, what is the minimum LR that recapitulates a given protein-protein complex and in a database called Quorum? And you see that at our cutoff of 600, we get about 70% of the complexes. So we're able, in principle, we have information to recapitulate protein-protein complexes. Again, the other databases just don't have the same coverage. So in principle, just sort of an aside, we could take this information and try to rebuild a complex from start. Remember, here I've started with a known complex and asked, can we account for the fact that there is a complex? But if I try to simply take a given protein and all of its interactors, I'd have too much information. So to actually rebuild complexes from scratches, something we're working on. But I'm not convinced we're going to do it. We'll see. So this sort of summarizes my attempt to convince you that preppy is a useful tool. And I'll just point out that we're using the same structural blast approach now to look for protein-small molecule interactions. And the goal here would be given a complex to predict, perhaps, small molecule inhibitors of disease-related proteins or whatever. But we're working on that, and it hasn't led to a publication yet. So in the time remaining, I'd like to sort of point to some new directions we're taking with this. Although I guess this is something we are publishing, there is something some of you may be familiar with called gene set enrichment analysis. And basically, there are publications where proteins of a known function are classified in a particular gene set, which is associated with the same function. So what we've done is the following. We take a query protein, and this is a list of its most likely interactor, second most likely, third, fourth, whatever. And we ask, is there an enrichment in a particular gene set? So if you look at these, the three purple proteins are number two, three, and six. So three of our highest confidence predictions belong to gene set one. Suddenly, this gray one, so the gray is distributed so there's no enrichment. So one can take these predictions and ask, are certain protein functions enriched? And use that to infer the function of the query protein Q itself. So the basic notion is, if a protein's interactors belong to some gene set, then the protein itself is more likely to belong to that gene set. So this is a way of inferring protein function, and this tells you we studied 10,000 proteins, 2,000 of those. The known function was in its most likely gene set, and about half were in the top 10. So this, I think, is a novel way to infer protein function by asking what the function of its interacting proteins is. And this is something, again, we've been pushing. So now the last few slides are going to sort of, in a very, very hand-waving way. This is more like a grant application, something that we've started to do now. So there are disease mutations. People do GWAS studies, and they find SNPs that may be associated with this and that disease. And the question is, then, how do you analyze what that SNP might be doing? And the only method currently used to analyze SNPs is to look at, see whether it might affect protein stability. So if you have a SNP that looks like it's, say, a buried charge in a protein, if you know the structure, you might infer it's disrupting the structure. But protein-protein interactions are not directly taken into account. And in this terribly complex slide, I just want to give you the notion of what we're working on. This is a protein P, which is, say, identified as being associated with a certain disease. So we look, this is sort of its interactome and the interactions of its interactors. We construct these networks. And then we can ask different questions. Does it have disease-related SNPs? Do its partners have disease-related SNPs? If there are other proteins associated with that disease, do they have SNPs that have a phenotype? And by doing this, we started doing this. We're able to assign function to different SNPs, to basically expand the information we have, but not only looking at the protein which has a SNP, but looking at the proteins around it that might have similar functions that are affected. So our goal ultimately is to take SNPs and identify them with protein networks rather than with individual proteins. So this is one direction that we're sort of going with this. Another direction is, which gets back to the prediction of interfaces, is to look more deeply at individual cases. Because ultimately, we have this method, but we'd like to use it. We hope others do as well. So we've become interested in Keras pathways. Now, Keras is a protein associated with many cancers. And Keras mutations are believed, people have not been able to drug Keras. It catalyzes the GDP, so there's a GDP cycle. And it's known that when it binds GDP, it basically doesn't bind to downstream effectors when it binds GDP. It does. And these are, this is pictures of our protein interface prediction method showing you that the GDP surface binds to known, it's known interactors, and the GTP surface binds in a different region to other proteins. So we're looking sort of closely at the difference between GDP and GTP bound forms of Keras. We've used Preppy to annotate, to find partners of Keras. And there are already many pathways associated with Keras. And most of the pathways, we've identified the proteins that are known to interact. But now we have a bunch of proteins that are not known to interact. And we're now testing those experimentally. So our goal, because people haven't succeeded in drugging Keras and inhibiting its function, is to find new pathways and hopefully be able to attack its effects that way. So this is a major focus in my lab now, using Preppy to find new pathways of known oncogenes. And finally, another area we're working on. So many of these interactions take place on membrane surfaces. Keras has a membrane, two-action membrane attraction binding modules. So many of the interactions take place in the 2D environment of a membrane. This is an example of a Keras pathway where it activates PI3 kinase, which phosphorylates PIP2 to PIP3, which then attracts other proteins to a membrane surface. And there's a whole cycle here of interactions. Most of them are known. We were able to add a Preppy prediction that these two proteins interact on a membrane surface. And again, that's being tested experimentally. So we think that, again, by looking in detail at particular pathways, motivated by the hypotheses Preppy has generated, that we'll be able to look deeply into specific biochemical problems as opposed to the more general list of hypotheses that we provide. So that's it. I just want to mention the people that were involved in this work. Donald Petrie is a senior scientist in my lab who is involved in everything we do. Cliff Zhang wrote the first copy of a first version of Preppy. He's now back in China. Nacho, Garzon, Hauwuk Wang did most of the programming on both Preppy and Predus. Diana Murray is a senior scientist in my lab who motivates much of this research. And Andrea Califano is my colleague in the Department of Systems Biology at Columbia. Andrea is a card-carrying systems biologist who works on cancer pathways. And he works on using, say, co-expression information. He'll look at tumors, see what proteins are overexpressed in tumors, see what proteins are overexpressed in normal cells or under-expressed. And from that information, uses reverse engineering to generate biochemical pathways, signaling pathways. So we are working with him to introduce structure into that line of research. And that's where we're going with this. So I think the message I'm trying to give is that we're at the very earliest stages of introducing structure into this area of systems biology. But I think it adds a perspective that really hasn't been there before. And with that, I'll stop and thank you for your attention. Is that something that can only be supporting complex that requires subtle structural rearrangement of production? And is there a lot to lose? So in principle, yes. But this is what I was saying at the beginning. Where if you ask me, what about alternate conformations of proteins? That's not information. That some proteins have conformational change. I know that. But I'm just, we're not dealing with that. We're dealing with things at a much cruder level. Even if a protein undergoes a conformational change, if I superimpose, so if I take the RAS case, there's a GDP and a GTP structure, they're different, there's some loops that move. I can use either structure in a crude way to predict whether it binds to another protein because most of the secondary structure elements are in place. So it's this point that I was making, we can't, in this sort of problem, we can't worry about detailed structural information because we're trying to do things on a genomic scale. Once we have a hypothesis, then we can look more closely, run molecular dynamics, do docking, whatever you want. The idea first is to find new hypotheses about what proteins interact and then to worry about what you're worrying about. But if we worry about it at the beginning, we won't be able to do anything. That's the issue that we're sort of dealing with. Yesterday you presented an interaction that was almost un-detectable, but what is actually more important was the context of other interactions. I don't know if stability was very much part of the preppy algorithm. It seemed to be a visual component. So it's possible that within your million or critical interaction, there are some that are functionally important, but un-detectable, maybe not. I'm sure that, in fact, most of our million interactions may not even be physical. They may be just functional. We detect physical ones by the structural score. But to make the point, there are two interactions I talked about yesterday, a trans interaction and a cis interaction. We detect the trans with preppy. We don't detect the weak cis interaction with preppy. But remember, that's really weak. That's a solution that's... Yeah, but if you're not, the duration of structure is important in these cases, if that's correct, so if the functional side has been preserved, but it's very much of the weak side of the spectrum, you might actually potentially capture it, but I guess it's not. I mean, we do capture weak interactions in some cases. The stability, I mean, again, it would be too much. We don't have that information for tens or billions. We just don't have it. So this is all based on using structural similarity and saying we can carry that up until a point to generate hypotheses. I mean, it's annoying. It's all my life I've worked on, as I said, every angstrom, I cared about every angstrom and every kilocalorie. And now I don't in this research. I can't. Yeah. So we didn't get a chance to talk about it. We started sort of rewriting preppy for viral host interactions. And yes, right now it's all structure-based. So we don't think we'll do as well, but we're hoping to see if you take, say, Zika virus. So I know where the money is. So Zika virus, we're going to look at what predictions we make for chicken and what predictions we make for human to see what's different. But I don't expect it to be as effective just for the reason you mentioned. Interaction is on one action. Yes, yes, it's all. That slide I had was too complicated, but yes, that's just the goal to extend it to that as well. First, I'll take host privilege. It's a little bit related to the question that Alessandro asked. So some years ago I had a postdoc and we did a binding site model for protein-protein interactions, kind of same functionality as credits and whiskey. And we ran into a problem, which is that we could predict the binding sites correctly, but then we'd get a lot of false positives and they weren't kind of scattered uniformly. They'd be a patch of that sort of protein. So how do you really, you know, so we figured this was probably like a binding site for another interaction partner. So how do you even know what your true negatives are when you can test these things? So it's a great question. All the questions are great questions, actually. I've never said it to anyone. That's a really bad question, but it's really good. So this came up in predest, too. And what we've done is we've... So we have a data set which the people use to test their predictions and those are known strong protein-protein predictions. You probably use the same benchmark set. But what we've now done is take our predictions and separated them into patches and we have our most strongly predicted patch or second or third or fourth. And we're actually thinking that the weaker patches might have something to do with protein solubility. So if you see a... Like, for example, drug companies are interested in antibodies that behave properly that you could use as drugs. So sometimes they precipitate out a solution and we're sort of looking at data now to see if auxiliary patches might be involved in that or perhaps another protein-protein interaction. So what we are classifying are most likely patch or second most likely patch, et cetera. So that's how that's going. Benchmark set that you use for training. These are mostly... I would imagine are high interaction, not energy working pairs. And if you look at these interactable maps, there are usually... You see these nodes and you look at them and you look at them once. And I'm imagining perhaps the benchmark set might be populated by the destroyer or by these protein-protein actions. And so if you wanted to look for, say, weaker transient interactions and if you kick out, say, some of these proteins out from the test set, I'm wondering if your life-intuitive issues might change and maybe a new future set might emerge that might capture these kinds of interactions? I think as we... The trouble is we need a data set of transient interactions to think about. So I think you're right that we are... There is a bias towards stronger interactions and again, I don't quite know what to do with that unless I have a data set that I can sort of retrain our parameters for the weak interactions. I expect that weak interactions are just like strong interactions, just less so that the same parameters might be used for both, but don't know. So the question is prepping for protein DNA. Yes, that's something we actually want to do. It won't get us at the problems that you and I worked on of specificity, but I think it can be used to identify protein DNA and protein RNA binding regions and protein surfaces. You start with complexes, superpositive proteins, see where the DNA is, and perhaps use that together with electrostatic properties to see if this is likely to bind to a nucleic acid. So yes, that's something we're actually beginning to do. We're doing prepping for drugs. We're doing prepping for DNA, for RNA, and for membrane surfaces because many proteins interact with membrane surfaces, sometimes specifically to particular phospholipids. So we're even making a preppy phospholipid version. So we're doing all that stuff. Well, let's give Barry a hearty round of thanks for... Thank you.