 Well, thank you for coming back from part two, and yeah, it's been an exciting day. I just got the news this morning, so it's very nice to be able to be here and share it with a lot of friends. So yesterday I was telling you about evolution as a design algorithm, and I thought I would continue the story by telling you how everything I told you yesterday is slow and stupid, although I didn't mention that. And I'll tell you about another process that nature uses for exploring novel biological functions, and that's by recombination. Before I do that though, I'll reintroduce the same character, which is my favorite model system. It turns out to be a very practically useful model system in addition to being a fascinating protein, and that's my friend the cytochrome P450 enzyme, which I like because it can do extremely difficult chemistry events, activation of unactivated CH1s, and oxygen insertion is one of the reactions, so it's a mono-oxygenase. Does it cost of NADPH cofactors, and it does it very, very well under mild conditions. It's also extremely versatile because it can do a number of other oxidation reactions as well, so there's a long list, and my vision is that we could actually generate a general CH activation platform just from this one great discovery that nature made. The mechanism I'm not going to go into, this is an enzymology seminar, but rather than a biological design, I'll point out that it's a pretty amazing little molecular machine. It has two substrates, one is your hydrochloric substrate, and it gets oxidized, and when that binds, it enables the first electron transfer for reduction of the iron, second substrate, which is molecular, oxygen can come in, and it undergoes a second electron transfer, so we've got two electrons being delivered one at a time in this catalytic cycle, and the highly vacuum oxygen species is generated, and then the catalytic cycle, when the product is released to release the product, can go through the catalytic cycle. And I'll just point out that there's a little shunt of the action, which will be important for today's lecture, in which all of these finely tuned electron transfer steps, and I had to show the structure, it's just the human structure, but these finely tuned electron transfer steps can be bypassed in the presence of a peroxide, so hydrogen peroxide, for example, can deliver the oxygen in processory oxidation as well, so you can actually undergo the catalytic cycle via this hydrogen peroxide shunt pathway, and that had been noted many, many years ago, and of course again, since I don't particularly like that pathway, it usually shuts down after a couple of turnovers, but it certainly had been noted. And I'll also point out that it is a wonderful life of a machine, a lot of the work is done by a prosthetic group, the porphyrin, and so a number of people think of the P450 as being all this protein going surrounding the porphyrin, and the work is done by the heme, and the protein serves the role of providing an active site, a substrate mining site that allows such a commitment. Obviously that's a gross and unfair simplification, because we realize that this finely tuned electron transfer has to take place, and also as I showed you in the movie yesterday, the active site was opened up when the FGUAC is moved apart, and when they close it down over the substrate, water is excluded from the active site, so it's unfair to think of this as just protein blue that provides an extra shape, that would be biomedic, magnetic catalyst designers would call it protein blue, and biochemists would appreciate it from the complex machinery that it is. And let me just say that making biomedics from hems is singularly unsuccessful in providing the same kind of chemistry that the P450 can do, so it's really a wonderful object that nature has made. In fact, the heme domain responsible for catalysis, where the work goes on in substrate oxidation, is just this little green section, and there's a large part of the machinery in the reductase, a highly interactive complex that is responsible for the catalysis. As I mentioned, a fascinating thing about the P450 is that it comprises a very large family of enzymes that can be found in a wide range of organisms from humans to microbes, and if you look at all their capabilities over all of evolution, at least those capabilities that exist in 2008 and probably a few years before, and those things were submitted to the databases, we see that it's a part of the building to accept a very wide range of substrates and do this versatile chemistry on such a wide range of substrates. Humans, may interest in number, if you have more than 50 of the cytochrome P450s, were there responsible for a first-lander defense of the xenobiotic breakdown, so we'll find them in drug metabolism, but we also find them in biosynthetic roles in stereosynthesis, and diastropocytosis, and biosynthesis, and a number of other places as well. What I'm going to point out about the natural evolution of these P450s today, that's important for this lecture, yesterday I pointed out that somebody discovered one a long time ago, and over the process of Darwinian evolution, that is the creation of mutations, and then they're sorting out through the process of natural selection, this family of proteins that exist today was generated, and I pointed out last time there could be a very large number of amino acids that are changing during this process, such that two P450s that exist today can differ in more than 85% of their sequence. This is out of 450 amino acids, more than 350 changes, so there's a lot of mutation that goes on. Yet, the three-dimensional structures don't change very much, so biochemistry students learned this early on. If you look at the families of proteins that are related by evolution, we see that the sequence can vary a great deal. The function also varies, but maybe not as much. All of these are oxidative enzymes, mono-oxygenases, as well as the other chemistry that they can do. But what doesn't change rapidly over evolution is the fold, so in fact the retention of fold is often a very good indicator of evolutionary relatedness. This is important for today's lecture because I'm going to remind you of how evolution can be used as a forward engineering algorithm to make new proteins in the laboratory, and then we'll talk about how this sequence diverges and the retention of fold can also be used. So yesterday I mentioned that we can make new versions in the laboratory by just essentially mimicking this Darwinian evolution, but now applying a breeding pressure. I decide who lives and who dies, I decide which ones go on to the next generation, and that's easy to do because I feel a little compulsion about flushing my bacteria down the drain, and I can take the gene and the codes of people that exist today, rapidly make many versions of that and express those themselves, select the screen for the ones I want, and when I find improvements I can repeat the process thereby creating evolved proteins with new properties, for example, I can make the more stable or more active on the natural substrates and look for the activities that I know could exist in the natural world and recreate them in the laboratory. And so the good news is that this forwarded evolution works. I hope that I convinced you of that with yesterday's talk, but perhaps I also frightened you because on the order of one new graduate student is required for evolution on one new substrate. It's a labor-intensive process. I think I described this in my 20th, the accumulation of 23 mutations to make an appropriate amount of oxygenase, and if you bothered to look at the dates on some of those publications, this has been a almost tenured project. So if I want to explore the entire functional diversity of the P450 family, this is back last year, I think it was only 4,000 sequences in the database, maybe 4,000 in substrates, so it doesn't scale particularly well because in my middle age I'm getting greedy. I would actually like to be able to make P450s that could perhaps cover the entire range of the natural activities of these enzymes, so I don't really want to follow this experiment 4,000 times, and I doubt that the 4,000 graduate student would be terribly thrilled about signing up for my group. All right, so given that, I'd like to talk about a second, very important diversity-generating mechanism in natural evolution, and that's recombination, basically molecular sex. So if we can take the products of evolution that exist today, that encode for these evolutionarily related proteins, they all share pretty much the same three-dimensional structure, but they're very, very different in sequence. And what we've learned over the years from a very large number of laboratories and biochemical experiments is that if you make chimeras of proteins such that the crossovers between the genes say we take one segment, the green segment for the green protein and the blue segment for the blue protein and retain the order of the amino acids, many of these chimera proteins are functional. You can retain the fold-in function of proteins if you choose these crossover sites carefully. So often they retain the structure, yet they diversify the sequence very significantly. So chimerogenesis has been used to study, for example, to ask questions about what parts of the protein are conferring which function. You can use this in the laboratory, though, as a diversity-generating mechanism by making large families of these chimeric proteins all at once, and so that would be the subject of today's lecture. In fact, it's interesting to ask this question, just how conservative would the mutations be? The idea is that you can make chimerics because the mutations that have accumulated in natural proteins, the work of natural evolution is embedded in those sequences, are compatible with the three emotional structures of those proteins. Therefore, if you mix them naturally from different sources, they will be more conservative mutations than mutations you can make at random. So we can ask it actually in the laboratory now, because there's so many great methods available. We can actually ask this experiment, ask this question in a high-throughput fashion, and say, just how conservative are these mutations, because we can make, for example, we have 13 crossovers between two genes that make 16,000 sequences, and if you choose the right genes, such as a valactinase, that confers antibiotic resistance, you can rapidly screen to identify which sequences that retain function for that activity. Now, just three or four years ago, when this work was done actually five years ago, it was actually a little bit of work to assemble these. You can buy the oligos from your favorite supplier and assemble them combinatorially to make all 16,308 more sequences, and that took about four weeks. And of course, the technology has changed so much that you can actually just go and have these genes synthesized for an unreasonable price by companies. It's amazing how things have changed even over five years. But Michelle took her oligos, synthesized genes, made all those possible sequences and put them back into bacteria and said, which of these encodes a functional valactinase? And the first thing that she found is that in fact, your combination is much more conservative than random mutations. Yesterday I showed you that as you accumulate mutations randomly in our protein, the fraction that retains a function drops exponentially with the number of mutations. We see that kind of behavior. With recombination, however, we can go up into very large numbers of substitutions. And because both of the parents are functional, we know that this curve has to go down and then back up again. And so we can actually quantify using the valactinase experiment where this curve lies. So these are the actual data from Michelle's experiment to calculate the effects of recombination. And I'll just tell you, as you go out to 60 to 90 amino acids substitutions where this exponential decline takes you down to a very low number, we find many orders of magnitude improvement in the probability of retaining a function. This is a highly conservative operation. So if we look at the fitness landscape on average, here we're plotting the probability of folding as we accumulate random mutations so this is no longer on a logarithmic scale, it's just the exponential decline from end of mutation drops down and essentially goes to very low numbers. With chimerogenesis, you have this ridge on average in a fitness landscape. So it's a very special part of sequence space that's rich in folding and function. However, if one is actually doing the experiment in the laboratory and you're trying to accumulate as many mutations as possible, we'll talk about whether that's interesting or not. But if we want to create chimeros that are out here at 100 amino acid substitutions, the probability still drops down below what I'd want to work with. Because if you imagine the way that I do these experiments, I don't actually do them. If students do these experiments, we have to screen through each individual mutant one at a time and determine which ones fold and then from those, most of them require something that's still functions. So a probability that drops to one in a thousand is not very much one because every thousand might find one that still folds and maybe every folded one, there'd be one in a thousand that has an interesting function so that's really important to me. So what I'd really like to do is create large numbers of proteins that have large numbers of amino acid substitutions and I'm going to do that by recombination. But I'm going to do everything I possibly can to bring this probability up much closer to one. And here's how I'm going to do that. I'm going to contaminate my beautiful experiment with some thought. And so I'm going to say kind of plot as we used yesterday to argue that you really want to keep mutation level flow going to the actual structure base. These are the kinds of fundamental questions when you ask with these experiments when this recombination happened that's not disruptive to the three-dimensional structure. That's a prerequisite for folding the function. You want to keep the structure but accumulate large number of mutations. If we accumulate large number of mutations what kind of functional diversity do we find? And in particular, if you take evolutionarily related sequences most of the mutations happen pretty much on the surface of the protein until you get up to very high levels of mutation. So the mutations that are happening in these sorts of experiments are going to talk about 50% identity and say those mutations are happening largely outside the active site of the enzyme. So now what you're doing is you're accumulating lots of mutations and shuffling those around by recombination but they're mostly outside the active site. So it's an interesting question. What kind of functional diversity can you get in an operation like that? And finally, if you're creating thousands of proteins to the life of a person that fold and function and have hundreds of amino acid substitutions what do you have? You have a synthetic family of proteins. So if you've got the synthetic family of proteins and they're not just sequences in a database they're actually things you have in your hot little hand. What can you do with that and can we discern, for example, sequence function relationships from those proteins? So let's take the first question. The first question is when is recombination not disruptive to the structure? And for this I can go to work that was stolen out of biochemistry actually out of genetics and take it into computer science, genetic algorithms and take it right back into biochemistry. And in computer science it was recognized that if recombination is used as an operation and an optimization or a software involvement then you can use recombination. Recombination gives you solutions in which things that interact with each other favorable in the solution are retained. So favorable interactions are not broken by recombination. It's effective. So there will be some partition of your object into pieces that breaks the fewest interaction. If you have the wrong partition you're not going to be able to build up the object from that. And so if we think of a three-dimensional protein structure and these are not domains or anything like that this is a three-dimensional structure of the cytochrome P450 what I'm proposing is there will be some decomposition of this protein into elements that become softable from which you can take from different colored parents for example and rebuild this P450 and still retain its three-dimensional structure. Why don't we look at this it was a nice experiment that was done fairly recently by a master in Hernandez who just took two related proteins from a thermophilic source and a mesophilic source so this one's less stable than that one and they recognized that these two proteins even though they're single domain proteins they're small compact elements they shared an interface so both of the proteins shared amino acids across this interface and when you swap these pieces of the structure so if you swap red into the blue protein or blue into the red protein this interface does not get disrupted and so therefore that swap is perfectly compatible with the three-dimensional structure they showed not only it's compatible you can make these kind of red proteins but the effect of these swaps is perfectly adequate so that if you measure the delta-delta G for this mutation and the delta-delta G for that mutation they're exactly the same as the other side of the thermodynamic thermodynamic cycle so these effects are adequate even though they're happening over many mutations this is exactly a very nice demonstration of what we've been able to play several years before in the scheme of a combination of the one that Chris Boyd did when he was joint-studied with me and Steve Mayo and the basic idea was to identify this partitioning of the protein where we're going to recombine elements and many proteins can be 3 and be 20 so it doesn't really matter but we're going to do it in such a way that we choose the crossovers to break the fewest number of interactions upon recombination so the question is what are those elements that break the fewest number of interactions and we're going to define the interactions in a very very simple way this is where the contamination of the experiment with some thought comes in only the fun is very simplistic because yesterday I said details matter they matter a lot but I don't understand the details so I'm going to describe the protein as something very simple and that is interactions from a contact matrix and then an identity sequence identity among the parents so the seed function is just the context that people are thinking of me and just look at one another and we'll just give it a cutoff and so you can draw this protein like this and it's a one-dimensional chain where interactions are these connectors between the interactions so there's lots of long-range interactions and there's proteins lots of short-range interactions and then some of the others and then you can also draw the same thing in the form of contact matrix so there's a one if there's interaction otherwise it's zero then the doubting function tells you what sequence identity the parents share and that's really critical for this because remember that we said conserved interface that makes it so there's lots of interactions between those two pieces of the protein so if we think of a crossover between the sequence of parent one and parent two to make this particular chimera this interaction in the chimera M to A existed in one of the parents so that interaction cannot be broken that's an interaction that was not broken so that don't get a delta function of zero there whereas this particular contact in the three-dimensional structure A did not exist in any of the parents and therefore is a new contact or represents one that's broken so when you sum these up over all possible contacts we can score each chimera by the E value and then the design becomes a dual optimization where you're trying to minimize disruption for a certain level of mutation and this is a dual optimization public and actually we can get the global optimal solution by dynamic programming which Jeffrey Eindhoven did a few years ago as I said the details matter we don't know anything about the details so we don't know much about what mutation level is required to obtain new functions so I'm going to try to maximize that just for fun this is a very simple model of scoring broken interactions by one dollar so it's going to cost you $1 every time you break the interaction and I'm not going to give you any credit for making new interactions because I don't know what those provide and so does that tell me anything about who's likely to fold and who's not so listen back to the data lack of mixes because that's such an easy system Rake of Mind 3 that's shared between 37 and 42% sequence identity so these are not terribly unidentical there's a lot of mutations there make all the possible sequences which for three segments or seven possible this is more than 6,500 chimeras and this is a historical slide this we for technical reasons went to a new method for constructing a library in the same year this was a lot of work to make exactly what you can order on the internet now in synthetic genes so we don't make this using complicated methods anymore we just get synthetic genes in these chimeras so you can make all 6,561 sequences you can sequence them very rapidly because there's only one possibility at each of the 8 blocks I'm sorry 3 possibilities at each of the 8 blocks is either going to come from the blue parent or the red parent or the green parent so if you have a probe a sequence probe for each parent so if it's got 8,1 there it's going to light up for the green probe and so you can actually do this the treatment of the eight will well place and you can put that way when you do that you found that somebody made a ruboo in the library construction left out red for position number 4 but otherwise it was okay so we know like the unselected library looks like and we know that we can in principle be looking through all the possibilities for swapping fragments from each of the parents so here's some of the ways of making a data lactamase by recombination from this library first of all looking at 553 sequences we found 111 that folded that means that 20% so not 1 in 1,000 now 2 in 10 will fold the function at this very high level of mutation up to 86 amino acid substitutions so I don't know too many people who can accumulate 86 amino acid substitutions in hundreds and hundreds of proteins at the same time there's a lot of data lactamases they have pretty high levels of mutations that can be the most mutated ones there's lots of crossovers so you can find crossovers in multiple locations that lead to functional proteins and in this set on average there's 4 to 4 amino acid substitutions with respect to the closest parent so is there any correlation between this scoring function that I described to which is an incredibly simple one this E value, this disruption value and whether the enzyme folds and functions or not and the way we look at this is we know that mutation is disruptive so of course things that have low levels of mutation are probably going to have a high level of function so we expect to see an active enzyme down here but out here at high mutation levels it's very clear that the ones that have low disruption that make mutations that do not disrupt the interactions are the ones that tend to be active so we know that this scoring function gives us some information for recombination if that's the case we'll go and make a moment to P450's that way so let's take P450's that are related to one another fairly closely related we can take the cellus medicurium one that I introduced to you the other day yesterday and it's close relatives from the cellus samplis 102A2 and 102A3 these all share about 60% sequence identity and therefore differ about 150 amino acids because this is a much longer protein these are the blocks if we decide to cut it into 8 segments this corresponds to 8 blocks that are color coded here and I'll show you those in a little more detail in a minute and once again this gives us 6561 possible sequences this corresponds structurally to sequence elements that are more complicated than secondary structures for example we're not just swapping in and out of the helix with a machine by forcing crossover at 7 positions and also saying you can't have very small segments of structure the algorithm chooses the crossovers to occur at these points and you'll see interesting things like a crossover being placed in the midst of an alpha helix so it's not even choosing the end of the alpha helix it's actually saying you get the least disruption from putting it in the middle you get long meandering sequences that go through there that probably don't care where you're really cutting you get more complex combinations of data sheets and alpha helical regions and this is the corresponding space filling model and you can see there's plenty of interaction surface between and just looking at pearls on the string and cutting for the string ends there's lots of interaction between these elements but these are ones that have the most conservative interfaces it is interesting however to look at how the domains or what do we call a swap of the elements identified using this algorithm compare to some domains in this protein that we know from things like molecular dynamic simulations by looking for example on pieces of the protein that move in a correlated fashion you can identify some domains in the cytosome people are 50 that have these correlated motions they're spread out along the primary sequence such that we have orange interacting with orange over here so recombination couldn't quite capture these boundaries but if you look at where the schema recombination points come out to be we have points that maybe these very very well as well as you possibly can such that block 1 and 7 together make up this sub-domain blocks 4 and 5 make up this sub-domain block 6 and 8 make up that one so it's trying its darndest to capture those even though there's no dynamic information put in there but so ever it's all structural so there is clearly some attempt to identify structural domains alright, what do these proteins do? that's the fun part with a sample from before 50's it's easy to assay for who folds actually that's actually pretty hard to do for a lot of proteins this one we have a good folding assay because it won't bind the heap and allow it to persist me unless it's properly folding and in fact it's a nice call because you'll only get the different spectrum the carbon and oxide different spectrum at 450 nanometers if it is properly folding but if it's slightly misfolded you'll get a peak at 420 and you can distinguish those so in a 384 ball plate we can array out the clums for all the different variants and see very rapidly who folds and who doesn't with a certain particular color so here are the results for the P450 family we did a random sampling of 628 we found it nearly half encode folded proteins this represents that a 10,000 fold increase in the probability of folding by using the schema algorithm is our estimate they have an average once you have the folding one 72 of the isosubstitutions so these are really significantly different they are habitably different from one another that's a measure of different bite they may be different but they are at a sequence low significantly different at 47% 6,560 sequences that means there's more than 3,000 folded P450s when I first came out with this result the entire database of P450s was only where it was 3,000 so this had doubled the number of functional folded P450s that were now very wide range of mutation levels up to 109 amino acid substitutions they really are quite different from one another and the difference between my P450s and the ones that were in the database in 2005 is that mine are all sitting in place in my refrigerator and I can do things with them I don't have to synthesize them I don't have to clone them they're already old done and so I can start asking interesting questions about the ones that are folded and I'd like the ones that you find in nature because you may think that those are not interesting but if there are done computational analyses it's really nice to have real decoy sequences these are the ones that are gone for the same operation but did not encode a folded P450 so if you think you're good at predicting what makes a pretty fold or not I will offer you my sequences to play with so let's ask a functional question and this is a really interesting and simple question I know that it interests some of you in the audience so let's talk about how these different blocks contribute just to the stability of the protein, the conformational stability a cytokine P450 does not undergo reversible unfolding so we're going to be looking at irreversible unfolding we'll measure it just for the purposes of this talk by looking at the temperature at which half of it unfolds and it's no longer binding carbon monoxide so you can ramp up the temperature see where the carbon monoxide peak is and come up with a nice current like this and then come up with the T50 here are T50 values for about 200 of the folded chimeras and there are a couple things to note from this first of all parents distribute over this range from about 44 to 55 degrees centigrade so they are not terribly stable proteins but they are easy to work with still and the chimeric proteins distribute such that most of the chimeras are within the parents so most of the children are functional half of the children are functional of the functional children most lie within the parents capabilities but there are outliers there are some children that are less functional than the parents and even more interesting there's a nice tail out here where the children are better than the parents so this recombination operation has the possibility of giving chimeric proteins that are more stable than the ones that she started with in fact some of the most stable B450s that have been recorded well, gosh, we've got 190 sequences, we've measured their disabilities, what can we do with this how can we think about these why don't we start with the simplest possible model the simplest possible model for correlating the stability of the sequence is just to assume that the effects of the different blocks are additive so that means with three parents and eight blocks there are 24 possible sub-elements that are being swapped among all these proteins and if we do a linear model that means we've got most 24 parameters but it happens we can use one as a reference so we only have 17 possible parameters in here, a0 and a16 of these elements where each fragment xij is contributing aij in terms of stability that means you can take a chimeric protein and just sum up the stability of each fragment if stability is additive in this system so why not let's give it a try we looked at the data for that and we're stunned to see a really good correlation for this simple model so if you predict the T50 by fitting all of the 100 and actually just 204 points to the measured 50 you do the model you find a fit and then you predict then and you find a really nice correlation in fact the correlation is so good that when we went and looked at the worst outliers and went back and sequenced them they had sequencing errors so all the other ones were perfect, all the blue ones the sequences were perfect but we found quite a few patients in these in sequencing errors from the pro-hybrid which has a 3% error rate in the worst outliers so that was pretty good all right, if it's a good model then you can predict things that you did not know before so what you'd like to do now if you have a good model for the linear fit for what the contributions of the different sequence elements are you should be able to predict the disabilities of new proteins of new pyres that we had not previously tested so in fact we went then in fact 20 new sequences just pulled them out of the soup and constructed with you including the predicted most stable kind of and found your equally good fit for those predicted disabilities in fact the most stable P450 has this fully on 65 degrees centigrade wow, that's pretty remarkable what that means then is if your model is good we should be able to take a small sampling of all 6561 take a sampling experimentally measure those and predict all of the disabilities of all proteins in that set and the most interesting ones of course are the several hundred here that are more stable than the most stable parent so if we're so smart we should be able to measure this, predict the most stable ones synthesize those which we did and construct them and test their disabilities so we took the 44 they were predicted to be the most stable I'm sorry 44 of the most stable from the 300 that were predicted were more stable than parent A1 so these are their sequences you can see that there's consensus here that parent number 2 should be at position 1 and that there's little less height and other positions and while we synthesized them every single one of the 44 was more stable than the most stable parent in September but we have now 44 highly thermostabilized side-by-side protein for 50s whose stability range from the T1 to about 55, 56 to 65 can be generated so that just shows that the schema algorithm was able to capture the interfaces capture the cut points that allow these pieces to be swamped and not only that then allow these pieces to contribute in additive fashion this is just a demonstration of the sequence diversity that's available we can put this on a turnery diagram where the sequence of A1, A2, and A3 are the vertices of this diagram and then the sequences of the unit and sum or the camera is lying between and this is just to show that the sequences are different from one another in a significant way as well they're not 44 of the same protein or 625 of the same protein these are all significantly different from the parents and from each other so who cares about stability stability's great but it's only interesting in the context of what a protein does having a stable rock is not as interesting so what do these do in terms of their characteristic activities and let me tell you just a little bit about that in the last few minutes as I mentioned yesterday that a BN3 in fact all of the parents for this experiment have very nice properties in terms of their biotechnological properties they're well characterized, they express well they collide all the work that's done in a single polypeptide but they have this fatty acid hydroxylase activity that is of marginal interest to me it's very fast but it doesn't do anything particularly interesting and so what I like to do is be able to screen them to find out whether these new recombined proteins that have many amino acid substitutions have acquired new catalytic activities and for this I'm going to make use of this peroxide shunt because remember I believe done this in the heme domain of the P450 I don't have reductases attached to these and so instead of using a mono oxygenase activity I'm going to feed them hydrogen peroxide and drive the reaction to the peroxide shunt and there's a particular mutation that had been reported to increase that activity many fold and that mutation was made in all of the parents so they all should if they are active have this peroxide denace activity so how do you find out what's the range of activities it has well you have to go and measure it and so like a 64 question type of game I'm not going to tell you what they do unless you have a way of asking the right questions and the answer there in that library or the energy crisis but if you have it asked it it's not going to come out and tell you that so that just means you have to knock them under and develop high throughput screens and give you a nice color metric output or a four metric output for these activities and we have those that's exactly what we've been doing for the last 10 years with a random genesis so we have activities on fatty acids like some straights and we measure it with 1920s German analytical chemistry we've got our demethylation acids when you make formaldehyde then you get a nice purple color and we combine all this we've got alkene epoxidation acids so we just collected all the acids we developed over the years and started asking each of these kind of proteins you've got the heave in there what can you catalyze the reaction on and I'll summarize that it worked just my kind of most of the ones that fold also function so that little bit about the protein go around the human vein is valid if you have the heave if you've got a heave properly inserted into the protein it will be active on something to reverse the approximation what exactly is on you've got to find it out but it will be active on something which is a nice thing for the cyber company for 50s so you have to knock them under so we took 14 of the best express cameras as well as the parents 1 and 2 the heave domain of parent tree is not functional all by itself it requires its reductives to be functional on any of these substrates but we can measure the activities on the substrates now I'm going to go through this data because it's mind-blowingly boring to look at these kinds of plots but I will tell you the conclusion that's important for this lecture on the substrates that we studied and none of these spoke much like padiacens there's at least one of these chimeras in 14 that's more active than either of the parents so just by checking 14 chimeras we found something for every substrates that's more active than 1 and 2 furthermore you can find substrates that on which the parents don't have any activities so 2, 3, 7 there's very little activity or no measurable activity in the parents and sometimes multiple cases where you can find activity so you can find new activities from these chimerated enzymes for example, if I introduce you to some of the chimeras 2, 1, 1, 1, 3, 3, 1, 2 which looks like this is active on zoxazoline that's a good drug and it's a substrate for the human B450-6182 6182 takes about everything and this chimera will make the identical human metabolite made by the human enzyme they make the same product and if you just plot where the amino acid substitutions are this one is 98 different from its closest current which is green and the 98 amino acid substitutions as I told you are lying all on the surface of protein there's nothing near the active site versus new activity 2, 1, 3, 1, 3, 3, 1, 3 which looks like this guy permethylate is permethylated sugar so I introduced that reaction yesterday and this is one of the culprits that does it it's differing from its closest parent to 85 amino acids once again primarily on the surface of protein all over here and this will take the one position and do a selective demethylation on that and finally there's one more example 2, 1, 3, 1, 2, 3, 3, 2 which looks like this one is different from its current at 84 amino acid substitutions and yet this will take the rapamil and make the authentic human a single authentic human metabolite so it will demethylate it will de-alkylate the other position here to create these two products and that's an identical reaction made by the rat liver chromosomes so these are functional enzymes and they're actually pretty good so these are also part of the ones that have gone into the connexus sip sip so let's finish with one last little functional thing and then remember we're only looking at the heath domain I've made these recombinants of the heath domain yet we know when the whole inside complex it's interacting with the reductants so if I would like to reconstitute the monoboxygenase how can I do that which of the three possible reductases should I use or do I have to recombine the reductases so we went and made all the parents and 14 of the active primaris the ones that we studied before and reconstituted them with all three reductase domains and asked which of these will be functional and what's interesting is that every single one of the functions work as hollow enzymes and why is that? I bet you can tell me why not all of the interfaces are conserved all of the key interactions between the heath domain and the reductase domain are conserved among the parents so are almost completely conserved so the blue ones are conserved in all three parents and the red ones are conserved in two out of the three parents so these key interactions across hydrogen bonds male right interaction or no sorry water interaction are conserved among all three parents so that when you swamp them it doesn't really care so here's something these reductases differ at 50% of their sequence yet they retain this ability to interact across the domain that was those were a retain during evolution and therefore did retain when you do the swapping so let me summarize some of the points I've tried to make the first question was when is recombination not disruptive to the structure well that happens when blocks are defined by these maximally conserved interfaces and the schema scoring function allows you to identify those computationally in a very rapid way and so that when you combine these blocks from different parents you can make proteins they're highly mutated yet still full of function so we can make 4,000 or 3,000 from people of 50s in a single experiment but a functional diversity is accessible well I've just shown you that in a very superficial way but I hope that I've told you that there is functional diversity there they exhibit features of stability that are outside of our ancient parents and catalytic activities that are both better than the parents and allow you to have catalytic activity on substrates that are not accepted by the parents and notably these bacterial people of 50s can acquire activities of the human enzymes and finally the last question was can we discern sequence function relationships with this synthetic protein family made at the laboratory and I hope I've shown you that I did not discuss this here but it's been published we can find actually specificity determining features of one of the parents that were not known before what I did show you was that the stability contributions are additive in this system and therefore the stability of unknown sequences with a high level of accuracy using this what we're proving now a minimum of very small number of sequences this means that even in the absence of a high throughput screen if you can look at even 30 sequences then you can make a large set of proteins and accurately predict highly stable sequences from that and I didn't have a chance to tell you about sequence analysis but I know that some people here are interested in that these consensus energies that you can calculate from sequence analysis can also be calculated in these non-natural protein families and this allows you to predict stability just from folding data alone you don't even have to measure the stability so you can accurately predict stability just from the sequences of folded proteins and if people are interested in that they can go to the paper that was published in September so for that I want to quit I think I'll thank you the people who did all this work and I'll thank you for your attention I was hoping that you could elaborate on the schema kind of stuff in the active site and one side of these that's where you wouldn't want to disrupt the active site must be a very important schema but the other would be that it would probably highly conserved so that you actually could swap in the active site and you could clarify for that for me and were there any crossovers in the active site area and one might argue that that would maybe give you more functional diversity the only way to get mutations by recombinations is if mutations exist in the parents and so it can take proteins that are 50% identical that's a very high level of identity very rarely will there be mutations in the active site so since they don't exist in the parents they're not going to exist in the congruence they're homologous and there are high levels of identity so 50% now that's high from an evolutionary point of view but it's very low from a protein engineer's point of view because you still get very high levels of mutations but you don't get them in the active site so this experiment shows that mutations outside the active site can infer these in some states that's the best whether that's a special feature of people with proteins I don't know yet in some cases homologous proteins that are functionally different so then one wonders what would happen is that a way to get a lot more functional diversity even though they want to offset the prosody of this sort of process that's an interesting so as you go to more sequence divergence you can actually find things that have now diverged in function or maybe converged so they're not homologous right now we're still limited to about 30% 30% identity is the lowest that we've gone so even in 30% identity evolution there are a couple of bold people in the group who'd like to test what you're talking about not homologous but as you know the mutational still matters and as you go to 20% and 15% identity the structural divergence has now gotten so large it goes up exponentially to see what's urgent so then there's a lot of disruption I wonder if there's other fragments and then just focus then on the fragmented domain that's important or the subset of domains that are important for the agrocytes to be the rest of the same yes, so that's another way but then you don't get very many proteins that way so if you just choose one fragment yes, you can swap in a structurally homologous fragment from another protein but you just make one from a protein I would make 4,000 so I'd have to find 4,000 all these fragments to put in so I don't get the common and plural I guess that's what I'm asking you about the fragments you certainly can you certainly can now that we have gene synthesis you could just as easily make that library as synthesizing all of these genes so it becomes more practical to do that we wouldn't even have thought of doing that a few years ago because you'd have to make each one of those one by one because I could just make all 6,000 at once with the methods we were using before but yes, we should go back and look at that the mutation level is not going to be as high we get hundreds of mutations because we're swapping lots of fragments at once if you're just swapping in one fragment then you'll have a lower mutation level but I think that's interesting yes, the calculations all the gene calculations are available on my website what is more difficult to do is the RAS algorithm that does the calculation of the optimal crossovers but anybody who's ever asked us to help with that we're happy to do actually I think there may be a RAS Python code on the website now the substrates that you chose to look at so you chose a subset of compounds it seems like really anything is possible those compounds the ones of the 14 you know why? because it's really hard to measure whether a steroid has been hydroxylated whereas if you hydroxylate an aromatic you can use these color metric reagents if someone pays enough I'll go and look at a steroid but that's why Phinexes has these proteins because I really don't want to do that that takes LC and GC and it's not fun I think the statistic where the actively polluted proteins are active and that you show the now the functions of 14 of these so do you measure the closest parent functions as well oh yeah so the parent functions most of the substrates the parents have very low function on so we know the parent functions on those substrates and usually we choose things that parents are not active towards but then do they still confer the parent parental activity for the other substrates? oh for the other substrates we haven't looked at the fatty acid activity for the chimerics we did that for the PMO that I described yesterday but we haven't done it for the chimerics it's really interesting have you thought about doing this with the sort of point of experience and you've taken the same increase in interest so as you go to higher and higher stability there are going to be pure possible sequences that are more stable than those it certainly would be interesting to find those two possible sequences because I think the temperature and the stability of proteins is dictated not by the physical chemistry but by other biological requirements I think it is possible to make even more stable proteins than we have now so it's not going to be as easy to stabilize a thermophilic protein than it is to stabilize the protein sort of along those lines of increasing stabilization of the protein if you run through multiple rounds you expect that you'll get another increase or the blocks that you're using are the same blocks that you used before so presumably that would go yeah you wouldn't do another round of recombination like this but what you really want to do is take those stable ones and find some other ways to increase their stability through point mutation for example or making more crossovers another would be to say instead of using several crossovers make 23 crossovers but then the combinatorial explosion goes out and finding the linear model the linear fit would take a lot more experimentation so I think the better thing to do would be you've got a nice starting point of 44 new sequences and it has an interesting activity in stabilize that you've got you've got a bunch of mutations out there that were affecting the substrate's specificity have you determined for any one of those whether or not it's a single domain swap that confers that the initial activity or is it the combination of those 75 or 80 mutations that give you just that right fit to confirm that so basically you're asking what's the minimal number of mutations required that I have and we don't have enough data to say to do a machine learning you'd like to find us a structure activity relationship but you actually need a lot of data for that I think it's extremely powerful looking at this it's easily going to mess with the active side of the mutations but once you get out to that secondary shell it seems extremely extremely powerful