 Ideally, from a scientific point of view, what we want to try to do is understand what the underlying physical principles are that cover the folding stability function approaches. And we want to use a design-based approach to this as opposed to a traditional perturbation-based paradigm. So typically, if you're interested in understanding how an enzyme work, for example, you might make mutations or perturbation in the active site and then imagine what that perturbation does. Using a design-based perturbation, what we're interested in is to build a model, a physical model, a computational model, of what the enzyme's supposed to do, how the protein's supposed to fold, and then do designs in the context of that model, make new proteins, measure them experimentally, and ask whether or not they work. If they work, then our model must be somewhat accurate. And if not, then we have to go back and refine the model. As an economist, at the end of the day, what we're really going to do is make interesting stuff, including things like end-to-end antibodies, biosensors, enzymes, or biofuels, et cetera. So what I'd like to try to cover today is outline here. I'm going to do a brief introduction to the computational protein design. Most of the time, talking about computational enzyme design, some published and unpublished work, and then if there's time, talk about library design and the reduction of enhanced fluorescent proteins. Okay, so just so we're clear on what we're doing and not doing, the difference between fold prediction and designs, we hear of showing what fold prediction is, you know, the sequence, and then we want to make some sort of prediction or calculation about what the structure might be. We can think of this as a mapping from a point in sequence space to some single point in structure space. So this is a very hard problem. Lots of progress, but recently, but still, a problem going from sequence to structure. What we're interested in is essentially the exact opposite problem, the inverse folding problem. That is, we want to start with a structure and some function, and then find a calculation to give an optimal amino acid sequence. And so this is the inverse mapping that we're starting in structure space and doing the inverse mapping to find some sequence or some sequences in sequence space, all of which will give rise to the target structure. And of course, this is an easier problem simply because the inverse mapping is how to generate. There are many, many sequences, all of which will give rise to the same target structure. So we can think of methods that may not be perfectly exact and fully encompassing all possible sequences. As long as we land in the sequence subspace, we're going to be successful. And of course, we're interested in designing libraries of sequences that we can then use in experiments to do a screener selection. And that's just, again, ground in specifying a single sequence, we're going to specify a collection of sequences. So why do you do this computationally? It's, I think, easy to demonstrate why a computational approach is advantageous or can be advantageous. You're thinking about a single protein comprised of p-residues or 20 dashes per unit of acid. So they're on the order of 20 to p different sequence combinations. And so even for small proteins or peptides, the number of sequences is astronomically large. And so if you think about an experiment, for example, for an 18-mini-acid peptide, there are 10 to 23 unique sequences. And if we count these emphasize one strand of each of the 10 to 23 possible sequences, it would be part of the mass of the baseball, which is almost tractable. But of course, even by the time you get to very small proteins, you're at the mass of the universe in terms of building a computational library that expresses that computational university. So computationally, however, we can address these numbers now quite easily. So over the years, we've focused on combination of methods and applications for computational design. Most computational design methods have these elements in common. They start with some model of a protein backbone, either derived from the personal crystal structure database or for some other means. We use Rodeo libraries to capture the conformational flexibility of amino acid side chains. So Rodeo is just an explicit confirmation of amino acid. And we build these libraries by looking at the structure database. Of course, our model includes alomates force fields that capture the physics of the problem. We use cometary optimization methods to solve cometary problem. And of course, for many types of design, negative design becomes important as well. That is, that's not always sufficient to do positive design, that is designing for what you want. Sometimes you also have to consider designing against the things you don't want, like aggregation, for example. So you can use these sort of methods then to ask questions about the relationship of sequence function and stability, think about how we should approach certain function, make models because that could be interesting. And so historically, we showed back in 907 that you could actually do this, it would work. So what's shown here is a result where we took a Zinc figure of fold, which we showed in red, and stripped off all the side chains. We drew the requirement for Zinc binding, and designed the novel sequence that is primarily the folds to the target structure that back one fold to show here in purple. And then to the extent that our force fields are somewhat accurate, and also that we're doing reasonable optimization, we should be able to design protein variants that are thermostabilized relative to their wild-type counterparts, and we showed that back in 908 that you can take a mesofilic bacterial protein, run the design calculations, and generate variants that are hypothermostabilized relative to the wild-type, and some that have both temperatures well above 120 degrees Celsius. OK, so how do we do the calculations? Let me just give you a quick flavor for what the calculation will entail. Here's our protein. We're going to do a calculation on two positions in the protein P1 and P2. The first thing we have to do is specify which amino acids are allowed to each of the positions, and which rotators of those amino acids are going to include. So in this case, we're allowing both the same set of amino acids and rotators for both positions. Alley has a single rotor, we're allowing three rotators, and for this simple case, we're allowing three rotators for serine. And then we need to compute various types of energies. So the standard methods for computational protein design rely on the fact that we have at most two body energy terms, and so we have one body term, a self-energy, and then two body terms. So for the one body terms, we compute the interaction energy between some choice of amino acid and rotator, placed on the protein, the interaction energy between its atoms and the rest of the atoms in the protein, and we say that in a vector. And of course, we just do this for all single interactions, both first at P1 and then at the second position, and then we just record these energies. And the types of energy terms that we're evaluating include things like van de Waal's energy to capture sterics, in most cases, a simple columnic energy for electrostatics. And we can use more sophisticated electrostatic calculations like the St. Paul's, which is much more expensive, and other problems you can address if you want to be composable. We have a hydrogen line term and a solvation term that captures the high-infobre effect and also the desire for polar amino acids to be surfaces both. So after we compute the one body energies, we have a two body term, where we have to compute all pairs. In this case, the backbone goes away, and it's already taken to calculate the interactions there, and now we're still going to get the interactions between pairs of rotators at the positions in the calculation. And of course, we run through the entire matrix and store all of its energies. So at the end, we have a question of two bodies, one body energies, and then the optimization is a simple optimization, which is to find the choice of P1 and P2 that minimize the energy in the system. Now, of course, in real calculations, the number of choices that we would have at a position would be on the order of perhaps thousands of different rotator combinations to come with the full coefficient of flexibility of all the amino acids. So the major sizes here would be many, many bytes of data, and we can use standard methods like quantum optimization to find solutions, or more sophisticated methods, which I'll mention very briefly. But in a simple quantum optimization, we select a random initial position. So the second serial rotator for P1, the first serial rotator for P2, here are the one body energies, here's the two body energy, and we sum them to get the energy in the big random moves, subject to the metropolis criterion for progress. And in this simple case, we can find the ground state solution minus 9 k-cals per mole, and we can go back and build a model of rotator 3 of serine at P1 and rotator 2 of serine at P3, or P2. OK, so it turns out, for most cases, simple line optimization is not sufficient, and we spent a lot of time, I won't talk about any of this in detail, looking at other optimization methods. Many years, we focus on the method called de-handling relation, which started as well, but more recently, we've been using the method called faster, which stands for fast and accurate side chain topology energy refinement. It's not my name, it's the original authors, it's a group of Belgian that came up with the basic algorithm, and then we can support design purposes. In general, I would say that the computational component that is just doing the optimization for energy is a relatively solved problem using the current versions of the faster algorithm. Let me limit ourselves to two-body potential functions. OK, so I want to focus down on some applications, and in particular, thinking about enzyme design. So there are two types of activities that I'll talk about. One is a simple esterolysis reaction, and it's a published work, so I'll go through it quickly. And then the other is capillimation, which has been focus of some monotone of my groups, but other groups in the area over the last couple of years. OK, so for esterolysis, we selected a simple system, carnetrofinal acetate, shown here as a substrate, and decided on the mechanism. Of course, if you want to design an enzyme, we have to have the mechanism that we're trying to target. There are only two types of mechanisms for this hydrolysis that we can think about. One is a nucleotide attack on this carbon, or a water mediated attack where we pull up hydrogen on a water molecule, and then hydroxide would attack the substrate. So we select the nucleotide attack, and it goes through a hydrogen immediate, which decomposes to release a first product, and in this case, acylate and enzyme are radiated, which then goes through a subsequent hydrolysis to release the enzyme, and then the other product acid. So the way we model this in the calculation is to decide what we want for the nucleotide model. So in this case, we select histidine, and then we just model the sum, either a transition state or a hydrogen intermediate on this reaction pathway as a set of rotators. So in this case, we're modeling a hydrogen state, and we just model it as a set of expanded rotators attached to histidine. So then we run the calculation. If we place this histidine, this modified histidine in various spots in the protein, we can ask, what are the other mutations that are required to stabilize the interaction of this system? And there might be all the active site of the protein. So we allow the common rotations for histidine, about kylin and ky2, but of course, we have to add now additional confirmational flexibility to account for the presence of the substrate. So we can do that, and we're able to find interesting hits in a scaffold of a protein called thyrinoxin, which doesn't catalyze this chemistry. Here, panel A is shown. This is where the histidine is. Here's where the substrate lies in the computational model from the design calculation. We also get serendipitously a lysine residue that sort of positions itself on top of the active site. And here's what the surface looks like before the calculation. This is the wild-type enzyme, the wild-type protein. And then the triple mutation, which is surprising, the key mutations to introduce catalytic activity. So this blue seam becomes the catalytic histidine, which is here. And then these two aromatic amino acids are mutated to colony to make room for a substrate binding. OK, so starvenolene actually works. It catalyzes the chemistry. So this is Michele's method of plot showing initial velocity as a function of substrate concentration, and you get a classic behavior showing saturation in higher substrate concentrations. Despite the fact that it actually shows activity, the activity's not phenomenal. It's only about two orders of magnitude of radio acceleration with a Km, roughly 110 micromolar. So Km's not bad. The radio acceleration is mediocre at best. But magnetistically, it actually works as we would expect. So here's some evidence to show that it works the way we want it to work. So at some substrate concentration, this is just product release as a function of time. So for our design molecule, P2E2, we can see that we get the expected result. We get initial first days and followed by steady state phase. If we look at wild peccarydoxin under the same conditions, it doesn't catalyze the reaction. We take our design protein and only, sorry, we take wild peccarydoxin and only introduce the catalytic histidine catalysis is weak to non-existent. If we take our design protein and knock out the catalytic histidine, it also turns off activity. So we require both the catalytic histidine and also the other acrylcite mutations for full activity. Because the reaction is actually quite slow, we can follow my mass spec. And so the desired mechanism includes the isolated enzyme intermediate. And we can see evidence for that on this slide. So here's the mass spec of just the enzyme by itself. We get a peak for the parent protein. This is of all the experiment, and you often get a copper matrix adiabatic form, and that's what this peak is. We add substrate now to the reaction, run over to the mass spectrometer at the end. We see the parent protein, but we also see a peak come in at plus 42, which is a mass of acylation. So for example, we would expect for the isolated enzyme intermediate, here's the copper matrix adiabatic for the parent protein, and we also see the copper matrix adiabatic for the isolated enzyme. If we take the enzyme minus the catalytic histidine and add substrate, we don't get the isolated enzyme intermediate. We just get a noise level for the background. And so this, I think, demonstrates that we're operating by the desired mechanism, going through an isolated enzyme intermediate. Furthermore, we can inhibit the enzyme using a non-parapylogenal substrate analog shown here, and we get sort of classic behavior in the double reciprocal plot where when you get an intersection on the y-axis, it's an indication of competitive inhibition, which suggests that this substrate binds competitively at the same active site. But the Ki for the inhibition is weak, only 20 milliwaller, but again, mechanistically, it does what we expected it to do. That was published a number of years ago, compared to early telecanobodies, it's sort of the same order of magnitude in terms of radiocellaration and NKM. So we were pleased by that, and we looked at a more complicated system to see if we could make more interesting enzymes using more complicated chemistry. Before we did that, however, we wanted to do a modeling study just to see if we could recapitulate more complicated binding interactions in known systems. So we looked at three different proteins, prison mutase, the binding biotin is drugged out again, and then try to specify the sombrose. The idea here was that we would take an active site region, strip off all the side chains in the active site, and run our calculation to see if we could cover both the active site residue, amino acid identities, and the positioning of the substring, which is known by crystallography. So in all three of these cases, we're able to do a pretty good job of recapitulation. So in the prison mutase case, for example, there's lysine, there are two arginines, and we're going to interact with the substrate, but this is actually a transversate anvilon, and we can cover not only the sequence position of those amino acids, but also the conformations to fairly high accuracy, as well as the positioning of the transversate anvilon, and we can do that for other cases as well. So we were actually quite pleased with that, and we set out then to look at a more complicated reaction, although it's still an ultimately simple reaction. In this case, it's the capillimation, where we're looking at extracting this hydrogen from the substring. It goes through a fairly simple transition state, and it goes irreversibly to product. So the idea here is that we want to be able to take a protein that doesn't catalyze this chemistry and introduce a new active site that will execute the capillimation. OK, so we selected the camp as a model system, we and others, because of, who's going to grant, that's actually true. So Dartmouth had a grant program for doing model chemistry by a computational design. And so my group, David Baker's lab, and Hohenle's lab, and Duke and others around the country, were on a large grant program, and we had a list of model reactions, and the camp was the first one. So we were all working on that. And so the camp was the first one on the list because there was an existing calent antibody from Don Hilbert's work, he was at Scripps, which had a fairly yeast-grade acceleration of 10 to the 6th. And subsequently, there was a crystal structure of that calent antibody, and it had to be used to listen to the antibody that showed here. And so what you can see is that the substrate line is sandwiched in by aromatic interactions, and then the catalytic glutamate is locked in by other polar interactions in the protein. And so if this were the real substrate, this interaction then would be the calentic contact. In addition, Ken Halke at UCLA had previously computed a transition state model by Amnistio Methods for this reaction. And so we could combine that in the quantum chemical notion of what the chemistry should be, capture the transition state, and we'll learn from the crystallography about what the acrocytes look like, and then build a new protein that sort of does all this stuff. And so an active site design we want then is to allow for some general base, either aspirin or glue, ideally with other interactions that tie down the position of the catalytic unit, and hopefully also modulate the PKA. Here's the substrate, and we want to position the substrate in an active site where we can have at least one base stack interaction. And then most ideally, even though it's not present in the calic antibody, we'd love to be able to put a hydrogen bond on this oxygen because in the transition state, this oxygen develops negative charge, and so we'd like to be able to stabilize that in the hydrogen bond. OK, so we pushed the button in the program, got lots of hits on lots of scaffolds, and made lots of molecule kills, and burned out lots of grad students. And none of these weren't. I'll show you in a minute. Some of them look beautiful, so here's one. These are all models based on the calculation. Here we have a substrate perfectly sandwiched between two phenylalanines with the calic, I can't tell if that's an aspirin glue, perfectly positioned to pull off this hydrogen. And there are many designs that look that nice, but none of them actually work. In fact, the typical data that we've got to show here, background reactions show in black, some substrate concentration, the function of substrate concentration, initial velocity. One of our designs, 16.3.1, is even slightly worse the background, which is not good. The scaffold, of course, doesn't catalyze the reaction at all. Turns out that just serendipitously, BSA does everything, BSA catalyzes the reaction weekly, so our designs weren't even as good as some random protein that I didn't suppose would do this chemistry. So we were quite discouraged and really wanted to figure out what was going on, and so we decided to go back and look at one of our best designs and analyze it much more closely and see if we could learn anything from it. So our best design was called HD1, so it doesn't work. Here's what it looks like, though. Here's the substrate, again, this is just a model. Here's the catalytic unit. People will be tied down by hydrogen bonds. We get a tire scene that makes the hydrogen bonds. This oxygen has been developed negative charge in the transition state. It feels stacking with this tryptophan. And importantly, this hydrogen bond network, here's calic glutamate, goes through this histidine, this asparagene, and to this tryptophan, so the tryptophan is even held in position probably to help the substrate get in and make the appropriate calic contact. So we hear the molecules folded. We can tell that by CD, the salt structure, by crystallography. And this is the crystal structure, whereas this is a model of where the substrate should be, so everything else is real except the substrate. And so what we observe from the crystal structure is that the active site is too exposed that there was way too much water in the active site. And that's a problem for two reasons. One is that, of course, removing water from the active site would give you an S3K in terms of the activation entropy for the reaction. But in addition, the presumptive catalytic unit, which is sort of shown in the back here, is almost completely solvated. So the pKa is not going to be appropriately shifted. So in order for this chemistry to go, we want to shift the pKa of the spartan of the glutamate up a little bit so that it's more rapid. But if it's fully solvated, the pKa is going to be 4 or 4 and 1 half, just like it would be for regular glutamate. So active site too exposed. In addition, this is an comparison between the x-ray and purple and our design. Again, the position of the substrate is just the model. In our design, of course, I pointed out this beautiful hydrogen bond network that goes to the tryptophan. This is the design of this gray. But in reality, what we see is that we get great positioning of the glutamate, hydrogen bond to the histamine, but the asparagene rotates 90 degrees and does not form this hydrogen bond to the tryptophan. So what happens is that it's a tryptophan actually rotates out of the active site, effectively disrupting the binding interaction energy that we would expect to have if things were oriented properly. So we have this, the pisatene is not quite strong. And the third thing we looked at in collaboration with Ken House Lab at UCLA is MD analysis. And so here's a summary of the MD simulation. What we're doing here is we're tracking the catalytic contact between the hydrogen on the substrate and the oxygen on this glutamate. And so as the simulation starts, this is the, that contact distance is a function of time. Of course, we want it to be catalytic competent, so free extras are less. Initially, the contact is maintained, but fairly soon the simulation it jumps to a state where the distance is way non-catalytic. So if we look at this as a histogram, we can see that the desired catalytic contact is here at the green bar, but much of the simulation we're way out here at distances that are clearly not going to be catalytic. We can see what happens in the simulation shown here. So here's the substrate, here's the catalytic glutamate. Initially it's contacting appropriately, but as the simulation progresses, what you'll see is that the substrate flips over and moves out of the active site and the active site becomes filled with water and as we saw in the crystal structure, the glutamate becomes much more solvated than desired. And so of course now the distance between the hydrogen and the glutamate is way non-catalytic. It's rotated 180 degrees away from where it was seated. So based on that, we were able to think about what we needed for a second generation design. We wanted a deeper active site. Everything else still looked good. In terms of the catalytic unit, it's still one of the base vacuum, we still wanted this hydrogen bond donor to the oxygen. So here's a new active site, same scaffold, but now push deeper into the protein. Here's the substrate, here's now rather glutamate, we have the aspartic acid that's going to function as a catalytic unit. Here's a free unit that makes a hydrogen bond to the oxygen on the substrate and then we have a base vacuum with this active site. And we also get serendipitously a wild type lysine that makes a contact to the nitric group on the substrate. So we can repack the active site as a 12-volt mutation of the wild type. We repeat the MD simulation. What we see is that we still get two states of two different conformational states, but they're now both catalytically competent. So the design has the contact here, here's the aspartic acid, here's the substrate, here's the hydrogen. In state one, the substrate slides around a little bit and we still maintain catalytic contact. It quickly flips over to the second state where you can see the substrate is sort of split down in the active site. The contact is maintained and water is not penetrated in the active site, giving us hope that the pKa of the aspartic acid will be appropriate for the chemistry. You look at the system map for distance distribution, and that looks great relative to what we saw before, sharply peaked at the catalytic contact. So if we make the molecule, as shown here as we make it, it's resisted purified, here's the purified material. We can show it by CD that's folded. Compared to wild type, it's 20 degrees destabilized, but it's still well folded with a pair of TM well above where we're doing our actual chemistry. The actual wild type protein is hypo thermophilic, which was on purpose because we knew that introducing a large amount of mutation would destabilize it, which it does. We have crystals we'll come back to in a minute. So experimentally again, it actually works and catalyzes the chemistry. So here's HG2, no-keep-the-calis-ven plot. It does not show saturation. We have worked very explainable, it'll show you that it do show saturation, but relative to the background, it's been underneath here. And the wild type scaffold shows no activity, our second generation design is active, even though it doesn't turn over, we still fit it. This is not fit for students, you shouldn't do this. Anyway, just against the rough idea what the KM is, you know, we KM, Bob Williams-Mollard and K-Cat, which is not that great. But it actually works. And so if we look at knock-out stands, so we take our design HG2, which is shown here in blue, if we remove the catalytic aspartic acid and mutate it to asparagene, it turns the enzyme off. We remove the hydrogen bond interaction to the oxygen that's supposed to be developed, to the oxygen that's developing a negative charge in the transition state, it's closed to the enzyme down. And then if we compare our design HG2 to K70, which is a design from the paper lab that was published about a year ago, we can see that ours is better. I'll have to work first. Okay, so in terms of numbers then, we had a ray acceleration of roughly 10 to the 4, which compares favorably to a paper lab design that was published about a year ago. And so given this encouragement then, we sought to further understand what's happening in HG2, and in particular to see if, which had not been demonstrated in any previous case, whether or not a transition state analog was actually binding to the current acrocyte. Okay, so this is a real structure of a transition state analog in the acrocyte, and we can see several interesting things. First and probably most important feature is that, even as a conduct, this is kind of tight, the substrate is interacting with the presumptive catalytic aspartic acid. In addition, the substrate is stacking on this aromatic amino acid as designed, but interestingly, the substrate is flipped over roughly 180 degrees from where it should be. So in our design, this nitro group should be way over here interacting with the splicing. It's not, it's actually now flipped over, and rather than the, this is a designed serine position, rather than the serine interacting with what would be the oxygen on this side of the substrate is now interacting with the nitro group. So the great news is that we have now for the first time a co-crystal structure of a designed enzyme with a substrate analog in the acrocyte showing the critical contact. Bad news is that, which may be consistent with the low activity, is that the detailed structure, in case that design is not doing exactly what we want, but it's actually, we get a critical interaction, but we're not getting other interactions that we would have liked to have as part of the design. Okay, so then based on this, and the fact in particular that the serine is not interacting with the oxygen that would be the real substrate, we did a third generation design that modulates this position somewhat to fill a void space that we see in some of the MD simulations. We already had the MD simulation on this third generation design, so this is now a point mutation of our active design to serine-to-thraining, so this has the methyl group, the MD simulation looks great, and then we made this molecule in assay this activity. And third generation molecule is much better than Hague. It starts to show saturation behavior as desired. It would kill its man in the plot, and as you can see, it's much more active than our previous design or other designs that were published. Okay, so in terms of radiation toleration then, for our third generation design, we're almost at 10 to the six. In fact, if we shared this molecule on the lab with Hilferlab, at the Atheha who do demerensimologies, which sort of hack into enzymologists, they show that in a appropriate buffer system that they can get radiocelebrations of 10 to the six, which match the catalytic antibody. So this is the interpret radiocelebration, the best design design that's been described today. And so, one interesting thing I want to point out is that if any of you are familiar with the competition for the design field, there's been a lot of controversy recently around competition for enzyme design, there's an important paper that was crafted from science about a year ago, and so there's now, not my line, of course, there's now a real effort to bring collaborators in to corroborate design. So what we did in this case was to set in clones, the Hilferlab and Rebecca Bloomberg graduate student in Don's lab, using a different purification scheme than we used, getting much cleaner material, was able to show that the enzyme does exactly what we expected it to do, and has kinetic properties that are consistent with our data, it's not slightly better, and now they're also in the process of using direct evolution methods to see if they can improve the, okay, so, some of this part of the talk is that, you know, you can only do this, you can use competition design to design the novel enzymatic systems. I think we're still at very growing stages, but, you know, in terms of introducing novel activity, nothing else would have worked. You can't really use direct evolution to introduce novel activity, and you try to have evolution very effectively to increase activity in the existing enzyme, but to introduce novel activity, this looks like it might be an interesting way to go. I didn't show you a lot of data, there's a paper that's coming out hopefully soon, I'll show more data. One thing that we found surprising, and what I found surprising was that the M.D. analysis was amazingly predictive, so we have a whole panel of molecules that we designed where we did blind studies at UCLA where they ran the M.D. simulations and then told us which one could work and which ones wouldn't work, and roughly 80% of the time the prediction was correct, that the things they said wouldn't work based on the M.D. simulation didn't work experimentally, but the things that they predicted wouldn't work based on M.D. simulation didn't back work experimentally, and so I was actually really amazed at a simple M.D. simulation, and of course if you notice these M.D. simulations weren't very long, and of course they don't do, there's no chemistry happening in here, there's no, this is not a QMM calculation, it's a pure mechanics calculation and the only thing that's being evaluated is whether the substrate stays in the active site or whether the catalyst incompetent orientation. So that was actually really nice and potentially very visible going forward for screening perhaps many thousands of designs and only going to laboratory with those designs that are predicted to be active. Okay, of course I'm very interested in trying to do real stuff with this and in fact, I started a company last year with Monsanto to look at engineered enzymes and we had biotech applications using this kind of methodology, but we had no methodology out of it. Okay, so I'm going to switch gears and talk about library design. We'll just only go through the first two pieces and not at the time for the third part of it. So we're just getting about coming up with methods for designing libraries, computation designing libraries, and then validating them and then showing the application where we had published application where we made an enhanced loop for the rest of the protein. And we're also thinking of really interested in making enhanced red for the proteins that are kind of brand shifted relative to what's available now, which would be much more useful for imaging applications. Okay, so we're interested in library design is to generate new and improved protein function by screening commentary on the proteins that have a high fraction of functional molecules played in a wide range of functional diversity. So the problem is that in terms of library design, thinking about methods like garapone PCR and recombination experiments, the problem is that the issue of preserving function, so you have some wild type function which may be an antibiotic activity. You make a library of many different molecules. You'd like all the members of that library to have some functional activity. So you want to preserve functional activity. At the same time, you'd like the library to show a diversity of function. If everything looks exactly like the parent protein that you haven't done it in today. So you want higher preservation of function but also high diversity of function. And of course these things are actually correlated if you, in a PCR experiment, every PCR experiment for example, if you want to get high diversity of function, you might use a very high mutation rate. Now that unfortunately will give you typically low preservation of function because the more you use the protein, the more likely you are to destroy all of the function. And conversely, if you want to preserve the function, you might use a low mutation rate. So you'll get high preservation of function but of course low mutations. I mean you're not changing the parent protein at all and you're going to get low diversity of function. So of course what we want to do is use computation to optimize both properties, both diversity of function and preservation of function, that is one to have calculation, allow us to use a high mutation rate but then to show us where the protein with mutations can come. Now the problem is that, and I didn't point this out in the previous time, in the previous part of the talk, when we think about functional properties of proteins like as a matter of reaction, our design calculation is actually not really doing the chemistry. There's no QM in that calculation. So what we're relying on is a hypothesis that basically says that if the protein is fold in stable then it increases the probability that it's going to be functional. And clearly the option that it must be true, the protein's not fold in stable, it's not going to be functional. So our hypothesis is fold for functions so we're going to stabilize compute fold stability as a surrogate for understanding what the real function is. All right, so the way we do this is shown here so in our normal calculation we have a structure, we have a parallel sequence of scoring function, we have energies, we have an optimization, we get something at the end. So for the calculation I showed you before, we have two types of energies, this one body energy and a two body energy, we run our pair energies and then we run our optimization, we get a sequence out. So for a library we're going to do the same thing but just manipulate the energies. So we convert our one body energies instead of being focused on a single river, we aggregate them into amino acid single energies, amino acid one body energies and then amino acid pair energies. And then we can apply these second strains to get set one body energy to set two body energy and the sets are just the collections of amino acids that are coded by degenerate codons. So if you use different degenerate codons, we'll code for different sets of amino acids and we can put all that together to then get a calculation that where the output is now a library, it's a sequence of degenerate codons that codes for different combinations of amino acids at each position. Then we can go into the laboratory and make that gene and then as expression we're a library. So it's identical calculation but now we're adding additional constraints for degenerate codons. All right, so we tested the GFP. We're going to target 15 core positions in GFP, showing here in yellow, excluding the chromophore to show here in green. We're going to score using fluorescence on a plate reader and we're going to score preservation of function as the integrated fluorescence intensity. So the functional approach means the fluoresce and we're going to score a diversity of function as the peak position for the fluorescence map. Okay, so there are different libraries that we're going to compare. So this method I have just mentioned is called Divis which is a silicon, but we have two different versions of it that I'm using here. One where we constrain the system to select two full degenerate codons at nine positions out of the 15 in this region. Another setting where we, so this is a very high mutation. We're going to mutate nine positions out of 15 which is kind of crazy. A less aggressive implementation where we're going to use four full degenerate codons at four positions. So now we're going to mutate four positions out of 15 and then a mean field-based approach that we published many years ago where the calculation targets the two positions that are most likely to be tolerated mutations. So we go from very high mutation rates and moderate to low mutation rate. And then a random library where we use two full degenerate codons at the same, at the same line positions, at the same line positions identify this calculation. So we don't do a truly random because truly random would be, let me just get garbage. We have aeropropyl-PCR control directly at the entire gene, the entire protein. But again, because if we just did aeropropyl-PCR target at this tight region, we would get too many mutations and we'd get garbage out. So these are, they're good random controls but they're, they're, they're applied to be better than what you would expect if you did it really properly. Importantly because we do this computationally, we constrain the libraries so that we always maintain the wealth of amino acid at each position in the library. Okay, so here are the different libraries, here are the positions, here are the calculations. And you can see here for the highest mutation rate, we have these two full degenerate codons so we have a rolling analogy that's a three in serine available to see. At nine positions selected by the calculation, so the two-fold, and of course you preserve the wealth of amino acid in black. The four-fold libraries that targets four positions and uses four-fold degenerate codons but also maintaining the wealth of amino acid. The mean field calculation which does saturation with genocides is that two positions that are most able to accommodate mutation and then the random control, same line positions but now a random selection of the two-fold degenerate codon that contains wealth of amino acid. Okay, so then you can score these by preservation of functions so these are ordered by our mutation rate so very high, medium, low, random control and then your chroma PCR control for the entire gene. There's a number of clones that were sampled. These are experimental data. You can see that we get good preservation of function despite the fact that we're using an amazingly high mutation rate where this is over a stretch of 15 amino acids. So 12% of the clones have at least half of the thrust intensity of wild type and almost half of the clones have at least 150th the thrust intensity of wild type. So we now can look at the other libraries that have lower mutation rate. We get the counter-intuitive result which is that the lower mutation rate has lower preservation of function which is exactly the opposite of what we've expected but exactly what we want from the calculation. We want the calculation to allow us to use a high mutation rate and get high preservation of function. And this sort of counter-intuitive thing is explained out here, I'll read it running on time. But we get the desired counter-intuitive effect of high mutation rate gives high preservation of function. Look at the random control. It obviously does terribly as we would expect and the aeroplane PCR control using this mutation rate targeting against the entire 300 amino acid protein does exactly what we expect. That is, we're mutating it lightly and we get high preservation of function. And you'll see on the next slide that this is, because of this, we get low diversity of function. Okay, so look at diversity of function now. So only those clones that have at least half the intensity of wild type get the rate ordered by mutation rate. So what I'm showing here that is for the highest mutation rate, this is peak position deviation from wild type nanometers. So zero would be identical to wild type. The white bar here is the median peak position. The red bar is plus and minus 25% of the data and the black tick marks are the actual data. So for the high mutation rate, we get the expected result, high mutation rate, high diversity of function, almost a uniform distribution of peak position across this miracle. As expected, the lower mutation rate approaches to give lower diversity of function. So they're all targeted, they're all sort of right on top of where you expect wild type to be. They're at control, same thing. And the airport PCR control, as I showed last slide, despite the fact that it had high preservation of function because we're using a relatively low mutation rate, it also has very low diversity of function. So all the data are right on top of zero. So we're not changing the present property of protein. Okay, so we can reserve function, we get diversity of function because we can have the calculation tell us where to put the mutations and sort of crawl rates are to ID that. Designing a library based on bold stability is a good service for the actual chemical function of the model, as this was published a few years ago. And one quick example of using this for something that might apply, something that might be useful, is looking at blue fluorescent protein. So here's blue fluorescent protein, here's a chromophore, here's some residues that are around the chromophore. So BFP has a blue shifted admissions maximum, which is useful for doing cell biology experiments if you want to do imaging. The problem with BFP as it exists is that it has a very poor polynomial, which means it's very dim, it's very pH sensitive, and most importantly, it photoblogers very rapidly, so you can't even bite it and it just goes away. So we design a library using this method, that targets 12 positions around the chromophore. If you do site-saturation adenosine at all 12 positions simultaneously, you would have 10 to the 15 different variants, which is intractable experimentally. Calculation reduces the number of variants between a focused chromophore on a library and 10 to the fifth variants, and we can screen those conveniently using FAPs. The rest is activated cell suite. So here's the design, here are the positions, here are the wild communal acids, here are the well-fed codons, here are the degenerative codons that the calculation pulled out, and here are the amino acids that correspond to these degenerative codons. And you can see that some positions are wild-type, some positions have lots of mutations that are spread across this whole cluster of positions in the protein. So a very non-trivial construction for the library. In a single pass of FAPs analysis, we can screen that library and pull out a clone which is a triple mutant wild-type. So here, and it has the following properties. Again, this is a single pass to the screen. We can have some kind of yield from 0.34 to 0.55, which is actually pretty remarkable. And we get a 40-fold improvement for the total region of half-life. So here's a wild-type BFP, which has a short half-life. Here's our design variant called azurite, which is a much-enhanced half-life. And looking at that in real cells, so these are the mammalian cells. Here's BFP as a function of time. You can barely see it's dim because of the formal carbonyl that, of course, gets dimmer because of the total region of time. Here's our variant, very bright, at time zero, and they change brightness even at longer timescales. So this is, I think, a great demonstration of designing a library with some target function where you cannot compute the function. We can't compute the resident properties in these calculations, but in a single pass, getting them all to which it is great. And the skip-all and so forth stuff. We just acknowledge it was a good work. The un-published design work was done primarily when I was a graduate student in the lab, and everything else was published I also want to acknowledge that in Kent House staff, Curt Kessler graduate student was responsible for all the MD simulations. And then the structure stuff that we do is a collaboration with a facility in Caltech called the Life Conservatory, which includes a bean wine at Stanford and Len Thomas and Pavley were critical in helping us do the response. Thank you. I have to take questions if there are any. I'm curious, the first question is, whether...