 Thank you all for coming to the second lecture of the Steenbach series and again Joe gave a fantastic presentation yesterday on evolution of protein ligand specificity and today I think he's going to talk about the action at the other end on where the protein and DNA interact and so with that really, really short introduction I'll let Joe take the stage. Thank you. Well thank you Vetsen, it's been a real treat to be here over the last couple of days. I enjoyed talking with the students today, talking with the faculty for the last two days has been wonderful and I don't know who's recruiting the new faculty in biochemistry but they're awesome so nice job. So yesterday I talked about the evolution of hormone specificity in the steroid hormone receptor ligand binding domain as a model for the evolution of ligand regulated allosteri. Today I'm going to talk about a different domain in the same protein family the DNA binding domain which is a model for the evolution of transcription factor specificity for DNA. The characteristics of this system make it amenable to certain biochemical techniques which were not available to us with the LBD so it allows us to gain some different kinds of insights into the nature of protein evolution. Oh and I want to mention up front today I should have done this yesterday the students in postdocs who did the work Tyler Starr, Alicia McKeon, Jamie Bridgim, Dave Anderson and Geeta Ike were responsible for most of the work that I'm going to show you today. So the introduction is essentially the same as last time so if I'm going to assume that most of you are here I'm going to run through it very quickly if you weren't here tough luck. So the framework for the work that we do is that explicitly studying evolutionary history by using phylogenetic methods to reconstruct the history of sequence evolution and then biochemical and molecular methods to experimentally test hypotheses about the causes of historical evolution can shed light on essential questions in both evolution and biochemistry. So in evolutionary biology we're interested in the nature of the causes and consequences of evolutionary processes. What's the distribution of the effect size of mutations? Does it take a lot of mutations to confer a new function or just a few? What's the role of genetic interactions between mutations in structuring the evolutionary process? Does it make evolution contingent on chance events or is selection always capable of producing the same optimal outcome? That question of optimality is important is what happened during history the only or the best way that a certain new function presumably selected for could have evolved that is does the world we live in represent the best possible world or the only possible world given that context? Does protein evolution follow a trajectory from promiscuous to specific proteins? Does it tend to pass through promiscuous intermediates on the way to new functions? Questions like this. In biochemistry evolution is important because it provides the ultimate explanation for the properties of proteins. We want to know why a protein has the properties that it does. We have to understand its history because those properties were all produced by evolution. And then this idea that historical analysis provides a very efficient and powerful way for identifying the changes in sequence and structure that matter to producing functional differences between present day proteins. I talked about the difference between comparative horizontal approaches where we attempt to find the differences in sequence between extant proteins that confer their functional differences and that this approach usually fails whereas vertical analysis where we explicitly retrace through time the evolutionary process is more powerful. And I mentioned two reasons but I'm going to mention a third new one. So the two that I mentioned were one it the vertical approach allows us to dramatically narrow the number of candidate sequence differences. So if we're interested in the difference between purple and blue functions and we're comparing these sequences to each other we are including all of the sequence changes that have been amassed along this lineage and this lineage. But if the change in function happened only on this branch we can look at the much smaller number of substitutions along here. Further, if any of the substitutions along here interact epistatically with the function switching mutations then swapping them between the extant proteins will often result in a failure to identify the causal mutations. But I want to add one more issue which is if we're comparing purple to blue and trying to swap residues to cause the difference we need to know which function is ancestral and which is derived. In order for identifying the causal effect of any mutations to be an effective process. So if we know we're going from purple to blue then through sequence changes from say state I to state J if reintroducing J into this ancestor produces the blue function then we know precisely through this experiment what is the cause of the present day functional diversity. This is that change in sequence from I to J. But if we're just going back and forth between extant functions and extant sequences even if we could convert one function into another we don't know that that's historically relevant. But we also don't know if converting purple into blue represents a way of destroying the purple function, conferring the blue function. If we had a promiscuous ancestor and we don't know it that's all that happened was the partitioning of functions. So by polarizing both sequence and functional changes through time it allows us to define precisely the nature of the difference and the direction of causality that we're interested in. But that was not as clear as I would like it to be. You'll see in the story that I tell today a sense in which this is important. Alright so that vertical analysis requires us to be able to characterize the sequences, structures and functions of the ancestral proteins. So we use computational methods to identify ancestral sequences at any node on a phylogenetic tree. The maximum likelihood ancestral sequence at any node of interest is the hypothesized ancestral sequence that of all possible ancestral sequences is associated with the highest probability of all of the sequence data in the world today evolving. So the maximum likelihood ancestral sequence maximizes the probability of the sequence data in extant proteins the descendants of that ancestor having evolved. Now this is always or virtually always associated with some uncertainty because of multiple changes at sites during the course of evolution. It's often true that there are some sites that are ambiguously reconstructed where there's more than one state that is a plausible reconstruction and then we want to know about more than just the best point estimate of the ancestral sequence but we want to know about a set of plausible ancestral sequences. So once we have that in hand we can synthesize DNA and experimentally characterize those ancestral proteins to test hypotheses about what they did, what was their function. We can identify causal mutations by recapitulating the process of sequence evolution in the background in which it occurred. We can do structural biology on these ancestral proteins to understand the biophysical mechanisms by which genetic changes produced shifts in function and there's one more thing we can do which I'm going to talk to you about today which is Tyler Starr's work using deep mutational scanning on reconstructed ancestral proteins to understand the topology of the sequence space on which this historical evolutionary process occurred. So we can compare historical evolutionary trajectories to other alternative evolutionary trajectories that could have happened but didn't and that approach allows us to understand not only what happened but why it happened. So the model that I've been discussing for these two days is the steroid hormone receptors. It is a family of six proteins in humans and they're distributed throughout the vertebrates and there are a smaller number in some invertebrates. They are molecular mediators in the sense that they are ligand regulated transcriptional activators. The ligand activated complex binds as a dimer to palindromic response elements near to target genes and in this active conformation recruits co-activator proteins and potentiates expression of the target gene. They have rich phylogenetic signal particularly in the ligand binding domain which I talked about yesterday and today we're going to talk about the DNA binding domain, a zinc finger region. This modular domain confers the DNA binding specificity of the protein and therefore the target gene specificity because of the modular structural architecture of these proteins we can mix and match DBDs and LBDs and therefore study their functional evolution independently of each other. So I talked about functional diversity among the steroid receptors in their ligands that they respond to. Last time today I'm going to talk about diversity in the DNA sequences that they recognize. There are two major clades of the steroid receptors. One consists of the estrogen receptors, alpha and beta, which bind to so-called ERE. Here is the palindrome. This is the half site, six base pairs of AGG TCA, a three base pair spacer and then an inverted repeat on the other strand. So that's the canonical ERE chip seek experiments show that this nicely characterizes the preferred sites. Then the other clade are the receptors for other steroids, progesterone, androgen, mineral corticoid and glucocorticoids. They also bind to palindrome with the same structure except it differs at the two central nucleotides. Instead of having GT in the ERE, the so-called SRE, steroid response element, binds not GT but AA. So instead of making you look palindromically and invertedly from now on, we're going to just talk about the diversification of this core half site from GT to AA, from ERE to SRE. And we're interested in knowing two things in this talk. What happened during history? How did this functional diversity evolve? What was the ancestral state? What were the genetic and structural mechanisms? And second, why did it happen that way, which is a way of asking what else could have happened? And that involves comparing historical versus alternative paths through the sequence base around this ancestral protein. So we have to begin by reconstructing the ancestral sequence. This is done on the same phylogeny that I described to you yesterday. Actually, it's slightly different. So this is a data set of 213 steroid receptors. We reconstruct it using several different probabilistic methods which give us essentially the same tree using a model of sequence evolution which best fits this specific data set. And I should mention that the two proteins we're going to reconstruct are the last common ancestor of the SRE binding proteins. It's called ANTS-SR2. And the ancestor of the entire steroid receptor clade, ERs and SRs alike, which we call ancestral SR1. And these two proteins are both more than 450 million years old. This is as old as the last, this is older than the last common ancestor of all vertebrates. And this is older than the last common ancestor of you and an octopus. So on that tree, which was inferred from both the DNA binding domain and the ligand binding domain sequences, because there's more signal that way, we reconstruct the ancestral DVD sequence at SR1 and SR2. So here's the confidence that we see in these two reconstructed proteins. This is much higher confidence than we were working with yesterday. The ancestral SR2, the mean posterior probability across sites is 0.98. Only two are ambiguously reconstructed. The vast majority have no ambiguity whatsoever or only a very small amount of ambiguity. There are only two sites which have a plausible alternative reconstruction, which is defined as having a posterior probability greater than 0.2. SR1, the more ancient ancestor, is a little bit more diffuse in its reconstruction, but it's still a reasonably confident reconstruction. The mean posterior probability across sites is 0.87. There are nine that are ambiguous, and we're going to return to that uncertainty later. Now the thing is here that these differ by 38 substitutions from each other in a domain that's 82 amino acids long. So they flank one branch on the phylogeny, but that is a long branch post-duplication on which a lot of sequence evolution occurred. So we want to characterize their functions with respect to DNA binding and activation. So the first assay I'm going to tell you about is a traditional reporter assay using a Luciferase reporter that's flanked by four copies of the response element of interest. We take our ancestral steroid receptor DNA binding domain, subclonid to produce a fusion protein with a constitutively active activation domain, transfect this into cultured cells, and the extent to which this is a good binder to the response element being tested, we get Luciferase expression, or not if it's a bad binder. So we're comparing the canonical area to the canonical SRE. We've also done this on alternate versions of the SRE, and the results are virtually identical. Okay, so here are those results. SR1, here's the ancestor of the whole family. Nice robust activation above baseline expression off the ERE, no activation whatsoever off the SRE. SR2, the specificity is exactly reversed. No activation whatsoever off ERE, robust activation off of SRE. So let's color code those ancestors. The trajectory of functional evolution is from a specific SRE, a specific ERE activator to a specific SRE activator. So no, the ancestor is not promiscuous. The idea that evolution always proceeds from promiscuous to specific is wrong. It happens sometimes, but here's an example of evolution from one form of specificity to another form of specificity. So there was a discrete shift in specificity on this one branch. This inference is robust to uncertainty about the precise ancestral sequence. So we created alternate versions of both ancestral proteins that contain all of the plausible alternative amino acids at ambiguously reconstructed sites. They're all thrown in together. So for one protein, that's only a few sites, for this one, I believe it was nine. So this differed, so this all-to-all protein differs from the ML reconstruction at nine amino acids. We can think of this as the far edge of the cloud of plausible ancestral reconstructions and its function is still ERE specific. SR2's all-to-all is also SRE specific. So that reconstruction of a discrete shift from ERE to SRE specificity is robust to uncertainty about the ancestral sequences. So Alicia McKeon, who was a grad student in the lab and is now a postdoc in LD's lab at Utah, dissected the biochemical basis for that shift. So she did this using fluorescent anisotropy, binding of the ancestral proteins to tagged oligonucleotides, both ERE and SRE, which titrating in the protein allows her to precisely quantify affinity. So this is showing you the macroscopic Ka. So a big bar is higher affinity. The deepest ancestor prefers the ERE to the SRE with an affinity that is 1,400-fold higher. Across this interval of phylogenetic time to SR2, we have a shift to a higher affinity binding to SRE, which now has a 50-fold preference for SRE. So across this interval, you have a 70,000-fold shift in the relative affinity for SRE versus ERE. So Alicia sought to dissect this binding event of a dimer to a palindromic response element into the components of the binding cycle. So I've shown you about the macroscopic binding of the dimer. This can be broken down into the binding of a monomer to a half-site and the binding of the second copy of the protein to the second half-site in the dimer. And we can describe the cooperativity using this cooperativity coefficient omega, which describes the fold increase in the affinity of binding to the second site relative to binding of the first site. These can be measured directly using, I already showed you how we measure this using a tagged palindrome, we can measure K1 using fluorescence and isotropy using a flagged, a tagged half-site. If those are measured, then we can infer omega, and this is reversible binding so we can exactly calculate the effects on the free energy of binding of each of these events. So I showed you the 70,000-fold shift in macroscopic affinity. Alicia measured the half-site binding affinity. She found that there is a shift in specificity, but SR2 ends up with a lower affinity for its preferred half-site than the ancestral protein started out with, its affinity for ERE. But that reduction in half-site affinity is offset almost completely by an increase in cooperativity, particularly, so SR1 begins with low cooperativity, only about two-fold. And there's an increase, a substantial increase in cooperativity up to, I think this is 30-some fold on the SRE. And as a result of this trade-off in the components of macroscopic affinity, we end up with macroscopic affinity where the SR2's affinity for a palindrome is of the same order as the ancestral protein's affinity for its preferred response element. And that trade-off of the components of binding becomes important in the evolutionary process for reasons that you'll see. So that's what happened macroscopically, old-style biochemically. We'd like to know what are the genetic causes and what are the structural causes for that evolutionary change in function. So our candidate historical genetic changes are the sequence differences that happen on this branch between SR1 and SR2. There are 38 of them, I told you. That's too many to work with conveniently. So we looked at phylogenetic pattern to identify substitutions that are diagnostic in the sense that they're conserved in one state in the SREs and conserved in a different state in the ERs that bind to ERE. And unfortunately, there's still 19 of those. So that's still a lot. So we used structural information to narrow down our candidates further. And working with Eric Ortland and his postdoc, Mike Murphy, at Emory University, we determined the crystal structures of ancestral protein SR1 on the ERE and ancestral protein SR2 on the SRE. So these are crystal structures. And you can see they have very similar overall confirmations, as you would expect. These are the dimers. This is one monomer, that's the other monomer. There are some subtle differences. For example, there's a more ordered helix here. And more significantly, there's a loop here that leads from the DNA interface up to the dimer interface, which is ordered in the SR2, which is not resolved in the SR1 structure. But the overall theme is the same. When we look at the amount of buried surface area in the protein DNA interface, we see that ancestral SR1 buries more surface area than SR2. But if we look at the dimer interface, SR2 buries more surface area in the protein-protein interaction than SR1 does. And this is consistent with the phenomenon that I described a minute ago, which is that the half-site affinity of this protein DNA complex is higher. But that's offset by higher cooperativity of binding by the dimer. So what are our best candidates for a change in half-site recognition? It's the recognition helix. Both of these have this canonical binding motif, which is a 10-amino acid helix, which sits down in the major groove. And the two residues that differ in the half-site between ARE and SRE are right underneath it, are right here. Among these 10 residues, most of them are conserved, except there are three differences between SR1 and SR2. And these are both diagnostic in that they're conserved in their different states in all the descendants. And they're unambiguous in their reconstruction. These are reconstructed with 100% posterior probability as changing along this branch. OK. So the ancestral sequence is shown here in lower case. The three residues that change go from EGA to GSV. And a close-up of the hydrogen bonding interactions between the side chains on this helix and the DNA shows a changing pattern of hydrogen bonds, including by changing hydrogen bonding by some of the residues that change, like this glue, as well as other residues which do not change from SR1 to SR2, like K and R. And especially this lysine here, which we'll talk about more in a minute, which forms hydrogen bonds in the SR1-ERI structure, but not here. So there are some differences here that suggest that these may be causally important. So these three are going to be our first candidates for a shift in specificity. So we're going to test this hypothesis in two ways. First we're going to ask if the derived amino acids at these three sites are necessary for the shift in recognition to SRE. So this is the activation assay. And the hypothesis is that if we reverse them to their ancestral state in SR2, we will restore binding to ERE and abolish binding to SRE. And indeed, that's exactly what we see. Though going back to the ancestral state is sufficient to switch the specificity back to the ancestral function. And we can calculate the effects on the components of binding. This is now shown as the effects of the RH, the recognition helix mutations, on the free energy of binding macroscopically. The three mutations cause a 30,000-fold shift in preference. So these are very large effect mutations. They do this by decreasing the affinity to the ERE half site and have a much smaller beneficial effect on SRE affinity. So they only increase SRE affinity by a very small degree, but they have a major effect on ERE recognition. And they have no significant effect on cooperativity. So how do they work? Well, we looked at the crystal structures, and Dave Anderson also in the lab used those crystal structures as a starting point for molecular dynamic simulations to understand how they affect molecular interactions. And so first we can ask, how did these substitutions reduce affinity for the ERE? Well, there are two phenomena we need to look at. First is that there are specific interactions with the ERE, like this hydrogen bond between the glutamate, which changes to a glycine in SR2, which are abolished. So there's a loss of specific positive interactions. And you can see that here when Dave looked at the number of hydrogen bonds formed between residues on the recognition helix and the DNA, that when the recognition helix mutations are introduced on ERE, there's a loss of about two hydrogen bonds across this evolutionary transition. And so you would think that the kind of mechanism that might confer this change would be a shift in positive interactions. So you start with hydrogen bonds to ERE, and you switch to hydrogen bonds to SR2. You do see a loss of hydrogen bonds to ERE, but there's something else that's important. Actually, is that negative interactions are also introduced. So in the derived state, actually, let's just do it from right here. This is the best place to show you. The negative interactions in the derived protein are that it leaves hydrogen bond donors on the ERE with unsatisfied hydrogen bonding potential. So you can think of this as a kind of electrostatic clash where salvation is going to be favored over binding. So it's not a steric clash, it's more an electrostatic clash. And that's introduced when you try to put this protein on here. What about the effect of these substitutions on SRE binding? How is SRE affinity enhanced? Now this is really puzzling because there are no new hydrogen bonds that are introduced. You can see it here. There's no increase in the number of hydrogen bonds. And when we look at Van der Waals' context, there's no increase in packing. So how do we explain the enhanced affinity of this protein for the derived response element? And the answer is that there was an ancestral clash between SR1 and SRE, which is relieved by these mutations. And let me show you what I mean. So the SRE has the AA in the response element. And on the other strand, there are two Ts, which have these bulky methyl groups, which are shown right here. And those bulky methyl clash. If we model in the ancestral SR1 with its glutamate, those clash sterically with that glutamate in its ancestral position. And when Dave looked at molecular dynamic simulations, what he sees is that if he tries to dock the protein with that glutamate on the SRE, that clash causes the glutamate to shift in space to relieve that clash. And in turn, it then occupies the space of that lysine next door, which is hydrogen bonding to DNA. And that lysine shifts over by an average of, across frames, by an average of about two angstroms. And it no longer forms the efficient hydrogen bonds to DNA that it used to. This steric clash is therefore not only associated with suboptimal packing, it's associated with leaving both the glue and the lysine unsatisfied in their hydrogen bonding potential. So what you have is an ancestral steric and electrostatic clash with SRE that's removed by these mutations. So the take-home message from this is that the key determinants of specificity are not positive interactions, but are negative interactions. And the shift in specificity evolves by negative mechanisms that is relieving ancestral negative interactions, introducing new negative interactions to different response elements, thus changing the affinity. Now I showed you only a backwards experiment in the derived ancestor going back at these three key sites to the ancestral state. But if these are, we'd also like to know, are these three historical substitutions sufficient to explain the evolution of the new specificity? And the experiment there is putting the derived states into the ancestral SR1 background. And here we expect to be able to go from ERE specific to SRE specific. And we get from this a DbD that does not activate on either. We know that it folds, we can express it really nicely. We know that it's expressed in the cells, it just doesn't activate. Which indicates that there must have been permissive mutations during the same interval of time because those RH mutations did occur during their history and they're conserved in every descendant steroid receptor that's ever been sequenced since. So there must have been epistatic mutations that occurred along the same branch that allow the protein to tolerate the recognition helix mutations. So Jamie Bridgman in the lab did a ton of work looking at the other 35 substitutions along that branch sorting her way through them and identifying 11 permissive substitutions in three structural groups all of which are necessary for the RH substitutions to be tolerated. So when we take those 11 permissive substitutions and introduce them along with the RH mutations now we get an SRE specific protein. And they have this rescuing effect on this non-functional, this non-functional version. They are authentically permissive because when they're introduced on their own they do not change the specificity, this is still ERE preferring. They do increase the activation on both response elements non-specifically. So where they're located there are, there's one that's on the dimer interface, there's one on the loop leading from the recognition helix up to the dimer interface, there's one that contacts the DNA backbone just on the other side of the recognition helix. And then the other eight are in the C-terminus, there's this tail here which is not resolved in the crystal structure which is thought to bind the DNA backbone and possibly the minor groove. So we see a kind of evolutionary contingency again. Permissive mutations occurred which moved the ancestral protein into a region of sequence space where the function switching mutations could be tolerated. The acquisition of those permissive mutations could not have been driven by selection for the new specificity because they make no direct contribution to the specificity itself. So they may have occurred by drift, they may have been selected for some other function. In either case, selection for the new function is not sufficient to have driven the evolution of the genotype by which the new function was conferred. So how did they do this job? Well yesterday I talked about this model for permissive mutations, the idea being that function switching mutations may destabilize proteins, so permissive mutations might confer excess stability on them allowing destabilizing mutations to be tolerated. So we tested that by measuring the denature induced unfolding, the 11p, the Rh substitutions have no effect on stability, so that's wrong. What about the effect on possible effect on binding affinity? As I mentioned, that the Rh mutations reduce half site affinity of the protein for the DNA. So it could be that the permissive substitutions non-specifically increase affinity. And that's precisely what they do, introduce them on their own, they increase affinity without changing specificity, add the Rh to this background and you now have affinity that's comparable in magnitude to the ancestral affinity. So they offset the deleterious effect on the stability of the complex. And they do this by increasing both single site affinity, this is the delta-delta G of the permissive mutations, and they also affect cooperativity. So they play a key role in that offsetting process of moving from half site affinity, lowered half site affinity, but higher cooperativity. So we can think of the model this way. If we're interested in the stability of the complex of protein and DNA, you have an ancestral protein, you have specificity switching mutations that reduce the affinity to such a degree that this is a non-functional complex. The permissive mutations, however, increase with that affinity, allowing the function switching mutations to yield a functional protein with a different affinity. And notice there's no epistasis here at the level of affinity. The epistasis is at the level of function because there's a non-linearity in the relationship between affinity and activation. So based on their effects on the energetics of binding and their location on the structure, we hypothesize that these permissive mutations shown here in orange enhanced non-specific interactions with the DNA, such as with the DNA backbone here and here, and enhancing the dimer interface, therefore enhancing cooperativity through the substitution and the substitution. When we dissect their individual effects on the components of binding, that assignment is corroborated. I'm not going to show you those results here. But so the take-home sort of picture here is that novel specificity for DNA evolved by this really complicated mechanism. You might think you would start with positive interactions and you would end up with a derived state with exchanged positive interactions. But instead, you've got this ancestral state with positive interactions and negative interactions. You turn those positive interactions and abolish them and instead introduce negative interactions to the ERE. You don't create any new positive interactions to SRE, but you offset them with cooperativity and non-specific binding. So it's quite complicated. So why not this simpler mechanism of going from one hydrogen bond to a different hydrogen bond? And we speculate that this is because of the mutational accessibility of the two different classes of solutions. If you wanted to create new sequence specific positive interactions to SRE and reduce affinity for ERE, your substitutions would have a hard bunch of jobs to do. They would have to create new positive interactions. They would have to be constrained by volume because they couldn't create a steric clash. They would have to be constrained by charge in order to form a hydrogen bond to the DNA. They would have to have the right geometry, they would have to be at the right distance and angle from whichever DNA atoms they're interacting with to form a hydrogen bond. And they would have to not interfere with interactions of neighboring residues and bases. So that's really, really hard to do. On the other hand, what about introducing new sequence specific negative interactions like these or abolishing these negative interactions in the ancestral state? The constraints are much looser. You need some kind of a clash. You could mess up, you could have an incompatible volume or you could have an incompatible charge or you could have incompatible geometry. The only thing you have to maintain is preserve the neighboring interactions that maintain the sort of generic form of binding to the other response element. So our hypothesis is that the reason evolution took this circuitous pathway was because there were no mutational solutions in this category, whereas there are lots of ways to do this and there are lots of ways to make the permissive mutations that would compensate for them. That may say something, evolution here might give us some insight into possibilities for designing proteins with desired functions as well. All right, last chapter that I want to tell you about is Tyler Starr's work. It's not yet published. Why did evolution produce the derived genotype with the three, the derived genotype at the recognition helix and the permissive mutations? So to answer this question about why, we have to compare what did happen to other possible histories that could have happened. So was the historical outcome the only way? Was it the best way? Was it the most accessible way in sequence space to achieve SRE specificity from the ancestral SR1? We want to know what was the topology of the sequence space over which the recognition helix sites evolved and how did that topology, by which I mean the distribution of functions over sequence space, how did that shape the evolvability and robustness of the protein? How did the permissive mutations affect the trajectories that were available to the protein to get from ERE to SRE specificity? And so to do that, Tyler developed a method for deep mutational scanning of these ancestral transcription factors to determine the DNA specificity of all possible combinations of all possible amino acids at the variable recognition helix sites and he did this in both the ancestral SR1 background with and without the permissive mutations. So here's what he did. We're trying to do a multi-dimensional characterization of sequence space where we get all possible combinations of all possible 20 amino acids because that's the only way to find the various possible trajectories through sequence space. You can't do N to the 20 where N is the full length of the protein as an astronomical number of genotypes. So what Tyler did was he looked at all possible combinations at four key sites in the recognition helix. The three that changed during evolution plus one more which is that which is involved in, it's that lysine which is involved in hydrogen bonding to the DNA and it is variable in some nuclear receptors between K and R. So these are the four variable sites in the recognition helix including the three that confer the change in specificity. And here's the assay. So he produces this library. He clones it into an expression vector with the Gal 4 activation domain, puts this into yeast with an integrated reporter contract consisting of GFP that's driven by the ERE. So effective DbD variants will cause GFP to light up. He sorts this library of transformed yeast using facts and he partitions it into four bins. And then he uses Illumina sequencing on each bin to determine the frequency of any given genotype in the four bins. And from that distribution we can calculate the mean fluorescence of each variant. So this gives us a quantitative, a bulk way of quantitating the activation effects of every possible variant in this library. And he did this with both an ERE reporter strain and an SRE reporter strain. Multiple replicates show good correlation from one replicate to the next on both response elements. Most of the scatter is on inactive genotypes anyway. The method for getting from the distribution to the estimate of mean fluorescence he tested by comparing that estimate to the measured mean fluorescence of isogenic clones. And he finds a very nice correlation. And further, he experimentally characterized the affinity of a number of these variants for DNA and finds that the mean fluorescence in the FACTS assay nicely predicts through this threshold relationship the binding affinity for DNA. So this actually gives us a nice readout primarily driven by affinity. Okay, so here's what he finds. These are, it shows you the active genotypes in the recognition helix library in the ancestral SR1 background. On this axis you have activation on SRE, on this axis you have ERE, and each dot is a protein variant. These are the stop codon containing variants. So this is the distribution of fluorescence from known non-activating proteins. And I'm putting on here the level of activation in purple by ancestral SR1 on ERE. This is our baseline and ancestral SR1 with the permissives in the RH on SRE. And that allows us to classify variants into the following categories. We can color as purple the ERE specific genotypes which are as good as the ancestral state on ERE but do not activate on SRE. Green is the SRE specific ones which activate on SRE as well as the derived genotype did but don't activate on ERE. And then in blue are the promiscuous ones which activate above null on both. So what do we see? There are alternative solutions. There are 42 recovered genotypes which are as specific as SR1. There are 41 SRE specific genotypes as good as the derived genotype and there are 45 promiscuous ones. So the historical outcome was not unique and it's also not optimal because here's the historical outcome and all of these are better on SRE and give no activation on ERE. The historical residues that make up the genotype are neither unique nor optimal. So here Tyler has shown the representation in the ERE specific genotypes of each residue at each position. This is basically a position weight matrix. And here are the historical ones E, G, K and A. All of these are not, they're neither the only residues to confer ERE specificity nor the most frequently associated with it. The historical outcome, it's even more so. Historical outcome G comes in third, S you can't even see, K is second, V comes in third. So the genotype isn't made up of optimal amino acids. Here's the sequence space of those. They're colored by their functionality. They're connected if they are accessible by single nucleotide mutations. If they are isolated from any, if all of their nucleotide neighbors are non-functional they appear as islands. This ancestral space is somewhat fragmented. You have possible ancestral starting points that can't get anywhere and you have possible derived functions that can't be accessed. But there is a large connected sequence network in the middle, including the ancestral protein. And that ancestral protein could have accessed a number of different green nodes through a continuous walkthrough sequence space, such as this one. But it's five steps away, which is more than the historical pathway that was taken. And the historical outcome is not accessible at all because the permissions are not here. So the historical outcome, GSKV, is not in this network at all. So there are ways this could have been done without the permissive mutations. But not the historical outcome without the permissive mutations. So then Tyler wanted to know what was the effect of the permissive mutations on the structure of the sequence space. Same type of library but now he makes it in the background that includes the permissive mutations, does the same two sorts for ERE and SRE activation. Way, way more genotypes become functional. About a three-fold increase in the ERE specifics, what is that, a seven-fold increase in promiscuous and a much bigger increase in the number of SRE specific genotypes. So the permissive mutations provide a general increase in the number of ways of achieving the derived function. Looked at a different way. This is a scatter plot of activation on the ERE, each point is a genotype and it's categorized by its activation on SR1 without the permissives and with the permissives. Up here are the ERE specific proteins. You see there are a large number of alternative genotypes in the network of proteins associated with more robustness of the ancestral function when the permissives are there. On the SRE there are a huge number of inactive proteins that become accessible or that become able to confer the SRE specific function. So what does this say about specificity of permissive epistasis? First, specificity is limited. The permissive mutations permitted a huge number of different genotypes that could confer SRE specificity. But they're somewhat specific with respect to the response element target. So they're much better at permitting the evolution of the new function than they are at permitting genotypes to be tolerated in the ancestral function. So this is a kind of third order epistasis between the permissive mutations in the protein, the function switching mutations in the recognition helix and the sequence of the DNA that they're binding to. I'm going to skip that. Okay. Here's the sequence base of the permissive of the permissive background. The ancestral network when we include permissive mutations is very large and it's densely connected. There are now only two little islands out here consisting of three genotypes and you can get from anywhere in here to anywhere else through a continuous walk. You can see where the ancestral protein is. It can get in three steps from the ancestral genotype to the historically derived genotype. And there are also many other ways of doing this including single step trajectories where we can get in one amino acid change, actually one nucleotide change, to from an ERE specific protein to an SRE specific protein. So the historical trajectory is neither unique nor optimally accessible and it's not optimally functional either. What about robustness and evolvability? I'm going to show you, I think, just three characterizations of this and then we're done. 11P substitutions increase the general evolvability of SRE specificity from ERE specificity. Before I was showing you the trajectories from SR1 to the derived RH genotype. This is now looking at the accessibility of SRE from any ERE specific starting point in the network. So these are the general characteristics of how this transition could occur in this network of sequences. So what you have here is the number is how many steps are required to get to the nearest SRE specific node and the distribution of those steps across all the possible ERE specific starting points in the network. So each purple node gets tossed into one of these bins based on how close it is to SRE specificity. And you can see that the largest category of ERE specific starting points can't get to SRE under any circumstances. And then of those that can, the mean is about four steps. When you include the permissives, now there are only two ERE specific starting points that can't get to an SRE and the mean path length is only two. So the permissives increase the evolvability of this function from ERE specificity. The historical path was not uniquely short. It took three steps. There are many pathways within this network that involve one or two steps. So this was not the easiest way to evolve it. In the ancestral background without the permissives, every pathway from an ERE to an SRE involves a promiscuous intermediate. There's no way to get straight from purple to green without going through blue. Introduce the permissives. We now no longer need promiscuous intermediates. Now only about half of the pathways involve a promiscuous intermediate. An equal number involve a discrete switch from purple straight to green. And then there are quite a few ERE specific starting points that have equally short paths, one of which goes through promiscuous, intermediate, others that are discrete. So the evolutionary pathways, because of the permissive mutations, no longer have to go through intermediates with these different promiscuous functions. In this ancestral space without the permissives, the majority of purple nodes, in order to get to a green node, need one or more neutral steps where they stay purple, where they don't change from ERE specificity. So true means that the path to an SRE has a contingent step, meaning it requires this neutral exploration of the purple part of the network. And then these can go straight from their starting point through promiscuous intermediates to an SRE. Incorporate the permissives and we now have a lower dependence on neutral steps through the network. A much larger proportion of ERE specific nodes are direct neighbors of steps to either promiscuous or SRE specific genotypes. So this sort of neutral percolation through the network as a prerequisite becomes unnecessary. Okay, so take-homes from this chapter, the permissive mutations are not necessary for SRE specificity to have evolved, but they enable many more ways of evolving SRE specificity. They shorten, they increase the evolvability by shortening the minimum length of trajectories from ancestral to derived. They remove the dependence on promiscuous intermediates. They reduce the contingency of evolution on neutral steps. So we can conclude that the historical trajectory was not the only or the shortest path to SRE specificity from the ancestral RH, but it is perhaps the one that just happened to occur by chance. So take-home messages, the new specificity for DNA evolved by a few large effect mutations that affected exclusionary, not positive interactions, but a large number of permissive mutations were required to make them tolerable. There was massive epistasis that restricted some evolutionary pathways but opened others, and therefore the historical outcomes and trajectories were not unique. They were not optimal. They were not the most accessible. They were simply the ones that appear to have occurred. So those epistatic mutations during evolution restructured the topology of sequence space. They changed the evolvability and robustness of the protein and the accessibility of many possible outcomes. So with people who did the work, Alisha did all that biochemistry. Tyler did the characterization of sequence space. Jamie and Geeta did the computational work. Jamie did the cell culture-based work, and Victor is the computational guru who implemented all the methods. So thank you very much for attending this. In the last lecture, if we have a few minutes, I would be very happy to take some comments, and thank you again. I'll squeeze in my first question here. This assumes that the DNA sequence does not make a stepwise change, right? So you're taking the double nucleotide to the other, and so how would you speculate your analysis would change if there were a one-step change along the DNA sequence, going from AA to AT to TT? How do you mean what would be the determinants of... How would the evolutionary pathway of the protein change if the DNA sequence is also undergoing a change at the same time? We actually know that because there's an alternate, slightly suboptimal, I think SRE-STAR, or SRE-2, we call it, which differs from ERE only by one nucleotide. And the three substitutions are still necessary to confer that change, and the permissive substitutions are still necessary to confer that change. I'm sure there are other... Oh, in fact, we know, there's a paper published in E-Life last year, which is looking at the sequence space of all combinations of ancestral and derived amino acids at the three sites and all 16 combinations of response element, so all the neighbors in the DNA. So it's all 128 possible complexes. And the path to the new specificity of the SRE and the variant SRE still requires all three substitutions. Maybe I understood you correctly. I mean, there are two things that you did. You assumed all the permutations first, and then the substitution. Actually, they would have been intervened. So the reason why... You mean the permissives and the... Substitutions in the permissives, right? And that is why nature may have actually taken the three-step, instead of the optimum two-step. I mean, maybe there were two permissives, a few permissives that took place, and then some few substitutions, and then a few permissives, and that may have been the two trajectory. It's possible. Now, here's what we know. We know that we need all the groups of substitutions of permissive substitutions to permit the... I think it's to permit any... I have to look at the data, but I think the answer is to permit any of the three historical substitutions to have occurred. There are actually only certain pathways through the three substitution space. Many of those are non-functional with or without the permissives. So I think we actually need the majority, the vast majority of the permissives to be present for any of those pathways. But you're right that there are a huge number of possible pathways through the three plus 11 substitutions, and we haven't explicitly characterized all of them. That's right. So we just... We know we can characterize some of those pathways, and many of them are dependent on 11 and having all the 11p, some of them probably are not. I think that's true. One way to postulate would have been to assume that nature should have taken an optimal pathway. I suppose you were saying that there are many possible days and what actually occurred is not necessarily the optimal one. These are two steps, the three steps. I mean, assume that the three step was somehow the optimal one that could have happened kind of from that back and figured out which is the only kind of sequence that would have happened that would have given the three step that could have prevented the two steps from happening. Yeah, I think I was answering a different comment. I'm sorry for that. So the point of Tyler's characterization of sequence space doesn't get at the ordering of the 11p versus the three. What it gets at is the fact that there are many other genotypes which are direct neighbors in the presence of the 11p and there are many other genotypes that you can get to from SR1 with far fewer than 11 substitutions or 14 substitutions. So the point is simply when we look at the entire sequence space, not just the pathways from historically ancestral to historically derived, there are many ways that this could have been done and there is indeed nothing functionally or accessibly and nothing unique or optimal about the pathway that was taken. Yeah. Do some of the optimal mutations that you see that would permit you to get to a function switching state, do they appear on the surface of the protein that would enable other factors to interact or not interact? And perhaps that's why they did come in or is it facing the DNA and nothing else is perturbed to the surface? Are you asking about the alternate solutions in the sequence space? Those are all in the recognition helix that's buried in the major groove. So they're all involved in contact with the major groove. If you were to just design the molecule, if you were to take the DNA binding domain and now make the three switches, do you still need the permissive mutations or can it be tolerated? I'm not a protein designer so I haven't tried to do that and if I did I'm sure it would be a disaster. But I think what Tyler's work shows is if you include the permissives, the effect of which is to non-specifically increase the macroscopic affinity, it buys you lots of degrees of freedom in evolvability and by analogy design, I would expect. So that might be a useful strategy in designing transcription factors to enhance cooperativity and non-specific contacts as a starting point. John. Your ideal case would have been to saturate new genesis across the whole group if you chose four practical reasons. If you chose three, two, or one, use the subset or eliminate one of them now. So there are many three-step. Histograms I showed you showed you that there are many more three-step paths with permissives than without. There are many more two-step paths with them and without. And there are some one-step as opposed to no one-step paths. So I think the qualitative conclusions sort of take home messages, I think they would be robust to that sort of jackknifing approach. So I mean they're well resolved given the molecules that are in the crystal. So would a longer piece of DNA look different? It's certainly possible. Neither we nor people who've worked on the crystal structures of the extend complexes have been able to do it with much longer pieces of DNA for technical reasons. I have no idea. Part of the analysis, you kind of reduce it back to like a single number. And so I'm wondering if you've considered that there might have been a large ancestral genome of bunch of ancestral proteins around. And some of them could have accumulated mutations and been free to go in this other one test. The reason why you reduce it down to a single thing is because the ancestral genome could have been just as big as today's genes. Do you mean a single binding site or a single protein? You single protein. So this is the protein that historically evolved this function. So the project begins by trying to reconstruct how what had happened happened. So now help me understand the question. No, this is a post-duplication event. You have one that maintains the ancestral ERE specificity and another that diverges to SRE specificity. And there are other members of the family, the larger family, nuclear receptors, that have different binding specificity. The next closest guys bind an extended half site rather than a dimer, and it's longer than six base pairs. There's all sorts of other stuff going on. Whether that created some sort of physiological or selective context that maintained the ERE specificity in one in the ER lineage and drove or allowed the evolution of SRE specificity in the other. I don't know because I can't reconstruct an ancestral organism. So I don't know the physiological or genomic context for why it happened. It is what happened.