 All right, so I would like to thank you all to be here for this SID virtual seminar. Today we have the pleasure to host Joshua Payne, who's a system professor of computational biology, so at the Institute of Integrative Biology of the Department of Environmental Systems Science from the ETH Zurich. So Joshua did his undergrad in mathematics in computer science at the Regist University in Denver, Colorado, his master's in operation research at the Rensselaer Polytechnic Institute in Troy, New York, and he obtained his PhD in computer science in 2009 at the University of Vermont. Then from 2009 to 2011 he was a postdoctoral researcher in the computational genetics laboratory at Dartmouth College and then from 2012 to 2015 he was at the Department of Evolutionary Biology and Environmental Studies at the University of Zurich, where he continued as a junior group leader, funded by an SNF ambitione. And then since 2017 he's a SNSF assistant professor at the ETH Zurich and since 2019 he's also a group leader at the SID, the Institute of Bioinformatics. So Joshua is very interested in evolution in vivo and in silica. His current focus is on gene regulation, particularly of the level of transcription, as well as genital phenotype maps and networks. The group is also therefore interested in understanding the design constraints, robustness and evolution of gene regulatory systems, particularly at the level of transcription using both modeling and data-driven approaches. So today Josh will tell us more of empirical genital phenotype maps of transcriptional and post-transcriptional regulation. So thank you again Josh for accepting our invitation and the thought in yours. Thank you for the kind introduction and for the invitation. Thank you everyone who's here in the real world since everyone is watching virtually. Right, so the title of the talk is empirical genotype phenotype maps of transcriptional and post-transcriptional regulation. I'm broadly interested in evolution. I happen to have this emphasis on gene regulation right now, particularly at the level of transcription, although as this title suggests I've also done a bit of work at the level of post-transcription and I'll talk about that a bit today. I think that this sort of interest in gene regulation and the evolution of gene regulation really comes from a fascination with the kinds of evolution or adaptations, innovations that are caused by DNA sequence changes that affect gene regulation. And we have loads of examples of this, right? We have examples of pigmentation patterns that have changed via DNA sequence changes that impact gene regulation including important evolution innovations such as the formation of isopods on butterfly wings which helps butterflies to avoid predators. We also have examples from morphological evolution, right? So we have examples of changes in the helmet structures of tree hoppers, body armor and fish and entire body plans and invertebrates. These are just a small number of examples. There are also many examples in physiology and behavior where we have real adaptations and innovations that have come about by mutations that affect the level, timing or location of gene expression. And all of these examples serve to highlight the evolvability of gene regulation and by that I mean the ability of mutation to bring forward different phenotypes, some of which may be adaptive. But of course not all of these changes in phenotype are adaptive, right? So mutations that affect when, where and to what extent genes are expressed are also commonly implicated in disease. A classical example is polydactylic in humans where you're born with extra digits and this is caused by a mutation in a DNA sequence that affects gene expression. This particular kind of sequence is called an enhancer. And this is clearly not like the worst disease that you can be born with but these kinds of mutations that affect gene regulation are also implicated in truly devastating diseases including many cancers. And so just as it's important for mutations to that affect gene regulation to be able to bring forth phenotypic variability it's also important that gene regulation is robust to such mutations. And for the past several years I've been kind of studying this interplay between robustness and evolvability in gene regulation. And in this talk I'd like to give sort of an overview of a series of studies that I've conducted over this period of time. More generally my approach to studying evolution in particular with the evolution of gene regulation is sort of I sort of have three parts of this study. So three approaches. One is purely theoretical so I work with computational models of gene regulatory systems and I use these models to try to answer questions that we can't currently address using experiments. The second approach of my research program is data driven so I work with publicly available functional genomics data such as data that describe when and where regulatory proteins bind the genome to affect gene expression, actual measurements of gene expression in terms of the abundance of RNA molecules, and also measurements of chemical modifications of DNA and histones that cause or cause by gene regulation. And the third approach from my research program is to collaborate with experimentalists. So I'm a computer scientist by training but I wouldn't want to be part of a lab that would let me in it. But I do work closely with some experimental groups in particular I work with Yolanda Shirley who's here at the University of Lausanne. And this is a really nice opportunity to bring some of the ideas that we have from our theoretical analyses into the lab. It also provides us an opportunity to help them sort of make sense of some of the experimental results that they're generating by complementing those results with the kinds of mathematical and computational techniques that we're familiar with. But for the purposes of today's talk I'm actually just going to focus on a series of studies that come from this data driven side of my research program. And what I'd like to highlight with these studies is how we use some old ideas from evolutionary theory to sort of make sense of how gene regulatory systems are robust and evolvable. And in so doing this allowed us to sort of reanimate some old ideas in evolutionary theory. The first idea has its roots in a paper that John Maynard Smith wrote to Nature in 1970 called this paper Natural Selection in the Concept of a Protein Space. And this letter was actually a response to a letter that was written a year earlier by Frank Salisbury in which Salisbury put forward what he thought was sort of a problem in evolutionary theory. And I think that this purportive problem is summed up well with this couple of sentences here where Salisbury says if life really depends on each gene being as unique as it appears to be then it is too unique to come into being by chance mutations. There will be nothing for natural selection to act on. And he sets up this problem by considering a hypothetical protein. So this hypothetical protein is encoded by a nucleic acid sequence of length from 1,000 and since the DNA alphabet has four letters to it there are therefore four of the 1,000 possible DNA sequences of this length which is roughly 10 to the 600 which exceeds the number of carbon atoms in the interrobable universe by several hundreds of orders of the diagnostic. And Salisbury said well if functional sequences are so rare then even if one of these functional sequences were to come into being by chance natural selection have nothing to work on because if you mutate that sequence it's highly unlikely you're going to get another one of these exceedingly rare function sequences. The Maynard Smith said well you know in this argument there's sort of this implicit assumption that those sequences that are functional are just distributed randomly throughout the space of all possible sequences but sequence space might actually not be organized that way. And he put forward this argument using a really simple and effective word game. This word game works as follows. You take two words from the English language in this case word gene they have to be in the same way. And you try to convert one word into the other by a series of single letter changes such that all of the intermediate words are also members of the English language. Alright so you can change word into war, war into gore, gore into gone and finally gone into gene. And the analogy with evolution I think is straightforward. So if you're talking about proteins the letters represent amino acids, the single letter changes represent amino acid substitutions and this requirement that the intermediate words are also members of the English language is akin to the requirement that the protein is somehow functional or has some particular phenotype. And I think this went a long way to emphasize Maynard Smith's point that once you have a functional sequence you're highly likely to have other functional sequences in the sort of mutational vicinity of that one functional sequence. But I also think this example helps to put you in Salisbury's mindset. If you think about this, the English language has 26 letters so that means there are 26 to the 4 possible 4 letter words. That's 456,976 possible 4 letter words. According to Scrabble there's only 4,175 real English words that are length 4. So less than 1% of the space of all possible English words of length 4 is actually populated by English words. So it's not trivial that you should just be able to draw these paths. I don't think Salisbury was crazy for saying what he said. I think that it was a reasonable thing to think. And the reason that Salisbury was wrong is that language like biological sequence space exhibits a correlation structure. So once you have a sequence that has a particular function, it's likely that it's going to be surrounded mutationally by other sequences that have the same or similar function. English, if you look at for instance the words around word and the words around gene, again according to Scrabble, you see there's these clusters of words. I think what else is surprising about this particular figure is that these are all English words. These are not necessarily colloquial English words, but they're technically correct. So a name is a monetary unit of Samoa for instance. A sword is the collective noun for mallards. You might, I don't know, next cocktail party. Put that to use. Right, so John Maynard Smith sort of postulated that these, what I would call genotype networks. So networks of genotypes, networks of sequences that all have the same phenotype. In Maynard Smith's example that phenotype was, the person was functional, but you could be more general about that. So he postulated these genotype networks would exist and that they would populate the space of possible genotypes and that this would help to facilitate evolution because once you had a sequence that was somehow functional, then in fact natural selection would have something to work with because mutations would be likely to create other functional sequences. But it actually wasn't for another 20 years before this idea was borne out in any kind of in this case it turned out to be a computational model of the biological system, but it took more than two decades to have any kind of real validation of Maynard Smith's idea and that first came about in a computational model of protein folding. So you have amino acid sequences, that's your genotype and then your phenotype is the particular tertiary structure that sequence folds into. And these genotype networks were found to populate the space of possible proteins. And this was then found in a variety of other systems all computational models of biological systems and people started to think about other implications for the existence of these so-called genotype networks. And there are two implications that are relevant for today's talk. The first is mutational robustness. So what I'm showing you here is a genotype network and this box is sort of the space of all possible genotypes. So each vertex in this network is a genotype, so it's a sequence and edges connect vertices, their corresponding sequences differ in single small mutations, such as a point mutation. And I've colored the vertices according to their phenotypes, so these all have some phenotype that's indicated by the color black. So you have all of these different genotypes that are mutationally interconnected with one another, and they all have the same genotype. And this has an implication for mutational robustness because if you were to take one of these genotypes and mutate it, there's a chance that you're going to get another sequence or another genotype that's also on the genotype network which means it also has the same phenotype. So the phenotype would be robust to that particular mutation. So the existence of genotype networks confers robustness to the genotypes, those networks harbor. The second implication for the existence of genotype networks is the vulnerability, or regards to vulnerability, which I mean the ability of mutation to bring forth phenotypic variation, some of which is adaptive. And the reason that the existence of genotype networks have implications for vulnerability is that these genotype networks do not exist in isolation in genotype space. Rather, these genotype spaces are populated by many genotype networks of different phenotypes, and these genotype networks interface and overlap with one another. So as a population spreads out neutrally on one of these genotype networks, subsequent mutations can create new phenotypes. So since these genotype networks spread throughout the space of possible genotypes and they bump into the genotype networks of other phenotypes, mutation can bring forth phenotypic variability. So the existence of genotype networks has implications for robustness and vulnerability, and it also helps us to understand how these two properties can be synergistic. So how can something, how can some system be simultaneously robust and evolvable? And this particular question is one that had been addressed in computational models of biological systems, but never in experimental data. And the reason, getting back to Salisbury argument, is that the space of possible sequences is so vast that if you wanted to characterize this space experimentally, it would simply be impossible for even like small macromolecules. However, we were at this time studying gene regulation, and we had thought well, maybe one way to get around this would be actually to study a sub-component of a larger biological system. That larger biological system is a gene regulatory circuit, and these are important for driving expression patterns that embody crucial biological functions and development, physiology, behavior, and the important sub-component are interactions between regulatory proteins called transcription factors and DNA. So these proteins are sequence-specific DNA binding proteins that bind DNA to regulate gene expression, either by recruiting RNA polymerase or by getting in the way of RNA polymerase. And importantly, the strength of this binding event is directly related to this regulatory effect, to its activating or inhibitory effect. And in studying gene regulation, you are aware of some data sets that exhaustively characterize the binding preferences of these proteins to all possible DNA sequences of a given short length, specifically all sequences of length 8. So you can think of this as a mapping between genotype and phenotype, or the genotype is a DNA sequence, and its phenotype is whether or not it binds a particular regulatory protein. You can also think of it as the actual strength that it binds a regulatory protein. So just to give you a feel for these data, all these data come from a technology called protein binding microarrays. These are chip-based technology and on these chips you have probes, in each probe you have double-stranded DNA sequences that are chosen in such a way that every single DNA sequence of length 8 is represented on this chip at least 16 times. And so you can get an assessment of the affinity with which a regulatory protein binds to every single possible DNA sequence of length 8 by looking at the fluorescent intensity of the spots that contain the DNA sequence relative to the spots that do not contain the DNA sequence. What's important for today's talk is just to know that for every single DNA sequence of length 8 you have a measurement of binding strength, the binding affinity to that sequence of length 8. And you have such data for a large number of transcription factors. The first studies I'll talk about we worked on the order of 100 transcription factors and later we're available, we're working on the order of 1,000 transcription factors. So we can use these data to construct these gene type networks that Manner-Smith had postulated and I'll show you sort of how we do this by example. So let's say you were to pull down protein binding microarray data for a single transcription factor in this case it's a transcription factor called SRY and you were to look at the distribution of the number of binding sites that have a given binding affinity. The distribution that looks just like this. So the mode of this distribution just represents sequences that bind non-specifically. So these transcription factors are just attracted to the DNA backbone. But then you also have this right tail of the distribution that represents sequences that bind the transcription factor specifically. So what you can do is you can just set a threshold on this tail and you can say okay everything above this threshold we consider specifically bound by the transcription factor and everything below is non-specifically bound. And you can then build up a gene type network out of these sequences and you can do so for a large number of transcription factors to sort of populate this space of all possible genotypes. So I'll show you how we do that by example. So let's say that this particular transcription factor just binds three sequences. You'd represent each sequence as a vertex in a network and you would connect vertices by an edge if the corresponding sequences differ by a single small mutation. So in this case we have this one point mutation that makes this top sequence differ from the middle sequence. And here you see another single point mutation that makes this bottom sequence differ from the middle sequence. Now obviously this particular transcription factor binds more than just these three sequences. This is the sub network of this much larger gene type network. This is kind of Maynard-Smith's idea born out in real data. So just to try to point home, each one of these vertices represents a sequence that binds this particular transcription factor, S-R-Y, and edges connect vertices if their corresponding sequences differ by a single small mutation. And we can study the structure of these gene type networks to ask quite a few questions, but for instance we can ask how the structure of these networks relates with mutational robustness. We can also ask how these gene type networks facilitate evolvability, and we can do that by not just looking at a single gene type network for a single transcription factor but populating this space of possible sequences with other gene type networks for other transcription factors, which I'm just showing schematically here, and asking how these gene type networks overlap with each other and how they interface with each other. This gives us a feeling for how mutations in these binding sites could bring forward phenotypic variation. In this case, that variation is new binding phenotypes. We can also look at how a mutation can abrogate binding, which is also very important in regulatory evolution. The loss of binding sites can cause important phenotypic variation that is sometimes adapted. Alright, so to show this more quantitatively, what I'm showing here on the Y axis is our measure of mutational robustness. So I can just explain this briefly how we do this. So the mutational robustness of transcription factors binding sites is the average mutational robustness of the individual binding sites. The mutational robustness of an individual binding site is simply the fraction of all possible mutations to that binding site that create another sequence that also binds the sequence of transcription factor of interest. So it's said more simply it's on average, how often does a mutation to one of a transcription factor binding sites abrogate binding. So when this robustness value is high, then it's very rare that these mutations break binding, and when robustness is low, it's very often with these mutations abrogate binding. Okay, so on the Y axis we have this measure of mutational robustness. On the X axis we have the size of the genotype network, shown as a fraction of genotype space. So this is just how many sequences are in this genotype network, divided by the total number of sequences of life bait. Each data point represents a transcription factor, and in this case we have the close symbols representing data from mouse, and the open triangles representing data from the yeast. And what we see is there's this sort of rhythmic increase in mutational robustness as a function of the size of the genotype network. And now if we do the same thing where we have a vulnerability on the Y axis, and the vulnerability is now defined as the ability of mutation to bring forth new binding phenotypes. So for all of the genotypes in the genotype network we look at all of the one mutant neighbors and ask which transcription factors to those one mutant neighbors bind. And that's our measure of the vulnerability. It goes from zero to one because we're normalized by the number of transcription factors in our dataset, that's the maximum. And again we're showing this as a function of the size of the genotype network. And here we see this increase that's much more abrupt, such that these genotype networks only have to occupy about one percent of genotype space before they're maximally evolvable. So this study helped us to understand how a genotype network sort of mediates this synergistic relationship between robustness and availability, and this was the first time that this particular relationship was demonstrated in experimental data. And getting back to the schematic here, it looks like this genotype space of transcription factor binding sites is organized in such a way that these genotype networks are overlapping with one another and interfacing with one another. So they're all just like this mess, they're all really highly intertwined with one another. In this way they're both robust and evolvable. And we're interested in understanding sort of the generality of this result, especially in the context of gene regulation, and more specifically in the context of regulatory proteins interacting with the nucleic acid sequence ligands. And so we decided to sort of go one step up in the hierarchy of gene regulation and look at RNA-mediated gene regulation, particularly post transcriptional gene regulation that's mediated by RNA binding proteins. So RNA binding proteins bind RNA molecules to regulate their stability, their transport, their decay, among other aspects of RNA biology. And what's important here is that we can actually do a comparative analysis of these genotype-phenotype maps, where in this case the genotype is now an RNA sequence and its phenotype is molecular capacity to bind RNA binding protein. We can do a comparative analysis here because the biophysics of binding in these two levels of gene regulation are highly similar to one another. And what's more, the computational pipeline that's used to go from the fluorescent intensity of these chips to this measurement of binding affinity that we work with, that's literally identical between these two experimental protocols. So we really have sort of a head-to-head comparison that we can make. So what we want to know is whether or not at this level of gene regulation this particular genotype-phenotype map is organized in the same way. So now we're looking at a different set of data, so now we have the closed symbols representing transcription factors, here the data is for human and fly, and the open symbols correspond to RNA binding proteins. And again on the y-axis we're looking at robustness, and on the x-axis again the size of this genotype network. And we see once again this sort of logarithmic scaling as the genotype power gets larger, robustness increases logarithmically. And this was a nice thing to see because this is something that's been predicted in computational models of biological systems where there's some nice math explaining this. And this is the first time that we were able to show using real data that this scaling relationship holds across multiple levels of gene regulation in this case. And now, so the relationships between mutational robustness and the size of the genotype network is really similar for these two classes of regulatory proteins, right? But now we're moving on to vulnerability that the story changes. So remember the closed symbols correspond to transcription factors, the open symbols to RNA binding proteins, and we already saw this really abrupt increase in vulnerability with the size of the genotype network on the previous slide for transcription factors. Now we're also seeing this for RNA binding proteins, except the rate of increase is slightly slower, and more importantly the maximum level of vulnerability is quite lower, right? So this hints that the architecture of these genotype genotype maps really differ between these two levels of gene regulation. And indeed if we look at sort of the average mutational distance between these genotype networks from one another, that's what I'm showing on the y-axis for fly and human broken down into transcription factors and RNA binding proteins that we find is that these transcription factor binding sites, these genotype networks of transcription factor binding sites tend to be much closer together in the space of all possible binding sites than are the genotype networks of RNA binding protein binding sites. What's more, the genotype networks of transcription factor binding sites tend to overlap with one another to a much greater extent and do those for RNA binding protein binding sites. And I think this is particularly surprising because in these data sets I can show you the numbers, but we have lower RNA binding proteins in our data sets than we do transcription factors, and we have fewer binding domains in the RNA binding protein data set than we do in transcription factors. Proteins, regulatory proteins with the same binding domain typically bind similar set sequences. So in fact the data set as it's set up should stack the deck in favor of RNA binding proteins having more overlap, we just observed the opposite. Whereas this really suggests that the architecture of these genotype phenotype maps are just fundamentally different, so if we go back to this sort of schematic it seems that these genotype networks of RNA binding protein binding sites are just kind of farther away from one another in this space of possible binding sites, and in that regard they are less available than our transcription factor binding sites. Alright, so that was a lot of detail. We can kind of come up for air now. So I had said earlier that I wanted to show how the two sort of ideas from evolution that helped us to think about these data and how in turn this helped us to bring new life into these old ideas from evolution theory and now I'd like to present this second idea which is much better known. So this is the metaphor of the adopted landscape which was put forward by Stuart Wright in 1932. It's a metaphor that really pervades the biological sciences and has shaped evolutionary thoughts ever since its inception. This metaphor is, it's not a perfect metaphor like any metaphor but it is in a metaphor it's akin to a physical space where coordinates in physical space corresponds to genotypes in an abstract genotype space and where the elevation of these coordinates corresponds to some quantitative phenotype or to sort of the ultimate phenotype of organismal fitness. Evolution can then be viewed as a hill climbing process in these landscapes where populations tend to move towards adaptive peaks as a consequence of mutation and natural selection. So in the context of transcription factor body size, we think about where phenotype is this binary you know, is this sequence bound or not. If we instead start thinking about the phenotype of these sequences as quantitative, so what is the actual strength in which this particular sequence is bound, we can transform these sort of flat genotype networks that I was showing you before into these adaptive landscapes where we can study the ruggedness of these landscapes. The ruggedness in the adaptive landscape has several important implications for evolutionary processes ranging from the evolution of reproductive isolation to the evolution of sex, how is genetic diversity generated and maintained but what's germane to this particular talk is that the ruggedness in the adaptive landscape has important implications for vulnerability. So for the ability of mutation to bring forth phenotypic variation. And the reason that the ruggedness in the adaptive landscape has implications for vulnerability is that both the shape of the landscape and the populations location within a landscape determine the amount of phenotypic variation that mutation can bring forth. So for instance, a population that resides in this particular region of the adaptive landscape may have no problem just by a mutation in natural selection marching directly up this hill to the global peak in this particular landscape. There's a population that is navigating in this section of the adaptive landscape might get trapped by this local optimum that's separated by this adaptive valley from the global adaptive peak. So in this way, a population's location within a landscape really determines the kind of phenotypic variation that mutation can bring forth. Right. So we analyzed the topographies of adaptive landscapes and transcription factor binding affinities for a large number of transcription factors specifically for 1137 transcription factors from 129 eukaryotic species representing 62 DNA binding domains. I believe that each of which you can think of is like a distinct biophysical mechanism by which the transcription factor interacts with the DNA. And we could characterize the ruggedness of these landscapes in a variety of ways. Some of these ways are really simple like just counting peaks, how many peaks are in the landscape for instance. Other measures pertain to things like epistasis. And then we can compare these measures of landscape ruggedness with kind of a pair of expectations that come from no models. One no model generates very smooth sort of Mount Fuji like landscapes and these kinds of landscapes would pose no obstacle for evolution. It doesn't matter where you are in a landscape. You could always move uphill to the global peak. And we also considered a no model that produces very rugged landscapes and these really hinder the navigability of these landscapes. Right. So here we can look at these data and how they compare to these no models. So we're looking at three distributions here. Each distribution shows you the number of peaks in a landscape. The top panel shows you this distribution for the additive model. The middle panel shows you this distribution for the empirical data. So all 1137 transcription factors. And the bottom panel shows you this data for this shuffled model which generates really rugged landscapes. And what we find is that the empirical landscapes are much closer to this additive model. They tend to be single peak landscapes. They're very different from what you'd expect in this highly rugged model. But there's also variation in the number of peaks. Some landscapes do have more than one peak. The next thing that we could look at is the number of binding sites in a peak. So when you think of a peak you might be tempted to think of just a single sequence being in that peak but our data don't always agree with that expectation. So here again we're looking at distributions for the additive model, the empirical data, and the shuffled model. And what we find here is that again this distribution for the empirical data much more closely resembles that of the additive model that it does of the shuffled model. So these peaks in these empirical landscapes tend to contain multiple sequences. And importantly there's variation in the number of sequences per peak and we're going to come back to that in a moment. We also studied the mutational accessibility of these peaks in each landscape and I explained what I mean by that by example. So here I'm showing a genotype network with transcription factor binding sites. I've just chosen this one because it's small. What we'll do is we'll zoom in on this peak here. So this is the peak sequence. And when we have the zoomed in view the number in each vertex represents the mutational distance to that peak sequence. And for each mutational distance what we'll do is we'll look at all possible mutational paths to the global peak. So for instance here I'm showing one mutational path that goes from a sequence that's two mutations from the global peak to the global peak. And we'll just ask what fraction of all of these paths are just increasing in binding affinity. So this particular path, so here in the y-axis we have binding affinity and the x-axis we have the distance to the peak. We see that this binding affinity is just increasing along this path so we'd say that's an accessible mutational path. Similarly this other path from the same starting sequence is also mutational accessible. We're just increasing in binding affinity. And then in contrast if we start with this sequence and take this mutational path we have to go through this valley in binding affinity. So we'd say that is not an accessible mutational path. So for any given mutational distance we'd say mutational accessibility is simply the fraction of paths that are accessible. So the number of black lines divided by the total number of blacks. Okay, so I can show you these data. First just let me orient you to the structure of these genotype networks. So first you're just looking at a distribution of the mutational distance to the global peak. So the average is between three and five mutations to the global peak. And now if we look at the mutational accessibility of these peaks as a function of the distance to the peak, we see first of all what we expect. The farther you are from the global peak the less likely you are to be able to get there by mutational accessible paths. So we have this decrease. And I think what's actually quite surprising about this trend is that even near as far away as possible there's still around 20% of all possible paths that are mutational accessible. So in this regard these landscapes are highly navigable right? They're smooth, they're mutational accessible. And if we compare these findings to what we expect from our 2-0 models we see that in this case we're really intermediate between the 2-0 models. You can't really say that it's this internal mutational accessibility of these peaks are more like the additive or the rugged null model. Alright. So as I said before there was variation in the number of sequences in the global peak. And when we were doing these analyses and actually all the analyses that I talked about today we were always acutely aware of the fact that we were working with data that were generated by an in vitro assay. So we wanted to know to what extent do our findings say anything about the evolution of binocytes in vivo. And in each of these studies we really looked into that and presented one analysis from this study. So since we saw that there is variation in the number of sequences per peak we hypothesized that those transcription factors that had global peak sequences that were very narrow visited less diversity in the binding sites then would transcription factors that have global peaks that are very broad. So if you're a transcription factor that has a global peak that looks more like the Matterhorn than it does the Millaisone, we would expect you to have fewer sequences I'm sorry less diversity in these peaks yet. So yeah fewer sequences. Okay. So what we did is we looked at 19 Saccharomyces cerevisiae strains and we calculated diversity in these strains for 23 yeast transcription factors and we then related that measure of diversity with the size of the global peak. And we found this very striking correlation and this is a correlation that holds up even after we control for things like the overall specificity of the transcription factor as measured by the information content of its position weight matrix. So this was one of several analyses that made us feel that these landscapes that were constructing from the vitro data actually tell us something about how binding sites evolve in vivo. And this is surprising right because there's just tons of other factors involved in gene regulation in vivo that are completely abstracted away in these vitro acids. Okay. So that is the research part of the talk and just to sort of zoom out I think that these studies have sort of helped us to better understand how gene regulation is simultaneously robust and capable of bringing forth phenotypic variations such as changing pigmentation patterns and body plans that really embody a lot of diversity of life around us. I know since this is a talk for a bioinformatics audience I would also just like to highlight a tour that we developed that allows you to take whatever data is that you're working with that you think these kinds of analyses might help you better understand and to use those data to perform all of the analyses that I talked about today in a really just easy to use format. This web server is called the genometh server so it's a web server that's also underlined on package that I'll talk about briefly but basically all you need is these two columns of an input file so you have to have some genotype and then you have to have some hierarchical phenotype assigned to each genotype. So in our case here I'm showing you the genotypes are DNA sequences and the phenotype are the regular proteins that bind those sequences and then you can also have a quantitative phenotype which I'm calling score here so in our case this is binding affinity and then you have some measurement of noise. These two columns are totally optional though. So to give another example if you're interested in relationship between RNA sequences and RNA structures, the genotypes could be RNA sequences the categorical phenotype could be like which particular structure that sequence folds into. The score could be the folding energy and the noise could be some measurement of your confidence in that folding energy. The point is simply to show you this is super simple input form and all you need is to put your data in this form, you pipe it into this server and it will literally perform every analysis that I described today in this talk. Not literally, with the exception of the binding affinity one. It doesn't do anything with the data. So it's I think a pretty convenient tool. It also has these nice interactive visualizations that you can play around with so you can look at your genotype networks, you can move your genotypes around where you get a feel for your data and if you're not the type that likes to work with a web server, the software engineer that worked on this, he developed a really nice pipeline package that implements all of these analysis that I described today. I don't expect you to be able to read this code. Where I put this up here just to show you there's only 13 lines of executable code here as the ones that are not created out and those 13 lines are sufficient to reproduce all the analysis that I talked about today. So it's really like it's a nice little tool. So I do not do this work alone. This has been all very collaborative. A lot of the work was done with a Ph.D. student member of this group named Jose Avalora-Driguez. The TUNETS tool and this RNA binding protein research was done in collaboration with Jav Khalid and then finally all of this was done in collaboration with Professor Andreas Barber. These are my funding sources over the course of the period we did this work. And finally if you're interested in reading more, this is really the body of work that I just went over. That's it. Thanks.