 Okay. All right. Good morning everybody. Thanks for coming. Welcome to week four of our course. Our usual housekeeping, the disclosure for the third time that I have no financial relationships with commercial interests. If that doesn't satisfy the CME folks, I don't know what will. Okay. So we're going to pick up where we left off two weeks ago and talk about how to further analyze sequences primarily at the protein level. And that might seem odd in the context of a course called Current Topics in Genome Analysis. But it's really important that we consider both DNA and protein sequences when we're doing our analyses. The advances that we've been able to make over the last number of years in genomic science really makes it much easier to find mutations, look at human genetic variation, study chromosomal aberrations and changes in gene expression. And we need to be able to translate those findings at the nucleotide level to the protein level, keeping in mind of course that the proteins are the workhorse of the cell. So in the laboratory, hopefully this would lead to better experimental design and allow you to better understand what a mutation is doing, the net effect of that mutation on a protein structure and function. In a clinical setting, you might detect a mutation in a patient, perhaps by doing genetic susceptibility testing, and you need to be able to understand the net effect again of that mutation at the protein level to decipher what's going on phenotypically in those patients. So it gives us good information about what is going on at the metabolic level, what changes are going on that help us understand what is going on in the patients that we see in the clinic. So and hopefully by extension will allow us to think about new and interesting therapeutic approaches. So with that, we're going to continue our discussion of what can be done at the protein level by looking at profiles, patterns, motifs, and domains. We're then going to switch to thinking about things rather than as a string of letters at the sequence level to looking at three-dimensional structures to consider some other approaches that can be used to address the issue of similarity. And finally, we're going to wrap everything together with a discussion of multiple sequence alignments and how they fit into the overall landscape of sequence analysis. The nuances of how to do them, how to interpret your multiple sequence alignments, and more importantly, how to not over-interpret your multiple sequence alignments. So as we did in the first week, we talked about homology searches, and when we talked in the context of BLAST, all of these searches were one against one. You had a query sequence, and that was compared one by one to all of the sequences in a public database or perhaps in a database that you have collected in your lab. Another approach to doing this is by using things called profile searches. So rather than doing these comparisons one to one, we use the common characteristics of a protein family to bolster the detecting power of these methods. And by doing this, this allows us to identify members of a protein family that BLAST or FAST-A itself might not find. So these searches can be one against many. You start with a query sequence, and you can compare that to entire families of protein sequences all at once, or many against one, where you develop what is essentially a multiple sequence alignment and use that as the basis of comparison to pull out those last few members of a protein family. So we're going to talk about all four of these methods that really capitalize on commonalities and sequence commonalities in domain architecture to take you one step further than what you can get simply by doing a BLAST search. So like we did two weeks ago, let's start off with some definitions. The first definition today is something called a profile. And all a profile is is a numerical representation of a multiple sequence alignment. And what this depends upon are patterns or motifs, protein motifs, secondary structural elements that are important that contain conserved residues. When we look at these profiles, they are intended to represent all of the common characteristics of a protein family. And by doing that, again, it allows you to find similarities between sequences that have little to no sequence identity. So if you'll remember two weeks back, I offered the example of the homeodomain family, 60 amino acid motif in length. But there's only a small handful of residues across those 60 amino acids that are absolutely conserved. And because of that, BLAST itself might not be able to find for you all of the members of that particular family. And this is a very common problem that happens over and over again. So you have to have some other techniques in your arsenal that are going to allow you to find these distantly related proteins. Okay, so what does one of these profiles look like? So here I just have a multiple sequence alignment, 10 across. If you look at the last three positions, you'll notice that position eight, you always see a T, a three in that position. In the ninth position, you sometimes see a proline, but sometimes you see a threonine. And in the last position, you always see an invariant glycine. Based on that multiple sequence alignment, we have methods that are going to ask four questions. So what residues do we see at each position? What's the frequency of observed positions at places in this alignment where we don't have absolute conservation, which places are conserved in the alignment? And finally, we don't have any in this particular example, but where can the gaps actually be introduced? Once we answer all of those questions, we can construct something called a position-specific scoring table. So this should be slightly reminiscent of the blossom matrices we talked about two weeks back. And this is just, again, the numerical representation of this multiple sequence alignment. So we have position one through position ten going across in the alignment, and one going down here to position ten. Now, we know already that in the eighth position, we have an invariant threonine. In the last position, we have an invariant glycine. So let's look at that last position. And as before, if we line up where the G is with the G in the scoring matrix, you'll see there that that converges on a value of 150, the largest number in the table. So whenever you have an exact match of your sequence to the profile, you give it the highest number of points. As before, we want to take into account conservative substitutions that might indicate something interesting biologically. So let's look at that next to last position where we can either have the P or the T. The consensus is the proline since it's mostly prolines in that position. If we look across, we see another positive value just not as high as what we had before. So that's to represent that there is some variation in that position. Finally, if we look at the second position here, there's a whole variety of things that can be there. If you match the consensus, you're scoring even lower, but the number again is lower to indicate less conservation at the position. And you'll see a smattering of negative values in there that are supposed to represent mismatches to the profile. Does that make sense? Okay, so pretty simple. Once we have these, okay, we throw this away, and this is used as the basis for comparison. So we can now take a single sequence and compare them to hundreds or thousands of these profiles, each of which has been constructed to represent a particular protein family. Now, the good news is you don't have to make these. They're already available. And we'll see examples of that shortly. The second definition this morning is something called a pattern. So when we had our profile, you saw what could be at each position. And you also had some idea of what the frequency of each residue could be at each position. The patterns take a little bit of a different approach in that it just gives you positional information. What has to be in each position by position in a particular motif. And there's lots of symbols in here. So let me decipher this for you. Okay. At the very beginning, you see phenylalanine and a tyrosine, and they're put in square brackets. So what the square bracket means is one of. So at the first position, either a phenylalanine or a tyrosine, not both, one or the other. Then you see an X. The X just says, well, any amino acid can be at that position. We then just see a letter by itself, no embellishment. So that means that that is an absolutely conserved position. So at the third position, you have to have a cysteine. Here's the X again. But now there's a two after it. And that just means two of them. So any two amino acids can follow. All right. Now here we have the curly brackets. So remember, the square brackets mean one of. And what the curly brackets mean is neither. So you can have anything at that position except availing or analanine. The X, again, in the last position. And here we've got a histidine. The three after it, of course, means three histidines can be in a row. So it's just a shorthand way of representing a particular motif or a particular domain. Again, positional information, but not frequency information. So both of these are useful. And you should use both of them. And let's just go right into the websites and immediately talk about how you can put these into practice. All right. So the first web resource we're going to consider today is called PFAM, which stands for protein families. And this is just a collection of all of these profiles, a lot of multiple sequence alignments that represent either, whoops, sorry, that represent either protein domains or conserved protein regions. By extension, because they're conserved, Mother Nature is conserving them for some reason. So they either have structural importance or functional importance. When we look at the entries, when we look at our examples, you'll see the actual alignment into creating the profile, some information on domain architecture. So when I use that term, it's the collection of domains in a row that characterizes a particular family. So a series of domains. It'll give you some idea of which species we see these particular proteins in. If there's any solved protein structure, either X-ray or NMR, we give you that information as well. And then finally, links to other resources. And we'll follow some of those through. There are two flavors of PFAM. The first one is called PFAMA. And these are based on curated multiple alignments. And these are the, this is the better of the two sets that I'm going to describe to you. In this case, somebody actually manually searched the database, made the multiple sequence alignment by hand, and then used that alignment using these hidden Markov models to find all of the other sequences belonging to the family. And if you read the documentation, they draw the analogy to a nice handcrafted beer that somebody has taken their time to put together for your satisfaction. Okay. Because of the care that has been given to constructing these, when you use them as a comparator, when you have a hit, the hits are very highly likely to be true positives. The difference between PFAMA and PFAMB is that in the case of B, these are not the nicely handcrafted ones. These are just automatically generated from database searches. By extension, they're deemed to be of lower quality because they're, it's lacking that manual inspection. But if you don't find anything based on the PFAMA search, it's worth taking a look at what you get out of PFAMB. Okay. Just by way of reminder, okay, there is a webpage up for all of the sequences that I will be using as the examples. The same way we did two weeks ago in the first sequence lecture and as Tira did last week for the genome browser lecture. And again, one, you really, really should go back to your labs, take the time to work through the examples yourself to really, really reinforce what we're doing here in class. It's the best way to learn how to do this stuff. So there is the URL. Okay. So we're going to pretend that we have now gone to our web browsers. Here is the URL at the Sanger Institute for the PFAM database. And all of your options are in a very concise list here. We're going to start with a sequence. So what we want to do is just click on sequence search. And when we do that, we get a box to put our sequence in. Underneath, it gives you a very almost it's very easy to miss link here of how to get to the options for your search. And you know, I'm big on the options at this point. So what we're going to do is click on that to just expand this display so you can actually see what the options are. Unlike BLAST, it's not very long. There aren't very, there's only two options here. So we'll take our sequence from the page that I showed you earlier. You can change the cutoff. So remember, we talked about statistical cutoffs for your BLAST searches, the same rules apply here. So in the context of this particular example, the value by default is one, for purposes of the example, I'm going to leave it at one. But remember, the rule that I gave you last week was that we want an E value .001 or smaller. Okay. But again, for the example, let's just leave it there. The other choice here is gathering threshold. And what that just means is that the E value will change as the comparison is done from profile to profile to profile based on that particular profile. So it uses a different statistical measure to assess similarity. At the bottom, you have a checkbox to search for PFAMBs. You should always check that box. I've left it unchecked here because I know that there actually aren't any. So we're going to go ahead and submit. And we get a very actually easy to read output that looks like this. At the very top, we found that our sequence has a p450 domain and it gives us some just a graphical view of that tells us three PFMA matches one significant, two insignificant. So here's the significant one. And here are the two insignificant one. If we look at the significant one, it tells us that our sequence has a cytochrome p450 domain in it. It tells us where the start and end positions in our sequence is. So from 41 to 500 in our sequence. And here's the E value. So tremendously low. It meets our criteria. No problem. Okay, so let's say we want some more information on this. There are two toggles here. One says show the detailed description of the page. The other one shows you the alignment. So we're going to pretend that we've clicked on both of those to expand the page. And let's concentrate on the alignment part. First, this is somewhat reminiscent of what we saw when we discussed blast two weeks ago. Your sequence is the sequence at the bottom. Okay, with all of the green on it. Okay, then there's a bunch of other lines here. Alright, the line that is marked HMM, the very first line, the hidden Markov model line is the consensus from the profile from that multiple sequence alignment that generated the profile with all the numbers in it, how it matched up to your sequence. The second line gives you the match between your sequence and the hidden Markov model. So anytime you see the letter, it's an exact match. Anytime you see the plus sign, again, it's a conservative substitution. Now we have some statistical information we haven't seen before. So we've seen the E value, but this also gives you information position by position of how good the alignment of your sequence to that profile actually is. And so why I've expand this page is so that you have this information in your notes. So PP, the posterior probability, and that just tells you how good the alignment is, higher is better. Okay, so if you see a one, for example, one means from 5% to 15%. A two means from 15% to 25%. You get the idea going all the way up to a nine, which means 85 to 95. And the plus sign means anything over excuse me, the asterisk means anything over 95%. So again, the higher the number, the better. And if we look at all these numbers, these look pretty good. Now, that's fine. Let's take a look at what we can learn about this protein family. Now that we know that our sequence of interest contains a P450 domain. So this is the summary page for cytochrome P450. You have a nice executive summary here that so if you didn't know anything about what this domain did or was structured, you very quickly can learn a little bit about that some information back in the literature and a sample three dimensional structure of one of the proteins containing a P450 domain that match this particular profile that we have found. This is an example structure below it says view a different structure so you can change this to see some other members of the family as well. At the very top, we now have some information about where we find this particular P450 domain. So there are 196 different domain architectures, different arrangements of protein domains that contain as one or more of those domains of P450 covering almost 28,000 sequences. Okay, so you get a sense of how pervasive this is. Let's pretend to go down some of the links on the left hand side here. So if we go to the second one down, this gives us an idea of domain organization. So this is now what I mean by the architectures, you can see some examples. So in some of these arrangements, you've got the P450 being the major part of the protein, but you've got some other domains on the side. It just one after the other slightly different collections of things. So again, the cell being very economical in trying to use the same motif in different contexts for in different metabolic pathways to perform different functions. And if you wanted to learn more about any of these, there's a show button under each one of these architectures where you can see all of the proteins that are structured like the ones that are shown to you on this page. So a very quick way to amass information, sequence information on similar proteins. If we go down to the next link in the left sidebar, we have a way to actually view these as a multiple sequence alignment. So there is just a button right there that says view. And if we click on that, that brings up something called JAL view. It is an interactive viewer that gives you a bird's eye view of what's what in this sequence. Now, all of the colors mean something, all of these histograms mean something. We're going to come back to this at the very end of the lecture in the context of our multiple sequence alignment discussion where I show you how to use this interactively and what you can learn from this. Okay, so we'll go back to this pretend we're back on the summary page here. And let's say we now scroll down to the bottom. This gives us some links to other databases that contain additional information about this particular protein domain. And the one I'm going to concentrate on is the one that's right in the middle four down that says pro site. So if we click on the pro site link, this gives us again yet another executive summary. And whenever you see these summaries, you should really take advantage of them because it saves you a lot of time in not having to collect all of this information from the literature to be clear, it's not a substitute for reading the literature, but at least will help you zero in on which papers are the important ones that you should read. Those are all cited throughout. So let's look at the box in the bottom. Alright, so before in the context of PFAM, we were talking about profiles, all those numerical representations of the multiple sequence alignments. Here instead, we have the patterns. And so you can see here, what the pattern looks like, you see the same symbols being used that I described to you before. There's an invariant Sistine here that it tells you right underneath that that's the heme iron ligand. So obviously very important to the functioning of this particular cytochrome. Now what I think is really cool about this particular page is a little bit further up, where it says expert to contact by email. All right, so this is now a person who lives and breathes this protein family every single day, they conduct their research on this family and unrelated proteins, they know the literature inside and out, they've published in this area. And more importantly, they have made themselves available to you. If you have any questions about this, when an expert in the field allows, makes themselves available to you in this fashion, if you're studying this particular protein family, you would be foolish not to take advantage of that. So whether we're talking about this family or any of the other ones that are documented in pro site, and you need assistance, you've just gone into a new lab, maybe you don't quite have your hands around what the particular system looks like, or you've found an interesting side road that you want to pursue in the course of your research, there is somebody who is available to you just by simply sending an email. So I think that's very cool. Okay, let's go back one screen. And you'll see at the top we were I've now clicked on InterPro, there's a series of tabs here. So we've left PFAM. And we're now looking, we're still on the PFAM site, but we're now looking at an entry from another database called InterPro. And so what InterPro is intended to do is serve as an aggregator is to collect information from a whole slew of databases like PFAM, like pro site, all of the other ones that were on the list that we just saw, put them in one place for you. So really, intending to be one stop shopping, a very useful thing to be able to do. So again, yet another executive summary, a little bit more expanded than what we saw before. And if we were to click on this InterPro link, we'll leave the Sanger Institute's PFAM website, we're not going to go very far, we're going to go to the European Bio Informatics Institute and look now at what we can find on the InterPro website. So we it gives us some idea of what these signatures look like. So when I use the word signature, it means the same as the word pattern. They are synonymous with one another. What we see here that we haven't seen before is the relationship of this protein family to other protein families. So there are various classes of cytochromes. So this gives you some idea of parent-child relationship subfamilies that belong to the major family of P450s. If a child has been identified by definition, it has to be part of the bigger group because it's more specific than the parent, so it implies a match to the parent and the signatures have to overlap. If we look further down, we have some information on go terms, some of the InterPro annotation again. Here's the first time we get to see the taxonomic span of P450. What organisms actually contain a P450 domain? Now, it's a very interesting representation. I've given you the key here. So in the center, this is supposed to represent the root of the tree. The inner circle, excuse me, are tree nodes. And in the outer circles are where you see the names of representative model organisms. Each one of the numbers tells you how many sequences have this P450 domain in it. If you were to click through, you could actually download the sequences. There's absolutely no significance to the placement of the nodes on the circle, so don't interpret anything about evolution from this. But there is one thing you can interpret from this, is by looking at what set of organisms are represented here, you can start to think about the evolutionary history of this particular family. When did it first emerge? Perhaps this protein family is co-opted from something earlier to perform a new role in higher organisms. So you can start to think about where it came from. Okay. Pretending to scroll down again a little bit more. We're going to look at domain architecture a slightly different way. Here are some example proteins that match this particular protein family, this P450. The blue bar is your protein, the one that we used in the very, very initial search, way back a few slides. Underneath, you will see a bunch of colored bars. Those correspond to the motifs shown in this table. And the ones with the cross-hatches are cross-links to entries in other databases. Okay. All right. So it's pretty easy to get through this. You can start any place you want. I started at PFAM. You could start at Interpro. You can start any place in pro-site. You can start any place you want. But really a very easy way to find out not just where at what sequence similarity you have to other proteins, but looking at it more from a standpoint of functional domains and structural domains. And again, by doing that and marrying it with BLAST, you might find things that you didn't find by just doing the BLAST search. As always, some additional reading for you. So in current protocols in bioinformatics just by way of reminder, this is freely available to you through the NIH library. You can just get these at your desktop and download the PDFs. The first one gives you a much more in-depth discussion of PFAM written by the folks who have put PFAM together. The same here for Interpro. This has been written by the developers to give you some more information about how to use these tools for protein domain analysis. Okay, so let's just keep building on this now. There's yet another method I want to describe to you that uses a slightly different algorithm. And this one is at NCBI. This is called the Conserved Domain Database, or CDD. So as everything we've seen so far this morning, this helps you identify conserved domains in a protein sequence. This is something that sometimes is called a secondary database because it uses a whole bunch of individual databases in concert for the searches. So PFAM A, which we've learned about earlier this morning, but not PFAM B. And then a set of other databases that provide similar types of information. Two of these are housed at NCBI that have been developed by Eugene Coonan's group. So COG gives you some information about clusters of orthologous prokaryotic protein families. COG spelled with a K. Now is the eukaryotic equivalent of this. So really trying to hit this from a bunch of different data sources to pull out as much domain information as we can find. Okay, this is done a little bit differently than it's done over at PFAM. The search here that uses something called RPS Blast, Reverse Position Specific Blast. So the query sequence that you put in is used again to search a database of these position specific scoring tables, but the algorithm is significantly different. It is not the same method that's used by either PFAM or by InterPro. So similar to the advice I gave you two weeks ago when we were talking about scoring matrices that you should use not one, but possibly three, when you do your blast searches, you should in the same way use PFAM, InterPro and CDD when you're trying to look into the question of protein domains just to cover all of your bases. Okay, so we're going to go to the NCBI website. Here is the very long URL. It's just easier to find this from the homepage, but once you get here very simple interface, all you have is a search box. There are no options here other than to tell it which database you want to search. The default where it says CDD encompasses all of those databases that were on the slide that I showed you a couple back. So we're going to just paste a sequence in the box. In this case it's the deleted and colorectal cancer DCC protein in human. We're going to click on submit query and this is what we get. Okay, so at the very top we get a graphical summary of how our protein, the gray bar, is lines up with a number of different protein families. So we see a couple of immunoglobulins. These FN3s are called fibronectin domains and out here we have something called a neogenin domain. Well, let's pretend I have no idea what that neogenin domain is, so let's get some more information. So this is the very first one in the list here. There's a plus sign right there and if you click on that plus sign it'll expand the view a little bit, give you some more information about what that domain is. Neogenin is a cell surface protein which is expressed in the developing nervous system of vertebrate embryos in the growing nerve cells. You then see a blast type representation of how your protein, the first line, hits this particular position-specific scoring matrix and it's just showing you this against the consensus. Any place you see a red position, those are exact matches, you can see that for yourself, you didn't need me for that, any place you see blue is not an exact match, any place you have a mismatch. Now there's more information available, there is a hyperlink right there, it's very easy to miss because of the color scheme, but if you were to click on that CD number this would take us to the next page and now this is the page on the CDD website that is devoted to giving you more information about the domain itself. Again some information, this is the same information we saw on the previous screen, to me the most important part of this particular page is the part that is at the bottom of the page and there you have a multiple sequence alignment. So pre-made multiple sequence alignments, this is not a bad thing right? So you can now take this multiple sequence alignment, you can download it, you can start thinking about adding your own sequences to the set or using this as is to do some phylogenetic analyses and again we'll talk much more about that later on today. So I think what you're starting to see is there are a lot of things that are available to you that have been pre-computed. Okay, so when you're thinking about doing a multiple sequence alignment see if there's one out there first okay and do some manual inspection and see whether or not that will do what you need it to do because it'll just save you the trouble of amassing the sequences and doing doing the alignment yourself but again manual inspection is always important. All right, so okay so that takes care of the first part here, the the one against, yes question Let's assume these databases have some hierarchical organization one database to like the last one has a lot more database combined together. Yes. So why not start with the database which is a combination of all the database, get the information and if you need specific information then you go to the parts of that hierarchical system. Okay, so all right, so the question was if you can do the search on something that pulls a whole bunch of databases all at once why not just start there and then if you need more information you can go to the others. I'm going to fall back to the you know the fact that the way the searches are done are different in each place. Each one has different collections of databases so in this particular sphere there's not one that I can say to you okay this has everything and it's the best method okay so you still should be searching the multiple, using multiple websites to do this information it's not a matter of okay that there's a decision tree that you should follow here. Okay, it's a good question though. All right, so here we are. So we've talked about one against many of these three resources. Now let's flip it the other way and talk about many against one. So again in these we have a single sequence we're comparing them against these profiles these patterns these other representations of protein families. In this method we're going to actually construct a profile to find the distantly related proteins of interest that correspond to it. And the method we're going to use to do that is something called Psi Blast and Psi stands for position specific iterated blast search. Okay, so we talked about blast two weeks ago at length and the first step of this basically just does a blast Psearch. Okay, so remember the P stands for protein so we start with a protein sequence and in round one it's just going to do what it normally does. It's going to search the public databases for any good pairwise matches of your query sequence to the sequences in the database. So exactly the same as a blast Psearch. Once that is done it's going to take all of those results then use those results to make a multiple sequence alignment calculate for you the position specific scoring matrix and that is now going to be used as the input for the next round. So that position specific scoring matrix now replaces the query and you're going to do that over and over and over again until one of two things happens either your search converges that you've gone through rounds and even by doing extra rounds you don't find any new sequences or it diverges that your initial query was so broad that if you kept going you'd basically pull in all of GenBank. So if that were to if you were to see that the numbers are getting out of control you just make your query a little bit shorter and you make your cutouts cutouts cut off a little bit more stringent. Okay so how do we do this? Blast homepage as before the second one down says protein blast and that's where we're going to find this particular algorithm. This page hopefully looks familiar to you now. At the very top we have a sequence so the sequence I'm using is a DNA binding protein the high mobility group protein from human and first thing we have to do is tell it well what do we want to search against. So last week you'll recall I excuse me two weeks ago you'll recall that I described to you a database called RefSeq which is NCBI's effort to represent each molecule in the central dogma once and only once so each DNA molecule each mRNA each protein sequence something that's very similar to that is called Swiss Pro so let's take a little bit of a side trip here. So Swiss Pro as the name would imply concentrates on protein sequences and the goal is exactly the same as what's done with RefSeq represent each protein sequence once and only once the canonical sequence that's accepted in the field of being the one that you should use. This is a database that's maintained by folks at the Swiss Institute for Bioinformatics. It's been in existence for many many years predating RefSeq. Like RefSeq it is non-redundant by definition if you're representing each sequence once. The important difference between what's done in NCBI at RefSeq versus what's done here with Swiss Pro is that there is curation by the staff at the European Bioinformatics Institute but there's also external experts like the ones we saw on that pro-site entry who volunteer to keep all of these entries up to date. What those experts also do is in the entries in the GenBank entries for these sequences that come from Swiss Pro there's a set of comment lines and again it's yet another executive summary. The thing that the expert in the field thinks is the absolute minimum you need to know about that particular protein. So again if somebody goes through that trouble you should absolutely take advantage of the expert's knowledge there. With like with RefSeq there's a distinct accession series so remember an accession number is the unique identifier for a particular sequence the sequence is social security number. In this case the number starts with either an O or a P or a Q and then has five digits following so you know which ones are your Swiss Pro entries. So what I tend to like to do is rather than search against NR I will use either RefSeq or Swiss Pro because you just get a much tidier hit list at the end so it's a good place to start your searches. All right so looking a little bit further down program selection we're going to select side blast so we can do this position specific iterated technique. Again here are the algorithm parameters that are normally hidden but you're going to get into the habit of always expanding them and here are some of the parameters we can set. In the first block we can tell it how many sequences to return what's the maximum number of target sequences the default is 500 I've set this to a thousand I think you should set that number as high as you possibly can. The expect threshold the E value remember are cut off for our purposes our guideline is 0.001 the default here is 10 so I've changed that value for purposes of the example. The matrices are here same options that you had with blast regular blast blast P before so the various blossom matrices PAM matrices and so on. We're filtering for low complexity regions so by way of reminder these are the regions of biased compositions when you have homopolymeric runs of a single residue that could confound blast. Blast really does rely on on a uniform distribution of residues and when it has these low complexity regions it sometimes messes up because it doesn't really know how to align the two sequences. Finally at the bottom again another E value threshold 0.001 the default is a little higher we're going to set this down here so this one and this one should match each other. All right finally with all of that we're checking off the show results in a new window box as we normally do and then go ahead and blast the sequence. All right so looks exactly like what you saw two weeks ago. You've got your sequence here it's found a couple of HMG box domains these are DNA binding domains and the graphical overview of your hits. Let's scroll down and actually look at the hit list by way of reminder you're always given the accession number at the front a brief description the definition line from that particular sequence entry the E values in rank order are shown there as well and then some links to external resources which are deciphered for you a little bit further up. Let's scroll to the bottom of this and you'll notice on the left hand side it says next to each entry new new new well of course they're all new because this is the first round but this becomes more important later on you'll also notice that there are a bunch of boxes and all the boxes are checked off. What that means is that in the next step all of these sequences will carry through so if there is a sequence especially around the cutoff that you don't think belongs to this set because of your biological knowledge uncheck the box okay let's say I had set the E value to be more permissive that I actually have things that have E values that are not better than point zero zero one those things I could put in they would be unchecked but if I could check those and put them in so I might have prior knowledge of something that maybe doesn't meet that E value criterion but should be included the same way we talked about it always looking on either side of the line when we talked about blast okay so here's our first round run of the mill standard blast p search what's going to happen now is that when we click on this go button it's going to take all these sequences make the multiple sequence alignment make the position specific scoring matrix and then use that as the basis to now pull these extra sequences out and let me back up one slide really fast so we're starting in this case with 137 sequences okay so we start with 137 we click on the go button we do this two three four in this case nine times until we finally reach convergence how do I know that I have a little message here no new sequences were found above the point zero zero one threshold okay I started with 137 by the time I was done I found 178 okay so 41 additional sequences were found using this method that blast p itself couldn't find so I think this helps demonstrate the power of using this kind of approach where you use the collective characteristics of the family you keep building on that to find new sequences and that just gives you much more power to detect things that are distantly related to the protein that you started with okay that makes sense we're good okay all right so that's it as far as the the profiles the patterns the motifs in the domains let's switch gears a little bit and talk about things in three dimensions the first time time we're going to do this in this course and I know that people tend to approach anything when it comes to the structural realm with a little bit of trepidation because this is normally in most people's heads the the domain of the geek down the hall with the coke bottle glasses and the machines with all the knobs and all of that on it but what i'm going to show you is actually very simple to do and it's very accessible to all of you you can do this at your desktop so just some preliminary tenets we've all heard chris anfinson won the Nobel prize for this the the tenet that sequence specifies confirmation okay depending on the string of letters you have that will make a particular for the most part a particular structure the converse however is not true confirmation does not specify sequence and there are examples in the literature of protein structures that are similar to one another that can be superimposed pretty well on top of one another but when you look at them at the sequence level the sequence similarity is down in the 25 to 30 percent range and even lower sometimes so things that you would not find if you were doing a blast search so why is this so we know of course that structure is conserved to a much greater extent that sequence the shapes have to be exactly what they need to be in order for them to to function in whatever metabolic pathway they are in and again by extension we might not find similarities be between proteins using traditional methods and i mean blast by that so things that are below the 25 percent identity threshold all right so now how does this actually work there are currently about 79,000 structures in the protein database about 73,000 of them are either x-ray or nmr salt structures that are just the proteins themselves not complex with anything and when you have a problem of that size of try to compare 73,000 of anything with these complex kinds of structures to the other 73,000 this really becomes a computationally ridiculously intractable problem so we have to make some assumptions and make some simplifications in order to be able to bring some computing power to bear to be able to compare all of the sequences excuse me all of the structures to all of the other structures so what vast does is in its very first step it takes all of the known structures from either PDB excuse me from the x-ray structures or nmr structures from PDB something that might look like this and it identifies where the alpha helices are and where the beta strands are and every place it finds one of those it puts a vector right through the middle of it so I might have an alpha helix the vector is going to go right through the center of that alpha helix going from the n terminal end to the c terminal end same thing with the beta strand we're going to approximate the path of that beta strand n to c by one of these vectors the method keeps into account the directionality of the vector and which vector it then connects to so also the connectivity between all of the vectors at that point it throws every single shred of crystallographic information away and all you're left with is a series of vectors that represent our original structure and again we know which ones are alphas we know which ones are betas what we then do with the folks at NCBI do is then what I could best describe is a glorified game of pickup sticks so what we do here is we have two proteins protein one and protein two so we're going to just assume for sake of simplification these are all alpha helices so here's one two three four here one two three four and five is over here and we're going to try to overlay these every way possible and of course this is all done by computer but here's let's say is our first alignment of all four of these alpha helices to one two three and four in the second protein so they're all going the same way they're pretty much on top of each other so we could consider this a good alignment so now let's do another one one two three four from protein one but now one two three and five from protein two as before the first three are overlaid pretty well on top of each other but five is off doing its own thing over there so we would not consider that necessarily to be a good alignment and this just goes on over and over any way you can combine all of these things the correspondences are made and what you end up with is something like this so this is an example that I've pulled together comparing a leucine isoleucine valine binding protein the accession number is two liv to just a leucine binding protein so one just binds leucine the other one binds all three of those residues and as you can see this program for what we just did remember we threw out every single atomic position years of somebody's work okay just went out the window in order to be able to do this comparison and you end up with something that looks pretty good okay even though we've made a lot of simplifying assumptions just to give you an idea of how to look at these any place that I have in this particular view we'll talk about the views in a moment this is just shown as what's called tubes it's this type of representation but the coloring is by identity so any place we have a red bar or a red position in the alignment the correspondence and colors between the the sequence window and the structure window red is a match blue is a mismatch and what I've also done here is I've just taken my cursor and highlighted a bunch of residues here and whatever I've highlighted appears here in yellow okay so a really fast way to see okay how do these two structures compare to each other can I start to figure out why does one only bind leucine and why does the other one also find isoleucine and valine if I'm interested in what is in a particular region I can very simply find out what's where in a structure just by highlighting the residues it's just as easy as that so it's a really really powerful tool the CN3D viewer that I'll talk about okay so before we get into how to find all of this information just a couple of reminders it's not the best method for finding structural similarity because we threw all the crystallographic information away it was the basis for making these vector calls but we've reduced again all of this to a series of vectors so it really is a big loss of information which in turn gives you less confidence in the prediction but this is a matter of degrees if you look at the example that I just showed you so regardless of the simplicity of the method it really gives you a really nice simple fast answer to the question of structural similarity and if you find something that is of interest that you want to pursue then enter into the collaboration with somebody who might be able to start to do a little bit more complicated things for you all right how do we get here so we're back at the NCBI homepage the URL as always at the top is the search engine what I've picked off the pull down menu is structure the PDB accession number which takes yet another form then we've seen before a letter excuse me a number in three letters is one HMF when I issue the query I get back one and only one entry because I use the accession number that's what you would expect and in this case what I've pulled out is this high mobility group binding domain this DNA binding domain in this particular case in rat if I click through just by clicking on the title it brings me to a page that looks like this something that we've not seen before I have three different windows here the first one just says all right are there any known interactions between this particular protein and any of the other ones for which there is a solved structure in this case all we see is the A so there are none for that subset of protein so 73,000 proteins I'm given just a quick molecular graphic here so I get an idea of what this protein looks like and I have ways to now either view the structure directly we're going to do it a little bit differently but you could just click on view structure and go straight to that but what I want to start to think about is what other structures by virtue of this vast method this comparison method are similar to the one that I started with okay so if I click on that vast button it pops up another box here if there are multiple chains in your protein or multiple elements so if it's bound to DNA it's going to ask you which one you want to do the comparison off of here we only have one so we're just going to pick entire chain and we get this okay not particularly graphically appealing but it does the job all right so your protein is the one that is in this purplish orangish color it shows you where again this HMG boxes DNA binding domain is and then going down from here it shows you all of the other proteins with solved structures that were deemed similar to the one you started with by virtue of this vast algorithm you'll notice some of them are have no gaps it's across the entire length but then as you get further down the list we start to see gaps so the point to be made here is that we can find similarities either on a global level the whole protein to another whole protein or just important structural or functional domains it doesn't have to be all or none here there are ways of recasting this so that you actually get statistical information and you know that I'm big on that for purposes of time I'm not going to go through that today in this third row of the table it shows you the settings you can use and I will give you the current protocols reference at the end of this section where I explain in great detail how to actually decipher those tables instead let's just go right ahead and look at this structure so look at the structure just by clicking on this 3D alignment this is going to spawn this CN3D viewer the viewer is freely available from NCBI just by clicking on this download CN3D button it'll run on a Mac on a Windows machine and on Linux so you can just download the appropriate flavor which should be of course Mac all right so when I do this I get something that looks like this this is the default view and I call this the seminar view because it's a very nice way if you're giving a talk of having your audience sort of get familiar with the structure that you're talking about just as an aside on the bottom here there are two settings in the CN3D viewer again that are described in the current protocols chapter so one of the settings is rendering the other one is coloring so these are the settings that give you this view so every place you see a crayola crayon in the middle of the structure that's an alpha helix the flat end is the C the end terminal end the pointed end is the C terminal end we don't have any beta strands in this particular structure but if you did you would just see a flat brown ribbon now again a nice way to orient people to what your structure looks like it's easy to understand but of course proteins don't have a crayola crayon in the middle of them floating through the cell so what we want to try to do is figure out a way to see what is the real shape of this protein the surface accessible shape of this protein in the cell so by changing these two settings from what's shown on the left to what's shown on the right we end up with something that looks like this okay every place you see a blue that is a positive charge every place you see red is negatively charged the gray is all neutral you'll recall that I told you this was a DNA binding protein and I have rotated this purposely in a certain way to show you that there is a big cavity over here and you can guess what goes into that big cavity that's where the DNA goes there are also two aromatic residues two positively charged aromatic residues that poke right out into that space okay so now you can start to think of well how does this thing bind to DNA those two positive residues actually intercalate with the negatively charged DNA backbone to facilitate the binding okay so now by going to this level of looking at structure rather than looking at the sequence of letters a whole different set of things open up to you so you can see which residues are on the surface which ones are buried which ones may be near active sites or binding sites you can start to more rationally design your experiments so instead well if you're doing gain of function or loss of function experiments instead of randomly choosing residues which is what we used to do years ago you can make more selective choices on what parts of the molecule you want to mess up so and you can also start to think about the clinical implications of mutations okay you can highlight particular positions see where the position of your mutation is and see well is it some place that actually is in the business end of the molecule that might in turn start to give you some clues about why you're seeing the clinical phenotypes that you're seeing so uh yes that's okay I've never heard a DNA binding protein amino acid residue intercalating between the bases of DNA this is the first time I have heard it does it is it common or is it I've never heard this before yeah we can talk about it later this is this is something this is my prior life I used to study these things so we can talk about it later okay um all right so I so we only spent maybe five or so minutes talking about this particular tool I want you all to download this tool onto your onto your max and then figure out take your protein of interest and play with it okay with the current protocols unit that gives you a step by step do this do this do this it's incredibly easy it's something you shouldn't be afraid of doing just because you're working at the structural level and again you're going to start to find things that you just can't even think about finding when looking at those aligned sequences of letters whether we're talking about just pairwise comparisons domains what have you so a very very powerful tool to have in your arsenal and that most people don't have in their arsenal so you'll give you a little bit of a one-up over your competition of course there is the question of well what do you do if there is no solid structure for the particular protein that you want to study there are de novo predictive methods ab initio predictive methods that are available to you they are beyond the context of today's discussion but I do give you this reference if you are interested in delving into that part of the world okay all right so that's it for number two today finally the multiple sequence alignments the grand finale so and what is great about this topic is this is going to serve as a nice way to unify everything we've talked about over these two sequence analysis lectures brings it all together all right so what can you learn by doing a multiple sequence alignment so obviously once you've got all of these characters lined up all of these either nucleotide positions or or amino acid positions it allows you to identify where are their regions of conservation can I identify new patterns and where are the domains so same as before these will allow you to better design your experiments because you see where the conserved residues are they've been conserved for a reason helps you to start predicting structure and function especially if you add an unknown to the set the same way when we're doing the one to many comparisons earlier in this lecture and allows you to confirm whether or not a newly discovered protein is actually a member of a particular protein family it also gives you a really good basis for predicting secondary structure there are ways to take sequences and predict where the alpha helices and beta strands are this provides confirmatory evidence for those predictions it is an absolute must if you're performing phylogenetic analysis you can't do it without a multiple sequence alignment that is the input to every phylogenetic analysis program and as we saw earlier we could generate if we wanted to these precision specific scoring matrices to use in our much more sensitive sequence search methods all right so what do we need to consider when we're doing going through the effort of constructing one of these multiple sequence alignments we are of course going to look for positions where we have absolute sequence similarity invariant positions where we can align up as much as many common characters as possible we're going to look for conserve positions as well we don't want to just focus on places where there is absolute conservation because there is always something to be learned from those positions where we see amino acid substitutions where one residue could possibly substitute for another one not adversely affecting the function or structure of the protein and finally we could have to think about structural similarity so even though we're going to amass a set of sequences to make these alignments you should also look in the structure databases you've just learned how to do that to see is there other information about where the alpha helices are where the beta strands are where the loop regions are because that is going to inform you as you go along where the gaps can and can't go all right all right we tend to do this at the protein level rather than at the nucleotide level because of what's it's usually called this 20 versus four thing so if you look at the side chains of the four nucleotides they're not that dissimilar from one another but of course when you look at the 20 side chains of the amino acid they're vastly different from one another so we tend to go in that direction because of the chemistry behind all of those individual letters in the alignment and you could then if you did a protein alignment translate back to nucleotide sequences after you do the alignment now of course this depends on the context that you're doing this in so in a couple of weeks you're going to hear from Laura Elnitsky who's going to tell you about upstream regulatory elements if the subject of what you're doing is strictly at the nucleotide level then that's where you work okay all right so this next slide is a just a 30,000 foot overview of all of the things you have to do when you construct an alignment okay don't let that scare you it's it's not as daunting as it looks on the slide very quickly first you need an input set of sequences so we're going to find some sequences through our database searches that you now know how to do using some reasonable e-value cutoff you then actually run the program you're going to take a look at the output you're not going to just accept the output as it is inspect it and assess the quality of the alignment if you see that there are problem sequences that really shnook the alignment that if you have it in there it just messes everything up you take it out okay and then you redo the alignment once you have a good alignment you can then add the problem ones back in one by one and finally once you have that nice alignment you can go about the business of interpreting it so let's tick these off we'll start with selecting the sequences so you want to start with a reasonable number okay remember that this is a global alignment method so we talked about global versus local so remind you when we have a global alignment method the alignment is across the entire length of the two sequences being compared in this case it's more than two sequences being compared and because of that the compute time goes through the roof you'll see the equation in a moment so because of that if you have too many sequences in the set it just takes a very long time and sometimes it doesn't yield good alignments also if you're trying to do a phylogenetic study if you have too many things in there you won't be able to interpret the trees that you get at the end of the day you have to have a reasonably sized data set so what I offer you is a good starting point start with about 10 to 15 that's a good place you can add in as you go fall park upper limit there's nothing holy about that number 50 or so tends to be a good place okay all right sequences because it's a global alignment should be of the same length that's one of the requisites for doing a global alignment but you may not want to take the sequences directly as they come out of the databases you may want to trim them down just to look for the regions that are important to you either from a structural point of view an evolutionary point of view functional point of view so and you can do that by relying on the results you get from either blast or side blast which will tell you immediately which regions are conserved all right when you select the sequences you want to have a combination of closely related sequences and divergent sequences the closely related ones are going to tell you well where are the absolutely conserved positions or the very highly conserved positions what's required in that particular alignment if you go more divergent it gives you some good information about evolutionary relationships so again good starting point you want to be in that 30 to 70% ballpark if you push too far in the direction of having things that are absolutely identical to each other well there's no point in doing the alignment you haven't learned anything okay so it's really going in the opposite direction to try to find those those new relationships so again the most informative alignments result when you've got things that are not too similar but not too dissimilar as well all right so you now run the program based on sequences you pull together given those considerations the program will always give you an alignment it'll always produce something the question is whether the what it produces for you actually makes biological sense so some of the things to think about when you look at the quality of that alignment you're looking again for conservation of residues across the alignment you're looking for conservation of physical chemical properties so this is again what can substitute for what you hopefully will see a relatively neat block type structure things should line up nicely so you might have some gap regions but you should have some pretty long regions of alignment and you want to avoid excessive numbers of gaps so remember that any time you see a gap it represents a biological event either an insertion or a deletion so you can't just put them in because it makes it look nice that you have to keep them to some sort of reasonable number as I've already alluded to if the alignment that you get in the first round is good just go ahead you can add some new sequences to the set and then realign them but if you see that one of them is really messing up the alignment just take it out especially if you see the inclusion of a long gap in a single sequence just take that sequence out do the alignment without it and then try to manually put it back in and that's an important point we tend to rely on these methods to do the work for us when it comes to a multiple sequence alignment you're going to have to do manual alignments as well so you can use that as your starting point but then using things like the viewer I'm going to show you you can actually move residues back and forth especially to deal with these problems using the structural information you have and the information that's already in the alignments to line up the best way possible those problem sequences okay so we've already talked about that you've already seen examples of expertly created in quotation marks multiple sequence alignments that are available to you online cross check what you get versus what they've got okay and you can use that information to help refine the alignment you just did and again anytime you've got information from solved x-ray or NMR structures use that to nail down the structurally important regions to say all right here I can put the gaps and here I can't all right all right so how do you now finally interpret all of this you've got a nice alignment you've been very careful about how it's been put together what can you learn from it anytime you see an absolutely conserved position obviously required for proper structure and function if you see relatively well conserved positions those are the positions where you can tolerate a little bit of change but it doesn't really mess up either the structure or the function of the protein but you shouldn't discount places where you don't see conservation so these are places where I would say you should look maybe more carefully than the places that aren't aligned for clues as to evolutionary innovation the places that can mutate freely and possibly give rise to new proteins with new functions okay so the orthologs and the paralogs that are related to the rest of the ones in the group the gaps once again so anytime you see a gap-free block it probably corresponds to a region of secondary structure anytime you see a gap-rich block especially if we're talking about certain classes of proteins those are usually unstructured or loop regions okay questions just threw a lot at you all right so how do you actually do this the method that I recommend the most widely used method is something called Clustel W it is a great method for doing automatic multiple alignments of either nucleotides sequences or protein sequences it's very fast we'll see how it uses the scoring matrices in a moment you can in advance bias the location of the gaps to take into account the considerations that I just gave you so if you know there are places in advance where you can't put gaps you can tell it don't put them here okay and it works with a very nice Java applet called Jalview you saw an example of that earlier and we're going to play with that a little bit in the remaining moments we have so to take this again out of the realm of the black box how does this alignment method actually work it doesn't matter how many sequences you have in the set it only will align two sequences at the time and then just gradually build up the alignment based on all of those sub alignments it uses the scoring matrices in a very similar fashion to what it does for BLAST that we talked about two weeks ago to calculate the alignment so because of the way it does this it's a generally fast method and what you get back is generally of high quality so to put a finer point on it let's say we have four sequences in our input set A, B, C and D the first thing that it's going to do is compare every which way every combination of those four sequences to one another and calculate a similarity score now to just give you a sense of the size of the problem I alluded to this earlier if we have four sequences there are six possible alignments when you move them around but once you get into these bigger numbers you see how quickly based on this equation the number of alignments gets so that's why you want to keep the number of sequences in your input set to a reasonable number because it just gets slower and slower and slower all right so we're going to compare these every which way the numbers are here in the table so A, B, C, D going across the top the same going down the side of course the comparison to itself is on the diagonal will always be 100% based on the numbers off of the diagonal it's going to look for the highest numbers of how we compare these sequences with each other so the highest scoring pairs here are A with B at 80% identity C with D at 92% identity so in the first step it's going to take A and B align those two to each other C and D and align those two to each other and then these alignments become fixed so in the next step it's really treating A and B as a single sequence A, B C and D the same way and it's going to align this alignment to this alignment going out as far as it has to go we only have four here so that's it but if you had more you would just do this over and over and over again starting at the tips of the tree going all the way back to the root trying to put these sub alignments together into a good global alignment so that's how it works what's nice about this is you do the easy alignments first if you start with the things that have the highest percent identity the chances are those alignments are going to be correct and then you can use that information regarding whatever conservation you see at each position to help with the more difficult alignments as you push from the right hand side to the left of that tree the things that are more distantly related some of the pitfalls though if the initial alignments are on distantly related proteins it may get it wrong and then it's going to build on that and so it's going to keep propagating the error throughout the alignment so once the alignment's fixed it's not reconsidered so again you're just going to keep pushing this alignment along something called a greedy algorithm now nice what's nice about what the folks at Clustle have done is they've provided a way a remove first that allows the tree to be recalculated so if there has been an error at the beginning you still have a way of fixing it later on but some of the progressive alignment methods don't have that so just something to keep in the back of your heads if you choose to use a different method all right so once we use the method it's going to calculate these scores and of course it's going to give us an alignment if you're interested in doing phylogenetic analysis afterwards you can output the alignment in any one of a number of sequences it is you uh excuse me in the number of formats that's used to do these phylogenetic analyses you also get two tree representations of your alignment the first one's called a cladogram so this is a tree that is assumed to be an estimate of the phylogeny the word estimate is important here the branches are of equal length so it gives you a sort of a first impression of common ancestry but because the branches are of equal length you don't have a sense of what the evolutionary time is between each protein or each taxa the opposite the phylogram is also assumed to be an estimate of the phylogeny but the branches are not of equal length so we now have something that indicates to us some proportionality in time how much time is in uh represents the amount of inferred evolutionary change so these are nice these are used to construct the alignment to make the alignment better but the most important thing to remember is these are not phylogenetic trees okay I will show you how to make one by the end of the lecture today but these are just two representations don't interpret them as true phylogenetic trees conservation patterns are shown on the slides things you would expect because we're keeping physical chemical characteristics in mind so aromatic residues tend to be lined up things that carry either a plus or a minus charge ends of helices catalytic sites and the all important cystine crossbridge all right at the bottom of the alignment you're going to see a um a nump bunch of symbols that represent to you position by position how good the alignment is at that particular position if you have a column in that alignment that is entirely conserved you're going to see an asterisk at the bottom and you want to see that at at least 10 percent of the positions that's a generally accepted indication of a good alignment if you see a colon instead that's a conserved position by using the definitions from the previous slide so things with strongly similar properties that can substitute for one another and then there's this semi conserved thing here which is weekly similar properties all in that same column and and the easiest way to describe that is if you're looking at a conserved position if you refer back to the previous slide everything in that column is from one of those groups the semi conserved you might have things from two of those groups okay so but it's not obvious from from the from the title there all right how do you do it so we're going to ebi again here is the url for the cluster w website I'm using a set of five fos proteins as my input and again these are on the sequence site that I'm encouraging you to to do the examples from all right so again we've got lots of options I'm just going to pretend I've clicked on both of the more options buttons here to expand this page all the way the first thing it asks you do you want to do a slow alignment or a fast alignment you always want to do a slow one okay it may take more time but this is uses an algorithm that is more careful in doing the alignments and putting things on the tree that I just described to you the right way so it's worth the extra time and we're nitpicking with with what slow and fast mean here underneath you can pick which scoring matrix you want when we talked about blossom two weeks ago you'll remember it was always blossom and a number and that number was used to tell you something about the evolutionary characteristics of that matrix you'll notice here it just says blossom and there is no number the method itself will choose which of the blossom matrices is most appropriate for each one of those pairwise alignments so you use these as a series rather than just picking one single blossom matrix and that's actually a good thing here because you might have things that are very divergent or or very closely related so it'll make the adjustments along the way you're then here given an opportunity to reconstruct the tree so this is that remove first thing that I alluded to a little bit earlier so even though you've now constructed the tree you have a chance to reconstruct the tree so if you in this iteration pull down here you pick tree it does it at each step if you pick alignment it only does it at the final step the default number of iterations of how many times it's going to do this is one I've chosen 10 here okay once we're satisfied with all of that we're going to just go ahead and submit the alignment and there is your alignment okay very psychedelic multicolored the colors of course mean something so in your handouts you have the the key to tell you what's what here so red is small and hydrophobic and you can read the rest for yourselves we're going to go across some of the tabs here at the top so right now we're on the alignment tab and before I leave here you'll notice at the bottom here are all the codes that I described to you so where the asterisks are the absolutely conserved positions the conserved positions with the colon and the semi conserved positions with the dots okay all right so we're going to go from alignments to guide trees so this is now the cladogram that I had described to you giving you a branching pattern for these five proteins all of these are of the same length so we don't have an indication of evolutionary time if we now click on this show as filogram tree button we now do have an indication of evolutionary time but again this is not the phylogenetic tree and I don't want you to interpret these this way it's just sort of a first approximation and again used to calculate the alignment if we now go from guide tree yes I have a question away in the back question two slides back two slides back what are the character I can barely hear you what are the characteristics yes so the question is all right I've been very protein centric here as far as the characteristics that you should use for for looking at these it's a simpler problem again when you're looking at the nucleotide level because you only have four characters for letters that you are trying to align I would still tell you to use the same set of considerations about if looking for convergence I excuse me closely related versus distantly related but you're going to do it in a different way now I mean you're going to do it as far as looking at organismal spread things that are either close together on the evolutionary tree things versus things that are distantly related so it's it's the same considerations but now thinking in terms of phylogenetics rather than thinking in terms of physical chemical characteristics okay and we can talk a little bit more about that later that's a great question okay so let's go back forward okay results summary here all the scores that were used to construct the tree but it's not excuse me to construct the alignment but now we have an opportunity to play with the alignment and we're going to just click on this start jowl view button to start the java applet so it's used to interpret the alignments it's used to manually edit the alignments same types of coloring schemes as we've seen before you can generate pairwise alignments so you can select individual pairs of sequences and just see how the two align with each other figure out what consensus sequences are if you have redundant sequences or recalcitrant sequences that don't align well you can remove them from the alignment and again this is going to be the basis for us calculating our trees all right so we've clicked that button it's going to launch a new window on your desktop this is the default view so here are the five sequences we started with and a bunch of histograms here so here the coloring is similar to what we saw before the three lines here are intended to represent to you position by position the quality of the alignment what the conservation line is well how conserved is the total alignment and so this is just a better way to think about this is it's just an indication of the percent identity at that position the quality is the alignment quality pairwise pairwise pairwise through this whole thing based on the blossom scores and finally what's I think most important to people is once you build this here's your consensus sequence wherever it can call a consensus residue it does where it can't if there's no majority rule you'll see a plus sign instead okay now if I in this really teeny tiny type up here you've got menus believe me you got menus up there okay so if you now pick on color and you from that color submenu you pick percentage identity it changes the color scheme so everything we have here is now in blue and it gives you an indication of how conserved each position is so rather than looking at the physical chemical characteristics that we saw in the previous view we're now looking at just strict conservation measures the darker the better so 81 to 100 percent is the dark blue and you can see some examples of that here and the lighter it gets the percent agreement in that block goes down when you go back to your labs and do this example scroll all the way to the right hand side and you're going to see some huge dark blue blocks and that immediately should catch your eye as the things that are most important in this alignment to either the structure or function of these five FOS proteins as a group okay we're going to change this again so if we go to calculate in these menus up here pairwise alignments I've picked two FOS from a FOS B from human FOS B from mouse so it just takes those two out does a pairwise alignment tells you that they're 95.8 percent identical to one another and this is the characteristic view that we've all now gotten used to looking at as far as alignment by alignment position by position all right here's the tree finally so we go back to this menu to calculate to calculate tree to neighbor joining using blossom 62 all right and this now gives you your first true evolutionary tree and you can see how these various proteins are now evolutionarily related to one another based on the sequence information that we have given from constructing this cluster alignment okay now for those of you of it who have done phylogenetic analyses before you know that this is much more complicated than just picking something off of a menu I would treat this as a first approximation there are methodologies that use things like neighbor joining and some other methods but do it in a way where you generate thousands of trees looking for consensus trees and use bootstrap methods to make sure that the branching pattern is reliable but again this gives you a good nice first estimate of what the tree actually looks like so a couple of things I want to leave you with before we end today is that with this you know knowing now how these methods work you know there's no best method for doing a multiple sequence alignment this is the one that is most commonly used I think it's a really good method but there are other ones out there as well so you should certainly try to use more than one method especially if you're going to generate trees at the end of the day because all of the methods as I've hopefully tried to get across to you have their strengths have their weaknesses okay some more information on cluster is in the current protocols unit including how to make best use of that Java applet the JAL view viewer there's another method that's also described that's called tea coffee that automatically uses some structural information to inform how the alignments are done okay so before I leave you just some words of wisdom so we've we've covered a wide swath of methods and again I've tried to explain to you what is in the black box here and I just want to remind you one last time that when you do these analyses you put the sequences in the some results come out on the other side but you should always couple what you get out of this box with what you know from what you do in the lab every day from what you've read in the papers from what you've learned in seminars your own hands on knowledge of these biological systems because once you do that if you couple this with what you know with your biology hat on that will always always serve you well okay so with that program note just a reminder we do not have a lecture next week there's another group in the hall next week so we're going to reconvene in two weeks time where Laurel Nitsky from NHGRI will take us back to the genomic level we're going to be talking about regulatory and epigenetic landscapes of mammalian genomes so I'm happy to take any questions from the podium thanks for coming and we'll see you in two weeks all right okay