 Okay. Good morning, everyone. Thanks again for joining us for this third week of Current Topics and Genome Analysis. We're going to pick up right where we left off last time around in looking at different ways to think about sequence data. First, we have to get the requisite disclosure out of the way once again. I don't have any relevant financial relationships to disclose this morning. Okay. So where we left off last week was we were talking quite a bit about homology searches. And in thinking about that, we spent a lot of time talking about BLAST. And you can think about BLAST as essentially a one against one way of looking at sequence comparisons. You start with a sequence of interest and you now compare that one by one to a collection of sequences in some public database or perhaps in a list of sequences that you've compiled yourself. So really, again, one against one. Another approach for doing this is to instead look at collections of protein families and look at those in what you could call a one against many or a many against one sort of way. The reason for doing this is we can gain a lot of information by looking at the collective characteristics that we can find by considering an entire protein family all as a single entity. We can identify new members of a protein family that may not have been found using the standard BLAST type searches. We can also find some commonalities in domain architecture by doing so. So we're going to discuss four methods this morning, two that are one against many PFAM and CDD and two that are many against one the other way around side BLAST and Delta BLAST. So like we did last week, let's just go ahead and start with some definitions. So the first term we have to define is something called a profile. And all a profile is is a numerical representation of a multiple sequence alignment. So we're going to take a set of aligned sequences and represent that as a numerical matrix. The reason we can do this and capitalize this on this in our sequence comparisons is because it allows us very quickly to identify patterns or motifs that are conserved across the entire protein family, the common characteristics that define that protein family. What that in turn allows us to do is find similarity between sequences that have little or no sequence similarity and I'll give you the same example I gave you last week with respect to the homeodomain proteins. If you look at that protein family, the family is characterized by a 60 amino acid across a multiple sequence alignment. But there are very few positions in that block of 60 that are absolutely conserved across the entire 60 amino acids. And so by looking by traditional BLAST searches, more than likely you wouldn't identify all of the members of that particular protein family. So when you're trying to look at distantly related proteins, these are now the tools that we're going to use to try to tease those out. So these are really good techniques to have in your arsenal. So how does this actually work? How do these profiles get constructed? So let's consider we have a multiple sequence alignment here. So we've got all of these sequences, 10 amino acids across. You'll notice some of the positions are more conserved than others. So we've got a G that we always see in the 10th position. In the next to last position, we normally see a purling, but sometimes there's a 3 anine thrown in for good measure. So we look at this multiple sequence alignment and we're going to ask ourselves four questions. Which residues do we see at each position? So which ones can or cannot be at a particular position in this protein family? What's the frequency of those observed residues? How often do you see a given residue? Is it an absolutely conserved position or is a little bit of wiggle allowed? Which positions are conserved? And where can the gaps be introduced? Once you answer all four of those questions, there's an automated method that will take these multiple sequence alignments and generate something called a position-specific scoring table or a position-specific scoring matrix. And this is now what we're going to use as the basis for comparison to an individual sequence of interest. Let's just quickly talk about what all of these numbers mean the same way we talked about the blossom matrices last week. So again, if we look at position 10 here, we see that there's always a G in that last position. I want you to imagine that this alignment is now flipped onto its side. So we've got position 1 to 10 going across in the alignment position 1 to 10 going down in the matrix. So if we have a sequence of interest in the 10th position of that sequence, if we see a G, we just find the G here. We find the corresponding G there. Find the value at the intersection. And in this case, we're going to assess 150 points for that exact match. In the next to last position, we've got again a proline in most places but the occasional threonine. So if we do the same thing and look for the intersection of the proline in the consensus sequence versus the proline in the sequence of interest, here we've got in this case 89 points. So the same type of thought pattern that goes behind the blossom matrices that you give the most points for an exact match, you still assess a positive value for conservative substitutions, just not as many points as you would have for the exact match. Finally, if you have a residue in your sequence of interest that doesn't match the consensus at all, you'll start to see negative values in the table in the positions where there's really a lot of variability, you'll see the positive values go down. So again, a way to represent this multiple sequence alignment as a series of numbers. Again, you don't have to calculate these by yourselves, but you can now use these to take a single sequence and compare them to hundreds of these position-specific scoring tables. The other definition for this morning is something called a pattern. So let's go right to the example. So we've got this very interesting nomenclature up here on the screen, and we'll just take it position by position. So in the first position here, you see two letters with square brackets around the match. Just when you see the square brackets, it means one or the other. So one of the things in the square brackets. So in this case, we have a pattern that begins with a phenylalanine or a tyrosine in the first position. Where you see the X, that means any amino acid can occupy the second position. We just have a C here all by itself. So that means you must have a cysteine in the third position, X followed by a number. So that just means two of those. So any two amino acids. The curly brackets, as opposed to the square brackets that mean one of the things in the brackets, the curly brackets mean anything but what's in the brackets. So not valine or alanine. Again, we have an open position that can be anything. And finally, three histidines at the end. So the difference between the profiles and the patterns is that the profiles give you information both on position and frequency where the patterns only give you information about position. So this just becomes a simple pattern matching exercise to find family members or assign a new protein of unknown function to a particular protein family. Both of these are incredibly useful ways of approaching the problem of assigning characteristics to an unknown protein. So let's go ahead and put these into practice by talking about the first method, the first database, and that's called PFAM, protein families. So what PFAM contains is a large collection of multiple sequence alignments based on protein domains and conserved protein regions. And because those regions are conserved, they more than likely have some sort of structural or functional importance. When you look at one of these PFAM entries, you'll see some very good information. You'll obviously see the multiple sequence alignment that was used to derive that matrix. Protein domain architectures, you'll get an idea of a species distribution of where that particular protein family can be identified. So it's in certain species and not in other species. Information on known protein structures and links to other relevant databases. As we go through all of the things I'm going to show you this morning, you'll see references along the bottom and I would encourage you to pull those references to get a more in-depth view of each of the things we're going to discuss this morning. So there are actually two databases within PFAM, PFAM A and PFAM B. So PFAM A is the better of the two because they're based on curated multiple sequence alignments. So someone has collected information based on their biological knowledge, created a scene of alignment and then has used that alignment to gather in the rest of the members of the protein family. The methodology used to do that relies on a program called Hummer that Sean Eddy at the Janelia Labs had developed many years ago. And because of the way this construction is done because of the curated nature of the method, the hits are likely to be true positives. There's also something called PFAM B. The difference between the two is that unlike A that is based on curated alignments, B is based on automatically generated alignments from database searches. So those are deemed to be of lower quality, but you can certainly take a look at those if you don't find anything matching your sequence of interest in the PFAM A set. So the old documentation just always compared PFAM A to PFAM B in this way. They would say that PFAM A, you have a handcrafted multiple sequence alignment. So you should think of that as like a nicely handcrafted beer. You know that it has much more care given to it. And by virtue of that, it's a much higher quality. As with last week, all of the examples I'm going to take you through this morning, you can do them yourself when you get back to your labs. So this is the URL for the webpage that has all of the sequences. So again, I would very much encourage you to do this when you get back because it's one thing to watch me do it. It's another thing to actually have your own hands on the keyboards actually going through the exercises. Okay, so we'll go to the PFAM website. Here is the URL over at Sanger and very simple homepage. You can do a sequence search. We're going to do that in a moment. But you can also look at members of a family view a clan, which is just a group of related families, structures and so on. So we're going to go ahead and just take a do a basic sequence search. So by clicking on those words pops up a box, nothing else. Okay, as with the examples we worked last week, there are always options we can set and default values we can change. The way you do that here is there is a very easy to miss hyperlink right here that says here. Okay, so we're going to go there and take a look at what we get. Not a lot of options, but a couple more things you can do. So we're going to go ahead and put our query sequence in the box from the example webpage. You can either run the search using an E value cutoff. So here the E value is one. The rules that I gave you last week for blast govern here as well. Okay, so we're really looking for things that are a 10 to the minus third or better at the protein level. But we're going to just for the sake of the example, leave that at one for now. The gathering threshold really applies to a cutoff that was used by the curator to develop this particular profile so you can safely ignore that. If you want, you can search for the P fam bees by just checking on the box, but we're going to just leave that the way it is right now. Okay, so we'll go ahead, submit the search, and this is what we get back. So just a quick overview. This represents your sequence of interest. So there's a very light gray bar in the background, and you'll see that it identified in this case of P450 domain, but not all of one. You'll see that this side is rounded, this side is jagged. So it means you've got part of it, but not all of it. If we look at the significant matches, cytochrome P450 was found from position 41 to 500 in our sequence of interest. It matched the profile from one to 457. So the reason this is in red is that's to communicate to you that it didn't match the entire length of the motif, just part of it. Below, you'll see what's marked insignificant PFMA matches. So two hits come up. You do want to take a look at those even though they don't meet our statistical criteria. The E values are here, but you might have, again, biological information from research you're doing in the lab, from things you've read in the literature that might deserve giving these some additional consideration. So now let's just concentrate on that P450 hit. So if we now click on the view link from the previous page, we get an expanded view here and a lot of text. So let's just focus on this part of the slide. So we start off in the first row is your sequence of interest. It's just marked seek. The second row is the consensus sequence that's based on that position-specific scoring matrix. So any place you see an exact match, you'll see the letter repeated. Any place where there's no match to the consensus, you'll actually see a plus sign in this example. I'm sorry, that's indicating where you have a conservative substitution. The PP line posterior probability, you will see numbers here. The numbers indicate to you how good the match is at each position and the key is up here above. So it tells you that if you see it's a zero, you've got zero to five percent confidence. One means five to fifteen percent and so on. So you really want to see as high a number as possible in those positions. The star means that it's the best you can actually get. In most of our positions here, we have asterisks going all the way across. So now we can go directly to our some more information about this P450 domain just by clicking on that particular link. If we go to the PFAM tab at the top, we get a nice executive summary. If you knew nothing about what cytochrome P450s were, right there you get a quick summary of what they are. Some of the most important literature references that you should be reading before you even start doing any work on this particular protein family. You also get a representative structure on the side here. A different one will come up every time and you can just go through them either by pulling down the pull-down menu or just clicking on view a different structure. It'll just take you through some of them. Now if we look at the top, it tells us how many hits are actually found in this particular protein family. How many members there actually are. So we find that there's 282 different architectures comprising just shy of 40,000 sequences. So let's go ahead and take a look at some of those things. So we're going to go down the left side bar. We were at summary. We're going to Dormaine Organization. And so we see here all of these schematics, all these graphics, showing different combinations of protein domains that are found in subsets of those 39,592 sequences. So in the first case it says that 33,600 someod of them have this kind of orientation, this kind of domain structure where you've got a p450 domain and another domain that's in front of it that's represented by this little orange box over here. If you wanted to see all of those sequences you would just click on the show link and it would give you all of those right away. So sometimes when you don't see absolute sequence similarity it's good to look at things at the domain level because you can start to infer function just by looking at do I have a similar domain architecture as the sequence that I'm starting with. From a clinical point of view it's important to think in these terms because you can look at this and start to ask yourselves well do I have a mutation, a clinically relevant mutation that maybe knocks out one of these domains that this domain no longer forms so you either have a structural consequence or a functional consequence because you have that disease mutation. Okay 39,592 sequences let's take a look at a subset of those so you we now go to the next link down alignments and on this page you can take a look at the actual alignment in a number of different ways. Every place where there's an alignment available you'll see a check mark and if you were to take your cursor and mouse over that particular check mark it would actually switch over to reading as a view so it'll just change into a hyperlink and so the one that I've moused over here is under the seed sequences so again the seed alignment is what that expert put together the known members of the the family as the seed for bringing in the rest of the members in this particular family and if you go ahead and click on that it will bring up a viewer called Jalview this is within the context of the web browser we're going to see it later on in a more interactive fashion. The colors do actually mean something here so in the first set of colors the key is on the right hand side it gives you some sort of coloration that gives you an indication of the physical chemical characteristics of the residues in those particular positions things that are more commonly going to be able to substitute for one another just by looking again at the entire constellation of proteins what we know about substitution patterns so things that are either small or hydrophobic things where you see charged amino acids histidine tyrosines and so on. Occasionally you will see rows that are marked with an ss in parentheses and the ss stands for secondary structure so these are places where there is known secondary structural information so the position of the alpha helices or the position of the beta strands are known by virtue of that that makes for a much better alignment because you can use that information to better predict what is supposed to be where so it really goes a long way in helping these alignments be much more robust and we'll talk about that again towards the tail end of the lecture so the key for that is also here so in the rows where we have that ss designation any place you see the red blocks those are the alpha helices any place you see the yellow blocks those are the beta strands let's go back to the PFAM page and take a species centric view of this information so this very colorful representation here is supposed to represent all of the organisms on the planet you see the color assignments on this side of the slide here if you look very closely you can see that the metazoa are here followed by the coordinates followed by by the mammals going there and you can just take your cursor and just click on those particular regions to just focus in on those particular taxa what I've highlighted here is an organism called nematostella this comes from the phylum nideria these are organisms that are unified by the fact that they have stinging cells nidocytes that they used to capture their prey so this includes hydra and jellyfish and corals and cnm and ease by highlighting that box it also shows me the lineage up here of that gets you from the root to that particular organism and you can very quickly just download all of the sequences from this protein family from that particular organism with a single click of a button so it's a nice tool to have if you're looking again at a particular organism and you just want to find all of the or a particular class of organisms and phyla whatever to quickly pull all of that information together rather than having to generate those those sequence files one by one we'll pretend we've gone back to the summary page here and now we're going to just scroll down that page to where we have a bunch of external database links and the one I'm going to take you to is the one right in the middle called prosight and this is where we're going to see the first example of a pattern though the the designation that we talked about earlier where we know what the positional information is for each member of a protein family and if we look at the big box at the bottom the very first item in that box gives us the consensus pattern for this particular cytochrome p450 family you have links to other databases as well but the most important thing I think on this page is the part that says expert to contact by email okay so these are folks who keep these entries up to date and have agreed to serve as a reference point if you have any questions about this particular protein family so you would be somewhat foolish not to take advantage of this kind of resource because this person is making themselves available to you if you're studying cytochrome p450 you have a question that has befuddled you you can't find the answer in the literature click on the link send them the email and they'll get back to you so it's a good way to actually get to a human resource rather than just looking at all of the database entries so the next thing we're going to talk about is related the conserved domain database or cdd so what we're going to use cdd for same sort of goal here is to look for conserved domains in a protein sequence the main difference between pfam and cdd is that in the calculation of the cdd entries what they do is incorporate three-dimensional structural information and by doing that again what it helps you do is to define domain boundaries and it allows it gives you a really powerful tool for refining the alignments the data comes from a number of sources including pfam but not pfambi and a bunch of other databases that are focused on collecting information on groups of orthologous proteins so the way the cdd searches are conducted is through a variant of blast called rps blast reverse position specific blast so basically it's just a glorified way of saying we're taking a single sequence our query sequence and we're going to search that against a bunch of pre-calculated position specific scoring matrices the same way we did with pfam a the important thing to keep in mind is the method that is used by cdd is not the same method that's what is used by pfam okay so similar to the advice I gave you last week about using multiple scoring matrices when you do your blast searches when you're doing this kind of search looking for protein domains you should use both of these methods when you're delving into these kinds of questions just to hedge your bets okay so here we are at the cdd home page url again up in the corner we're going to take a sequence from our sequence webpage in this case this is the sequence of the deleted and colorectal carcinoma gene dcc the protein that's encoded by that gene from human we're going to leave all of the options exactly the way they are but if you wanted to only search one of the databases from the list that I showed you earlier you could select just a single one but we're going to just throw everything at it go ahead and click on the submit button and off we go so here we get a display the top part is very reminiscent of what we saw last week with our our blast searches so your query sequence is represented by the gray bar going across the top position one to position 1447 every place that a domain or a motif has been found is marked in relation to the coordinates on this line so there's one to one correspondence between all of these shapes and each one of these lines in the table we're going to focus in on the very first one this neo-genin domain right here at the C-terminal end of our DCC protein in order to learn a little bit more about that there's just a plus sign over there we're going to go ahead and click on that it'll expand that part of the page and we get a very quick again executive summary of what this actually is you find out that it's an immuno globulin type domain and how it is actually composed the alignment shows you the alignment of your sequence the top line with the CDD domain matrix the second line every place you see a blue letter that means to the every place you see a red letter that's an exact match every place you see a blue letter that is a mismatch okay we can get some more information about this particular domain by looking at this link it's very hard to see here in the blue following the name of the domain so if we were to click on that and we take us to now the domain page again the same information we saw before the most important part of this page to me is the stuff that's all the way at the bottom so we're going to pretend that we scroll down and here we have the sequence alignment so you can see which blocks of this particular protein domain are conserved across all of the members of this particular family you can actually download the sequences from here and this now puts you in a position of not even having to construct the multiple sequence alignment we're going to talk a lot about that today but you now have the tool in your hand to be able to start doing multiple sequence alignments and we're going to talk about again a lot of that later in today's lecture so again a good way to just be able to pull homologs together without actually having to go through the routines that we talked about last week so again something else for your own material okay so we come back here we've talked about the one against many techniques now let's flip it it's the other way and talk about many against one or many against many depending how you want to look at it the first technique we're going to talk about is another variant of blast called Psi Blast and the Psi stands for position specific iterated and what this allows you to do is to start to identify distantly related members of a protein family that again you might not have found in the context of a regular Blast search the Blast searches start to fail around the 25% mark so this is the reason I gave you that cutoff criterion last week for protein protein searches and one of the criteria you're going to apply is to look for at least 25% sequence symbol identity because as you get below that you really can't be too sure that those things are actually proper real matches that they might be actually be false positives so this is another thing you can throw at it to maybe push the boundaries a little bit this is a really nice version of a profile based search and the way it works is like this in the first step you do a regular Blast search so we're going to start with a protein sequence and we're going to search that protein sequence against a database of our choice what will then happen is we'll get back our hit list the same way we did last week but based on the results of that hit list what's in there we're going to take all of those hits take all of those sequences align them make a position-specific scoring matrix and then that matrix becomes the input to the next round of searches so this is now an iterative technique that each time you go around you're going to gather in more and more sequences you're going to keep adding them to your alignment you're going to keep recalculating the matrix you're not it is for you and by doing that we're going to just keep increasing and increasing the resolving power of that position-specific scoring matrix in finding new members of the protein family all right so in practice here we are at the blast home page this hopefully looks familiar to you now since this is a blast based search we're just going to go to the the protein blast section of the basic blast collection that we have this is again the blast search page we put in a sequence here in this case this is a high mobility group protein from humans so this is a DNA binding protein as before we could select which database we wanted to search we want to search so we have a pull down here and this time around I've picked something called swiss probe so let's talk about that for a minute so what swiss probe tries to do is something very similar to ref seek that we talked about last week so you'll recall that ref seek is an effort by NCBI to represent each and every sequence in the central dogma once and only one so it would have a single entry for each DNA RNA and protein sequence non-redundant what swiss probe does it's a similar effort a long standing effort that was started many many years ago by the folks at the swiss institute for bioinformatics only on the protein side of the house but pretty similar in what its end goals are actually actually are so you have a by definition the database is non-redundant like ref seek there's ongoing curation by the staff at EBI the European Bioinformatics Institute but also by external experts people similar to the ones that you saw in the prosight entry that take it upon themselves to make sure that these entries are kept up to date the folks who are actually working in the field doing experiments having their hands in it every single day what you'll see in the entries of those swiss probe entries is that you'll have some extra fields there including comment lines and I always point this out to people because again that's the nice executive summary that comes to you by the expert in the field so again you'd be foolish not to take use of that information because the expert has put in there what they think it's most important for you to know about that particular sequence the way you know that a sequence is a swiss probe sequence goes back to the accession number and again the accession number is the unique identifier the social security number for a particular sequence so here is how the representation is and you already know what the square brackets mean so it's one of these letters so it either starts with an O or a P or a Q and then has a five digit number following okay all right so we come back to our page here what I've also done in this case is I've limited the organism selection here to just the vertebrates so remember you always have the ability to search subsets of the database based on what organismal spread you're interested in the program selection is in the red box so what we were doing all of last week was the blast piece this time we're going to pick the second one down side blast as usual we're going to take a look at the algorithm parameters to maybe tweak things to our best advantage here the E value threshold by default is 10 we're going to change it to .001 per our guidelines from last week we can again filter out our low complexity region so the low complexity regions or those homo polymeric runs places of biased composition in our sequences that could confound the blast algorithm we're going to set a second E value threshold here a side blast threshold also 2.001 you'll see the default here again what I like to do is always check this box here at the bottom so that my results are coming up in a new window so if I ever have to come back to this search page it's still there I don't have to keep hitting the back button to get here and then we're just going to go ahead and blast it all right so hopefully this looks more familiar to you now given last week's lecture exact same format so we have our query sequence 1 to 200 and 30, 215 two HMG box domains have been identified here so two dna binding punitive dna binding regions and our overview of the blast results by score we're going to just scroll right down to the hit list again we have our list of alignments here so the description the definition line the very short one line identifier of what each one of these sequences actually are our score values remember the score values are important but we're not going to look at those instead we're going to look at the E values which are calculated from the score values the great neutralizer that allows us to search to compare any one blast search to any other blast search okay and these are now ranked from lowest to highest you'll see this time around two extra columns and the one that's highlighted says select for side blast and all of the check marks are checked off so you'll remember how this works we start with a sequence we do a the first round is just a regular blast P search we now have to get around the business of specifying what is going to go into the next round what is going to be used for calculating that matrix which will now be the query for the next round of searching so we're going to just go ahead and include everything that's what all the check boxes are there for for they're automatically checked by default now in this case when we look at the E value column we're still well above the rules that I gave you last week the one times 10 to the minus third but let's say we weren't right you would see everything above the line would be checked off everything below the line would not be checked off but again this is where your biological knowledge really comes into play you may have information that says something below the line should be included and perhaps that something above the line should not so this is where you have to look at this check off some additional boxes uncheck some of the other boxes if you see something that you know is not correct that you know is actually a false positive all right to now do the next round all you have to do is click on that go button it will calculate the matrix based on all of these sequences and so here we now have the results page again but now we've got a new column here used to build pssm this just tells us all right what was used in the last round to get us here if we scroll down we're going to see a bunch of yellow lines okay there's nothing checked there because these are the ones that were just found okay so these are all the new sequences that we found by virtue of having all of this information from the protein family collectively so we were able to tease out additional sequences here most of these are for something called the sex determining protein so this is tdy or sry these are members of this family so we're going to go ahead and include these in the next round of searches okay and we're going to just go around it around as many times as it takes two three four and in this case five okay where we just keep just hitting that go button at the bottom you'll know you're done when it says no new sequences were found above the point zero zero one threshold so we started in round one with our regular blast p-search we found 77 sequences by going through this process we ended up with an extra 100 sequences so that's a pretty substantial number okay things that we would not have found using just the traditional blast search or even just by using the default value so this is why it's important to be able to know what those defaults do and changing them when necessary so I think this demonstrates very well how you can use the collective characteristics of a protein family to really find these distantly related members again things you wouldn't find otherwise always always always please check the statistics especially in the percent identity column because even though our criteria that the method applies we've specified e-values but we have not specified percent identity cutoff so you need to do that by i so if you see things that are too low just take them out there's a newer variant of this another member of the blast family called delta blast and so this method is slightly different from what's used by side blast but not grossly different but so it's pretty easy to understand instead of doing a blast p-search in the first round sequence of interest against in our example Swiss Proat instead it's going to take the sequence of interest and search it against all of the entries in CDD the database that we talked about earlier so it's going to take that profile information that it finds compute a new position specific scoring matrix and now use that as the input for round one it's an iterative process just like side blast in round one it's searching against CDD starting in round two it's just successive side blast searches so from that point on it's exactly the same it's just how is the first round actually calculated what the author set out to do with this particular method is really come up with a more robust way to find homologs and you end up getting some really nice high quality alignments even when you're starting to push the boundaries of sequence similarity where things are getting down right around that 25% twilight zone the only caution I have to give you here is while the methodology is very robust it really depends on having entries in CDD all of the protein families may not be necessarily represented there that information on homologous proteins just might not be there I think currently there's about 75% of all known protein families are catalogued within CDD at least based on their documentation so you just need to keep that in mind that you may not be looking at everything that's available to you so this is yet another case of you've got multiple methods available to you absolutely do both I would encourage you to pull this paper because it shows you some performance data information the usual Venn diagram representations where you see how many were found by one method how many were found by the other but then how many are in the overlap and there's quite a few that fall on those other other sides it really would behoove you to use both of the two methods and the authors recommend that as well okay all right so that's it as far as individual sequences go we spent most of the last two weeks talking about this so now let's take this the next step and think about multiple sequence alignments which are just in essence a whole bunch of pairwise sequence alignments okay so what can you learn by doing these multiple sequence alignments as we've talked about many times you can start to pick out conserved regions patterns and domains why is that of use to you it can start to help inform you about how to design your experiment so where are the regions of a particular protein that you should be looking at where are the things that might be a little bit less important with respect to structure and function again as we've harped on quite a bit it helps you pull in new members of protein families as well if you don't have these it's very hard to do everything that's in the bottom part of the slide so when we start talking about predicting secondary structure yes you can do this on single sequences but it's much more powerful when you when you look at a family of sequences you cannot perform any sort of phylogenetic analysis unless you have one of these things so so even the the idea of looking at evolutionary relationships becomes impossible without having these multiple sequence alignments in hand and again as we've talked about it gives us the input to many of these search methods okay so what do we need to think about when we are constructing one of these multiple sequence alignments and these are themes that we've hit on several times already in the last two weeks we want to keep in mind positions in our multiple sequence alignment so just the columns of the alignment where we can have absolute sequence similarity we're going to try to align as many of the common characters as possible from sequence to sequence to sequence by doing so hopefully we can start to think about conservation so we'll start to see patterns where maybe there's only one or two or three amino acids in a particular column that gives us an indication of where can you have conservative substitutions that don't adversely affect the function or the structure of the protein finally it gives us some idea about structural similarity so we can again as I've alluded to several times now if we have secondary structural information or information from either NMR structures or or x-ray structures we can use that information to fine-tune the alignment so I know many of you probably all of you at some point have tried to put together a multiple sequence alignment so we're going to go through some general guidelines to help you break up your game a couple of notches all right so the first thing is when we do these we tend to concentrate on the protein level rather than on the nucleotide level and hopefully that's become a little bit more obvious to you as to why have we gone through a lot of the considerations we've talked about when we look at these protein alignments they tend to be more informative quite simply because you have 20 amino acids that you're trying to align on the protein side versus the four nucleotides that you are trying to align on on the nucleotide side and so it makes for more accurate alignments it goes back to the side chain chemistries of each of the amino acids and all of the physical chemical properties that come along with that now of course there's going to be cases where you can't do that and so let's say you're going to be looking at regulatory elements things like that you by necessity have to work with nucleotide sequences and Laurel Nitsky in two weeks is going to spend much more time talking about exactly that all right so this slide is just intended to give you the 30,000 foot overview of what you have to think about as you do one of these and so we're going to take each one of these off in order to give you a better appreciation for how to approach this question in the future so you want to put together multiple sequence alignment you need sequences okay so how do you actually pick them first is how many and so you want to keep that number relatively reasonable and the reason for that is when we the algorithms that compute these multiple sequence alignments are global alignment methods we talked about the difference between global and local last week the global alignments you're trying to align the entire sequence along its entire length with every other sequence in the set because of that this is computationally intensive it takes a long time when you start to put too many sequences in your input set the methods just start to choke and you might start getting inaccurate alignments the methods are getting much better with time they're able to handle more and more sequences in the input sets but it's something to be mindful of the other thing is sort of a more practical issue that if this is the first step to doing a phylogenetic analysis you have to keep in mind the next step in the process that okay this is now going to produce the input for my to my the program I'm going to use to generate my trees same considerations apply there the more and more that you have in that input set the harder and harder it is to actually generate reasonable trees that can actually be interpreted so it becomes almost impossible when the data sets get too big so good starting point 10 to 15 sequences I'm saying ballpark upper limit here 50 to 100 that I know that's a very wide range and it really just depends on how either closely related or distantly related each one of those proteins in the set actually are all right the sequences should be about the same length and the way you get them there is sometimes is to just trim them down so we talked about domain architectures earlier and we've seen results from our blast searches and we've seen things in cdd that show us where the concerned blocks are you can use that information to now trim back your sequences blast side blasts whichever method you want to use to have a more sort of tidy input set going into the multiple sequence alignment program so normally what will happen is the action tends to be in the middle you can safely get rid of the n-terminal ends safely get rid of the c-terminal ends and come up with the alignment that you want if you don't already know that there's some business going on in both those either the n-terminal or c-terminal ends okay all right the degree of sequence similarity so depends on what question you want to ask if you're looking for required or highly conserved amino acids you want to push towards having much more closely related sequences if you want to go the other way and study evolutionary questions you want to use more divergent sequences so a good place to start I think is sequences that are about 30 to 70% similar because that's going to hedge your bets on both of these things but I think the last bullet really says it best that the most informative alignments you can actually get is when you have a sequence that where you have sequences that are not too similar but not too dissimilar because if they're too similar well you're not going to learn anything you didn't already know and if they're too dissimilar you're not going to be able to align them so that's how you put the set together we're going to put it into a program we get back some results you have to inspect those and this becomes an iterative process what I normally recommend people do is they start off their alignments using a small set of sequences go on the small lower ends of the ranges I gave you earlier and go ahead and just do a preliminary alignment and look at all right just by eye how many residues are conserved across the alignment do the alignments actually make sense do you see a neat block type structure and have you been able to get away with very few gaps being included remember that the gaps represent biological events those are either insertions or deletions so you can't just keep inserting gaps without regard to biology it's going to end up messing up the alignment and messing up the interpretations if you've done this on a small number and the alignment's good fine add a few more sequences do it again and you can just keep going around like that if you find out the alignments not so great take out sequences where you see tons of gaps are being included then realign so this is not just a one shot deal you're going to if to do it well you have to go through a few rounds of fiddling with the input set to get a good alignment once you have it you can use some visualization tools that we have to look for key residues and problem regions what I always think is a good idea is to cross check against these expertly created multiple sequence alignments and we've seen examples of that through our travels over the last two weeks so you know where to go to look for those again if there's structural information that you have absolutely use it because it's going to tell you really definitively where can you put in a gap and where can you not tolerate a gap so if you have an alpha helical region no gaps if you see a beta strand region no gaps in the loops fine you can have them there okay all right interpretation so now you've got it it's to your satisfaction what does the alignment actually mean what can you glean from it any time you see an absolutely conserved position you know that those are required for either proper structure or function so they've been conserved for a reason if you see a relatively well conserved position you can tolerate limited amounts of change you can have a little bit of wiggle room there and not adversely affect the structure or the function of the protein now I think this is where most people stop considering things since they're looking for what is in common but I think there's a lot to be learned for looking at what is not in common the non-conserved positions where you can actually mutate freely because this is really where the evolutionary innovation goes on where you could possibly give rise to proteins with new functions and finally we've already talked about where the gaps can actually be tolerated what that might indicate so let's talk about a particular program the one that's probably most commonly used to do multiple sequence alignments and that's called cluster omega and so as you would expect from a multiple sequence alignment program you'll get out multiple sequence alignment whether you start with nucleotide or amino acid sequences it's really fast in the calculations and as you'll see it's really easy to use you can provide it information if you have structural information about where the gaps can go and where the gaps shouldn't go in advance so it knows not to put them in certain places and we'll see this in the context of a Java applet where we can actually manipulate the alignment of it okay before we actually get to the program a little bit of theory so how this works is using a method called a progressive alignment method so let's say we have a sequence set it's got a whole bunch of sequences in it rather than trying to align all of them all with each other all at the same time we have to do it in a little bit more rational fashion and the way the methodology does it is by just taking two sequences at a time starting with the two most related sequences and then just building up the alignment for there so it'll start with two it'll keep adding and adding and adding to bring in the ones that are a little bit less related to the ones that are in the rest of the set so we're gonna start in the places where the alignment is actually easy and move to the parts where it's a little bit harder because the sequences are more divergent I think an example helps here so we're gonna start with four sequences A, B, C and D first thing we're gonna do is calculate a similarity score of each one of those to each other sequence in the set so that's the equation of how many alignments you get so here we have four sequences it means we have to calculate six alignments this is just to illustrate why if your sequence set is too large that the numbers of calculations start to go up really really fast so it becomes very computationally difficult for the algorithms so here's our four sequences again and we're just going to put this little scoring matrix together to show us how how identical each one of these is to every other one in the set so four sequences six alignments from the previous slide one, two, three, four, five, six so by looking off the diagonal which are the exact matches we see the two most related pairs are A with B 80% and C with D at 92% so in the first step we're gonna just take A and align it with B we're gonna take C and align it with D once we do that we're going to use this alignment fix it take this alignment fix it and now represent A, B and C, D as single sequences so we've basically gone from four to two okay we're gonna and take those new in quotation marks single sequences align those and keep doing that as many times as we have to to keep adding the remaining members of the set so here we only had four we're already done but you you get the idea that you're starting at the tips merging each new pair of sequences as you go towards the root of this tree okay so why is this good especially for large sets you do the easy ones first because as you already know those are the those simplest to align but more importantly you're gonna use the information from those easy alignments to help you align the more difficult ones the more that are more distantly related because as you start to pile up these alignments you're gonna get more information the same way that we did by looking at our position specific scoring matrices that will help us later on in the process get those tough ones into position disadvantages we're going there's a directionality to this process and we are fixing these new alignments as we go if we make an error early on that effort error is propagated all the way through so so if that alignment is incorrect in a particular position or several positions that it's just gonna stay fixed through the entire rest of the process you're gonna end up with a possibly bad alignment now the good thing about cluster and a lot of the other methods is that they've now introduced ways for iterations to take place so rather than just doing the alignment process once it'll do it multiple times to make sure that you don't get caught in essentially what's a local minimum it's gonna cost you some increased compute time but it's well worth doing okay now there's other obviously there's other theoretical underpinnings for different types of multiple sequence alignment methods and so this is really all you need to know about this one if you're gonna use something different please just take the time to understand how the method works because you just again don't want to end up with a bad alignment because you haven't selected the sequences properly or you haven't taken into account some of the fine considerations of how it's done the last thing you want to do is be in a position where you're gonna start to draw improper biological conclusions all right so we're back to the program what do we actually get so as one would expect you get a multiple sequence alignment and you and those pair wise alignment scores that you saw previously as we talked about the progressive alignment you also get something called a cladogram and something called a filogram so the cladogram is a tree representing your sequences that is an estimate of the phylogeny so it gives you a branch structure all of the branches are of equal length so you don't have a sense of evolutionary time here it gives you a first glance at common ancestry but again we don't have a sense of how much time has elapsed that separate each one of the taxa on the tree the phylograms a little bit better again an estimate of the phylogeny the branches are not of equal length so the these length the branch lengths are each proportional to the amount of inferred evolutionary change when we look at the alignments we already saw a first glimpse of this when we saw Jalview in the context of PFAM earlier what it tries to align so it tries to keep aromatics together basic and acidic side chains catalytic sites and so on you'll see that the interpretation of these is empirical there are no rules I can give you here like I gave you for BLAST previously we're going to just take a look at it and see how good it is but we do at least have one indicator that's going to appear in the alignment and that is these symbols if you have an entirely conserved column you'll see an asterisk noting that particular position you want to as a rule of thumb see that in about at least rather 10 percent of the positions if we have a conserved position places where the physical chemical properties are strongly similar you'll see a colon on that position and if you see a dot that is what they call semi conserved so weakly similar properties probably encompassing one or more of the classes from the previous slide so how do we do it here's the URL again this takes us off to EBI here I put in five sequences in this case all in fast day format from our web page that has the sequences for a set of PhosB proteins the input is very easy because it's really just the sequences we're going to take a look at more options just to see you can see what's there here are where the settings are for the iterations that I talked about the things that will make sure that your progressive alignment is less prone to error we're going to leave those as set the only thing I've changed here is the output format to cluster with numbers and that's just going to number our alignment that we're going to look at in a minute we're going to go ahead and submit that and this is what we get back so here's our alignment our five sequences are aligned you'll see the numbers along the side so that's the cluster with numbers bit so you know where you are in each sequence the color key matches what's here in the inset so the reds are the small residues the blues are the acidic and so on across the bottom you will see where the conserved and semi-conserved and conserved and absolutely conserved positions are so it's the asterisk the colon and the dot in that order and places where that's not seen you just don't have a mark on those columns we're going to take a little bit of a different look at this by going up to the top where it says phylogenetic tree and if we click on that tab here we actually see the clatogram that I was telling you about before so sort of the first hint at the relatedness of these proteins you'll notice that the branch links are all pretty much the same so that we're not getting any sort of sense of evolutionary time here if we click on real here that will now convert this to the phylogram and you see that how the map has actually changed now this both of these look very different than what we're going to see at the very end of the lecture when we actually construct a phylogenetic tree but it gives you again sort of a first glimpse at what we're going to be looking at to get there we're going to go to this tab that says results summary and so there's a bunch of links to the actual files if you want to download those to take them into another phylogenetic analysis program they're in the right format that you'll need to go right into those programs but what we're going to focus on instead here is this box here where it says JAL view and so you just click on the button that says start JAL view this will launch a Java applet that we're going to now use to actually take a look at our alignment and a little bit more depth so we can use it to manually edit the alignments if we don't like how the alignment has been put together we can actually shove lines back and forth so we can introduce gaps or we can take gaps out we can color the residues in a variety of way look at pair-wise alignments do consensus sequence calculations if we see that we have an over representation of a particular sequence in the alignment we can just go ahead and take it out and the most important thing at the end of the day calculating the tree all right so we've clicked the gel view button the viewer has been launched you'll see a window that looks like this this is the default view our five sequences going across the top and we have some indication of the quality of the alignment position by position as we go across so conservation just tells you how well the alignment is at that particular position it's an indication of percent identity so the higher the bar the better the alignment at each position the quality is another way of looking at that same metric based on blossom scores and finally you have a consensus sequence going across the bottom and this is again based on the percent identity here is where the plus sign actually represents that there no there has been no consensus residue identified for that particular position we're going to go ahead and change the menu options if you squint real hard you'll see that there's a bunch of very small type up here with a bunch of pulldown menus and that's where we're going to change things so you'll see the upper part of the slide change so we're going to go to the color menu there is a submenu that's called percent identity so you pick percent identity and that changes the representation so everything is in now different shades of blue the darkest of the blues are the most conserved regions going to the regions that are just still rendered in white so this is a very quick way to identify new motifs candidate regions that might have structural and functional significance so here you want to look for blocks that have higher absolute sequence identity because there's obviously going to be evolutionary pressure to keep these regions conserved so going beyond position by position are there entire blocks that have to stay together if we want we could do just a pair wise alignment for any of the sequences in the set so I've just selected the first two from mouse and human and so you go to the calculate menu and pick pair wise alignments and you would get this alignment it tells you the percent identity 95 some odd percent and the alignment in the reminiscent blast format the final step we're going to finally generate an evolutionary tree so if we go to this calculate menu again calculate tree from that menu we go to neighbor joining using blossom 62 and now we have what probably is the fastest way to generate an evolutionary tree without having to go to another piece of software and so you can see this representation looks very different than what we saw earlier for this same example on the web pages you can actually put evolutionary distances on these to give you an idea of the branch lengths but I would use this as a first approximation because obviously there are methods out there there are much more robust at generating these kinds of trees but as your first glimpse this is a great place to start so obviously what I want to reinforce here this is the most commonly used method we use it a lot it's not the only method there's a lot of other methods that are out there they all have their strengths and weaknesses and in practice we use different methods to make sure the alignments are as good as we can get a tool that's going to help you with that is another program that I want to tell you about called tea coffee the coffee grinder comes from their website and so what this actually does is takes the multiple sequence alignments a little bit further you can't because you can combine sequence information profile information and structural information all at once whether you're looking at protein structures RNA secondary structures so this really is great for RNA sequences it has some specialized algorithms that are fine-tuned for specific classes of protein so if you're looking at transmembrane proteins or non-coding RNAs and so on where the rules are a little bit different this is a good place to look to align those what I particularly like about this method is it allows you to take output from other methods like cluster and other things that are out there and combine them into a single master alignment and what that master alignment is going to quickly tell you is where there are inconsistencies between the methods so if there are places from method to method to method where you now have an inconsistency give special attention to those regions you might have to hand curate those alignments a little bit to improve upon them so it's a quick way to zero in here is the URL for tea coffee they've just the team behind this in Cedric Notre-Dame's lab has just published a tutorial and the reference for that is on the slide so that will just take you through step by step how to actually use this method okay so with that we end our tour of sequence analysis over these two weeks and I always feel like it's important to end on this slide the black box that I alluded to very early on last week unfortunately as we all know you go to a webpage you see a box you stick your sequence and you hit the button okay and it becomes a very mechanical approach to sequence analysis where you really don't know what's going on inside the box put in your sequence get out your results and hopefully what you've started to get a better appreciation for over the last two weeks is that the inspection is really the most important part of the process and the only way to be able to do that is to really understand not at a gross technical method but to have some understanding of what the methods are actually doing in the background so that you can take the results from these methods couple them with your own biological knowledge never leave your biological knowledge at the door it's an important part of this because remember these are predictions you have real data from the laboratory put them all together and that approach will always always always serve you well so with that word of advice I want to just give you one additional resource which I think is very cool so David Searles who was at Penn for many years and then went on to Smith Klein recently published this perspective piece in plus computational biology the references on the slide and why I love this is because a lot of people don't have access to courses like these or tutorials hands-on workshops back in their individual institutes so this is a great way to teach yourselves how to do a lot of the techniques we're going to talk about in this course that we don't necessarily have time to cover what's very very nice about this is what David's done is he's created five curriculum tracks and the five are shown on the slide there so if you're interested in doing analyses there's a subset of courses he recommends that you should watch online so very much in the Coursera model or the Khan Academy model where you can just sit at your lap at your computer download some handouts and learn about all of these techniques if you're interested in building the tools there's a bioinformatics tool track and so on so this is a must read consider it a signed reading download this particular paper and I think you'll find some very very important and very very useful things in this manuscript okay so with that we're going to shift our focus next week back to the level of whole genomes as I look over to Dr. Wolfsburg Tira's going to take over next week and we're going to start talking about how we're going to visualize and analyze genome scale data identifying important sequence based feature features and using that information to its best advantage so with that I thank you for your attention I'll take questions at the podium and thanks for joining us this morning