 Okay, good morning everyone, and welcome to week four of our current topics in genome analysis series. This week we're going to continue our discussion of how to effectively and efficiently analyze biological sequence data in a way that can help you advance your own research questions. Before we get going, the obligatory CME disclosure that I have no relevant financial relationships with any commercial interests. Okay, so if we think back to week one of this series, we spent a lot of time talking about BLAST, which allows us to compare individual sequences against a collection of individual sequences in the public databases. And in that discussion we covered a lot of the building blocks that go into doing sequence similarity searches, the kinds of searches that can allow us to hopefully reveal potentially interesting biological relationships. We're going to continue to piece all of those building blocks together this week in slightly different ways. So rather than doing one-on-one sequence comparisons, we're going to use a slightly different approach using collections of protein families rather than individual sequences for our analyses. And by doing that, it allows us to identify new members of a protein family that might not be detected using simple BLAST techniques or to find commonalities in domain architectures by using the collective characteristics of a family of proteins. So in these methods we start with a individual sequence of interest and we compare that individual sequence to collections of what are in essence multiple sequence alignments. So as we did in week one, let's go ahead and start off with some definitions. So the first term we need to define is a profile. And what a profile essentially is is a numerical representation of a multiple sequence alignment where we essentially take that alignment and reduce that to a scoring matrix. To do that and to come up with a nice alignment, it depends upon the patterns or motifs in the sequences of interest containing conserved residues because what we want to do at the end of the day is use these profiles to represent the common characteristics of a protein family. And again, by doing so, we can start to find similarities between sequences that just at a percent identity level are quite distant from one another. A good example of that is the homeodomain protein. So if you think about the consensus homeodomain, it's about 60 amino acids in length, but there are very, very few positions, just a handful of positions in that 60 amino acid stretch that are absolutely conserved. So because of that BLAST itself might not necessarily find all of the sequences in the family. You need to have other techniques that you can put into your arsenal, especially when dealing with distantly related proteins. So how do we actually go about getting these profiles? So we're going to start off with this multiple sequence alignment here. It is 10 residues across. And I want to pay particular attention to the last three positions. So if you look at position 10, you'll notice that there is a glycine conserved across all of the sequences in this alignment. Again, if we look at position 8, same deal. We've got a threonine in every sequence at position 8. But if we look at position 9, we have a little bit of wiggle room. Most of the sequences have a proline in that position, but some of them have a threonine. So starting with this multiple sequence alignment, we're going to ask four questions that are going to allow us to generate the numerical matrix. We're going to take a look at which residues are seen at each one of the positions in our alignment. We're going to ask what the frequency of observed residues are position by position, so looking for conserved positions, semi-conserved positions, non-conserved positions. And finally, we're going to look at where we can introduce gaps. This particular alignment is gap free, but gaps are allowed in these alignments. Once we answer those questions, we can go ahead and construct something that looks like this. It's called a position-specific scoring table or a position-specific scoring matrix. And the way this works is very similar to what we saw when we had our discussion of the blossom matrices in week one. So here we've got our alignment going 10 across in the matrix. The 10 across is now going from top to bottom here. The consensus sequence for each individual position is shown in the leftmost column, and we have the amino acids going across the top. So if we have a sequence of interest that we're comparing to this, and let's say we're looking at position 10, that absolutely conserved glycine residue, if we look at what a score would be if a glycine matches at position 10, we would give that 150 points, a very high positive score, to indicate the absolute conservation at that position. If we look at the next to last position, remember that we can either have a proline or a three-inning at that position, the consensus being the proline. So if we look at how many points a proline in position nine in a sequence of interest would earn, here you were to earn 89 points, not as many as the 150, and that's just to indicate that we have a little bit of wiggle in that position. We can allow for conservative substitutions. Finally, in a case like position two, where there is a lot of different residues that can exist at that position, the consensus might be a proline, but it scores still a positive score, but a lot smaller positive score. So again, same general rules as we saw with the Blossom matrices, you get the most points for an exact match, you still get points for a conservative substitution, just not as many, and you'll notice throughout the matrix a bunch of negative scores, so you would subtract points for a non-conservative substitution. Now the good news is you don't have to calculate these. There are ways that we're gonna talk about this morning where you can take a sequence of interest and compare them to hundreds and hundreds of these profiles. The second construct we're gonna talk about this morning is something called a pattern. So if we just think about the profiles we just saw, the profile gives you both position and frequency information. In the case of the patterns, it just gives you positional information, so really this just becomes a pattern matching exercise. So we have codes that look like this, let me help you decipher these. When you see some amino acids surrounded by square brackets, that just means one of them, so in this particular case, the first position described by this pattern, you would have either a phenylalanine or a tyrosine. In the second position where you see the X, any amino acid could exist. We now see a C all by itself, so that means there is an obligatory cysteine in the third position. So the X comes back here, just like we had an X here, but it's followed by a two, so that just quite simply means in positions four and five, you could put any two amino acids. We now move to this, now we're at position six, curly brackets as opposed to the square brackets. What that means is none of those residues, so you can have any residue at that position except a valine or an alanine. Again, we see the X, so that can be any amino acid, and here we have an H3, so similar to the notation here, we can have three histidines in a row at the end of this particular pattern. So again, a construct you can use to do a simple pattern matching exercise allowing you to assign a protein of interest to a known protein family. Both of these approaches are useful, and we're just gonna go right into putting these into practice. So the first resource we're gonna talk about this morning is called PFAM, and PFAM is led by Alex Bateman's group at the European Bioinformatics Institute, and what PFAM presents to you is a collection of these multiple alignments of protein domains, conserved protein regions, and the usual mantra here, when you have those things, they probably have structural, functional, or evolutionary importance. When we look at the entries within PFAM, obviously we'll see those multiple sequence alignments, but there's a lot of really useful information in there in addition. There's data on protein domain architectures, the species distribution of family members, information on solved protein structures, either by X-ray or NMR, and links to other relevant databases. The web resource is called PFAM, the database itself is called PFAM A, and these are based on curated alignments of known members of a protein family. So what happens here in the construction of these, there are experts at EBI who mine resources like the Uniprot Knowledge Base and other sources for known members of a protein family of interest. They take those sequences, they align those sequences, and then use that alignment as a seed, a seed alignment to find other members of the protein family using a very widely used program called Hammer. This is done in an iterative fashion to try to keep building on that seed alignment over and over and over again, with the goal of trying to find all of the true members of that particular protein family. Given the rigor of the method and the fact that there are experts involved in generating these alignments, when you have a hit to one of these PFAM profiles, it is highly likely to be a true positive. As we go through the examples, the same way as in week one, I very strongly encourage you to go back to your labs, your offices are sitting at home one night when there's nothing good to watch on cable, and go through the examples that we are gonna work through this week as well. If you just point your web browser to this particular URL, you'll find the sequences for all of the examples we're gonna work through this morning. Okay, so let's go to PFAM. The URL is quite simply pfam.xfam.org, a very simple web interface, nice and clean, very easy to find stuff. Three things I'm gonna point out to you where we're going to do a sequence search, but before we do that, you can actually do a keyword search, so you can just put in the name of a protein family and get back the multiple sequence alignments, all of those sequences that are associated with that particular family. The other one that's particularly interesting here is where it says view a clan. And so this is essentially a way to find information on super families, okay, families of families. And PFAM's formal definition of this is just a collection of families that have arisen from a single evolutionary origin. So it gives you some very good information of what came from what. But again, we're gonna focus this morning on our sequence search, so we're gonna go ahead and click on that link, gives us a box, and we could just go ahead and paste our sequence into the box, but you know that I always like to look at the options and show all of those things to you. So we're gonna do that by clicking on the word here, that's right here, okay. We get a slightly different screen now, we're gonna go ahead and paste the query sequence in here. We have just a few options, but they're important options. The most important one is the E value down here. So we spent a lot of time in week one talking about what general cutoffs are for your protein-based searches and your nucleotide-based searches. The default here is 1.0. The default I like to use for my protein-based searches is 0.001, 10 to the minus third. For purposes of the example, we're gonna just leave it the way it is just to see what we get. The other option here is something called the gathering threshold, and this is something that's assigned by the curator. It varies from family to family, but it represents a minimum score that's needed to be assigned to a particular PFAM family. Okay, that's in the box. In practice, you're gonna change this to 0.001, but we're gonna leave that there for now, and we're gonna go ahead and click on submit, and let's see what we get back. All right, so very simple output we get back. Here is our list of significant PFAM A matches. In this case, we get only one, and that's the cytochrome P450. We started off with a sequence that we pasted into our box. Under the alignment columns, it tells us what part of our sequence aligned with this cytochrome P450 domain. Starting at position 41, ending at position 500. Now I want you to imagine that we're gonna click on two links at the same time. We're gonna click on this show link, and this one here is where it says show hide alignments just so you can see everything that's available to you. So clicking on the upper link produces this very nice descriptive summary of what you're actually looking at, and this is one of the things that I really like about PFAM. They've gone through great care to make sure that the documentation is accessible and understandable. The bottom, the second link, the alignment, we see the alignment down here, so let's focus at the bottom of the screen first. The sequence we started with, the one we pasted in the box is the one it has shown along the bottom. We then have three other lines here. The first one is hashtag HMM. So this is the consensus sequence for the profile that your sequence, the query sequence, matched. The line following, hashtag match, shows you where the matches actually are. So this is the same rules as we saw in our BLAST outputs. Anytime you see the letter of the amino acid, that means it's an exact match. Anytime you see a plus sign, that infers that it is a conservative substitution. One of the things we have here that we didn't see in BLAST is this last line, hashtag PP, PP stands for posterior probability. So what this is giving you is a statistical measure position by position of how well your sequence, the one at the bottom, matches the profile, the one at the top. The higher the numbers, the better the match at each individual position. And the key is shown to you up here where it says hashtag PP. So zero means zero to 5%, one means five to 15, and so on. When you get to a nine, that's 85 to 95, and ideally you wanna see a whole bunch of asterisks because that's the highest posterior probability you can get, 95 to 100%. This is also shown to you graphically in the sequence line. You'll see there's various, mostly green, but various other colors here as well. Here is the heat map for the color key corresponding to the posterior probabilities. So we figured out our protein is part of this cytochrome P450 family. Let's learn a little bit more about that family by clicking on the P450 link right under the word family. So this now takes us to the landing page for this family. You'll notice that there's three tabs at the top. You could certainly go to Wikipedia, at like many of us do, and learn some good information there. There is another related database called Interpro that contains much of the similar information that I'm gonna show you here, but the middle tab is the PFEM tab, and that's what we're gonna concentrate on. This always starts off with an executive summary, a nice, high description of what this protein family is, does, is involved in metabolic processes, structural considerations, anything you can imagine that is important about the biology of this particular protein family. So a nice place to start if you really don't know anything about this family. You're then pointed to a number of literature references. These are considered to be the must reads about this particular family. So again, if you're starting out, it's a great place to start to learn more about this particular family. On the right, you see an example structure, and at the bottom of this box, it says view a different structure with a pull-down menu so you can cycle through any one of a number of solved NMR or x-ray structures. And again, these are solved structures, not predicted structures. Okay, at the top, there's a box that I have here in red, just to give us a little bit of a bigger view of the breadth of the cytochrome P450 family. It's involved in 455 different domain architectures corresponding to just shy of 42,000 sequences. So let's take a look at both of those. To get to the domain architectures, you could either click there or you can click over here where it says domain organization. So here, you see a whole bunch of different domain organizations that involve one or multiple copies or partial copies of the P450 domain. Okay, so all of the different types of proteins where you can find this domain. For each one of these, there's a show link. So you can click on that show link, download all of those sequences automatically. So in one fell swoop, you can find all of the sequences that correspond to a given domain architecture. So this is really great information to have since similar domain structures may help you deduce the function or the structure of a previously unknown protein. And from a clinical point of view, it's really important because you can start to ask whether or not a mutation knocks out a critical residue in either a structural or a functional domain. And that might help you explain a clinical phenotype that you're observing. Okay, so let's move from the domain architectures. We'll go down to the next link. There's 40,000 plus sequences. So let's go ahead and take a look at a subset of those. So when you go to the alignments section of the site, you're presented with a wide variety of ways you can look at these sequences. You can look at the seed sequences. Remember that those are the ones that the expert pulled together in order to develop the full set. You can look at it in a number of different formats. Easiest one to look at it in is just HTML format. So I'm gonna go ahead and click on the link under seed next to HTML. And what you get is something like this. Okay, so all of the sequences from that seed alignment are shown to you. This is a pre-made multiple sequence alignment. That's always nice when somebody who is expert in the field will do this for you. The colors represent roughly physical chemical properties and those are shown to you in the key on the right. Now what you'll notice is there are a few rows in here that have in parentheses the letters SS and that stands for secondary structure. So this is a nice little addition because we're not just including the sequence information here but we're starting to incorporate structural information and whenever you can do that that really helps better inform the alignments because it tells you right away which regions can tolerate gaps and which regions are less tolerant of gaps. Usually when you have a secondary structural region, the inclusion of a gap will often break whatever that secondary structure is. So you can see where are the alpha helices, where are the beta strands, where are the coiled regions as well, the random coils. We're gonna talk about how to interpret something like this at the end of the lecture when we talk about multiple sequence alignments. Let's go back to the PFAM page and ask, okay, 40,000 sequences. What's the species distribution of those 40,000 sequences? So this is a seemingly forbidding view here but it's actually a pretty easy to understand. If you start at the middle, you start to see where the metazoans are and then the chordates and then the mammalia and then when you finally get to the outside ring, that's where the individual species actually live in this view. I've highlighted one box here. You can barely see it in the red for an organism called nematostella. So this is an organism in the phylum Nidaria. These are the jellyfish. There's a phylum that is unified by the presence of these organisms all having nighted sites or stinging cells. So again, these are the hydra jellyfish corals and CNM and these why this is a particular interest to me is that my own research group is currently sequencing one member of this particular phylum called Hydractinia and this is a model that's showing great promise to help us advance our understanding of things like regeneration and allure recognition. So you can go in zero in on a particular organism by just clicking the right box. You're shown the lineage in the upper right so a much easier way to understand the lineage of the organism. But more importantly is this selections box here at the bottom where you could just download in this case the 90 sequences that come from this one species that contain the P450 motif. So a very valuable resource from the standpoint of any of you who may be thinking about questions in the realm of comparative genomics. All right, so we're now back to the PFAM homepage for P450. I wanna concentrate on what's down here at the bottom, the external database links and one in particular that's called ProCyte. So now we're gonna shift from having, looking at the profiles to looking at the patterns that I showed you earlier. So much easier to look at just a single page, cytochrome P450 again another nice executive summary with citations that you can go off and read. But in the box at the bottom in the big blue box, the first lines you see consensus pattern in the format that we showed you earlier. So the square brackets indicating one of those residues has to exist and so on. To me the really important thing on this page is right there experts to contact by email. So these are folks who have made themselves available to you. They're the people who have created these entries because they are experts in this field. They know everything about this particular protein family. So if you have a question, this person is available to you. Click on the link, you send them an email. All right, when an expert in the field makes themselves that easily available to you, you should absolutely take advantage of it. So this is very nice to us from the standpoint of not having to always rely on what is available to you in the public databases, you can actually rely on a human resource as well. So that's it for PFAM. We're gonna move to the second resource available to you where we can do these one against many comparisons. And this one is called the Conserved Domain Database or CDD. This is a resource made available by our friends over at NCBI across campus where you can identify conserved domains in a protein sequence just as we did with PFAM. A little bit of a different approach because it incorporates three-dimensional structural information that helps to define the domains of boundaries and refine the alignments. It uses a variety of source data. It uses the data from PFAM A that we just looked at, but it also uses data from four other data sources as well to augment the predictions that it can perform. The way that the profiles are constructed in CDD is using a variant of BLAST called RPS BLAST, Reverse Position Specific BLAST, where you start with your query sequence. It's used to search a database of pre-calculated matrices as well, but the methodology that is used is not the same as PFAM. It's a different algorithm. So this brings back one of the points we talked about in week one. Since each one of these methods uses a slightly different algorithm to make its determinations, you should use both of the methods that I'm presenting to you today, both PFAM and CDD, when delving into an examination of protein domains, just to make sure that you've got your bases covered. So we go to the CDD database. This is the short URL for that resource. So if you go to this website, you'll see a link to CDD. It will bring you to this page. As always, we have our box where we can put in our sequence, and in this case, from the example data set, we're putting in the sequence for a DCC from human, deleted and colorectal carcinoma, to use a clinical example. We have a bunch of options that are available to us here. Again, the most important one is the E-value threshold. The default here is set at 0.01. Again, I would bump that down a little bit to 0.001, but just so you can see what kind of data you'll get, we're gonna leave it the way it is this morning. Once you've made those changes, whatever changes you wanna make here, go ahead and click submit. And we get a display. Certain elements should be reminiscent of what we saw in week one with our blast output. So you have, at the top, a schematic, your sequence in the black bar going from one to 1447. And then a hit list of the domains that are matched by comparing this particular sequence to all of the CDD profiles. They're shown to you in graphic format here, and each one of these boxes corresponds to one thing, one of the items in the list. The very first one in the list, which is marked IG1 neogenin, corresponds to this particular hit, the very first one next to the word specific hits. And if we wanna get some more information about that, we're gonna just click on the plus sign that's right there. That'll expand out the screen. We get, again, a nice little summary of what this domain is, what it does, where it's found, and our alignment. A much simpler view of the alignment here because we just are looking at positions at either match or mismatch. So any place you see a blue residue, that is a mismatch. Any place you see a red residue, that is a match. Your E value is right here, 4.8 times 10 to the minus 51. So we're gonna consider that to be an extremely good match. And it's only, the domain only covers part of our sequence of interest corresponding to what we see here, starting at position 41, going to position 136. Right under the word accession, you have a link. And that link will take you to more information about this particular domain. So as before, again, the summary, a nice set of references that are germane to this particular immunoglobulin-like domain. But to me, the most useful part of this page, we're gonna pretend we just scroll down to the bottom, is this bit down here, where you can get sequence alignments in a variety of formats. This particular format is called Compact Hypertext. It's a compact version showing you conserved blocks. And you can see those blocks very plainly by just looking at the alignment on the screen. Using the links on this page, again, you can just download these sequences for your own use, for downstream studies, such as phylogenetic analyses, or to build more complex multiple sequence alignments. So whenever these resources are available to you, these pre-made alignments, pre-made collections of sequences, you really should take advantage of them because it is a very time-consuming exercise to fish all of these things out of the public databases. It's not a complete substitute, but it's a great way to get you a good way down the road in collecting those data sets. All right, so that takes care of the first bit, the one against many. Now we're gonna flip things around a little bit. In the first case, we were using a single sequence to search against protein families. Now we're gonna do the converse. So the first example of this is something called PsiBlast. So yet another version of Blast from NCBI, Position Specific Iterated Blast Search. And the value of this technique is that it really is a great tool for finding similarity between proteins of interest that you may not find during a standard Blast search or repeating theme for this morning's lecture. So very easy to use method and we're gonna walk through this. You're gonna start off with your sequence of interest and we're just gonna do a run-of-the-mill Blast P search the same way we did in week one of the series. Once we have the results of that Blast P search, the method, the PsiBlast method, will calculate for you a Position Specific Scoring Matrix based on all of the sequences that were found. So it'll have the hit list, it'll make the multiple sequence alignment, it'll make the numerical matrix based on that multiple sequence alignment. Once that's calculated, the matrix replaces the query for the next round of searches. So your original query, the sequence you started with basically gets thrown out at this point and we're just gonna use the matrix at this point for successive rounds of searches and it's just gonna go around and around as many times as it has to until no new significant alignments are found. Okay, so kind of neat method. So how do we actually do this ourselves? We're gonna go back to the Blast homepage. Hopefully this looks very familiar to you at this point. Just by way of reminder, most of the basic Blast programs are found right in this center section here. The second one down is Protein Blast and that's the one we're gonna consider here. Okay, so here is our query page. As before, we're gonna just paste a sequence into the box. In this case, this is the high mobility protein from Human. If we look at the first set of options that are available to us, choose Search Set. For the database, I've selected RefSeq. So by way of reminder, RefSeq is the curated effort being led by NCBI to represent every molecule in the central dogma once and only once. So each DNA sequence, each mRNA sequence, each protein sequence. Under Organism, I put in Deuterostomes rather than just pull everything in from all organisms just to show you that you can limit your search by taxonomic group. Beneath that, it says Exclude and there's a checkbox which I have checked off. It is not checked by default. Models XM slash XP. So in our discussion of RefSeq, recall that in the construction of the RefSeq databases, there are certain entries that are identified to you with their accession number, their unique identifier starting with an X. When you see XM, those are predicted mRNAs based on genomic sequence. When you see XP, those are predicted proteins based on the predicted mRNAs. So these are all models. These are not experimentally verified. So for purposes of doing this type of analysis, I like to check that box just so I can deal strictly with sequences that have been experimentally verified. We then move a little bit further down. The program selection, of course, in this case, is side blast. As always, we're gonna look at our algorithm parameters to see what we can tweak to improve our results. So E values are the thing we always pay attention to. So the default here is 10. I've changed it to .001. Under filters and masking, I've checked off the low complexity regions box. So again, by way of reminder, the low complexity regions are when you see homopalmeric runs of either one or a couple of amino acids or a string of Qs or a string of Gs. Those things can found the blast algorithm. So it's usually best to check that box to make sure that you get a better set of results. Further down the side blast threshold, the default again is .05. I've changed it to .001, my usual cutoff. Once all of those things are done, we're gonna go ahead and blast this. And this is what we get back. So very similar to what we've seen before when we talked about blast in detail a couple of weeks ago. At the top, you have your query sequence that is represented once again by the black bar positions one to 215. There are two instances of an HMG box domain found here. So two DNA binding regions in this particular protein and the overview of the blast results. We're not gonna consider that this week. We're gonna go right down to the hit list. And at the top it says sequences producing significant alignments with an E value better than the threshold, better than our .001 cutoff. Okay. Let me take you through the column structure here. So as always, there is the short description, the death line for each one of these found proteins. Information on the scores, but remember from week one, we're not gonna pay attention to the scores. We're more interested in the E values meeting our cutoff. And for the ones you see on the screen, they obviously do. The percent identity, how many residues exactly match our query sequence. And at this top part of the display, they're very high values. The accession numbers, and you can click on any of these to take you to the individual gem bank entry. And two columns we haven't seen before. The first one says select for side blast. And you'll see that all of them are checked. So in this first round again, all we've done is just a blast P search, even though we were going through the side blast interface. But for the next round, we need to be able to calculate that matrix that I told you about. So everything that's checked off, all of those sequences will be included in the calculation of that position specific scoring matrix that will become the query for the next round. The last column says use to build PSSM. Well, we don't have one of those yet. So that's why there's nothing in that column, but you'll see that populate in the next round. Okay, cause all of them are new. All right, pretend to scroll down. We're at the bottom of our list here. If we look at this and we decide that all of these sequences should go to the next round. And again, keeping in mind the importance of manual inspection of your results, we're just gonna go ahead and click on the go button right there. And while you go pour yourself a cup of coffee, it's going to go ahead and make the multiple sequence alignment, calculate the matrix and present you back the hit list. So this is now round two, same display as before, but now you see all the green check marks. So these were all of the sequences that were carried over from round one. If we scroll down, it looks a little bit different. So we got a whole bunch of yellow going on here. And you notice that there are no green check marks here. So these are new sequences that were found using this method that were not found in the BLAST P search. So right there, it shows you that BLAST doesn't find everything. Sometimes we need to use more advanced techniques to tease out the sequences. It might be a little bit more divergent, but that are still related to the one that you started with. Now, in this particular case, this is where you have to put your biology glasses on because I had given you two rules in week one about what you should be looking for as far as your cut-offs go. E values, 10 to the minus third or better. Percent identity, 25% or higher. And if you look at some of these, especially towards the top of the screen, 21, 22, 19 over here, those are things that don't match our second criterion. So what you need to do is take a look at them, take a look at what it says in the death line, convince yourself whether or not these are truly related to your sequence of interest. If they do belong to the set, if they do, great. Leave the box checked. If your biological knowledge tells you otherwise, uncheck the box so it does not get included in the subsequent rounds. So again, another case of not leaving your biology at the door when looking at these kinds of results. All right, we're gonna click on go again and we're gonna go around two, three, four, five. In this case, nine times. The ninth iteration here, and just to give you a sense of the time, this probably took a half an hour start to finish going through round by round. You know you're done when you see this message right here that says no new sequences are found above the .001 threshold. So fine, you have reached convergence at this point. Even if you were to issue another command, you wouldn't find anything new. In this particular case, in round one, we had found 117 proteins matching our query by using this method, using the collective characteristics of the sequences that we found that kept getting added and added to our position-specific scoring matrix. We ended up with 156. So we found 39 additional proteins that blast itself did not identify. So again, the power of using all of the information that's found in all of the sequences within a family. And as always, my usual caveat, make sure you check the statistics before you go run off to the lab and start designing experiments. So pretty easy to use. A lot of the stuff is done for you in the background. Another similar method also by NCBI is a little bit of a different flavor of this called Delta Blast. It basically takes the same approach where you start with a query of interest, but instead of doing a blast P search in round one, the search is gonna be done against the CDD database that we just talked about. So it's gonna take your sequence of interest, search it against the CDDs, use the results of that to then compute the position specific scoring matrix and use that as the basis for subsequent rounds of searches. So from that point on, it's pretty much the same. Because you're starting with the data from CDD, which includes structural information, it's a method that improves homology detection. You will get some nice high quality alignments even at very low levels of sequence similarity. And it really tries to capitalize on the homologous relationships that are already cataloged within CDD. We're not gonna work an example this morning, I just show you this as something you should add to your arsenal. It's very easy to use, very similar to what we just did with our side blast search. So once again, another one of the cases where because methodologies are slightly different, you should use multiple methods do both in this particular case. All right, so that's gonna be the end of our discussion of pairwise sequence alignments which we spent all of week one talking about in a good chunk of this morning. And we're just now gonna move up one level of complexity going to multiple sequence alignments. And just keep in mind that when we do these multiple sequence alignments, what we're doing in essence is a whole bunch of pairwise sequence alignments that are all compiled together. So why do you wanna do these? What can you learn from these multiple sequence alignments? A lot of the themes that we've been talking about really, really now start to come together when we move to looking at things in this sort of alignment fashion. It becomes very easy to identify conserved regions, patterns, and domains. That's important to you when you start to do experimental design. So it's not just an academic exercise of look at the pretty multiple sequence alignment I just made that I'm gonna make figures three of my paper. You can use this to help you with experimental design. Particularly important if you're doing things like site-directed mutagenesis. It helps you predict structure and function. And as we've already seen in the cases this morning, helps you identify new members of a protein family. It also provides the basis for predicting secondary structure. There are a large number of algorithms available to you where you can start with input sequences and predict where the alpha helices, beta strands, random coils are. It is an absolute must when you are performing phylogenetic analysis because you cannot do a phylogenetic analysis without a multiple sequence alignment. Those are the inputs to these methods. So you have to do that first before you apply the phylogenetic techniques and then hopefully find yourselves in a position where you can start to infer evolutionary relationships such as homology. And as we've just seen in examples in practice this morning, generating the kinds of matrices we use for these more sensitive search methods. So what do you need to think about when you start to go about the business of constructing these multiple sequence alignments? The usual cast of characters comes into play here. So we're looking for things like absolute sequence similarity where we are going to line up as many common characters, as many columns as possible using those parts to pretty much anchor the alignment. So the more absolutely conserved positions that we can align, it makes the alignments of the rest of the sequences easier. Taking into account conservation. Of course, you're not always going to see absolute conservation. We want to take into account residues that can substitute for one another and not adversely affect the structure or the function of the protein. And finally, structural similarity. We started to see this theme come into play this morning any time you have information about the structure of the proteins, either from X-ray or NMR studies. You can use that information to fine tune the alignment. So I know most of you, if not all of you have at some point tried to make or have made a multiple sequence alignment. So I just want to offer you some general guidelines this morning that will hopefully help you up your game a notch when you do this the next time around. There's all the ways the question of what to align. Protein sequences or nucleotide sequences. I prefer to concentrate on the protein level just because there is more what's called information content in the constellation of the 20 amino acid as compared to the four nucleotide bases that are all very structurally similar to each other. The protein alignments tend to be more informative. Now, of course, that depends on the context of what you're doing. If you're doing RNA-seq experiments, you're obviously going to be doing things at the nucleotide level. If you're looking at regulatory elements, you have to look at the nucleotide level. And Laurel Nitsky will talk a lot more about this in next week's lecture that's devoted to the discussion of regulatory elements. So as this cascade starts to come in, this is sort of the general overview of what you need to think about when you're doing these kinds of alignments. So the first step, hopefully, by sitting through these two lectures, you already know how to do this very well, finding the sequences that you want to align that satisfy a reasonable E-value cutoff. You then will run your alignment method of choice and do the inspection. The inspection becomes important because there is always an element of fine-tuning the alignments. The methods are pretty good, but they are not foolproof. They're sometimes where, because of the peculiarities of a particular input set, you're going to have to start moving things around a little bit yourselves. What you're going to look for are sequences that seriously disrupt the alignment. And by that, I mean that introduce a lot of gaps. You take those out, you can redo the alignment, then add back the more recalcitrant sequences based on key residues in the alignment, and then go ahead and do your interpretation. So let's talk a little bit more about each one of these steps. As probably as obvious at this point, the most important part of this is selecting the sequences, especially if you're going to be doing phylogenetic analyses because this is truly a case of garbage in, garbage out. You want to use a reasonable number of sequences to avoid technical difficulties. And that's because when we do these multiple sequence alignments, these are global alignment methods. So recall the difference that we talked about in week one between a global alignment method and a local alignment method. In the local alignment methods, everything we've been talking about up until now, you're looking for short segments of two sequences that you can align to each other, structural domains, functional domains, but not trying to force an alignment across the entire length of the sequences. That is what you're doing in a global alignment. And because of that, because of the nature of global alignments, the compute time increases exponentially as you add more and more sequences to the set. Now, while the alignment algorithms keep getting better and better with time, they tend to start to become ineffective on huge data sets and could give you some inaccurate alignments because of the amount of calculations that have to go through these larger sets. So it just becomes much more computationally intensive. The other thing, consideration's a little bit more practical here, that if you put too many sequences into your set, especially if you're going to do a phylogenetic analysis, it somewhat starts to become intractable. So while the methods and the newer methods can handle hundreds and even thousands of sequences, where I like to recommend people start is around the range of 10 to 15 sequences, you make, in essence, the kind of seed alignment we talked about earlier this morning. And then you can add sequences in to sort of a ballpark upper limit of 50 to 100 sequences. Again, it's possible to go higher with some of the methods, but you can answer the great majority of biological questions that you might be posing within this range. So unless you're really in the business of doing hardcore cataloging, deep characterization of protein families, things like that, for most questions, you can kind of stay within this limit and get the answers that you're looking for. All right, selecting the sequences. Because it is a global alignment method, they should be about the same length. That's where the global alignment methods tend to work best. You can trim the sequences down. So you're not obligated to use your sequences from the N-terminus to the C-terminus in every case. And it's always, it's quite common to find folks and especially the experts in the field trimming those regions back to particular regions of interest. And what can help guide you to which regions you should keep are, is the information that you get from the pairwise search methods and the profile-based search methods we've talked about through the two lectures. You should consider the degree of similarity in the sequence set. Again, depending on what question you're asking, if you're looking to find out what residues are required, which ones are highly conserved, you wanna use closely related sequences. If you're looking to study evolutionary relationships, you wanna look for more divergent sequences. Now, ideally in the first step, you wanna combine these two. So a good starting point is to use sequences that are between 30 to 70% similar to most of the other sequences in the data set. And the reason for that is if you have a data set where all of the sequences are too similar, well, you're not gonna learn anything you didn't already know. And if you have sequences that are very dissimilar from one another, too divergent from one another, you just might not be able to align them. All right, of course, all software will yield an alignment, but again, the question, does the alignment make biological sense? So you wanna examine the alignment looking for conservation of residues across the alignment, conservation of physical chemical properties that you might have knowledge of. A relatively neat block type structure, like we saw in the CDD example, and you do not want to have an excessive number of gaps. So similarly to our discussion with Blast, please keep in mind that the gaps represent biological events. They represent insertions or deletions. So this cannot be approached like a word processing exercise where you just put in dashes to make the alignment look better. You have to have a rationale for doing it. Inspection is absolutely necessary. If the alignment's good, you can go ahead and add new sequences to the data set, then realign. If it's not looking so good, again, remove any sequences that result in the inclusion of long gaps that's always something to be looking for, then realign. So you're starting to get the idea here that it's not throw all of your sequences into the program all at once, start with a subset, and then build out from there. And you end up getting much better alignments that way. To help you with choosing which ones to keep in, which ones to take out, which ones to put back in, there are some visualization tools available to you that can help you identify key residues and problem regions. And we'll see an example of that at the end of today's lecture. You want to cross check against expertly created multiple sequence alignments and just this morning, we've seen several places where you can find those. So those can help you inform your own alignments as well, or you can use those for the basis of alignments. You can start with those and then add your sequences in. And the point that I've already made several times now about trying to use as much structural information as you can, because that again is going to tell you where the gaps can or cannot be tolerated. Nails down those structurally important regions. Okay. Now we're back to putting the thinking caps on interpretation. When you have your final alignment, you're going to take a look at it. And whenever you see an absolutely conserved position, that's telling you that those are required for proper structure and function. They've been conserved for a reason. If we have a relatively well conserved position, that indicates to you that they can tolerate a limited amount of change and they don't adversely affect either structure or function. So you might have aromatics or positively charged residue substituting for each other, but it doesn't disrupt the actual function or structure of the protein. Most people focus on these first two, but I think really the most important one is this third one here, because this is what gives you a clue to biological innovation. When you see non-conserved positions, those are the ones that can, for lack of a better way of putting it, mutate freely. And because of those mutations, it could possibly give rise to new proteins with new functions. So again, don't discount those positions where you don't see discount, you don't see decent conservation because mother nature may be trying to tell you something. Finally, going back to the concept of the gaps, when you see gap-free blocks, regions of secondary structure more than likely, gap-rich blocks probably correspond to the unstructured or loop regions. And again, if you have structural information, you can confirm that. So those are all the considerations before we even start. But once you do that, once you assemble the data set, you have a nice pristine set that you feel comfortable with that can help you answer the question that you're seeking to answer, you now go to a method. And so the method we're gonna discuss this morning is something called Clustyl Omega. This is the most widely used program for multiple sequence alignments. And it just allows you to make these alignments either on the nucleotide level or on the protein level. And it does quick business of it. It aligns the data sets very quickly, very easily. One of the things I particularly like about this method is that you can take a pre-existing alignment, which in their terminology, they call an external profile and align your sequences against that external profile. So again, taking some of the alignments we saw earlier this morning and building on those with new information that you may have. It also works very well with Jalview. This is the app that we're gonna look at at the end of today's lecture. Just so you have some sense of how this method works, how the alignment is constructed at a 30,000 foot level, this is an example of a method called a progressive alignment. The way a progressive alignment is done is you start with your input set and the method is gonna figure out which of the sequences in your set are most related to each other, which ones have the greatest sequence similarity to one another. And it's going to use that information to say, okay, the two most related sequences, I'm gonna align those first. And then I'm gonna just keep building out from there on that alignment using that information. So the easiest way to understand this is it starts with the easy alignments first, builds it out and then when you get to the sequences that are more divergent, at that point, so much of the sequence set has been aligned that it makes it easier to place to make the harder alignments at the end. So it just builds them up from the most related to the least related. The advantages of the method, generally fast and you get alignments that are of really good quality. When we perform this method, we're gonna get a bunch of data back. We're obviously going to get a multiple sequence alignment. We're gonna get the scores that were used to inform the alignment by using that progressive method that I just described to you. And two sets of trees, one called a cladogram and one called a phylogram. Both of them are assumed to be estimates of a phylogeny. In the case of the cladogram, the branches are of equal length, so it can show you common ancestry, but it doesn't give you any sort of indication about the amount of evolutionary time that is separating any one protein from any other protein in your data set. The phylogram is slightly different. The branches are not of equal length, so they are proportional to the amount of inferred evolutionary change. The conservation patterns that are observed by this method are as described on the board. So aromatics are treated as a single group, basic side chains, acidic side chains, and so on. So this comes into play here. The notation that appears under each alignment. So we're gonna see an alignment across the bottom. These symbols will appear. Any column that's an entirely conserved column, you'll see an asterisk and you want that at least 10% of your positions because again, those serve as very useful anchor points for building the alignment. If you see the colon, that means that that is a conserved position. Only strongly similar residues exist there, so if we think back to the previous slide, residues from just one of the groups on that slide. This interesting term, semi-conserved, weekly similar properties and this just means that there are residues from two of the groups on the preceding slide. How this works is very simple. We go to EBI's website. URL is right there. As always, we have our box and in this case, using the example from the sample data set, we have five FOS proteins that are going to be aligned. In the second part here, set your parameters. There's two things we're gonna look at by clicking on more options. The first one is the output format. You wanna make sure that it reads cluster with numbers, just so you can keep track of what parts of which sequences are where it'll just make sure that the alignment is numbered. The other thing I like to do is this very last one where it says order and I change that to input. If you leave it as input, the alignment is presented to you, the order of the sequences is the same as the input order in the query box. If you don't click that, it's going to rejigger the alignment based on how that progressive alignment was performed. Finally, we're gonna go ahead and just click on the submit box. This takes a few minutes to come back but you will eventually get a screen that looks like this. Here are our five now aligned proteins. The alignment's pretty good because we have a lot of positions as you can see by the asterisks that are absolutely conserved. But then again, the ones with the colon are the conserved positions and the semi-conserved positions are the ones with the dot with the color key being similar to the slide that I showed you a few back with the conservation groups. There's a tab at the top that says phylogenetic tree. So let's go ahead and pretend we've clicked on that. And so this is the first of the two trees that this method gives you. So this is the clattergram where all of the branch lengths are equal. So you get a sense of who is more closely related to who but no sense of inferred evolutionary time. If we click on the real radio button that will transform this so that the branch lengths are not all equal. There you have relative distances here giving you some sense of evolutionary time. I will tell you that these both look very different from the real evolutionary tree that describes the relationships between these five that we're gonna make shortly. So consider this just to be a sneak peek. All right. At the top results summary, we're gonna click on that. It gives you the ability to download a number of files the percent identity matrix that was used to perform the progressive alignments. Your input sequence is just in case you've misplaced them but we're going to just concentrate on JALVU over here this Java applet that is going to let us work with the alignment a little bit more interactively. What it allows you to do is to start manually editing these alignments. You can color residues based on a number of properties. You can do a pairwise alignment of selected sequences from within the sequence set, determine consensus sequences, remove redundant sequences and calculate a first pass phylogenetic tree. So again, imagine you've clicked on that JALVU button. You're gonna see something that looks like this. A new screen will pop up. Here are our five sequences going across and you see three histograms at the bottom. The first one is called conservation. So this is your indication of percent identity at each individual position. So the higher the number, the higher degree of identity at that position. You see a quality score underneath. This is based on scores from the blossom matrices and it generally parallels what you see in the conservation line. Finally, the consensus at the bottom, you see shown across the bottom. Again, with the histogram showing you the confidence at each position, there are some places where you see a plus sign and that just means that a consensus residue could not be calculated for that particular position. Now up here in the corner, and I know this is impossible to see and it's very small on your screens as well when you go work the examples, are another bunch of pull down menus. So now we're gonna start to manipulate this alignment a little bit. So one of those pull down menus is called color, British spelling, and if you pick the percentage identity option from that menu, this is now what you see. So just going back one just so you can see the contrast. This is what we started with. This is what we now have. Different shades of blue coloring our alignment and the darker the blue is in any one part of the alignment, the greater the agreement, the higher the level of conservation in your alignment. So the regions that are in dark blue have 81 to 100% identity going down through the blues to the whites. So this is a really quick and easy way to identify your own motifs in a particular set of sequences that you're interested in and find candidate regions that once again may have structural or functional significance. So you're gonna look for those blocks of absolute sequence identity because that's gonna tell you where most evolutionary pressure has been applied to keep those residues conserved. Let's play with the output one more time. I selected two of the sequences, the ones in the red box. You go to the calculate menu and select pair-wise alignments. You get another box just showing you in this case the alignment between the mouse and the human-phosphor proteins and it gives you a percent identity just shy of 96% at the bottom. The last recasting of this that we're gonna do this morning is to construct our first real evolutionary tree unlike the phylograms and the cladograms that we saw before. So we're gonna go back to these menus at the top, pick calculate tree, then a submenu will appear and you're gonna pick neighbor joining using Blossom 62. There are two methods that are available to you there and two different sets of matrices but this is the one we're gonna choose and here is the tree which obviously looks very different than what we saw on the cluster website. So it's the fastest way to generate a tree without having to go to another piece of software. Again, I would use this as a good first approximation so you can see which ones of the proteins in your data set are most related to one another and by how much and then take this off to another method for a more advanced analysis if you're interested. So while we've concentrated on cluster this morning, obviously there's a wide variety of multiple sequence alignment programs that are out there and each one, like all of the methods that I presented to you over the two weeks that I've been with you have their strengths and have their weaknesses. So in practice, you should always use again different methods. The one I wanna point out to you specifically a method that can help you with this is called tea coffee and who doesn't love coffee. This comes out of Cedric Notre-Dame's lab where you can combine sequence information, profile information and structural information in doing the alignments. One of the very nice features of this is that it has a specialized algorithm for aligning regions of non-soluble protein. So there are different rules for transmembrane proteins because as you might imagine, the hydrophobic residues have to get themselves into the lipid bilayer. So the alignments are slightly different. So special rules for special data sets. Also, just going back to my recommendation of using multiple methods, once you do that and you have the outputs from them, you could come to tea coffee and it has a way that you can basically basically pass all of your alignments through the burg rinder and come out with a master alignment and that can reveal to you inconsistencies that you might find between the alignment methods. There's a very nice tutorial that was published back in 14 that I would recommend to you and this is the website for this particular resource. Okay, so today we're gonna end things where it all started in week one and my general disdain for the black box. So websites, we visited a lot of them in our travels over the two weeks. You'll see many more of them in the remaining lectures in this series. And it makes it very tempting to just take some sort of input, paste it into the box, click a button and you take your results at face value. So in goes the sequence, out comes the results, but you have absolutely no idea what happened within the box. And I wanna draw an analogy to what each of you hopefully already do at the bench and use CRISPR as an example of that. A very complicated but very powerful technology that you would never think twice about using without thoroughly understanding it first, okay? You need to do the same thing with all of these computational methodologies as well. So you're gonna take the results that you get from these methods, put your thinking cap on and do the manual inspection. Again, your biological knowledge is just as valuable as the results that come out of these methods. And when you put your thinking cap on, you inspect the results and you couple all of that with your own knowledge of the biological system you're studying, that will always, always serve you well, okay? The last thing I'm gonna offer you this morning, just for those of you who are interested in going the next step and learning a lot more in depth about these computational techniques are these two papers that were written by David Searles. The first one is called an online bioinformatics curriculum and the second one is a new online computational biology curriculum, interchanging the terms. The first one gives you what are called selected, excuse me, suggested curriculum tracks that are tailored to individual needs. So if you wanna learn more about bioinformatic analyses, data mining and so on, it will give you a list of massively open online courses such as this one that are available to you freely that you can take advantage of. The newer article recasts this in what they call a virtual course catalog divided into the departments of an imagined university. And the second paper refers heavily to the first. So I would suggest for those of you who wanna take advantage of these online courses that you download both of these papers so you can tailor a plan to your own interests that you can pursue during those elusive free moments in the lab, okay? So with that we're gonna conclude. Next week we're gonna shift our focus back to the level of whole genome. So Laurel Nitsky, one of my fellow faculty members from the Genome Institute will be joining us talking about her area of expertise, regulatory and epigenetic landscapes of mammalian genome. So with that, I thank you once again for joining us this morning and we'll see y'all next week.