 All right, good morning, everyone, and welcome to week two of this year's current topic series. Let's go ahead and get the formalities out of the way right from the start, so my CME disclosure statement, I don't have any relevant financial relationships with any commercial interest. Okay, so with that out of the way, my job today, in the same way that Dr. Green laid the groundwork for upcoming lectures in this series by providing you a broad overview of the field of genomics last week, I'm going to be providing you a solid foundation for the upcoming lectures devoted to bioinformatics. I mentioned last week that genomics and bioinformatics go hand-in-hand with one another. They're inextricably linked, and Eric really reinforced that sentiment when he showed you this figure from NHGRI's current vision document showing the spectrum of genomic-based research going from the understanding of the structure of genomes all the way through to our efforts focused on improving the effectiveness of healthcare. Now our ability to work across this entire spectrum really rests in large part on being able to apply computational approaches across the board, to analyze the kinds of data that are generated in the course of whole genome sequencing efforts, genome-wide association studies, clinical sequencing efforts, and pharmacogenomics studies, or any of the other many kinds of genomic science you'll be hearing about throughout this series. So given, again, that inextricable relationship between genomics and bioinformatics, what we're going to go through over the next two weeks is as gentle of an introduction as I can give you to some of the major tools used in bioinformatics. You're going to see the tools and the concepts underlying these tools again and again throughout this series, so it's really important to have a basic understanding of the theory that underlies these techniques to start taking these techniques away from the realm of the black box. So I hope you'll find this information useful and that you'll pick up a few strategies along the way that will hopefully be of practical use in your own studies. So with that, let's just start right off by reinforcing the importance of alignments as one of the most powerful approaches that we have in the study of sequence data, specifically with respect to determining similarity and deducing homology. So start off with a very basic question, why bother constructing sequence alignments? What do they do for us? And that question applies whether we're talking about a pairwise alignment where we're taking two individual sequences and aligning them to one another, or to a multiple sequence alignment where we're looking at a collection of presumably related sequences. So in either case, what we're trying to do is get a measure of relatedness between either nucleotide or amino acid sequences. And once we have that measure of relatedness, it allows us to draw certain biological inferences. So we might find structural relationships between these sequences, functional relationships, or evolutionary relationships. Now before we start to talk about the techniques, we need to define a couple of key terms. It's important to use the right terminology when we talk about these phylogenetic relationships. So the first term we're going to talk about is similarity. So what I'd like you to do is imagine that you have two sequences, A and B, and that you align those two sequences to one another. And the easiest way to convey how related those sequences are to one another is to just simply count the number of matches and mismatches along that alignment. So that gives you the quantitative measure. Again, similarity. It's based on an observable. In this case, how many positions are aligned with one another. We usually do express this as a percent identity. And by looking at this, it allows us to quantify changes that we see as two sequences diverge from one another, whether we have substitutions, insertions, or deletions. It'll also, from a functional standpoint, allows us to identify residues that are crucial for maintaining structure and by extension function. If we start seeing very high percentages of sequence similarity, high degrees of sequence similarity, what that might be telling us is that these two sequences in this pair of eyes alignment have a common evolutionary history or possible commonality in biological functions. So for example, we might be able to identify regions that are important from a functional standpoint when thinking about things at the protein level or regions that are important to expression when we're looking at sequences at the nucleotide level. The second term we need to define this morning is the one that is most often misused and that one is homology. So what homology is, is the conclusion that we get by looking at similarity. It implies an evolutionary relationship. When there is homology between sequences, those sequences are termed homologs. So these are genes that have arisen from a common ancestor. And unlike similarity that is the quantifiable term, this is the conclusion again. So either genes are or are not homologous. We don't measure this in degrees. And just to drive home the point, a quote from Walter Fitch, who was one of the grandfathers of bioinformatics. He says, it's worth repeating here that homology, like pregnancy, is indivisible. You're either homologous, pregnant, or you are not. So again, this is the conclusion. Now that term homologue actually breaks down into several subtypes, might describe several different types of evolutionary relationships. And we're going to focus specifically on two of them here. So the first type of homology has to do with orthologs. So orthologs are genes that have diverged as a result of a speciation event. So there is direct descendants of a sequence and a common ancestor. They more likely have similar domain and three-dimensional structure. Because of that, they usually retain the same or very similar biological functions over evolutionary time. And we can use that information to predict gene function in novel genomes, as we're starting to assign functions to newly identified genes. So put very simply, orthologs can be thought of as the same gene in different species. The second term, the second type of homology we're going to talk about is paralogy. So when we have paralogs, these are genes that arose by the duplication of a single gene in a particular lineage. So these might be less likely to perform similar functions, but they can take on new functions over evolutionary time. So this is where Mother Nature gets to be inventive, providing some insight into evolutionary innovations, taking bits and pieces of proteins, restructuring domains in order to accomplish new jobs within the cell. So again, put simply, paralogs can be thought of as similar genes in the same species, adapting or co-opting a preexisting gene product for new biological functions. So just to make this a little bit clearer, because the definitions are a little hard to follow, let's look at a picture. So in this case, we have a gene, gene alpha, in a common ancestor. And along the evolutionary time, there's been a couple of speciation events that have given rise to three species. Each one of them has an orthologue to gene alpha, gene 1, gene 2, and gene 3. So each one of these three genes arose from this common ancestor alpha. Now, if we go back a little further in evolutionary time, we roll back the clock a bit, and we consider the possibility of a gene duplication predating alpha. So we have a gene duplication here at the base of the tree, giving rise to genes alpha and beta in this case. So alpha went along and gave rise to three orthologs. Beta, in this case, underwent some similar speciation events, not exactly the same, but gave rise to three genes, four, five, and six. So how are these things all related to one another? Any of the genes 1, 2, and 3 are orthologous to one another. Genes 4, 5, and 6 are also orthologous to one another, because, again, in each case, they had a common ancestor. But now if we take any pair of 1, 2, and 3 versus 4, 5, and 6, those are the paralogs, because they go way back to this gene duplication event. That's how they're related to one another. So that takes care of the definitions. Now, how do we actually identify the orthologs, the same gene in different species? And the most common way and most straightforward way we have of identifying candidate orthologs is by looking for reciprocal best hits. And when I say candidate orthologs, that's very important. The word candidate really, really is important, because keep in mind that a lot of the things we're going to talk about through this series deal with computational predictions. So I really need to underscore the importance of being able to experimentally verify a lot of the things that you do using these computational techniques. So let's say we start with a gene that we're interested in from a particular organism. And we, as a first step here, we just do a sequence similarity search against all of the genes in a second organism. And what we might find is something that looks like this, a series of hits using some sort of sequence similarity search method that identifies a series of genes that are similar to the one that we started with. Now, I have the first one here marked in yellow. So we're going to just assume that based on some scoring method, this one is the most related to the red one that we started with. OK, so that's the first step. Now, where does the reciprocal part come in? Now what you do is you take this yellow sequence and do the sequence similarity search in the opposite direction. You go now take this and search it against all of the genes in organism one. And if you're lucky, what you'll find is that the best hit in both directions involve these two genes. So the red one is most related to the yellow one in organism two. Then when you use the yellow one as the search query, it's most related to the red one in organism one. OK, so that's the general idea. Now to the specifics. We have two broad types of alignment methods at our disposal. So the first one are called global sequence alignments. So what I want you to do is imagine that you have two full length sequences, either two genes or two protein sequences. And you now align them across their entire length. So we're going to force an alignment over the entire length of these sequences. So this works really well if you have sequences that are of similar length. But the problem now is that as these become more divergent, as they become more dissimilar to one another, the global alignment methods tend to miss important biological relationships. Because you're trying to force an alignment over the entire length of these sequences, it might mean that you're just not aligning important structural or functional domains. So from a biological point of view, a slightly different approach is better. And that's what are called the local sequence alignment. So here, we try to find the most similar regions in two sequences that are being aligned with one another. And the glorified term for this is paired subsequences. It just means that we're just looking for two regions within an entire sequence that align with one another. So any regions when we do these local sequence alignments outside the area of the local alignment, we just go ahead and exclude those. By doing this, we're aligning two sequences. We might find more than one local alignment that could be generated for any two sequences that we compare. So this works really well for sequences that share some similarity, but not high degrees of similarity or for sequences of different lengths. So again, I'll just emphasize that this is really the best way to compare sequences from a biological standpoint, especially if we're trying to deduce evolutionary relationships. So the approach of choice for biological discovery. OK, so let's say we've now aligned two sequences. We talked about the concept of similarity where we're going to try to look for a percent identity sum metric to allow us to quantify how good the alignment actually is. How do we actually deduce that number? How do we measure how good the alignment is between sequence A and sequence B? So to do that, we use something called scoring matrices. And these scoring matrices quite simply is our empirical weighting schemes that we use to assign these numbers, these numerical values. And what's in those scoring matrices is information concerning the physical, chemical, and biological characteristics of either nucleotides or amino acids. So they take into account things like side chain structure, their chemistry, the function of particular side chains. Based on what we know about how, especially on the amino acid side of the house, how these particular amino acids actually are involved in structure and function. So we know that things like cysteine and proline residues help to keep parts of a protein together. Tryptophan has a bulky side chain, so it can only be accommodated in certain parts of a protein. Things like lysines and arginines are positively charged. They're very important, particularly in the context of DNA binding proteins, because you need those positively charged residues to bind to the negatively charged backbone of the DNA. So in the construction of these scoring matrices, we try to take into account two major factors. First is conservation. Which of the residues can substitute for any other residue and not adversely affect the function of the protein? So we know that things like isoleucines and valines can substitute for one another in a protein sequence because they're both small and they're hydrophilic. Things like serines and threonines, they're both polar. So we often see that substitution. So the basic idea here is we want to just keep track of charge, size, hydrophobicity, and other physical chemical factors. Again, to try to see when we can't have an exact match between two sequences, what's the next best thing? The other factor is frequency. If we look at the entire constellation of proteins, all of the protein sequences available to us, how often do we see any one of those amino acids amongst the entire constellation of proteins? Some are more rare, some are more frequent, and that becomes important. Okay, so why should you actually care about any of this? Every single time you do any sort of sequence comparison, whether it's two sequences to one another A to B, or a sequence A to an entire database like GenBank, these scoring matrices are being used in the background. They appear in every single analysis involving sequence comparison. They also, as you'll come to hopefully appreciate, they implicitly represent evolutionary patterns. So the choice of matrix really can strongly influence what results come out of the methods that you apply to these alignments. So the default option may not always be the best choice, and so it becomes important for us to talk about how these scoring matrices are actually constructed, what guidelines go into putting them together. All right, so let's start on the easier side of the house. If we look at a scoring matrix that if we're doing a nucleotide-based comparison, we only have four nucleotides, A, T, G, and C, and if we align two sequences with one another, all we're gonna do is just apply a very simple match or mismatch scheme. So anytime you have a particular position in an alignment where the two nucleotides match one another, you're gonna assign a positive value. Anytime you have a mismatch between the two, you're going to subtract points. So in this case, the match gets you two points, the mismatch loses you three. The matrix, as you'll see here, you have the A, T, G, and C going across the top, the same residues going down the side, and just the way it works, that for every position in the matrix, let's say you have a T aligned with a T, you just look for the intersection, and you'll see that that's, in this case, it's an exact match, so that's the plus two. You'll see the plus twos all across the diagonal, representing the exact matches and the minus threes in the other positions. This assumes that each nucleotide occurs 25% of the time. So this works really well at the nucleotide level because nucleotides are not all that chemically dissimilar from one another, but as we've already alluded to, the amino acids are very different from one another, a much more complex picture, so we now go to something that looks like this. So this matrix is called a Blossom 62 matrix, and again, we have all of the amino acids going across the top, all of them going down the side. So again, I want you to imagine, now in this case, a protein sequence alignment, sequence A with sequence B, we're gonna look at a particular position and see how many points we can assign to a match at that position. So we're gonna imagine where we've got a tryptophanen, sequence one and a tryptophanen, sequence two, so we're gonna look for where the W intersects with the W, and that gives us in this case, 11 points. Now, if you look at the diagonal here, unlike the case we saw before, where no matter which nucleotides we were aligning and getting exact matches were, in the previous case, we were assigning the same value regardless. Here on the diagonal, you'll see that the values change. The 11 is the highest number in the diagonal, and that represents the frequency part of constructing the matrices. This is the most rare of the amino acids, so when you get an exact match, you wanna give it more points than you would if you had a more common amino acid aligning with each other. Okay, now, to the point of substitution. All right, so let's say instead of having an exact match, a tryptophanen for a tryptophanen, this time we've aligned a tryptophanen with a tyrosine, so instead of the W, W, we now have a W against the Y. We take a look here. We see now in this case, we would only give it two points, so it's still a positive score, which represents that it's a conservative substitution. Those two things can substitute for one another, and in most cases, not adversely affect the structure or function of that particular protein, but the lesser value here represents is just not as good as an exact match. Let's just take, as a final example, a mismatch, so let's say you've now aligned a cysteine with a glutamine, those two things are not known to be able to substitute for one another, so here we're going to subtract points, in this case, minus three. So the way the general rules work, just in summary, you get the most points for an exact match with the rare amino acid scoring more points than the common amino acids, you lose points for conservative substitutions and you lose points for mismatches. So the matrix we were just looking at was something called the Blossom 62 matrix. It is part of a family of matrices called the Blossom matrices. These were developed by the Hennikoff's at the Fred Hutchinson Cancer Research Center, and that's part of the good news, you don't have to derive these matrices yourself, you just need to know which ones to pick. So as far as the Blossom matrices go, what they were designed to do is look for differences in conserved, un-gapped regions of a protein families, which the Hennikoff's termed as blocks. These are directly calculated based on local alignment, so these are all based on observables, so using substitution probabilities, again the concept of conservation, and the overall frequency of amino acids. Because of the way these were constructed, using known protein families, they're really sensitive to detecting structural or functional substitutions, and they perform better than other matrix series that are out there, specifically the PAM matrices. The PAM matrices were developed way back in the distant past, as far as bioinformatics goes, at a time when very little sequence information was available, so we only had sequence data available for globular proteins. They were, the matrices were extrapolated from closely related proteins, and you'll often see them as an option for choice in phylogenetic programs, but I would recommend that you pick these Blossom matrices instead, because just bottom line, they really are useful in identifying both closely and distantly related sequences, so that's a big plus. All right, so now here's where evolution comes into play. When I showed you the Blossom matrix earlier, that was a Blossom 62 matrix, so there's always a number that follows the word Blossom, and what the N represents is the degree of similarity of the sequences that were used to construct the matrix, so here you've got, in the case of a Blossom 62 matrix, that means that the sequence is used to construct the matrix, shared no more than 62% identity. In the construction, what happens is that the contribution of sequences that are higher than the threshold are clustered and replaced by a single sequence that represents the cluster, so what's the reason for doing this? Last week, Eric showed you the phylogenetic range of a number of vertebrate species whose genomes had been sequenced, and while we have more and more whole genome sequences in hand from a wide variety of organisms, the genome sequences that we have available to us are not uniformly distributed, we have multiple sequences for some organism, so we really need to just remove that bias from the dataset, and that bias tends to be more towards closely related sequences. We wanna do this because we don't wanna miss anything that might be biologically relevant, so in practice, how does this work? Let's say we start with a protein family here, we've got one with six sequences in the group, any place you have an exact match, you'll see an asterisk at the top of the column. Let's say we're trying to construct a Blossom 80 matrix, so we're going to cluster together any of the sequences that share 80% or more similarity with any of the other sequences in the set, so in this case, we end up with three clusters, one that has two sequences, one that has three, and one that just stays by itself, it doesn't cluster with any of the other sequences. We then now collapse this down, representing these six sequences as three, so these three sequences are now represented by this single sequence, these two sequences are represented by this single sequence, and now we use this to construct the matrix. Okay, so what are the implications of this? When we do the clustering again, we reduce this contribution of closely related sequences, so we remove the bias that I mentioned earlier about substitutions that occur in the most closely related members of a family, but here's the more important thing, when you reduce N, the value of N, you're gonna get more distantly related sequences, if you increase N, you're gonna get more closely related sequences, so here's your first cheat sheet of the morning, put a big star next to this one, and as far as which one of these matrices to choose, the default, if you're going to do a blast search at NCBI, is gonna be Blossom 62, the reason it's the default is just through a lot of very controlled trials, it's been seen as being the most effective in finding all potential similarities, and you're gonna get percent similarities that go down as far as 30 to 40%. Okay, as we move up, the numbers here, as the numbers get higher, we're gonna get sequences that are more closely related, so by the time we get to Blossom 90, we're gonna have maybe shorter alignments, but they're gonna be much more similar to one another in the range of 70 to 90%. If we go in the opposite direction, we're gonna find more distantly related sequences, so longer alignments that might be weaker, and you can push it down to probably around 30%. Now, of course, the question goes back to which one to choose. If you know what you're looking for in advance, if you wanna find things that are highly similar, you're gonna push to the higher end of this range, if you wanna look for things that are more distantly related, you could go to the lower end of this range, but the method that I would strongly recommend you use is something that Steve Altschul, who is the person who developed BLAST, talked about in a paper called, he called it the three-matrix strategy, and so instead of doing a single search, instead, you're going to do three. You're gonna pick 62 as a starting point, pick one that's gonna give you more distantly related results, one that's going to give you more closely related results, and what you're gonna find is you're gonna get three different sets of results back. They'll be highly overlapping, but some of them will pop out certain sequences if the other ones won't. So again, I would strongly recommend you use this three-matrix strategy as you learn these techniques, as you become more comfortable, more familiar with them, but the bottom line to all of this that no single matrix is the complete answer for all sequence comparisons. All right, so one last thing we need to discuss before we get into talking about BLAST is gaps. So again, I want you to imagine sequence A and sequence B, and when we do these alignments, often you'll see we'll introduce gaps in order to improve the alignment, so we get these local alignments better. So again, use to improve the alignments between these two sequences. More often than not, these are used to compensate for insertions and deletions, and this is really important because the gaps, because we're representing insertions and deletions, they represent biological events, so you can't just keep putting these gaps in willy-nilly to improve your alignment because you're not actually reflecting biology by doing that. So you need to keep them to a reasonable number, again, to not reflect a biologically implausible scenario. And a good rule of thumb for this is about one gap per 20 residues, and it depends on the case, but that's a good starting point. Now, if you think again about your alignment, think about the scoring matrices, we've been comparing single positions, a letter in one sequence to a letter in another sequence, but now we have a situation where we have a letter in the one sequence and nothing in the other sequence. We have that gap there, so we can't score this simply as a match or a mismatch. So how do you score an amino acid against essentially nothing? The method we use to do this involves something called the affine gap penalty, and so what this does, all you need to know about this is that anytime you open a gap, you assess a penalty, anytime you extend the length of the gap, you assess an additional penalty. So this deduction is governed by a formula right here, G plus L times N, where G is the gap opening penalty. So as soon as you introduce the first position of a gap in a nucleotide alignment, you're gonna subtract five points. On a protein alignment, you're gonna subtract 11 points. As you make that gap longer, you apply the gap extension penalty. So the second term, L times N, so it's the penalty times the length of the gap. So every additional position you add to that gap in a nucleotide alignment, you're gonna subtract an additional two points. In a protein alignment, you're gonna subtract an additional one point. So let's say we have a gap of length two in a protein alignment. So G here is 11. If the gap is of length two, so L is one times two, the length of the gap. So 11 plus two is 13. You would subtract 13 points for a gap of length two. You can, as you'll see as we go through the method, you can actually change the gap penalties to make the introduction of gaps either more or less permissive. So if you want fewer gaps, you can adjust for that if you want to allow additional gaps, you can do that as well. All right, so with all of that, half an hour later, by way of introduction, we finally get to BLAST. It's finally time to talk about BLAST. And this is really the most important technique in bioinformatics. We're gonna hit this pretty hard today because I wanna make sure you know how to use it properly. Because there unfortunately are many cases in the literature where wrong biological conclusions have been made because of not interpreting BLAST results properly. Either the people doing the analysis didn't know how to use it or didn't really know what the statistics meant. The most important take home message I can give you today that applies to this and any of the other methodologies that you're gonna learn throughout the 14 weeks of this series is that just because something appears on your screen or on a piece of paper in the results does not automatically imply biological significance. So we're gonna talk about what does imply significance in a moment. Okay, so what is BLAST? BLAST, the basic local alignment search tool. What this does is it looks for high scoring segment pairs, HSP. So again, just a glorified term for just two sequences, a pair of sequences that can be aligned with one another. And when we do that alignment, we're going to try to make that alignment as long as possible in a statistically robust way. And I'll tell you about that in a moment. The score has to break a score threshold S before it's reported to you and the alignments can be either gapped or ungapped. Again, because we're looking for these local alignments within entire sequences, the results are not gonna be limited to just the best high scoring segment pair for the two sequences being aligned. And in fact, I'll show you an example that will produce two high scoring segment pairs. There are different flavors of BLAST and the three that are used most commonly are the first three on this slide. So BLAST N, N is for nucleotide. So we start with a nucleotide sequence and we compare that against a collection of nucleotide sequences. BLAST P, P is for protein. So again, the comparison is of a protein sequence to a database of protein sequences. BLAST X is a little bit different here. We're gonna translate the query into a protein sequence before we do the comparison. So yes, we start with a nucleotide sequence. We do the six-frame translation. We take the protein sequences arising from that and then compare them against a protein database. So this is actually a methodology that's much more sensitive than direct nucleotide searches. The last two are on the slide for sake of completeness. I can't remember the last time I ran one of these. The T in these both stand for translated and you can see for yourselves what the query and the target sequences actually are. But the comparisons here are all done at the nucleotide level. All right, so how does the method actually work? So we're gonna start with a sequence of interest. We have an amino acid sequence here, our query. And you'll see I have three letters in the middle of the sequence in red, the P, the Q and the G. And we're gonna start our search with this query word of length three. And what happens in reality is it starts at the one end at the end terminal end and it just keeps moving along the window moves along. So it starts with a GSQ in the SQS and moves along. But just for sake of symmetry on the slide we're going to start in the middle. What we wanna do is now compare this to all of the sequences in a protein database starting off just looking for other occurrences of the P, the Q and the G in that order. But we also wanna look for conservative substitutions because we might find some interesting biological relationships by doing that. So when we do that, we establish something called the neighborhood. So the neighborhood starts off with the exact match, the P, Q, G and that has a score. Or if you were gonna go back to your handouts to the slide that shows the Blossom 62 matrix and did and find the intersection of the P for the P, the Q for the Q and the G for the G you would find that the P would give you seven points, the Q would give you five and the G would give you six for a total of 18. What I've done here on the slide is I've just changed the middle position to show those conservative substitutions. So we've got the P, Q, G, E, G, R, G going down the slide and you'll see that because these matches are not exact at that position, as you would expect because now you know how the scoring matrices work, the score also goes down. And of course you would do this at all three of the positions. Now at some point you have to say, well maybe we're kinda going too far outside of the neighborhood so the method will automatically draw for you a line at what it calls the neighborhood score threshold. Blast computes this for you. In this case, T is 13 given the query word that we've started with. So anything above the line, everything from 13 and up we keep for the next step, everything from 12 and down we throw out. All right, so let's take our neighborhood, put it at the top of the next slide and here's the sequence now that we have that we started with. Here's our P, Q, G, our query word. You'll notice now I have it aligned with a sequence that was found, the subject. But in this case it's not a P, Q, G, it's the P, the M and the G and that's perfectly fine because if we look back at the top, P, M, G is in our neighborhood. Now just qualitatively you can see how good the alignment is by looking at this middle row. So any place you have an exact match you see the letter repeated. Any place you have a conservative substitution you'll see a plus sign. Any place you see a blank is just a mismatch. All right. But we wanna be able to apply numbers to this and we want to try to make this alignment as long as possible. So here we are, we've got this alignment starting at three but you'll see that there's many amino acids. Here the alignment is of length 40, you go as far as you can in both directions aligning as many of the characters as you can. So the question of course then becomes how far is too far? So here's our alignment again and here's just a graph of how the scoring is working out for this particular alignment. On the x-axis here we have the extension. How many positions have we aligned? On the y-axis is the cumulative score. So the scores that are coming from our scoring matrix as we just keep adding additional positions to this alignment. So if we start out here the extension is three. What we started with the query word we're above the threshold cutoff as you'll remember from the previous slide. So as we go in both directions we're gonna assume that we're having more matches than we do mismatches. More positive scores being assessed than negative scores being deducted. So the curve is gonna go upwards. At some point we're gonna break a threshold, the score threshold, okay? So this from here up these are the things that are all reported to you in your blast result and herein lies pitfall number one. Breaking the score threshold does not imply biological significance and I'll show you what does momentarily. We don't pay attention to the scores but the scores are used to calculate the number we are going to pay attention to shortly. Okay, so we're gonna keep going up. We're gonna keep assuming that we're matching more than we're mismatching but at some point we're gonna have gone too far. We're not gonna be able to align any more residues with one another. We start thinking about that matrix again and all those negative values in the matrix. So as we go out further we're gonna start subtracting points faster than we are actually adding them. At some point we're gonna leave the top of this curve and go down a certain amount, the significance delay decay excuse me that represents both the mismatches which assess negative points and the gap penalties which assess negative points. When you've reached the significance decay the method says okay I've gone far enough I'm gonna go back to the peak of this curve that's gonna represent the longest alignment I can put together, the length of my high scoring segment pair. So once we've done that we're gonna have to think about doing a probability calculation. So again why do we use the probabilities instead of the scores? Now that you know about the scoring matrices the answer will hopefully be easier to understand. I want you to imagine two different alignments of equal length. One is composed primarily of common residues one is composed primarily of rare residues. Okay and the composition of those alignments are going to be reflected in the score. Even if the number of positions in those two alignments that match one another is exactly the same you're going to end up with two very different scores because of the frequency information that's captured within the scoring matrices. So a better measure is needed to be able to compare one alignment to the next any two alignments to one another and that again is where the probability comes in. So this is the equation the Carlin-Ultral equation that actually governs that calculation. You'll see that in the exponent here is the S so that is the normalized score and the other factors that go into the calculation of this score. Again you don't have to do this the blast method will do it for you but what that value E represents is the number of high scoring segment pairs found purely by chance. So this is really the number of false positives that you have come out of your blast search. We want that value of E to be as low as possible. Lower values signify higher similarity. So rule number one, again put a big star on this one so what we're gonna look for as far as probability values if we're comparing two nucleotide sequences to one another we want that E value to be 10 to the minus six or lower. If we're looking at protein sequences we want that value to be E to the 10 to the minus third or lower and again keep in mind that when we talk about protein similarity searches that means both blast P and blast X remember the blast X we start with a nucleotide sequence but the comparisons done at the protein level so you're gonna use the protein statistics here. All right so let me take you through actually how you're going to do this when you get back to your offices and to your labs so we're gonna pretend we're at the computer you have the screen captures in your handout you might wanna look at the PDFs on your monitors back at your desk because they'll be a little bit easier to read but the most important thing to do is practice and just take some sequences that are of interest to you and the biological questions you're trying to answer and use the methodologies that I'm gonna show you today. All right so we're gonna start off at NCBI's homepage which there is the URL NCBI.nlm.nih.gov the most popular things that are on this website are shown in the right sidebar and there is a direct link to blast there so we're gonna go ahead and follow that link bringing us to the blast homepage or you could just go directly by typing in the full URL. So let me quickly orient you to this page at the top of the page you can just go directly to full genomes and search each one of those genomes individually. The middle of the page which is where the action is today is where all the basic blast searches are and at the bottom are a series of specialized blast searches which we'll talk about momentarily. So we're gonna start off doing a protein based search so protein blast the link is right there and if we go ahead and click that it will take us to the blast search page but before we do that we need a sequence that we actually are gonna use as the query. So what I put together for all of you is a web page here is the URL that has the query sequences for all of the examples that I'll be showing you both this week and next week. So this is a great way for you to go back to lab and do a little bit of practice on your own reproducing what we're doing in the lecture this morning. Okay so we're gonna take that sequence that first sequence from the page I just showed you and just stick it right into the box. Now you can use the entire length of that sequence that's great if you're only interested in part of that sequence you can actually specify the beginning and ending amino acids you want to consider the start position and the stop position. Okay so the first thing we have to do is pick what do we want to compare this sequence to? What database do we wanna look at? So we have a bunch of choices available to us and I wanna talk about one in particular that's called RefSeq. So what RefSeq is is an effort to provide a single reference sequence for each molecule of the central dogmas so to have every DNA sequence, every mRNA sequence and every protein sequence represented once and only once. And the reason for doing this is it really became obvious that we needed to have a resource like this when a lot of us were doing these glass searches and you might take a gene of interest you'd get back maybe two, five, 10, 20 exact matches and you just don't know by looking at the blast results which one is the right one, the canonical entry for that particular gene or for that particular protein. So this is a curated effort that helps address that problem. So by virtue of the fact that each of these sequences are represented once and only once, of course the database is non-redundant. It's constantly updated by curators to reflect the current knowledge of sequence data in biology and that's a great thing that these records keep up to date with what we know functionally about these nucleic acid sequences or these protein sequences. In the feature tables, the part of these entries that captures these biological attributes you'll find information about the gene, the gene transcript or the protein. They encompass a wide taxonomic range but primarily focus on both mammalian and human species. And again, just to re-emphasize the ongoing updates and curations and you'll see a timestamp showing you when that record was last brought up to date. Now, the way to know what a sequence, whether a sequence is a RefSeq sequence is by looking at its accession number. So anytime I use the term accession number, that is a unique identifier that if you type that in, it would produce back that one sequence unequivocally. So it's basically that sequence's social security number. And there's two series in the RefSeq accession numbers that I'm gonna show you. The first ones, they start with an N and these come from curation of GenBank entries. So the Ns, let's say we start with an NT, that's a genomic contigue. It may have come from a whole genome sequencing project or some other high throughput sequencing effort at the nucleotide level. The NM, M stands for mRNAs, the NPs, P stands for proteins and the Rs stand for the non-coding transcripts. So what these all have in common is that these have all been experimentally verified. Somebody has had these molecules in their hands and have actually directly sequenced them. Now, of course, we crank out sequences at a breakneck pace nowadays and so we don't always have the luxury of having these experimentally verified sequences. So what we do is often apply computational techniques to predict where the mRNAs are and where the proteins are. So this is where this X series come from. So let's say we start with a genomic contigue, these NT entries and we then take those, apply the predictive methods to figure out where the genes are, so gene structure, introns, exons and so on to figure out where we could come up with model mRNAs and then we use those in turn to come up with the model proteins, the XPs. So please keep in mind that again, these are experimentally verified, these come from genome annotation, these are computational predictions. So you have to look at them not necessarily skeptically because the predictive methods are so good nowadays but you just have to keep in the back of your mind that they haven't been experimentally verified. So whenever you have a choice, you're going to go look at the ones in the N series instead. One of the other choices that was on the pull-down menu was Swiss Pro, similar idea and we'll talk about that in greater detail next week. So I like to use RefSeq just because with this non-redundancy, you just get a neater and shorter hit list from your results usually. Okay, so let's go back to the screen here. So now we go a little bit further down. I, what I've done, excuse me for the purpose of the example, I've just left the database at the default which is the non-redundant protein sequence set. So that's every sequence that's in GenBank. As far as the organisms go, you can just leave it blank and search for everything but you can limit by organism or by taxonomic groups so you can put in the word human, you could put in the word bilaterian, you can pick particular classes if you're looking at a specific point in evolutionary time. Okay, as we go down, program selection. So BlastP is already chosen for us but you'll see that there's a whole bunch of other variations of BlastP and we're gonna talk about those in great detail next week. All right, as we get to the bottom of the screen, you'll see that there is a link here that says Algorithm Parameters and if you click on that, it's gonna expand the screen and what we're gonna see looks like this. So this is where we get to fine tune what Blast actually does. First one asks us, well, maximum number of target sequences. How many sequences do you want to see in your results? The default is 100, I always hike this up as far as possible so we're gonna put that at 250 to make sure that we don't miss any statistically significant hits. We go down, word size is three so you already know how that works. If you want three to, if you wanna just use a default, that's fine. If you wanna push more towards exact matches, you're gonna make the word size longer. The E-value threshold right here, it says it's 10. Remember that I've told you that we want to use an E-value of 10 to the minus third for our cutoff. So I'm gonna just go ahead and leave it at that value for purposes of the example but we're gonna apply the cutoff when we see the results. Scoring parameters, we can choose our matrices. Here's your choices. Again, all of the Blossom matrices, remember what the numbers actually mean but you also have a couple of PAM matrices available to you as well. Okay, finally, when we get to the bottom, a couple of new things to talk about. So filters and masking. So the first one says filter, low complexity regions. We're gonna go check that off. What is a low complexity region? So these are regions of bias composition. So you might have an amino acid sequence or a nucleotide sequence where on the nucleotide side, you might have a string of A's, a very long string of A's or on the protein side, a single amino acid repeated over and over and over again. So these homopolymeric runs, you might see short period repeats or the subtle overrepresentation of several residues. The reason it's important to mask these is it actually confuses blast. Blast relies on having uniformly distributed amino acid frequencies. So when it hits these homopolymeric runs, it often leads to false positives because if you think about the alignments and you've got these long runs of the same letter and over and over and over again, sometimes it's not exactly clear how to actually align that part of the two sequences with one another. So not enabled by default, you should just always do that. Okay, finally, the last thing, just as a matter of convenience, I always check off the box at the bottom that says show results in a new window. So once we click the blast button down here at the bottom, either a new tab or a new window will open. Why that's nice to do is if you wanna come back and change the parameters, you can just go back to this window and do that rather than having to keep clicking the back key on your browser. All right, so we'll go ahead, submit this. Here's what we get back. So here we've started with our query sequence. This is represented in this case by the gray bar here. Right under the gray bar, we'll see that there's a block here and that represents a protein family that was identified. So a particular domain architecture. Again, we're gonna talk about that much more next week. You'll recall that the default for how many sequences to return, how many hits to return was 100. In this case, if we look here, it says that 225 blast hits were found. So it's good that we put our value there up to 250 because we just wouldn't have seen all of the results that blast could find for us. Okay, let's push this overview to the top here. So this is a graphic overview of our results. So here are sequence, our query sequence is shown as the red bar and you see just a bunch of reds and purples here. So what does this mean? All of the colors, the color key is at the top. Remember that the colors correspond to the scores not to the probabilities. These are shown, so there's a color key. These are shown in descending score order. So the best score is at the top, the worst score is at the bottom. If you mouse over any of these lines, there's a window here at the top that will tell you what the found sequence was, what that particular line corresponds to, what sequence it corresponds to. You may see these lines in between, so what the gaps will either represent gaps or that you have more than one high-scoring segment pair for the query sequence versus that particular sequence. If there, and you'll see that either as a gap or you might just see this tick mark here, which means the gap is actually in the query sequence rather than in the found sequence. Finally, you might see lines that have these bars on them, but there's nothing connecting them. Those are just unrelated hits, just done that way to save space. Okay, so let's go ahead to see what we actually got. Let's scroll down, take a look at the hit list, see what we can learn. So let me orient you to the results first. You have a description of each one of the sequences that was found that was deemed somehow related to our query sequence. Each one of them has a short description called the definition line or the def line, just a quick one line description of what that sequence actually is or does. You'll see that there are a couple of columns here that give us the score values. We have a query coverage, so we'll see that in a little bit more detail, but the important column here is the E value. So I sorted this by E value. Next to that you'll see the percent identity values and then these are the accession numbers. Again, the unique identifier is the social security numbers for each one of these found sequences. All right, so if we look at the very first one, we see the E value is zero. So what the zero tells you is just because of space, there's not enough room to show the number. So whenever you see zero, it's 10 to the minus 1,000 or less, so you don't even have to think about those. Those are great hits. We look a little bit further down. Here we have a case three E minus 179, so that just means three times 10 to the minus 179. So it's just showing the exponential notation there. I'm going to scroll down to the bottom and remember that this is now a protein-protein search, so the E value cutoff that I'm going to apply is 10 to the minus third. Now in this particular case, again here's the E value column, in this particular case the one at the bottom of the list actually beats the threshold quite handily. It's 10 to the minus 53. So we're going to go ahead and just draw the line and accept all of these values for now. If we saw any that were below that 10 to the minus third threshold, we would go ahead and reject them, but I want to provide a caution here because you might have a case of something that's just below the line that actually is biologically relevant. So this is where you have to put your biology hat on. You've read tons of papers, you've done your own research, you know something about the system that you're looking at, and you may have biological evidence that argues in favor for something being slightly below the line as being related. So anytime you have biological evidence that always, always trumps computational predictions. So the guidelines that I give you here, these cutoffs are starting points, but again, let your knowledge of biology be your guide here. It will never serve you wrong. Okay, so let's take a look at one of the results that came out. And so here we have the classic way of representing a BLAST search so where we can see the alignments. So at the top, we have some scoring information, which we'll come back to in a moment. We can see where this sequence aligns. So in our query sequence, we're starting at position 17. The first line goes from 17 to 76, and you can see the line numbers, excuse me, the position numbers going all the way through this alignment. So the first thing we wanna do is take a look at these values that we have calculated to give us yet another way of determining whether or not these are actually related to one another. So first we have a percent identity number, and in this case, they're 100% identical to one another. The numbers you wanna look at, this is the second criteria, so again, put a star here. If we're doing a protein-based search, we wanna see percent identity greater than or equal to 25%. If we are doing a nucleotide-based search, we wanna see a percent identity greater than or equal to 70%. Right next to that, you'll see where it says positives. In this case, it's still 100%, so what that is is really a percent similarity. It includes conservative substitution. So again, identity means identity, positives means the identities plus the conservative substitutions. In this particular alignment, no gaps, so we don't have to worry about that. But what you will see is that there are some regions of our query sequence that appear a little bit grayed out in lower case letters. Those are the low complexity regions, the regions of biased composition where you have the subtle overrepresentation of a couple of amino acids. Here's a great case where you can just see over and over again, just short period repeats or the same amino acid coming over and over again. Okay, so that is one high-scoring segment pair that came from the comparison of our query sequence to this particular protein, this Prospero isoform. But if we scroll down a little bit more, we'll see something that says range two. So we started with our two sequences and two different high-scoring segment pairs were found. So two different parts of the alignments led to these local alignments. So again, remember the difference between global and local alignments. So instead of forcing the alignment across the entire sequence, we're just trying to find those local regions of similarity in this case, there are two. So we have this second high-scoring segment pair detected. We're gonna look at our identity information again. So 95% identical, the starting position for that particular part of the alignment. Here we do have a gap and that's just represented by the dashes in this case in the subject sequence. Okay, so let's put all of this together. We go back to this graphic. Here are the two scoring lines from the two alignments that I just showed you. So let's think about our cutoff. So first, we wanna make sure that the E value is 10 to the minus third or better. So yep, we've got that. So zeros in both cases. We look at the percent identities in protein base. So we want that to be 25% or better. We have that as well. So 100% for the first one, 95 for the second. And if we just look at where these correspond to the overview that we had, the first high-scoring segment pair goes from position 17 to 704, the second one from 777 to 1403 with a small gap between them. So the vast majority of this sequence is alignable but there's a region in the middle that you can't align the one you started with with the one that you found. All right, so your second cheat sheet for the morning. Again, keep this handy. If we're talking in terms of E value at the nucleotide level 10 to the minus six or smaller, protein level 10 to the minus third or smaller. As far as sequence identities go for nucleotides, 70% or more, proteins 25% or more. Again, don't use these cutoffs blindly. I give this to you as a starting point as you start to learn these techniques and start to get a better sense for how they work. But please, please do pay attention to the alignments that are on either side of that dividing line and never ever ignore what you know about the biology of those sequences that always is the most important thing to factor into these analyses. Okay, so let's take a little bit of a simpler approach here. Let's say instead of searching a database, you have two sequences and you just wanna see how those two sequences align with one another. They're predetermined sequences. So just like a blast search, we're gonna find these local alignments, either nucleotide sequences or protein sequences, all of the blast algorithms are available to you. Again, you can select either Blossom or PAM as the matrix for these comparisons. Same affine gap costs can be applied here. You can adjust them. The input sequences again can be masked. So we come back to the blast home page. In this case, the link to get to this is in the lower section in the specialized blast section, align two or more sequences using blast, blast two sequences. When we click on that link, brings us to this page, looks very similar to the one we saw before for the blast P search. We're gonna look at these tabs at the top first, make sure that we've clicked on blast P in this case to see the search page that's relevant to a protein-based search. If you go to the example sequences page that's up on the current topics website, you're gonna just take the two sequences from those sample sequences, paste one in the first box, paste the second one in the second box. As before, we can specify query sub-ranges or subject sub-ranges if we don't want to use the entire length of these sequences. Once again, at the bottom, the algorithm parameters link, we're gonna go ahead and expand the page now by clicking on that. We'll now see all of the stuff here at the bottom. So the general parameters here as before, how many sequences do you want returned? Well, we already know we've got two, so we can just leave that alone. The E values, the word size as before. Here, I actually don't change anything because you already know that you've got sequence A and sequence B, so you're just gonna, by inspection, not have to worry about any of these things. You're just trying to align the two sequences. But what will come out is the statistical measure of how well those two actually align with one another. As before, you can choose of a series of matrices. We can mask out the low complexity regions, which you should always do by default. Again, we're gonna check off the little box at the bottom where we'll show our results in a new window and then just go ahead and click the big last button at the bottom of the page. So what we get back is reminiscent of what we just saw. We have our overview of the results. Here's our query sequence again as the red line. We only searched it against one sequence, so there is the one sequence. Part of it aligned well, part of it a little less well. Let's scroll down a little bit to see what those results look like. So in our data table up here, again, one and not only one hit, we see the E value and the percent identity, a greater amount of information shown in the lower part here. So here we have our first part of the alignment, the longer part, the percent identities, 46%. So that beats our percent identity threshold for proteins, the expect value, 10 to the minus 26, beats our 10 to the minus third threshold. And again, you can just see what part of the sequence is aligned. So in this case, the query from positions 95 to 178. The second part, that little part at the end terminal end, if we look at the stats here, we're just, it's a very short alignment, as you can see here, it's only six amino acids long. The E value is almost two, so we're just gonna go ahead and reject that just on the basis of that one number alone. We can also look at this in terms of a dot matrix view. This is a traditional way of looking at this. Sequence one is represented on the x-axis, so positions one to 178, sequence two on the y positions one to 134. We saw that there were two high-scoring segment pairs, or at least two alignments, one of them which we actually accepted. Here is that shorter alignment, so it just shows you graphically which parts of the sequences were aligned with one another. It's a little bit more intuitive to look at this, but I show you this just for the sake of completeness. All right, so we've now spent a lot of time talking about things at the protein level, but of course there's a lot we can learn by looking at the same types of information at the nucleotide level. So we're gonna talk about a couple of methods here. All of the nucleotide blast algorithms can be found under the basic blast section by clicking on the nucleotide blast button. And when you get to the program selection part of the search page, you'll see that there are three options available to you. Megablast, discontinuous megablast, and blast-n. So what are those doing for you? What do they actually do? The first one, megablast, is the one you would pick if you were searching against complete genomes or chromosomes. So this is really optimized for aligning really, really long or highly similar sequences, things 95% similar or more. You'll see the word size because we're looking for things that are almost exact, is very high at 28. The affine gap, excuse me, this is the scoring matrix. So for a match, you get one point. For a mismatch, you lose two. This is a linear gap penalty. So unlike the affine gap penalty, just a value is assigned the same value for opening the gap and for extending the gap. Okay. The middle block here talks about discontinuous metal, megablast, and this is what you'll be doing most of the time at the nucleotide level. The results between these two are very similar but megablast will run faster in relative terms. So here we've got discontinuous megablast. Here is blast and our standard, we've got a nucleotide search that we're comparing against a whole slew of nucleotide sequence or a nucleotide database and you'll see that the values are all the same here. Finally, in the lower block, if you've got things that are very short, nearly exact matches, again you can use blast and to do it here but rather than using the defaults, you can just go ahead and change the values here to push in that direction. All right, so enough about blast. Let's talk about something else that you can put into your armamentarium and this is a method called blast which is the blast-like alignment tool. This is a method that was developed at the University of California, Santa Cruz and it's very similar to megablast in that it was designed to rapidly align really long nucleotide sequences so sequences at least 40 in length having a 95% sequence similarity or more. It get reliably confined to exact matches down to a length of 33 but this is really the method of choice when you're looking for exact matches in nucleotide databases. The other thing that's very impressive about it is its speed, it's 500 times faster than blast for most mRNA or DNA searches. The downside, it might miss divergent or shorter sequence alignments because it's not optimized to do that, it's optimized to find these long stretches of similarity. You can use it on protein sequences but if you're looking at protein sequences you should really be using blast P instead. So when do you use this? You've now done a sequencing effort, you've got a brand new gene that you're trying to characterize or a sequence fragment that you're trying to get some information on so you'll be able to find its genomic coordinates to determine gene structure so where the introns or exons might be or find markers of interest in the vicinity of that sequence. You can also use it to find highly similar or identical sequences so if you have a ton of mRNA data you can align those mRNA sequences onto a genome assembly. You can start to identify additional gene family members or you can start to in cross species alignments to identify punitive homologs or going back to the beginning. The other thing that's nice about it is you can take a sequence and show it as a separate track within the UCSC genome browser. So this is a new concept just thinking about a track. We'll see an example of that shortly and Dr. Wilsberg will be spending quite a bit of time discussing the use of the UCSC browser when she presents her lecture in two weeks time. So let's just go ahead and go to the blast, excuse me, the black home page at genome.ucse.edu. The link to black is in the left sidebar so if we just go ahead and click on that brings us to a very simple search page. Across the top you have a small number of selections you can make so you can pick which genome you wanna search against. In this case I'm starting with a rat sequence so we're gonna search against the rat genome. The query type here is DNA and then the other parameters are predefined for you. Then just go ahead and click on submit and we're gonna get back a very easy to understand results table. So across the top we have a number of sequences that were found based on our query. You'll see a score for each one of them so these are in reverse score order the highest one at the top starting and ending positions for each one of those alignments and a percent identity. We're gonna go ahead and click on the browser link for the very first one the best hit in the group just so we can see what this now looks like in the context of the UCSC browser. So we see where our chromosomal position is here by this red bar on this chromosome on chromosome five of the rat genome. Our sequence is the one that's labeled quite self-explanatory your sequence from Blatt search. So there is the sequence and you'll notice some discontinuity in the end so it's not a perfect alignment across the entire length. There's some gaps in there and there are also some regions that are marked in red along the way. So the reds are mismatches. You'll see the color key here at the bottom so red genome and query sequence have different bases at this position and you can read the rest of them for yourselves the other colors are in the key. What we also see here are other tracks. So each one of these lines is a track. So we have information on rat mRNAs from GenBank, rat ESTs, the express sequence tags some conservation information so looking at data from mouse and human and dog and opossum and so on. So it's a really powerful way of not just finding what the chromosomal context of your gene of interest is or your sequence of interest is but a lot of other information about related sequences other types of data that go well beyond just the simple idea of alignment. And again, Tira's gonna spend a lot of time talking to you about this during her lecture in a couple of weeks. All right, if we go back to our search page we have another set of links here called details so we'll click on the details link for our best hit. And so here what we have, the top part of this is our query sequence, the one that we started with from our sample set of sequences on the current topics sequence page. This is the sequence that was found. You'll see that there are things in different colors in uppercase and lowercase, the key is at the top. So anytime you have a matching base in CDNA and genomic sequences, those are blue and capitalized, the things that are in light blue are the boundaries of gaps in either sequences and those are often splice sites. So it's a view, it's not a particularly informative view. The view that I think is much easier to look at is the one that's reminiscent of the blast view where we just have the two sequences aligned with one another. You can see the positions of the gaps along the way and get a good sense for how aligned these two sequences are with one another. Finally, I just want to talk about one last sequence alignment program and that's called Fast A and so what Fast A does is identifies regions of local alignment much like our blast method does but it uses a different algorithm to use it. It uses something called a Smith-Waterman algorithm to find the best alignment between the two sequences. So the methodology that underlives this technique is significantly different than the one that is used for blast. Here are the URLs, if you'd like to check that out and play with it a little bit. It's slower than blast but it's more stringent in certain ways so I think one way to include this in your armamentarium is if you think back to our blast results we would draw a line at our E-value cutoff. You can use it as a second method to rule in or rule out things around the cutoff. So the same thing way that in the lab you might use two different experimental techniques to verify a result. Here you can use two different computational techniques to verify your computational results. Okay, so with that, that's where we're gonna leave things today. We're gonna build upon these basic concepts next week when we start to discuss how we can go well beyond just looking at letter by letter alignments to find commonality between sequences and we're gonna end up talking about how we can use multiple sequence alignments and apply those to your own work. So thank you once again for joining us today and we'll see you again next week.