 Okay, good morning everyone and thanks for coming this morning. A few housekeeping things before we get into today's lecture. For those of you who are joining us for the first time, the first day handout is in the back of the hall and this gives you some information on some additional resources that are available for this course including the course's website, how to sign up for the mailing list, so make sure that you're getting notifications about any changes in the lecture schedule and for how to earn CME credits and if we can throw the slides up please, thank you. Since I mentioned CME, we have to go through the formality of me disclosing that I don't have any financial interest for anything I'm going to talk about today. If you're a physician interested in earning CME credits, please make sure to sign in each and every time you come into the hall or join us at one of the teleconferencing facilities. Finally, before we get into today's lecture, a program note. So I want to bring to your attention a lecture later today at one o'clock in the MISUR auditorium by Carol Greider. This is the seventh annual Jeff Trent lecture in cancer research and this will be given by Dr. Greider who most of you may remember won the 2009 Nobel Prize in Physiology and Medicine for her work on understanding telomerases and the function of telomere. So I would very much encourage you if you have time today to join us over in MISUR in a couple of hours. All right, so with that, in the same way that Eric last week laid the groundwork for the upcoming lectures in genomics and gave you really a very comprehensive background of the field of genomics to the current day, my job is to try to provide you that same sort of solid foundation in the field of bioinformatics. And we mentioned last week that genomics and bioinformatics really do go hand in hand with one another. So this diagram, which Daryl Leesha and our institute designed to accompany our vision document a few years back, I think nicely illustrates the point about how the Human Genome Project forms the foundation of this very Frank Lloyd Wright-esque house. We can learn a lot of things about biology by looking at genomics. We can translate that into questions of improving human health and eventually move that up to questions in society as far as health disparities and similar issues go. You'll notice this house, even though it's built on the Human Genome Project as its foundation, has a number of pillars in it. These are cross-cutting technologies that are important to all three floors of this house and you'll notice right here we have computational biology. So I hope that helps to illustrate the really inextricable nature of genomics and bioinformatics. And so again, my job over the next two weeks is to give you that as gentle an introduction as I can to the field of bioinformatics to give you enough details so you understand the techniques without going into a ridiculous amount of theory. And you're going to see the things that we're going to talk about over the next two weeks repeated over and over and over again in subsequent lectures. So it's really important that you have some sense of the techniques that underlie what you're going to be hearing about in this course and not treat these techniques as a black box, as it's very easy to do when you go to a website, you see a box, stick in your sequence, click Go, easy as that. But unfortunately, that will many, many times lead to pitfalls. And hopefully I'll be able to point some of those out to you as we go along. So we've got a lot of ground to cover today. We'll probably go right up until 11.30. But hopefully you'll find this background useful and start to think about ways that you can incorporate these techniques into your own work. So just an overview of what we're going to do today. We're going to spend some time just defining some of the terms. We're going to reinforce the importance of alignments in bioinformatics and the kinds of things that we can discover by doing these alignments. These are the most powerful approaches we have to studying sequence data. We're going to talk a little bit about how we actually evaluate those alignments, the scoring methods that underlie those alignments. But we're going to spend the vast majority of our time talking about two major techniques, BLAST and BLAT. And we're going to then take this information, build upon that next week to talk about things like profiles, patterns, motifs, and domains. To think about things away from the sequence level. Now think about things a little bit at the structural level. And actually talk about a visualization tool and then end up next week talking about how to construct multiple sequence alignments. So, all right, so I've alluded to the fact that alignments are very important to bioinformatics. So why do we bother to do these things? Why don't we stand to learn from doing these multiple sequence alignments? And at base level, what they do is obviously give us some sort of a measure of relatedness between two sequences, whether they're two nucleotide sequences or two amino acid sequences. And when we do these alignments, it starts to give us some basic information about the relationship between those two sequences. We can start making inferences about commonality in structure, in function, or the evolutionary relationship between those two sequences. Now, in order to do this properly, we're gonna have to spend a little bit of time talking about the terminology. A lot of times the terminology is misuse, so we'll take a little bit of time just to go through the definitions. So when I want you to imagine that we have two sequences, we've aligned those two sequences with one another. And the easiest measure we have to assess how good that alignment is between those two sequences is to just simply count up how many characters, how many either nucleotides or amino acids match between those two sequences. So the quantitative measure that we have is something called similarity. It's always based on an observable, so in this case, how many positions have we been able to align? We usually express this as a percent identity, how many matches and how many mismatches we have between the two sequences. By doing this, this allows us to start to look at changes that happen between these two sequences as those two sequences diverge. So what are the net effects of substitutions, insertions, or deletions? More importantly, it allows us to start to identify what residues are critical for maintaining structure or function. Where is their evolutionary pressure to not change residues over time? Now, whenever we see high degrees of sequence similarity, high percent identities, it might imply to you one of two things. Either a common evolutionary history or some possible commonality and biological function. And this becomes important if you have just sequenced a new gene in the lab. You have a protein translation for that gene, but you haven't been able to assess a function to that particular protein. This is the key way you have to start to make some guesses about what that novel protein of yours actually does. All right, so that's similarity. Now, the other term which is often used and more importantly often misused is homology. And homology is a conclusion based on the alignments. A conclusion that is based on the percent identities that we see when we do a pairwise sequence alignment. So genes are either homologous or not homologous. This is not measured in degrees. This does not take a number. So this is a conclusion that implies some sort of evolutionary relationship. To make sure you don't misuse this term in the future, I like to show this quote from Walter Fitch, who was one of the forefathers of bioinformatics. And I'll just read this to you. It is worth repeating here that homology like pregnancy is indivisible. You are either homologous, pregnant, or you are not, okay? And so if you keep that in the back of your mind, hopefully you won't. You'll keep in mind which term applies to which. So again, similarity does take a number, homology does not. All right, now just to further complicate things, there are different kinds of homologs that we need to discuss. So that term, the term homologue, could apply to two different types of relationships. If we're talking about two genes that have been separated by the event of speciation, those things are called orthologs. They are related by orthology. If they are two genes that have been separated by virtue of a gene duplication event, those things are called paralogs. So let's just delve into that a little bit more. So again, what is an orthologue? These are two sequences that are direct descendants of a sequence in a common ancestor. And by virtue of that, by virtue of that common evolutionary history, they more than likely have similar domain structures, probably have common three-dimensional structure, and more importantly, serve the same or similar biological functions. Paralogs, on the other hand, again related through a gene duplication event. And this is what I like to call that it gives you some sort of insight into evolutionary innovation. What has mother nature been able to do to take a preexisting gene and adapt it towards using a new function? So I know these terms are not intuitive, so let's look at a picture, which might clear things up a little bit. So here we have a gene alpha, and let's say that over evolutionary time, gene alpha has given rise to three other genes numbered one, two, and three. So each one of those, one to two, two to three, and one to three, those are all orthologous to one another. But now if we say, all right, there's been a gene duplication event where we have another copy of alpha, we're going to call it beta, it could be an exact copy or a near exact copy. But maybe over time, this has evolved into another set of genes along a slightly different path than this one did. So you might end up with three genes numbered four, five, and six. So the relationship of four to five, five to six, and four to six, those are all orthologous to one another. But if you take any one of these three, one, two, or three, the relationship of these three to these three, those are paralogous to one another. So again, genes one to three are orthologists, four to six are orthologous, but any pair of the alpha genes or the beta genes with respect to one another are paralogs that are related through a gene duplication event. Now truth be told, these are terms that you should be familiar with, but unless you're doing a lot of phylogenetic analysis, you probably won't be using in your everyday work. It's sufficient to know that they're homologous to one another, but if you're starting to look at things, having to do with phylogenetic analysis, you need to understand what these terms represent. Okay. All right, so let's move to now a discussion of alignments. And so what I want you to imagine now, again, is two sequences that we've aligned to one another, sequence A and sequence B. And in something called a global sequence alignment, what we're trying to do as the name implies is align the entire sequence, sequence A, with all of sequence B. We're trying to force an alignment along the entire length of those two proteins or those two genes. Now, as you can imagine, because you're trying to force an alignment over the entire length of these two sequences, this tends to work better when you've got two sequences of approximately the same length. The problem with these kinds of methods, though, is as the amount of sequence similarity goes down, it gets harder and harder to align these sequences across their entire length. So you find that you can't force that alignment, but more importantly, what happens is you start missing commonalities in those two sequences because you're trying to force this alignment across the entire length of the alignment, but you're not necessarily keying in on important structural or functional domains that might be present in both of those sequences, but maybe you've got a protein that has an important domain at the end terminal end. It has the same domain in protein B in the C terminal end. There's no way to align those two if you're trying to align the whole thing. So from a biological point of view, this would be useful in limited circumstances, but what we tend to rely on more are local sequence alignments. So in these types of alignments, we try to find the most similar regions in the two sequences being aligned. So the term, the glorified term here is paired subsequences. Really, all that means is just two domains or two regions that align with one another in the two sequences under consideration. They don't have to be in the same part of the protein. They can be anywhere in the protein, in the two proteins. So because of that, you might get more than one alignment for any two sequences being compared. But again, we're now just looking for just regions of commonality between the two sequences under consideration. So again, from a biological standpoint, this is really good because it allows us to find those regions of interest. And this is really the method of choice as far as biological discovery goes. Okay, so let's say we have now employed some sort of local alignment method and BLAST is one of those local alignment methods. How do we actually score the alignment? So earlier on, I mentioned we could just count how many positions are in common with between the two sequences, but there are better ways to actually come up with a statistical measure of how good the alignment is. And so this depends on something called scoring matrices. And again, these are gonna allow us to say how good is the alignment between sequence A and sequence B? These scoring matrices are empirical weighting schemes that are intended to represent everything that is known about the physiochemical and the biological characteristics of both nucleotides and amino acids. So in constructing these scoring matrices that you'll see in a moment, we take into account things like side chain structure and chemistry and the function of those side chains, which of course are very, very different when we look at the side chains of all 20 amino acids. So for example, we would take into account things like cysteine and proline residues, which are obviously important for structure and function. The cysteine residues are involved in the cysteine cross bridges. Of course, prolines tend to demarcate the end of alpha helical runs, so those are important for structure as well. Tryptophanes, bulky side chains, very few things can substitute for tryptophanes, lysines and arginines, both have positive charges, so those are, we wanna keep charging in account as well. So putting all of this together, there's two considerations that we bring into account when developing these scoring matrices. The first is conservation. And quite simply, what amino acids can substitute for another amino acid without adversely changing the structure or function of that particular protein? And there's some more examples here on the slide, but at the end of the day, the game here is that we want to try to conserve the charges, the sizes, the hydrophobicity features and other physical chemical factors between those two amino acids. The second thing we take into account is frequency. If we just count up how often a particular amino acid occurs in every single protein we have the sequence of, we just are able to ask the question, how often does it occur? We take that into account so we would score rare amino acids differently than we would score common amino acids. Okay, so that all sounds well and fine. So why are we spending time talking about this this morning? And the reason we're spending time talking about this this morning is that every single time you do any sort of analysis involving lining two sequences up with one another, either in a pairwise fashion or in a multiple sequence alignment, these scoring matrices are at work in the background. You may not know that they're going on and that you're being used in the background, but they certainly are. Whether, again, you've got sequence A compared to sequence B, sequence A compared to a database, they're always being used. And so they always appear in these analysis involving sequence comparisons. More importantly, they implicitly represent to you patterns of evolution. So we're gonna talk about this a little bit more about how this comes into play. But the bottom line here is that the choice of matrix can really, really influence the results you get back when you use a particular program. And the take home message here is a lot of times we sort of are prone to use the default values that are presented to us at a particular website, but those default values may not always be the right choice depending on what you're trying to accomplish. And I'll give you some guidelines as we go through about how to make those choices. Okay, so what does one of these beasts look like? So let's start on the nucleotide side where things are fairly simple. And this is a straightforward match mismatch scheme. So here we've got the four amino, excuse me, the four nucleotides going across the top, the four nucleotides going down the bottom. And again, I want you to once again imagine the two sequences aligned with one another, two nucleotide sequences. And perhaps we have a position where we have two C's lining up with one another. So we have the C here in sequence one, the C here in sequence two. If we look where those intersect, we come up with a value of two. And you'll see that the twos appear every place on the diagonal. So any place you have a match at a particular position in this, using this particular matrix, you would assess a value of plus two points. You'll notice every place off the diagonal, these are the mismatches. Every place where you don't match correctly, you would subtract points. And in this case, we're subtracting three for every mismatch in the alignment. And this kind of matrix just assumes that all of the nucleotides appear at equal frequency at approximately 25% of the time. Now, this works fine again at the nucleotide level. We can get away with this because the four nucleotide bases themselves are not that chemically dissimilar. But we know that things on the amino acid side of the house are very, very much more complex. So we go from something simple looking like this to something that looks like this. All right, and so the print has gotten very small. Now, this is a matrix called Blossom 62. This is the default matrix that is used by NCBI when you do your blast searches. And let me orient you to what's going on here. Cross the top and down the side are each one of the 20 amino acids. And let's take again the case where we now have two protein sequences. We've aligned them to one another. And let's say there's a particular position where we've aligned two tryptophanes residues that they appear in both of the sequences at the same place. So here's our tryptophanes up here, the other tryptophanes in the other sequences here. And if we look at where those intersect, we see the number 11. Now, think back to the last slide on the diagonal. All of the values were exactly the same. If you look at the diagonal here, and this might be easier to see on your handouts, the values are all very different for these exact matches. So this goes back to frequency. How often do each one of these amino acids residues appear in the entire constellation of proteins? So the ones that have higher values appear less frequently. These are the rarer amino acids. The ones that appear more frequently have lower values across that diagonal. Now, we, of course, want to try to match up as many residues as we can, exactly. But we can't always do that, and it's actually OK that we don't always do that, because we want to take into account conservative substitutions as well. So going back to our tryptophanes versus tryptophanes, let's say now in sequence A, we've got a tryptophan in a particular position, and we've now aligned it instead with a tyrosine in the second sequence. So in this case, we now see a two. So that's still a positive value. That implies to you that this is a conservative substitution. These two amino acids can substitute for one another, and hopefully not adversely change the structure or function of this particular protein. Now let's take a case where you just totally botched things up. Up here, we've got a cysteine residue. And let's say we replace it with a glutamine. You'll notice where those intersect, we have a negative value minus 3. So those are two residues that are deemed to not be able to substitute for one another. And by virtue of that, we actually take a deduction here. So we subtract points any place we've got residues that should not substitute for one another. So put very simply, you earn the most points for an exact match. The rarer residues will give you more points than the more common residues. You will still earn a positive value. You'll earn points for a conservative substitution, just not as many as you would for the exact match. And finally, you will lose points every time you have a mismatch. OK, clear? OK, good. All right, so here is a blossom matrix. And this is one of a family of matrices called the blossom matrices. And these were developed by the Hennikovs at the Fred Hutchinson Cancer Research Center back in the early 90s. Blossom stands for block substitution matrix. And where all of those numbers come from in those tables, those are just not pulled out of thin air. Those numbers actually come from doing hundreds and hundreds of sequence alignments in protein families, protein families where we know the members of those families, and looking at the substitution pattern. So this is all empirical. It's based on reality, the observed substitutions that can take place without adversely changing structure and function once again. Because of the way these are calculated, these are very, very sensitive to detecting structural or functional substitution. So when you mess things up, that will be reflected in the scores. Now, in this last bullet here, I say, generally perform better than PAM matrices. We're not going to talk about the PAM matrices here, but I want to bring them to your attention. These were developed in the early 70s by a woman named Margaret Dayhoff, who really was the very first pioneer in bioinformatics. And she tried to construct similar matrices to the ones we've just talked about. But unfortunately, at the time, we didn't have a lot of sequence data. And the sequences we had for proteins were mostly of globular protein. So they had limited utility. They were as best as you could do at the time. But the Blossom matrices are certainly better than the PAM matrices are. But I just want to point this out, because sometimes you will see this come up in phylogenetic programs and other places. But whenever you have the choice, always pick the Blossom matrices. And I'll give you some background reading if you want to learn more about that later. So I alluded to earlier that the choice of matrix implies a certain model of evolution. So here is now where the evolution comes into place. Before we had Blossom, 62. So it's always Blossom and a number. So what that number means is that we have calculated that matrix from sequences that share no more than n percent identity. So put otherwise, if we have sequences that are above that cutoff that are more similar than n, we cluster those together and weight them to 1, to actually be considered as a single sequence. So pictures help here again. So let's say we have here a multiple sequence alignment. You'll notice that certain positions are absolutely conserved, the ones where the stars are. We could certainly calculate a matrix based just on that alignment. But the problem with that goes back to something that Dr. Green discussed in last week's lecture. We have genomic data for a whole, vast number of organisms. But unfortunately, that sequence data is not evenly distributed amongst all of the phylogenetic taxes. So things tend to be in the upper part of the evolutionary tree, but not so much as we go further down. So because of that, we would introduce bias into the calculations that we would see things pushed more towards closely related sequences. And by doing that, if we were not to account for that, we might miss substitutions that might be biologically relevant, that might provide us some insight into the relationship between two proteins. So in this case, what we're gonna do is let's say that we're gonna make the cutoff 80, so blossom 80. Here's our six sequences. And let's say, all right, I wanna now cluster together any of those sequences that have more than 80% similarity to one another. And when I do that, I have one that actually doesn't match any of the other one at that threshold. I have three sequences here that are 80% or more similar to one another, and I have two also meeting that same criteria. At this point, this one is treated as a single sequence, these three are treated as a single sequence, and these two are treated as a single sequence. So by doing this, we remove all of this inherent bias in the collection of sequences that we just happen to have in the sequence databases. All right, so more importantly, what should you be thinking about when you look at those N numbers? So when we do this clustering, as I've alluded to, we remove the contribution of closely related sequences, less bias towards substitutions that occur in the most closely related members of a family. But here is the take home message, reducing N yields more distantly related sequences. So in practice, this is your first cheat sheet of the morning, so put a star next to this particular slide. Which one do you choose? All right, now, the one that I showed you earlier, Blossom 62, again is the default matrix at NCBI. The reason it's the default matrix at NCBI is in reverse testing where you knew what the answer was supposed to be in advance, where you're trying to find all of the known members of a protein family, the Blossom 62 matrix performed the best. So by virtue of that, that's the one that was selected as the default. And so you'll be able to find similarities down in the 30 to 40% similarity range. If we go up here in the numbers, we start to get shorter alignments, but these are highly similar. So if you're looking for things that are very, very, very closely related to one another, you would push that Blossom number up. And you'll see that reflected here in this percent similarity number here, 70 to 90%. Alternatively, if we go in the other direction down to Blossom 30, we're now gonna get longer alignments, but they're going to be more weakly related to one another. These are gonna be more distant relatives to the protein that you started with, but what this will help you to do is start to identify what residues are important to structure and function, that despite being totally divergent, which residues were preserved over long stretches of evolutionary time in a particular protein family. So that's all well and fine. What do you actually do with all of this now? The approach that I would advocate to you, and this comes originally from Steve Altschul who developed the BLAST program, is to sort of take a triple matrix strategy. So I know most of you have probably done BLAST searches, you use the default value, but really what I would advocate is that you actually do three BLAST searches every time you wanna do a search. Use the default, since that one has been proven to be the most effective, but also pick one that's at shorter evolutionary distances, one that is at longer evolutionary distances, and see what you get. And I would invite you to do this exercise when you get back to your labs, because you are going to get three different sets of results. Now they certainly will overlap with one another, but you might find things, especially with this bottom one here, that you would not have necessarily found if you had just done the single BLAST search. Now certainly if you know in advance what you're looking for, you can do a single search and that is fine. But again, I would recommend you sort of take this approach until you become more and more familiar with the methods until you sort of develop a better gut feeling for what to use when. So at the end of the day, no single matrix is the complete answer for all sequence comparisons, and hopefully you understand a little bit why that is now, okay. Any questions so far? Okay, great. So as far as some additional reading goes, in current protocols and bioinformatics, unit 3.5 talks about these matrices. This is a unit by David Wheeler that talks a little bit about the PAM matrices that we passed by this morning, more on the BLASA matrices, but also some specialized scoring matrices for things like transmembrane proteins and other types of proteins that don't observe the standard substitution rules like most proteins do. So for all of you, you all have privileges to see the current protocol series by just going to the NIH library. If you follow the online journals link, that will eventually take you to these and you can just download the PDF at your desktop. All right. Now one final thing about alignments before we finally get to BLAST, and that is gaps. And so if we think about our sequences again and we think about these scoring matrices, all right, every place we have a substitution of one amino acid for another, we can look up in our table what the value would be for that particular alignment. But what do you do in the case of a gap? Well, you've got a character in one sequence and you've got nothing in the other, okay? So we have to handle these a little bit differently. Now the reason we usually put gaps in is just to improve the alignment. But what I want you to keep in mind that every time you put a gap into these alignments, these actually represent biological events. So these are insertions and deletions. And because they represent biological events, you have to keep them to a plausible number. You can't just willy-nilly be introducing gaps all over the place because again, it represents some time a long evolutionary time, an insertion or a deletion having taken place. And a good rule of thumb tends to be around one gap per 20 residues. And again, we can't score these simply as a match or a mismatch. Now with apologies for the equation, the way we do this is by something called the affine gap penalty. And all you need to know off of this slide is anytime you open a gap, you assess a penalty, anytime you extend the gap, you make the gap longer, you also extend a penalty. So for example, if we have a protein sequence and in making an alignment, I've introduced a gap of length two, I would have to subtract 11 points for opening the gap. The gap extension penalty is one. I've already told you that it's a gap of length two, so one times two is two. You add them all together, you would subtract 13 points for that particular gap. So that just gets added on to the other parts of the scoring thing. So gaps always lead in to a deduction of points. And I'm gonna be showing you as we go through these next set of slides, a whole bunch of parameters. You can change any or all of these, you can leave them the way they are. But hopefully when you understand what they are a little bit more, you can start to fiddle with these a little bit. So you can make these values larger or smaller if you wanna make the gaps more or less permissible. Okay, so with that by way of long-winded half-hour background, it's finally time to talk about blasts. So how many people in the room have used blasts before? Okay, the vast majority of you. How many of you actually know how it works? Okay, and a little chuckle too, that's good. And one final question, how many of you who have used it before have actually fiddled with any of the parameters other than just using the defaults? Okay, so a few people, all right, great. So hopefully for those of you who've played with it a little bit, you'll learn a little bit more today. For those of you who haven't done it before, I'll key you into a few things you should be looking at. All right, and I'm really gonna beat this to death today because blast by far is the most important technique we have in the bioinformatics arsenal. It really is the workhorse of what we do. And so what I really wanna make sure by the time you all leave the hall today is to make sure you all know how to use it properly because the literature is becoming more and more littered with cases where incorrect conclusions were made either because somebody didn't know how to run the blast search correctly or they misinterpreted the results they got out of the blast search. And one of the most important things you're gonna learn today is just because you see a hit on a particular piece of paper, or something come back in your set of results, it does not necessarily mean that it is biologically and statistically significant, okay? So I'll give you some rules for that as well. All right, so blast. The basic local alignment search tool. And what this is, again, local alignment method, we're going to look for things called high scoring segment pairs. And this is just another way of saying we're looking for domains between two proteins of interest or between our protein of interest and a whole slew of sequences in a public database. And we want to maximize the length of those alignments as much as we can. When we do these alignments, we have to hit a particular score threshold and the number S is gonna become very important to you in a moment. The alignments can either be gapped or ungapped. As I mentioned to you earlier, we could have multiple results coming out of a local alignment scheme. So we might get multiple high scoring segment pairs for any two sequences that we align with one another. All right, so there are five programs in the blast suite. You can differentiate between them by just looking at the last letter. So blast N, N is for nucleotide. So we start with a nucleotide sequence and we compare that to all of the nucleotide sequences in a particular database. Blast P, protein versus protein. Blast X is a little special. What we do here is we start with a nucleotide sequence. We do a six frame translation and then based on the putative protein sequences we get out the comparison has done protein versus protein. The final two are grayed out because they are only rarely used. I just include them for sake of completeness but these are the three that you'll be using most in your travels. All right, so let's start with a protein sequence and so here is a sequence and what we are going to try to do in order to affect this local alignment method. Again, not forcing the alignment along the entire length of the protein is rather than try to find long stretches of residues that match with one another, we're gonna start with little tiny bits. And so we always start with something called a query word and by default that query word is of length three and so I just want you to imagine this window of three moving across the sequence for sake of argument we're gonna start in the middle here. And I have highlighted here a P, a Q energy so three residues to start our query word. Now what I can do now is compare that P, Q and G to all of the other protein sequences we have available to us looking for every occurrence of P, Q, G in the rest of those sequences. So when I do that and if I go to my scoring matrix, if you take your Blossom 62 matrix in your handout and if you go P for P, Q for Q, G for G you would get seven and five and six for a total of 18 points. So that's fine as far as the exact match goes but of course we wanna take into account conservative substitutions as well because there might be some neat biological relationships to be learned. So you'll see right below here I'm now just changing the middle letter for most of these so if I have a P, E, G and I do the same exercise I would get 15 points, P, R, G gives me 14 points going down this list. Now at some point you wanna draw a line and so this is something called the neighborhood score threshold. Okay. This is automatically set by blast for each search so it'll change it based on what residues are being or in the query word but this is basically just the cutoff to say okay everything from here up is considered to be again either an exact match or conservative enough substitution to the query word that we started with. These things are in the neighborhood and we'll advance to the next step. Okay. Everything else gets thrown out. So let's move this to the top of the slide. Now here is our original sequence from the preceding slide. Here is another sequence that it found just by looking in one of the public databases. You'll notice that our P, Q, G now aligns with a P, M, G. Well that's all right because if you look up here P, M, G is part of the neighborhood. 13 points for that P, M, G. So here we've got our alignment and let me just quickly point out that you have a qualitative assessment here of how good this alignment is just by looking at the line in the middle. So any place you have an exact match you'll notice that the letter is repeated in that middle line. Any place you see a plus sign that's a conservative substitution. A positive value being assessed when you look back in that Blossom 62 matrix. Okay. Everything else is either a zero or a mismatch. So a negative score. Okay. So we've started with our P, Q, G. We've matched it with a P, M, G because the P, M, G is in the neighborhood. And now as you can see by the arrows we're extending out in both directions as far as we can. But the question is how far is too far? How far have we gone to? If we go too far do we actually start to degrade the alignment? So here is now finally where those scoring matrices come into play. So again, our alignment is here at the top. The graph here, the x-axis is extension. How many residues have we aligned with one another? So how far out have we gone in both directions? The length of the alignment. The cumulative score is on the y-axis. Here's our neighborhood score threshold T, which in this case was 13. The first point was a value of 13. So we break that score threshold, we can now do the extension. So again, if you don't break that score threshold, none of this takes place. So I want you to imagine just by looking at this sequence what you'll see, we've got a good amount of matches there. And for argument's sake, we're gonna say that the matches outweigh the mismatches. So we're going to be accruing more positive scores than we are going to be deducting negative scores for the mismatches. So as we go up, as we start to extend the curve goes up. At some point we break this value of S. So S is the score threshold. What that value is is it represents what value you have to beat in order for that particular sequence to be reported back to you as a hit in your blast results. And this is really pitfall number one. Just because it is reported in the blast results, once again it does not mean that it is biologically or statistically significant. So this is where where people really run into trouble is, oh, it's on my list, so it must be related. But it actually, a lot of times it's not. And we'll see an example of that shortly. So the take home messages, the scores are actually not that important. And we'll come back to that. All right, so we're gonna keep extending in both directions. We're gonna assume that the matches are outweighing the mismatches. We keep going up, we keep going up. But at some point, let's say the mismatches start to outweigh the matches. Think back to your scoring matrices, your deducting points every time you mismatch. If we start introducing gaps, you'll also start to deduct points. So because of those two things, you might drop a particular amount. And this is something called the significance decay. Once, this is also automatically set by the algorithm, once you drop a particular amount, blast will say, all right, I went a little bit too far. I'm gonna back up to this point here, the peak of this curve, whatever length that alignment is, that is considered to be the length of this high-scoring segment pair, the length of the alignment between sequence A and sequence B, okay. All right, good. All right, so now, so the scores are important because we go back to our Blossom 62 matrices. It allows us to assess how good the alignments are, position by position. But if I now want to go back to that question that I keep harping on, what is biologically significant? What is statistically significant? We need to now do a little bit of math. So think back to the matrices again, and I want you to imagine two sequences, two alignments rather. And in the alignment of one, you might have an alignment that matches position for position, but all of the residues are common residues. You might have another alignment of the same length where all of the residues are rare residues. You already know from the scoring matrices that the score, the net score for the alignment that has all of the rare residues is gonna be much higher than the one with the common residues, even though they're the exact same length, okay. So you can't use score as a determinant of significance. We need to now manipulate that into a probability value. And so the way we do that, there is something called the Karlin-Altrell equation do not memorize this. So here's our score in the exponential. It takes into account the size of the database, the length of the query, and a whole bunch of other factors to finally compute out a probability score. And so what this score represents is the number of high-scoring segment pairs found purely by chance, the number of false positives, okay. So because it is the number of false positives, we want that number to be as small as possible, okay. So I'm gonna give you your first set of general guidelines here. If we're doing a nucleotide-based comparison, so a blast N search, you want that E value to be less than or equal to 10 to the minus six. Again, just a guideline. If we're doing a protein-based search, this is both blast P and blast X, because remember, even though we're doing that six-frame translation starting with a nucleotide sequence, the comparisons at the protein level, the rough guideline I'm going to give you is a cutoff of 10 to the minus third, okay. So hopefully that makes a little bit more sense as far as the probabilities go, so the scores are important, but the scores are not what you're gonna look at at the end of the day. These are what you're going to look at because from search to search to search, regardless of the composition of the sequences, the lengths of the alignments, or any other factors, you can actually compare that value one to the next and start to develop a set of rules, okay. Questions. All right, good. So let's actually go through an example, and so we're gonna pretend we're sitting at a computer and in your handout you have each one of these screen captures. So what I would invite you to do once you get back to lab or once you get home is to just sit down with the handout and go through the examples. You can just go through them step by step. I know in the printouts it's a little bit small, but if you put them up on your screens back at your desk you can actually blow everything up and hopefully it'll be a little bit more legible. All right, so the BLAST website is at NCBI, so I'll always be giving you the URL. And this is the NCBI homepage. The most commonly used tools can be found here in the upper right and the one we're interested in right now is BLAST. So if we pretend to click on that, that brings us to the BLAST homepage. Now on this page things are organized in three general groups. At the top you can actually BLAST against assembled genomes. So instead of saying just BLAST against everything in the protein database or everything in the nucleotide database you could select a particular organism here. The basic BLAST searches, the five searches that were on that slide I showed you earlier are found in the middle section and there's some specialized searches in the bottom that we'll talk about a little bit later. But right now we're doing a protein example. The protein BLAST link is right there so we're going to follow that link. Before we follow that link all of the examples that I'm going to work this week and next week you can find all of the sequences at this URL so if you choose to do the examples on your own you can actually just grab the sequences from here and follow through the examples. Okay, so we clicked on that link, brings us to the BLAST homepage. So the very first thing you need of course is a sequence. So from that example page I've taken the sequence that was marked BLAST P, put it in the box. Okay, so simple as that. Next to that box you'll see something that says query subrange. So let's say you only want to consider part of this sequence. You can certainly do that by saying, all right if I only want to do the search using residues 100 to 175, you put 100 to 175 in the box it'll ignore the rest of the sequence so it saves you the trouble of doing manual editing. We also have a choice of what database we want to search. So there's a database pulled down here and these are the ones that are available to you. The default is the non-redundant databases as all of the sequences that are available at NCBI. There's two that I want to draw your attention to. One is called RefSeq and one is called SwissPro. So let's talk a little bit about RefSeq. We'll take a little bit of an aside here. So many of you who have done these searches before or you've done entree-based searches where you might have put a name of a gene into the entree search box and you've gotten back one, five, 20, a hundred different entries and you sort of look at it a little bit quizzically and you ask yourself well which one is the right one? Which is the canonically correct entry for the gene sequence or the protein sequence that I'm looking for. So what NCBI has done to address this problem is come up with a database called the NCBI reference sequence database or RefSeq. And the goal of this is to provide a single reference sequence for each one of the three molecules in the central dogma. So one sequence, one DNA sequence, one mRNA sequence and one protein sequence for each entity. What's nice about this database by definition it is non-redundant but it is constantly updated by curators to reflect the current state of biology. So when you look at the feature tables in those GenBank entries you can be assured that those are up to date. The way you can tell which ones of these are RefSeq entries are by looking at the first letter of their accession number. Now whenever I use the term accession number that is a unique identifier that if you were to type that in and do a search you would get back one and only one sequence. So in essence that is the sequences social security number. All right, in the first series you'll see that they all start with an N. Okay, so we have NT and underscore and six numbers. These represent genomic contigs more than likely coming off the sequencing machines in the kinds of projects that Eric Green described to you last week. And M, these are mRNA sequences and the P is for protein, these are protein sequences. The important thing about the N is that it says it's from curation of GenBank entries but what it really means that these are experimentally verified sequences that somebody at some point had that molecule in their hands whether it came off of a sequencing machine or was in somebody's laboratory these things have been verified to be correct. So when you have a choice these are the ones you would gravitate to. There's another series though that I wanna point out to you that start with an X and you'll see it says from genome annotation. So imagine we've got these genomic contigs at the top coming off of these sequencing machines at Mach speed. We can apply computational methods to try to guess where we think the genes are and what the gene structure looks like. So based on those predictive methods we can come up with a series of model mRNAs then translate those into a series of model proteins. So the important thing to remember here is again the ends represent experimentally verified sequences and they're the ones that are preferable always because again somebody's had it in their hands the X's are predictions. So it doesn't mean you should never use the predictive sequences but just keep in mind that they're predictions and the prediction methods are pretty damn good but they're not always perfect. So as long as you keep that in the back of your head you'll be fine. All right, so that's RefSeq. Why RefSeq is nice to use in the context of a blast search is that you're gonna get a neater set of results that the redundancy has been taken out each molecule is represented once so you're gonna just get a more tidy list of hits. The other database called SwissProat we'll talk about a little bit more next week. All right, so let's go a little bit further down in the entry here we can pick a particular organism here so you can limit this to humans or bilaterians or what any phylogenetic class that you want just by putting the right name in the box. So this is a little bit of a different approach from the first slide that I showed you where you could actually pick an organism. Here you could pick a set of organisms in particular phylogenetic taxa and so on. Going a little bit further down you'll see that there is a little box down here that says algorithm parameters and people usually miss this because it's below the blast button. This is where all of the funky little things that you can mess with, all of the parameters you can change are located so if you click on that I just want you to imagine now the screen is going to unfurl a little bit more and it's going to look like this. So the first thing to look at is the maximum number of target sequences. What is the maximum number of sequences that blast will return to you in the results? The default is 100. Okay, now in the case that we're gonna be working when you see the results we're gonna get 130 something results back. So we're gonna miss a substantial number of the results if we leave it at the default. So I always just hike it up as far as it'll go and as high as it'll go is 250. We can change both the E-value threshold. So these are the rules I gave you earlier for the cutoff. You'll see that the E-value threshold says 10. Well for protein based search I've told you that it should be 10 to the minus third or better give or take. So that's a significantly bigger number by orders of magnitude but for sake of the example we're just gonna leave it there for now because there actually is a virtue to leaving it a little bit higher. The word size from the example we did before are P, Q and G, the word size is three but you can change that if you make the word size longer you're going to push in the direction of more exact matches. Okay, let's go down a little bit more. The scoring matrices, so now that you know a little bit more about which ones work when and which ones should be used for different cases here are the ones you can select. So if you do employ the triple matrix strategy one, two, three, there they are, okay? And again the default is Blossom 62. Right above that actually if you look right below that it says gap costs so these are the affine gap penalty costs you can make the gaps more or less permissive just by using that menu. Below that under filters and masking and again something that's overlooked here I've checked off the box where it says low complexity region. So let's talk about what a low complexity region is and these are regions of biased composition so these could be homopolymeric runs where you've got the same amino acid over and over again, short period repeats or a subtle overrepresentation of certain residues. So in the example on the board we have a sequence where you have part of that sequence where you've got a run of A's and a run of Q's, okay? This poses a little bit of a problem to Blass because Blass depends on a fairly even distribution of the residue so when it sees something like this it quite honestly doesn't know what to do with it because you can imagine that it could try to align it in any one of a number of ways. So rather than trying to force an alignment that might be incorrect, we just mask these out. So by clicking that little box it'll just consider that, it'll just write over those residues and not consider them in constructing the alignment. Now just as an aside, where do these low complexity regions come from? The role isn't really, the origins aren't really that clear. Could be DNA replication errors, it could be from some sort of unequal crossing over but again could confound the sequence analysis and will often lead to false positives so the filtering is recommended but it's not enabled by default, all right? Okay, so down at the bottom, the last thing you're going to take care of there's a little teeny tiny check box there that says show results in a new window and this is just a personal preference for those of you who have done Blass Surges before you know if you submit the query you'll get another web page, you'll know they'll get another web page and you just follow the same web paradigm that you're moving across a bunch of pages. If you click that box it'll open up a new window so and I prefer that because it'll leave my query window where it is if I wanna go back and make changes I can make changes without having to page back and page back and hopefully still find that particular query page. All right, so with all of that we finally click the Blast button and we start getting back our results. Okay, so here, this is the top part of the Blast page. What you have at the very, very top is any time either protein domains either structural or functional domains are found those are pointed out to you so your sequence is represented as this grayish black bar and the various families that were found are seen below we're gonna talk about this much more next week. In this particular search we had 189 blast hits so remember the default was 100 so we would have lost 89 of those hits had we not upped the default excuse me up that value up to 250. All right, then we've got a very, very busy table and you all have probably looked at this and kind of wondered what is going on in here so let's take a moment to deconstruct this table. Now, in this table each one of these lines represents a particular blast hit so we had our query sequence of interest these were the sequences that were found these each one of the thicker bars represents one of our high scoring segment pairs. Okay, so these are in descending score order so the best hits by virtue of score are at the top going down from there so remember I'm not so keen on the score as being an arbiter of what's important but that's just the way this is organized. The color key to tell you what these colors represent is at the top so the better ones by virtue of score and red going down to the black. Any time you see multiple ones of these thick bars each one of these high scoring segment pairs connected by a thinner bar that implies to you that all of this is a single protein but that part of the protein was not aligned so either because it was masked or a gap was introduced it's just not part of the high scoring segment pair so the one here in the example we have one, two, three, four high scoring segment pairs in that protein but the alignment does not encompass the entire length of the protein. Okay, sometimes you'll see high scoring segment pairs on the same line but they're not connected to one another and that just is done for an economy of space. Those two are actually not related to one another. Okay, so now down to the more important part of the output, the hit list. All right, so for each one of these items that was found that was deemed to be similar by our blast algorithm starting with our query word and extending out as far as we could go we're given the accession number at the very beginning a small snippet of the definition line to give us some sense of what protein was found what the name of the protein is, the scores and the E values and so this is also organized in descending score order but what we're going to pay attention to again is this final column, the probability value, the E value, the representation of whether or not this is possibly a false positive. Now, the representation here because there aren't that many characters allotted to this in the formatting you'll see things like three to the minus 160 and 70 to the minus 118. What that means every time you see the E minus that's just times 10 to the, so two E minus 98 is two times 10 to the minus 98. So something like that certainly beats our cutoff parameters. At the top, you see a bunch of zeros so that just implies that it's less than or equal to 10 to the minus 1000. So those are the ones you don't even have to think about. Okay, as being related. All right, now, and finally some of these have U's and G's and S's next to them. If you see an S, that is a link to the structure database. If you see a G or a U, those are links to gene and uni genes so these are two gene-centric databases that NCBI offers that gives you more information about how those genes are clustered with other genes, how they appear as far as their gene structure goes and similar information. So let's scroll down, we're gonna pretend we scroll down. Here's the bottom of our hit list. You'll remember that we left our expect value, our value when we did the search at 10, okay? So we're getting at anything that had an E value of better than 10 in our list. So somewhere here we have to draw a line. I've given you a general rule of 10 to the minus third as being a good place for a preliminary cutoff. So if we look down this list, that occurs right about here. So we're gonna draw a line right there. So everything above this line we're gonna accept for now and everything below the line we're going to reject. Now, it's actually good to make that E value a little bit bigger than where you want your absolute cutoff and the reason for that is you should look at what's down here, okay? And look at the names of those things and you might have biological evidence or information from the literature that may actually argue for relatedness even though it's below the line, okay? So one of the most important things to remember whenever you use any kind of bioinformatic technique is that biology always trumps the computational predictions or the results of these programs, okay? The biology is real. So that is what you have to let guide you as you do these. So these guidelines that I'm giving you are all starting points, but don't put your biological knowledge on the shelf when you use these methods. Put them together to look at these results. And in fact, in this particular case, some of these should actually be included, not excluded. All right. Now, in the score column, you'll notice that they're all hyperlinked. If we click on any of those, that will jump us down in the entry to that particular alignment for that protein. So let's pretend we've done that. We've come down to a particular alignment here. And so what we see here is that we have our query sequence, the one that we used as the input. The alignment starts in position 17 of that. We've found another protein where the alignment starts at position 317, going across. So you'll see the numbers on both sides so you know where you are. Now, above that, you'll see a bunch of scoring information. So you need to pay attention to what's going on here. So for each one of these, you'll see a score value. Here's the expect value, the evalue here at zero. So you know you've beat your cutoff. It's less than 10 to the minus 1,000. So that's good. You'll then see identities, positives, and gaps. So what the identities are is the percent identity, how many exact matches were there. So in this case, 688 out of 688, if they're 100% similar to one another, positives is percent similarity. So this includes conservative substitutions. All right, so that's just the nuance here. So identities is what it sounds like, but the positives includes conservative substitutions and gaps are gaps we don't have any here. So rule number two. So besides looking at the expect values, looking at those evalues, what I want you to also look for is look at that identity number. If you're on the protein level, you wanna at least have 25% sequence identity. If you're at the nucleotide level, you wanna have at least 70% sequence identity. So both of those criteria must be met. Okay, all right. So we could scroll down. We would eventually find, oh, I'm sorry, and here you see that there is a bunch of letters that are in lower case. This was deemed to be a region of low complexity. So it's masked out as far as the alignment goes. You can still see what the letters are. They just were put in lower case. In the next slide, we'll see an example of a gap. All right, so this is actually the end of the alignment from the preceding slide. We see another set of statistics here, but we don't see, as we did in the previous slide, we don't see all of this identifying information at the top. We don't see what the name of the sequence is that was found. So what that implies to you is that we now have a new alignment starting at 906 in our query, ending at 1370 in the subject. Whenever you just see it go right into another alignment without a depth line, that is another high-scoring segment pair between the same two proteins. All right, and again, here is just an example of a gap. We got our same statistical information as we had before. So just putting all of this part together here is the overview that we saw. I've just cut out the bits, the scoring bits from the two alignments we just looked at. So if we look at the percent identity, so this one had 100% identity, this one was 92. So that certainly beats our guidelines, the expect values, here at zero, here at seven to the, 10 to the minus 180. So those certainly meet our criteria as well. And just so you can see what these correspond to in the diagram, here's the first high-scoring segment pair, there's a gap here, here's the second high-scoring segment pair there. So they're both connected to one another, okay? So putting all of that together, second cheat sheet for the morning. Okay, so here are your suggested cutoff values. Again, we're gonna have a probability measure and we're gonna have a sequence identity measure at the nucleotide level. We want to see E values that are better than 10 to the minus six, with a sequence identity of better than 70%. Protein level, 10 to the minus third for the E value, 25% or more for the sequence identity. Now, these are just general guidelines and I offer these as a starting point. Again, I wanna remake the point that I made earlier. Don't use these cutoffs blindly, okay? And don't forget your biology at the door. All right, so you want to make sure that you look at the list, consider any biological information you have, pay attention to alignments on either side of that dividing line, so there's nothing holy about these. This was just a starting point to help you better use the BLAST algorithm. Okay, questions? Okay, I'm hoping everybody's following along, okay. All right, just some final things before we leave BLAST. So the cutoffs are based on many, many years of experience, many, many people and is intended as a starting point. So you're asking if there's an alternative set of cutoffs, so there is no alternative set of cutoffs, but this is why, again, it's important to do the practice and to just use this in conjunction with the biological information that you have. All right, all right, just some things to keep in the back of your mind where BLAST may fail you, which is equally as important as knowing how to use it is when not to use it or cases where things might get a little bit messed up. So we've already talked about low complexity regions. Whenever you have repetitive elements, again, it doesn't like things that are not sort of random in nature, so whenever you have lines and signs and retroviable repeats, you may run into problems. When you're using BLAST N, you should choose filter species-specific repeats to get rid of those repetitive elements. Alternatively, you could use a third party piece of software called RepeatMasker, do the masking there, and then bring the sequences in. You also have to keep in mind low quality sequences, so things like express sequence tags and single-pass reads from large-scale sequencing and just to bring you back again to last week's lecture, you'll remember that Dr. Green showed you the results of that Poisson distribution slide. If you only sequenced a particular stretch of the genome once, your accuracy would only be on the order of 63%. You do it a second time, that bumps you to 87%, and so on, pushing us up towards that 12x coverage that gets us to that 99.999% accuracy. So if you have a single-pass sequence, it might actually be problematic as far as BLAST is concerned. All right, let's move along to a related BLAST algorithm called BLAST 2 sequences. So in this case, we're doing pretty much the same thing, but instead of taking a sequence of interest and comparing it to all of the other sequences in a public database or in a set of sequences that you may have assembled yourself, we instead have two sequences that we just want to compare to each other, so two predefined sequences that we just want to align. As before, all of the BLAST programs are available. You can pick whichever BLAST or PAM matrix you want, same affine cost as before, and you can mask the input sequences in the same way that we just did. So we go back to our BLAST homepage down in the bottom. This is located in the Specialized BLAST section. There is a link that says, align two or more sequences using BLAST. If we click on that, that brings us to this page hidden at the top. You'll see five tabs that say BLAST N, BLAST P, BLAST X, and so on. So I clicked on the BLAST P tab up there, two boxes for two sequences, and these sequences come from that list of sample sequences that I gave you, that webpage that URL for. So you can again do these yourself. So sequence one is in the first box, sequence two is in the second box. As before, we can limit the queries and subject subranges to only use parts of those sequences. Going down into the algorithm parameters at the bottom as before, we can toggle this down below the BLAST button there. We can change again the number of target sequences or the E value. This is, you really don't have to bother with this here because you've already selected the two sequences. So you don't have to worry about not finding the two sequences that you put in the boxes. They will appear. But you can change the word size if you wanna push the alignment once again towards longer alignments. So again, we can pick which matrices we want. We're gonna mask out the low complexity regions as we did before, click off the little box to have our sequences in a new window and then we're gonna finally now again hit the BLAST button. And now we get a very similar output than we did before. Here is our score overview because we started with two sequences. There's only one possible alignment, so you only have one line in the overview. And this is the one hit, the two sequences align to one another. You're then given two alignments for those two sequences with one another. You'll notice the first one, we've got 46% sequence identity. The expect value is a reasonable expect value. So that is a good alignment of our two sequences of interest. The other one is very, very short. You'll notice the expect value is 1.9. We're not even gonna consider that one any further. So this is just a really quick way of taking two sequences and aligning them to one another. But you do have some control about how the alignment is done. A link is up here that says dot matrix view. And if you toggle that down, you get something that looks like this. And this is a more traditional view of how sequence alignments are represented. So sequence one is represented by this scale across the bottom, sequence two is represented by the scale across the top. This alignment is this line. So if you were to match up the coordinates to what's on the table, you would see that this one starts here and goes there. This short one is this little bit here. When you see a straight line going across the whole way, that just tells you that the alignment has taken place across the entire length of the protein. If you see it jumping around, that's sort of interesting sometimes just from the standpoint of seeing different domains that are scattered about, but not necessarily in the same order in the two proteins. Okay. All right, so I don't wanna give short shrift to the nucleotide side of the house. Protein side, as you probably have now figured out, is much more complicated because of the scoring matrices and all of the considerations that go into aligning two protein sequences and keeping all of the physical, chemical, and other characteristics that we have to keep in track. But oftentimes you're going to want to run nucleotide-based searches. And so again, we could go back to our NCBI website. We could click on the nucleotide blast button. And the same, pretty much the same kind of page appears as we saw before. Except there's one difference here, we've got something called program selection here. And I know it's small, let me just read it to you. It's the first option says highly similar sequences, mega blast, more dissimilar sequences, discontinuous mega blast, and somewhat similar sequences, blast N. Okay, so let's talk about what each one of those three are. And I apologize for the busy slide here. Let me orient you for a minute. For each one of the methods that are listed here, I'm giving you the default word length, the mismatch score. So remember, this is a plus and minus scheme, analogous to the very first matrix I showed you about an hour ago, and how the gaps are treated. So if you've got very, very long sequences and you want to compare your very, very long sequences against complete genomes or against complete chromosomes, mega blast is a good way to go. The word length you'll see here is 28 for regular blast N, it's 11 by virtue of that. You're going to push towards longer alignments, but it's also going to be much, much faster. So if you've got long, contigued length input sequences, that would be the method that you would want to pick. Just as an aside, the gaps are treated somehow a little bit differently here that we have a linear system where we charge for the gap, but it's done in a different way than the affine method does that I explained to you earlier. Flipping to the bottom of the slide, finding short nearly exact matches, less than 20 bases. So here's where you would use blast N. The default again is 11, but we're going to switch that to seven here, okay? And what the NCBI website also recommends is that you leave E at a thousand, okay, much bigger number that we've discussed so far, turning off all of the filtering. So this is meant as a setting, so you could see all of the results, but you might not necessarily want to take all of those results depending what's on that list. You're going to look at those. You often would use this kind of technique if you're doing primer design or something similar where you've got to get very, very short sequences to match one another. In the middle is a variant of megablast and blast N. So blast N works exactly the same way as blast P that we talked about. Megablast is a slight variation of blast N that is just designed to run faster. So if you are just doing a regular cross species comparison, you've got a sequence of interest, you want to compare it to all of the other nucleotide sequences in the database, you could pick either one of these two. The results you would get back will be almost identical, if not identical. The only difference is that the discontinuous megablast is going to run faster. Faster, of course, is in relative terms when we're talking about computer cycles, but it's faster, okay? So just so you know what those three options are, you now know how to change any of these numbers to push the results one way or the other, okay? All right, so with that, we've got 17 minutes left, so let's get into a different method. We're gonna leave blast behind. Does blast make better sense now for those of you who have used it before? Okay, so think about how the parameters work. Think about those results that you get back in your hit list. You now have some generalized guidelines for where to draw the cutoffs. But again, couple what you see on these hit lists with what you know about the biology of these proteins before you make your conclusions, and if you do that, you'll be just fine. All right, so the final topic today, we're gonna talk about blast. So we're gonna leave NCBI. We're gonna go over to the Genome Center at UC Santa Cruz and talk about a method that's very similar to megablast called blast. And blast just stands for the blast-like alignment tool. A very original name, okay? And so, like megablast, this is designed to align very, very long sequences. So things more than 40 in length, having very, very high degrees of sequence similarity. So again, if you're doing more of a genome-wide type comparison, this would be a great tool of choice. It is much, much faster than blast is for RNA and DNA searches. You might miss divergent or shorter sequence alignments, but you wouldn't be using this method if that's what you were after anyway. You would be using one of the blast algorithms instead. You can use it on protein sequences in practice. I don't think I've ever actually done that, but it is available to you. Now, so when do you use this? If you have an unknown gene or some sequence fragment that you wanna characterize, you can find that particular fragment's genomic coordinates by doing a blast search. You can start to determine the gene structure. So where are the exons and where are the introns? And other markers of interest in the vicinity of the sequence. So we start to get a genomic view now of where your sequence of interest lies, not just did it match something. What else of interest is in that particular genomic region? You use it again to find highly similar sequences. So you can use it to find gene family members or putative homologs. Or you can just use this to display a specific sequence as a separate track. So we now introduce the concept of a track. And so when we look at these genome browsers, all of the data are arranged as tracks, simply lines on the genome browser that Tira will talk to you in much greater detail about in week four. So let's go to UCSC. There is the URL to get to the BLAT algorithm. The link is right there, the third one down in the sidebar. If you click on that, you have a very, very simple interface here. So I've taken the BLAT sequence from our list of sample sequences, pasted it in the box. This is a rat sequence, so I've changed genome to rat. It asks me what kind of the query type it is, so this is a DNA sequence. And that's all you have to do. Because this algorithm is so highly optimized to work as quickly as it can for a particular case, we don't have to worry about a lot of the same parameters that we talked about earlier this morning. So once I hit submit, I then get back a list of results that looks like this. So these are the various hits that were found. The best one is at the top. So I have, starting with my query sequence, an alignment that starts at position, one of my query sequence ends at 733 for a length of 725 residues. So there's actually a gap in here. You'll see the percent identity is 98.1%. The other ones are very, very small, 22, 23, 37 long. So we're just gonna leave those aside for now and just concentrate on that first result. And you'll notice that I have two hyperlinks next to that first accession number there. And so I can click on either one of those. We'll start with browser first, and that indeed takes us to the genome browser. So this is the first time we're seeing an example of a UCSC genome browser page. Our sequence, the one that we just ran through, Blatt is this line right here, which is surprisingly enough labeled your sequence from Blatt search. Okay, so you'll remember that it was 98.1% identical to what it lined up with here in the chromosomal coordinates. So that region where it didn't quite align is out here. So you'll see little thinner regions. Those are the gaps. The red regions are mismatches. Okay, and as you look at this, so that's a track, that's defined as a track, but you'll see other tracks on here as well where we can start to look at where the positions of RefSeq genes are, some rat ESTs, all of this information down here are vertebrate multi-sequence alignments that may point to areas of high conservation. Elliott Margulies is going to talk about this in greater detail during his lecture, but all down here, if you were to scroll down, there's many, many, many tracks that you can add or remove from this view. So really now you start to think about things beyond just the alignment of two sequences, but you now put things into a genomic context. Where are the introns? Where are the exons? Where are the snips? Anything that would be possibly important to discover about this particular gene's structure or function, okay? If we go back to this page, and we now instead click on the second link, the details link, we get a more traditional view of the data, so this top sequence is the one we started with, the query sequence, the bottom sequence is the one that it found, so this is on chromosome five in the rat, so CHR five. What the various capitalizations and colors mean is deciphered for you above in the text, so matching bases in CDNA and genomic sequences are colored blue and capitalized, so the vast majority of what you see, light blue bases, mark the boundaries of gaps in either sequences often splice sites. If we scroll down a little bit more, we just have a pairwise alignment very reminiscent of what we saw with the BLAST searches, okay? So any questions about BLAT? Okay, and we're gonna keep coming back to BLAT over and over again in almost every single bioinformatics lecture in this series. Okay, finally, in the last few minutes, let me tell you about one more sequence alignment technique called FAST-A, and this really was one of the first very widely adopted methods for aligning two sequences to one another. And so these are also identified regions of local alignment, they use something called a Smith-Waterman algorithm, but the important thing is that the method is vastly different than the one that I described to you that BLAST uses. And in many ways, it's a much more stringent method and might produce a better, possibly produce a better set of results. The problem is it's very, very slow. So the reason I just bring this to your attention, we're not gonna discuss it anymore than this, is to just say you might have a case where you've done a BLAST search, you go back to that cutoff line where, okay, I've given you those general guidelines, but you're not quite sure about maybe there's something above the line that you're uncertain of, or maybe there's something below the line that you might think should be included, your biological evidence might be a little bit sketchy, apply the second technique. So the same way in the laboratory, you will often use a second lab technique to confirm your results. You do the same exact thing on the bioinformatics side of the house. Often using a second program will help you rule something in or rule something out. Okay, so I will leave you with just further reading. So in my textbook, most of what we talked about today is in chapter 11, assessing pairwise sequence similarity, BLAST and FAST-A. In the Mount textbook, chapter six talks about how you search databases for similar sequences. Again, these books are all available for you in the NIH library if you want to learn a little bit more about what we talked about this morning. So we will reconvene next week, using all of what we did this morning to now talk about things at the protein level in much greater detail. Let me remind you once again of the Greider lecture, the Trent lecture today at one o'clock in Missouri. If you have any questions, come on down. Thanks for coming today.