 Before we get into this week's lecture, for those of you who are joining us for the first time, if you didn't pick up a copy of the syllabus last week, they're up in the corners by the door, so please pick one up on your way in. Also on the syllabus is the URL for the course website, which once again is the place where you can find information about the course. Most importantly, for those of you who may miss a lecture, this is the place where we post all of the YouTube versions of each one of the live lectures. Our goal is to have the lectures up the Thursday following the live lectures so that if you miss a lecture one week, you at least have some time to watch last week's lecture before this week's lecture. Just by way of reminder, if you are interested in CME credits, please be sure to sign the sign-in sheets in the back of the room. My disclosure for the week is right there. So in the same way that Dr. Green last week laid the groundwork for upcoming lectures in the course by providing you a broad overview of the field of human genomics last week, my job over the two weeks that I'm with you is to give you a solid foundation for the upcoming lectures devoted to bioinformatics. I mentioned last week that genomics and bioinformatics really do go hand in hand with one another, and Eric reinforced that sentiment when he showed you this figure from NHGRI's current vision document that was published in Nature last February showing the spectrum of genomics-based research going from understanding human genomes, just the structure and the actual sequence going through the biology of what's encoded in those genomes, how that relates to our understanding of human disease, advancing that into medical applications to finally hopefully improve the effectiveness of healthcare. And our ability to work across this entire spectrum really rests in large part on our ability to use computational approaches to generate the kinds of, and analyze the kinds of data that are generated in whole genome sequencing studies, in genome-wide association studies, clinical sequencing efforts, pharmacogenomic studies, the whole range of genomic science that you're going to hear about throughout this course. So given this inextricable relationship between genomics and bioinformatics, my job again during the two weeks I'm going to be lecturing is to give you as gentle of an introduction as I can to the major tools that are used in bioinformatics. You're going to see these tools and the concepts that underlie these tools over and over and over again in the next 11 weeks. So it's really important to have a basic understanding of the theory that underline these techniques and to start to take them away from the realm of being the black box. It's very easy to go to a website, you see a box, you stick your sequence in it, you hit a button, you get some results and you just assume that those results are correct and many times, in fact in the vast majority of those times, those results maybe are not biologically significant. So that's one of the major themes that we're going to hit through all of these bioinformatics lectures. So we have a lot of ground to cover today, we're probably going to go right up against the 11 o'clock mark so, but as we start I hope that you find all this background information useful and for those of you who are actively doing these kinds of studies or using these kinds of tools at whatever level, you'll hopefully find something that's of practical use in your own studies. So what we're going to do today is we're going to start off defining some terms. We're going to talk about the importance of alignments as one of the most powerful approaches that we have in the study of sequence data. We're then going to talk about how we actually evaluate how good alignments are and whether we can infer biological conclusions based on those alignments and then we're going to spend the vast majority of our time talking about two big tools that are used in biological discovery, sequence based discovery, BLAST and BLAT. And again we're going to use this as a foundation for my next lecture with you, two weeks from now that we'll now build upon these approaches. So why do sequence alignments, that's what we're going to be talking about for the next 90 minutes, and what these sequence alignments allow you to do is figure out some measure of relatedness between two protein sequences or two nucleotide sequences. And when you go through the effort of trying to align two sequences with one another, you can start to make some interesting biological inferences whether they have to do with whether two sequences have a structural relationship to one another, whether their functions are similar or whether they have a certain evolutionary relationship with one another. As we start to talk about these things, it's important to use the right words, so we have to cover a little bit in the definition department. When we align two sequences to one another, so we have sequence A and sequence B, the metric that we can use to assess, easiest metric we have to use to assess how good that alignment is is to simply count how many positions in that alignment match between sequence A and sequence B. So that observation, that observable is similarity. So the term similarity is usually expressed as a percent identity. You can use that to quantify changes that occur as two sequences diverge from one another by looking at substitutions at individual positions, insertions, or deletions. And you can also use them to identify residues that are crucial for maintaining a particular protein structure or function. And this is very important when you do a multiple sequence alignment. We'll talk about how to do those in two weeks' time because you can really pick out by looking at these multiple sequence alignments what's important from a structural standpoint, what's important from a functional standpoint, what residues do you have to preserve to have the right alpha helices or beta strands or to have the right binding pockets or so on. Now when we see high degrees of similarity between sequence A and sequence B, it might imply to you either some sort of common evolutionary history that they have a common ancestor or some possible commonality and biological function. So you may have an unknown, a new protein or a new gene that you sequence. You have no idea what it actually does in the cell. But if you find that it is similar by virtue of one of these sequence similarity methods to something where the function is known or the structure is known, you can infer a relatedness. You can start to assign function to that new, newly identified sequence that you have. Okay. So this term is pretty easy to understand. The second one though is often misused and that is the term homology, okay? So genes are or are not homologous. This is never measured in degrees. So you cannot say that something is 15% homologous or 50% homologous or 100% homologous with something else. So the similarity takes the number, the percentage, the homology part implies the evolutionary relationship. So to put a little bit of a finer point on this, a quote from Walter Fitch, it's worth repeating here that homology like pregnancy is indivisible. You either are homologous, pregnant, or you are not, okay? So and it goes on to say that if 80% of the character states, in this case the aligned residues are identical, one should speak of 80% identity and not 80% homology. So again, similarity is the metric, homology is the conclusion. Okay? Is that clear? Good. So just to make things a little bit more complicated, there are two types of homology. So let's define both of those. So the term may apply either to two genes that have been separated by an event of speciation, those are called orthologs, or between genes that have been separated by a genetic duplication event, and these are called paralogs. So a little bit more detail. When we talk about orthologs, these are sequences that have come from a common ancestor. And because they've come from a common ancestor, they most likely have similar domain structure, similar three-dimensional structure, and share a common biological function. Okay? When we talk about paralogs, they're related again through a gene duplication event, and these are kind of interesting from an evolutionary standpoint because this gives you some insight as to how mother nature has tinkered around with the armamentarium of genes that are available to start to use genes in different contexts. So the whole idea of evolutionary innovation, taking a particular gene that's already being used for one function, changing it a little bit, and co-opting it into either another pathway or into another function. Okay? I think this makes more sense when we look at a diagram. So let's say we have this gene alpha here, and we're going to call this the most recent common ancestor of three genes that are up here, gene one, gene two, and gene three. The relationship between gene one and two, two and three, or one and three, is that all of those are orthologous to one another, okay, because they rose from this common ancestor. All right. Now let's say a little further back from A, we've had a gene duplication event that gave rise to gene alpha and to gene beta. So perhaps alpha and beta are exact copies of each other. Perhaps they're slightly different, but from that point on, they've had independent lineages arising from them, okay? So beta in this case gave rise to four, five, and six. So these genes, again, because they came from a common ancestor, four, five, and six are all orthologous to one another. But when we talk about the relationship of any of one, two, or three, to any of four, five, or six, or any of the ones in this box, these are all paralogous to one another because they're related to each other way back here because of this gene duplication event, okay? Does that make sense? Okay. So this takes a little while to learn the terms, and when I was in graduate school, I had to post this on the board in front of my desk to finally learn them. But it's important to know what these terms are, because again, these do imply different kinds of relationships, and unfortunately, they're not always used the right way. Okay. So we're going to put the definitions behind us and now start to talk about the alignments themselves and how we look at the quality of those alignments. Okay. So I want you to imagine, again, two sequences, sequence A and sequence B, and we're going to try to align those sequences. And if we try to align those sequences across their entire length, from the N terminal end to the C terminal end of A and B, or if we're looking at nucleotide sequences from the five prime to the three prime ends of both of those, that's called a global sequence alignment because you're trying to force an alignment along the entire length of both of those two sequences. Now this works incredibly well when you have two sequences that are of approximately the same length and that are highly similar to one another because aligning those things over such a long stretch when they're very similar, it's an easy proposition. But as you start to see the degree of sequence similarity going down, doing these global alignments gets harder and harder to do. And when you try to force these global alignments amongst more dissimilar sequences, you start to miss important biological relationships. You may miss the alignment of an important protein domain that's responsible for either structure or function or some other similar features. So from a biological point of view, a slightly different approach is better. If we're interested in doing global sequence alignments, we do local sequence alignments. So here, we still start with sequence A and sequence B and try to align them to one another. But we don't try to force the alignment over the entire length of both of those sequences. Instead, we look for regions that align well. So the goal here is to find what in the parlance are called paired sub-sequences, just a glorified term for short regions that align well with one another. So we're looking outside the area of those local alignments or excluded. And you can imagine, because we're looking for these local alignment areas, that we could have more than one good alignment for two sequences that we're comparing to one another. So really, this is by far much better for sequences that share some similarity or when you have sequences of different lengths. And in the business that we're all in and trying to figure out what various genes do, how they're related to each other, trying to identify members of protein families, this is what we're trying to accomplish here. So not necessarily look for global alignments, but those key regions that are common from protein to protein to protein or from gene to gene to gene. So this is really the approach of choice for biological discovery. OK. So we're going to try to find these local regions. We're going to align the letters in sequence A and sequence B the best that we can. But at some point, we have to have some sort of qualitative measure to tell us how good the alignment is. So we talked about percent similarity before. That's one metric. But there are other metrics that are available to us as well. And in order to calculate these metrics, we use something called scoring matrices. So these are empirical weighting schemes that look at the physical chemical properties and the biological characteristics of nucleotides and amino acids. So specifically, side chains, their chemistries, their function, their structure. So when we look at these scoring matrices, what's being taken into account are factors like looking at the fact that cysteines and proliens are always important for structure and function. You need those cysteines in order to make disulfide cross bridges. Proliens often break alpha helices. So you'll see there are proline or a proline proline at the end of an alpha helix. Things like a tryptophane, big bulky side chain can only be accommodated in certain places in proteins. Things like lysines and arginines, they both have a plus charge on them. Things that you will want to conserve when you're comparing two proteins to one another. So the game here, what we're trying to take into account and trying to capture through these scoring matrices involve conservation. So which residues can substitute for one another and not affect the function of the protein? So things like isoleucines and valines can substitute freely for one another because they're both small. They're both hydrophobic. When we look at serines and threonines, again, we often see those substituting for one another because they're both polar. So we want to try to conserve, in this game, the charge, the size, the hydrophobicity, and other physicochemical factors. The other thing we take into account is the frequency of these individual amino acids, asking the question, how often do those amino acids occur amongst the entire constellation of proteins? Things like glycines all over the place. Things like tryptophanes, the most rare of the amino acid. So we want to account for the frequency of those amino acids as we start to score these alignments. So why do you care about all of this? Why is this actually important to you? When you do a blast search or a blast search or any sort of analysis that involves comparing one sequence to another sequence or one sequence to a set of sequences, these scoring matrices are all being used in the background. So again, we're talking about the black box here. So when you do those blast searches, these scoring matrices are being used. They're always part of a sequence analysis. The selection of them is important because each one of them, there's not just one. There's a whole slew of these scoring matrices. And each one of them implicitly represents a particular evolutionary pattern. So the choice of the matrix really can strongly influence the outcome of your analysis, the results you get back when you do that blast search. And for a lot of things that you might want to do, using the default values on NCBI's blast website may not be the right choice. So we're going to spend a good amount of time talking about these individual scoring matrices and, more importantly, some guidelines about which ones to choose. OK, how are we doing so far? Good? All right, so let's do the easy part first. If we compare two nucleotide sequences to each other, we have here just a simple match mismatch scheme. So we've aligned nucleotide sequence A with nucleotide sequence B, and we're just going to look for exact matches. Every time we see an exact match, we give that particular position two points. Every time we see a mismatch, we subtract three points. So you want to go, obviously, in favor of the matches. So for example, if we look at this, this is the actual scoring matrix. If we have a particular position where a C has aligned with a C, you just look for where these intersect, and there's the two points for the match. And you'll notice on the diagonal, you will always see the value 2. And then off the diagonal, any place there's a mismatch, you see the minus 3. So very simple. Now, this assumes that each one of the nucleotides occurs 25% of the time. So here's where the frequency business comes into play. In the nucleotide world, this is easy. The side chains are very chemically similar to one another. You can have a very simple scoring scheme. But it looks a little bit different when you get to the protein side of the world. So this unreadable slide, I know, is what is called the Blossom 62 matrix. This is the default matrix when you do a blast search on the NCBI website. So how do things work here? So as before, instead of having the nucleotides across the top and down the side, we've got each one of the amino acids across the top and down the side. The exact matches are represented on the diagonal. And it's probably easier to look at the screen than look at your handouts right now. And you'll see that the values on the diagonal, unlike the nucleotide case, are not all the same. When things are infrequent, they get higher scores. And in fact, if we have a particular position in our amino acid alignment, in our protein alignment, where a tryptophan has aligned with a tryptophan, and we see where that intersects, that particular position would be scored with 11 points. It's the highest number on the diagonal because the tryptophan is the least common of the amino acids when you look at this entire constellation of proteins. Now of course, we don't want to just look for exact matches. We want to look for conservative substitutions as well because we might learn something from proteins that are somewhat similar to the ones that we're looking for. We don't want to just look for the ones that are exactly the same. So starting with this tryptophan again, let's say we have a particular position where instead of having the tryptophan, the tryptophan has been substituted by a tyrosine. So we've got the W up here. We've got the Y over here for tyrosine. And if we now see where that intersects, two points. Still a positive value to represent that it's a conservative substitution, that the tyrosine can substitute for the tryptophan, but not as high as if you would have had the exact match. Finally, let's say you just totally mess things up. You've got a position where two things that should not be aligned are now aligned with each other. So let's say you've got a 16 that has aligned with a glutamine. In this particular case, minus 3, you're subtracting points because those two should not be aligned with one another. They don't have the kind of relationship that we're looking for. They can't serve in the same role functionally or structurally. So simply put, you get the most points for an exact match. We score higher for rare residues than we do for common residues. If you have a conservative substitution, you still earn points, just not as many as if you would have had an exact match. And finally, anytime you have a mismatch, you subtract points. Does that make sense? Good. All right. OK. So this is what the basic structure of these matrices look like. Now, I've already alluded to the fact that there are more than one of these. And this is one of a family of matrices called the Blossom matrices. These were developed by Stephen Yorga-Henekoff at the Hutch back in 1992. Blossom stands for Bloch's Substitution Matrix. And what they did was they took all of the proteins that were known at the time, did multiple sequence alignments on those protein families to look for the natural patterns of what could substitute for what, and which amino acids were more frequent than other amino acids. So this is based on empirical evidence from the sequence databases. So they're directly calculated. Because they come from real data, they are very sensitive to detecting structural or functional substitution. So really, these matrices have become the benchmark for whatever we want to assess the quality of any pairwise sequence alignment. Many years previous, there was another series, still is, called PAM, Point Associated Mutations. It was developed by Margaret Dayhoff back in the early 60s. She was one of the founders of the field of bioinformatics. But as you can imagine, in the 1960s, we don't really have very many sequences of any type. Most of what we had were globular proteins. They were very similar to one another. You'll often see these PAM offered to you as a choice in phylogenetic programs. But just because of the point in time where they were made and the kinds of proteins they were based upon, they're really not the best choice for most analyses. So whenever you have a choice between the PAM matrices and the Blossom matrices, pick the Blossom matrices. And there'll be some more information on that in your readings later. Since I hear a phone going off, if you haven't already, could you please put your phones on stun, and that would be great. Thank you. OK. So how do you now select which one of these Blossom matrices to use? When we say Blossom, there's always followed by a number. The example that I showed you before was Blossom 62. What that N represents is based on how it was calculated. So the N just says, all right, it was calculated from sequences sharing no more than N% identities. So in case of the Blossom 62, the sequences that were used were no more than 62% similar to any of the other sequences in the set that were used to calculate the matrix. I'll show you an example in a minute. More importantly, what this does is it helps to remove bias from the set. So the contribution of sequences that are higher than the number are clustered and weighted to 1. When we do whole genome sequencing, the whole history of how proteins have been sequenced and genes were found in individual laboratories, these are not uniformly distributed amongst proteins. They're not uniformly distributed amongst taxa. We have overrepresentation of certain sequences in the public sequence databases. So we need to normalize all of that out, remove the bias from the data set, and take those closely related sequences out so that we don't miss something that might actually be biologically relevant if we hadn't done that. So to hopefully make this make a little bit more sense, here is a multiple sequence alignment. There's six sequences in the set. You'll notice that there are some absolutely conserved positions, although the ones that are marked with the star and that are in the lighter blue. And if I were to now say, OK, I want you to cluster this alignment into blocks where we're going to apply an 80% cutoff. So I want you to just group this together, all of the sequences that are 80% or more similar to another sequence in the set. And when you do that, all right, so this one is not similar to any of the others at this percent relationship. These three are 80% or more similar to one another. And finally, these two are 80% or more similar to one another. Once this is done, this is counted as one sequence. These are counted as one sequence. And these are counted as one sequence. So we started with a set of six. We've collapsed it to a set of three. That's now going to be used as the basis for calculating the matrix. So again, this is how the bias is removed from the set. Hopefully that made sense. All right, so implications. OK, again, when we do this clustering, reduces the contribution of closely related sequences. We have less bias towards substitutions that are in really closely related members of a family. The end of the day, the take-home message, reducing N yields more distantly related sequences. So your first cheat sheet of the day, put a big star next to it, circle the slide. Which one of these do you actually choose after all of that? You know how they're constructed now? You know what the underlying implications are. Which ones do you pick? The one in yellow is the default again. And this was determined in studies where the answers were known in advance, going backwards to try to figure out which one should be the default on the NCBI website. That, again, is Blossom 62. It's the most effective one for finding all potential similarities between two sequences. And generally works best when you're in this kind of percent similarity range, in that 30% to 40% range. As you go further up, you'll do better finding more closely related sequences, things that are almost exactly the same. So by the time you get up to Blossom 90, you're going to get shorter alignments, but they're going to be highly similar in the 70% to 90% similarity range. Going the other way, when you go down, you start to get longer and weaker local alignments, things that might be less than 30% similar to one another. But this is really where a lot of the biological action is. You can look by eye and see two sequences are similar to one another once you have them aligned. Once you get down to the bottom is where you start making these interesting discoveries, finding out that two proteins are related to one another, even though there may be very little sequence similarity between the two of them. A great example of this, homeodomain proteins. I know everybody's familiar with the homeobox genes. When you look at the protein complement, you're looking at a 60 amino acid stretch. Six or seven of those positions are absolutely conserved over evolutionary time, and the rest of it, it's kind of willy-nilly. But if you were to just do a routine search and possibly pick the wrong matrix, you'd never find all the members of the family. So this is why the choice of these things is important. So the strategy that I advocate here is one that was originally put forth by Steve Altschul at NCBI, the guy who was behind developing BLAST back in 1990. And instead of doing one BLAST search, I want you to do three BLAST searches, okay? Use the default, pick one of the more higher number ones to give you the look for closer similarities, but also pick one of the lower ones to look for the more distant similarities. So when you go back to lab, take your sequence of interest, run it using three different matrices. The vast majority of the results will be the same admittedly, but you will start to see subtle differences that you wouldn't have found if you just did one search. So I think it's a good approach to start to find those relationships that you wouldn't normally otherwise find. Of course, if you know what you're doing in advance, if you're looking for things that are really close or really distant, you can just pick one and go from there. But if you're just really trying to cover the field, blanket all your possibilities, take the default, take a higher one, take a lower one as well, okay? So really what I'm trying to get across here is no single matrix is the complete answer for all sequence comparisons. You need to make this intelligent choice and not again use that NCBI website as a black box. It's a great resource, but there's a lot of things you can change along the way to help refine what you get, okay? Any questions? Okay, great. All right, so for those of you who would like a little bit more background, there is a unit in Current Protocols and Bioinformatics, unit 3.5 that talks more about the PAM matrices that I really went through very quickly, a lot more on the Blossom matrices and some more information about specialized scoring matrices. So we're looking at, in this case, things like transmembrane proteins, places where the usual rules don't apply because of where they are found in the cell because they're embedded in membranes and so on. So the frequencies and substitution patterns are often very different because of those either compartmentalized or embedded proteins, okay? All right, so one last thing we need to talk about before we actually get into Blast and those are gaps. So again, I want you to imagine sequence A and sequence B, you've just aligned them and there might be some place in that alignment where it would be fortuitous to improve the alignment to insert a gap, okay? Now, I want you to remember that when you put those gaps in, it's not just for sake of making a nicer alignment, it's not a word processing exercise, okay? Each time you put a gap in, it actually represents a biological event. So every gap represents either an insertion or a deletion, okay? So because of that, you can't just put them in willy-nilly again, I do this willy-nilly twice now today, okay? You can't just put them any place you want. You have to keep in mind that they represent biological events and keep them to some sort of plausible number. A good rule of thumb is about one in every 20 residues but the fewer number of gaps, the better. Now if you think about your position by position alignment again and you think about your scoring matrices, so you've got an amino acid lined up with another amino acid, you can go to your blossom matrices, see where the values intersect and assess a number for that particular position. But when you have a gap, okay? You have a residue and a nothing, okay? We didn't see gaps in the scoring matrices before. So we can't simply score those or a mismatch as we did before. So instead, there are gap penalties that are assigned with apologies, an equation. And basically all this equation is saying is that anytime you insert a gap, you take a penalty and you also take an additional penalty for the length of the gap. So one for the gap and one for the length of the gap. So the equation is just G plus L times N. So G is the penalty for opening the gap. You see the values here for both nucleotide and protein sequence comparisons. L is the gap extension penalty, this value right here. So it's two for nucleotides, one for proteins. N is just the length of the gap. So let's say I have a protein alignment, it is of length two, okay? So to assess how many points I'm gonna subtract for that, okay? So G for a protein is 11, L times N. So here's L, the gap extension penalty is one, the length is two, so one times two is two, 11 plus two makes 13. So you subtract 13 points for the gap, okay? If you'll remember, the tryptophan was 11. So you're plus 11. So to give you some measure of how these compare to exact matches. So when you start opening big gaps, you start plummeting as far as the quality of your alignment goes and the scores of that alignment actually go, okay? All right, so with that, it's time to talk about BLAST. Okay, so how many of you have used it before? How many of you have used BLAST? The vast majority of people in the room, great. How many of you actually know how it works? Okay, a few. And from the people who were left, how many people have actually ever changed the default values on the NCBI website? One, two, a few, okay, good. All right, so maybe not good. So all will now be revealed. We're gonna go through this in gross detail and hopefully you'll all be your lab's experts at this at the end of the day and again, use this tool to much better advantage, okay? All right, so BLAST is the basic local alignment search tool, BLAST, by far the workhorse in Bioinformatics, the most important tool that we use in our field because the whole act of doing these sequence alignments, whether it's one by one or one to a database, really help us discover these previously unknown relationships, so we're gonna again hit this pretty hard today to make sure how you use it, you know how to use it properly. Part of the reason for that is not just making sure you know how to use it right, but there are many unfortunately cases in the literature where biological conclusions have been made that are incorrect based on people's either misinterpretation of BLAST results, they didn't either know how to use the program or they didn't know how to interpret the results, so, and that's a pitfall that I don't want you all to fall into. So one of the things to keep in the back of your heads as we go through this are any of the techniques in the course is that just because a result appears on a screen or just because it's printed out on a piece of paper does not automatically mean it's biologically significant, okay, and we're gonna talk about that a little bit more as we go on, but just keep that in mind that that does not automatically mean it's a great result and that you should run with it. Okay, so when we do this, what we're trying to do this is a local alignment technique, so we're looking for these paired subsequences, in this case they're called high scoring segment pairs, HSPs, so a pair of sequences that can be aligned well with one another. We are going to try to align them to make them as long as possible. They have to be above a certain score threshold, we'll talk about that in a minute, and they can either be gapped or un-gapped. Because, again, we're looking for local alignments, we can have more than one good alignment between sequence A and sequence B. All right, there are various flavors of blasts that are shown on this slide, the ones that are used much more frequently are blast N, the N at the end just stands for nucleotide, so we start with a nucleotide sequence and we search that against a database of nucleotide sequences. Blast P, P is for protein, so we start with a protein query, looking against a protein database. Blast X is a little bit funky, we start with a nucleotide sequence, we translate it in six frames, and then we take that translation and compare it to the protein databases, okay? The two at the bottom are shown for sake of completion, they're not used very often, by far you'll be using the three at the top most often, and we're gonna concentrate on blast P for most of our examples today. All right, so before we actually pretend to do a search, how does the alignment actually take place? So this is a query sequence, just an amino acid sequence, and we're gonna use this as the basis of comparison to a public database. And what I've done in the middle is I've highlighted three residues, the P, the Q, and the G. Okay, so this is what is called the query word. By default, Blast will nucleate a search with a three amino acid stretch. Now of course it starts at the end, so it'll start with the GSQ and then go to the dexary SQS and move across, but for the sake of symmetry on the slide, we're gonna just start with the P, the Q, and the G. So what Blast is now going to do is look in the public database for any other protein sequence that contains in that order the P, the Q, and the G. But we also want to find places where there are conservative substitutions, places where there might be a little wiggle in the sequence, but might produce an important biologically relevant match, places where there are conservative substitutions. So certainly we're gonna look for things for the P, Q, G. What I've done for this example is just changed the middle letter here. So looking for things like the PEG, the PRG, and so on. Each one of these you see has a score next to it. You now know how to calculate this score, so if you were to take the Blossom 62 matrices you have in your handouts and just looked at P for P, Q for Q, G for G, see where they intersect. You get seven and five and six for a total of 18. The best match because it's the exact match. The next one down, we have a conservative substitution in the middle position. We see that the score goes down a little bit and the scores keep going down as we go through this. At some point Blast is gonna say, well, we've gone kind of far enough. We're now starting to go into territory where the sequences are starting to diverge too much. That's something called the neighborhood score threshold. In this case it's 13 points. This changes on the fly from search to search and depending on where you are along the length of this sequence. So you don't have to worry about changing that value. You can, but this is something I don't think I've ever personally changed in a search, okay. All right, so we have now established these other three letter words that are in the neighborhood and we're gonna use this as the basis for continuing into the next step of the Blast search. So we're gonna look for everything that has a PQG, PEG, PRG, and so on. All right, so this is from the previous slide. Here's our PQG again. And in this particular case, we've aligned with a sequence that instead has a PMG. So it's not an exact match, but if we look up here, PMG is in the neighborhood, so that's okay. And then the next thing Blast will do is starting from this position, this query word, three across, it's gonna go out in both directions as far as it can, aligning as many of these amino acid positions as it can. Okay, so we're gonna go out to the N-terminus, we're gonna go out to the C-terminus, okay. And by looking at this, before we talk about the numbers, you can also do a qualitative assessment of how good the alignment is. In this middle row, anytime you see an exact match, you see that the letter has repeated itself. Any place you see a conservative substitution, you see a plus sign. So just by eye, you can see how good the alignment actually is, and you want as much of that middle filled in as possible, okay. So now, time to bring the numbers into play. I told you, we start with that nucleated PQG, go out in both directions. Well, how far is too far? Okay, so here's our alignment again, and here is a graph. So the extension on the X-axis is just how many positions have we aligned? How long is the alignment? The cumulative score, which comes from our blossom matrix, is on the Y-axis. Back here at the first position, this is this part of the alignment, the PQG with the PMG, the 13 points that we had from the previous slide, above our score threshold, okay. So as we start to go in both directions, we're going to assume that we're gonna have more matches than mismatches. So remember, when we match, we're adding points, when we are mismatching, we're subtracting points, but we're gonna assume that we're going in favor of the exact matches, so the score as we start to extend goes up. At some point, you're going to hit this score threshold S, okay. This is also set on the fly, search to search. If you break S, this alignment is returned to you in your last results. It does not again mean that it is biologically significant. It just means it's gonna return the result to you, all right. So one thing to remember. So we're gonna break this score threshold, it's going to now appear in our last results, and we're gonna keep aligning merrily away, but at some point, we're gonna go too far. The mismatches are gonna start to outweigh the matches, so we're gonna start subtracting points at a faster rate than we're adding them. We might start having some gap penalties against subtracting points, and at some point, we're gonna hit something called a significance delay. When we go off of a peak like this, a certain amount down, BLAST is gonna say, okay, you've gone a little bit too far, we're gonna go back up to the peak, so whatever that length alignment corresponds to, so the length of this alignment at the peak, that is now your high scoring segment pair. The maximum length alignment in that region between sequence A and sequence B. Got it? Okay. All right. Now, I've just made a big deal about the scores. Okay, we need the scores to assess how good a match is, position by position by position. But we don't actually use the scores to assess biological significance. What we're gonna use instead is the result of this equation, okay? Don't even worry about what this equation is, okay? This is something called the Carlin-Altschall equation. What the score allows us to do, there's the score, based on the length of the database we're looking at, the length of the sequences we're looking at, composition, all of that, it allows us to calculate a normalized probability, and that's the value you're going to look at, okay? So why am I making a big deal out of this? If you now have some understanding of what those scoring matrices look like, I want you to think of two alignments. So we have maybe sequence A compared to sequence B, and that might be of a certain length, but it's all composed of common residues. You may have another alignment of sequence C and sequence D that is of the same length as the first alignment, but it's all composed of rare residues, okay? You already know that rare residues score higher than common residues. So you could end up with a situation where you have two alignments of the same length with the same number of positions matching up, but because of taking this rare versus common thing into account, the score of one is gonna be vastly higher than the score of the other one. So you need some way to compare from alignment to alignment to alignment some normalized measure that allows you to say, okay, this one is good and this one is not so good. So that's why the score itself is not a good measure, but it is the key for us to get to the probability value, I hope that made sense. Okay, so at the end of the day, you're gonna forget about all of that. This is what's important. What that E value represents to you is the number of high-scoring segment pairs that come from comparing sequence one to sequence two that were found purely by chance, the number of false positives, okay? So because this represents the number of false positives, you want E to be as low as possible, okay? So lower values of E signify higher similarity. And again, because this is a normalized value, you can use this as a metric to compare any two sequences, excuse me, any two alignments to each other. You don't have to worry about lengths and other vagaries. Okay, so rule number one, okay? If you are doing a nucleotide-based search, you've started with a nucleotide sequence, you've searched against GenBank, the rule I want you to apply here is you're gonna look for values of E that are 10 to the minus sixth or less. If you are doing a protein-based search, so this is blast N and blast X, remember blast X, even though you're starting with a nucleotide sequence, the comparisons at the protein level, you're looking for values of 10 to the minus third, 0.001 or smaller, okay? So that's rule number one, and we're gonna come back to the rules in a moment. All right, questions? Good, all right. Okay, so let's go through an example finally after all of that background. And again, we're gonna consider a blast P example. We're gonna pretend we're at the computer. The screen captures are all in your handouts or everything I show you, you have, and I would really recommend that you go through these exercises when you get back to your labs or to your offices. It's one thing to watch me explain this to you, it's another thing for your hands to get onto the keyboards and actually do them yourselves, that's the only way you're gonna learn how to do this stuff. Okay, so here's the NCBI website at ncbi.nlm.nih.gov. Let me orient you to the page a little bit. There are some common features, some popular resources that are shown in the right sidebar here, and the very first one on the list is blast, so we're just gonna pretend that we've clicked on that, that will take us to the blast page, or you could just go directly to this URL. There are three sections on this page. The first one allows you to search assembled genomes directly, so if you know that you only wanna search against a particular organism, you can pick one of these off the list and there's actually a link that will take you to all of the ones that are available to you. At the bottom, there's some specialized blast stuff, we'll come back to that a little later, but what we're gonna concentrate on now is this middle block basic blast, and it says nucleotide blast, protein blast, and so on. So we're gonna do a blast P search, we're going to click on that protein blast link. Before we can actually do the search, we need a sequence, okay? So again, for you to all reinforce these concepts when you get back to your offices or to your labs, there is a website at this URL that has all of the sequences that I'm using as the examples for these lectures, so please go to this website, copy them, go through step by step exactly what we're doing here. All right, so this is now the blast query page. So at the very top, we have a sequence box, so from that preceding page, I just took the first sequence, pasted it into the box. If I just leave that the way it is, it will search the entire length of that sequence from position one to however long it is, but let's say I only wanna use part of the sequence, I can tell blast, all right, only use a subrange, started position 10 and only go to position 50. If you have a reason for picking a particular domain, you can do that rather than trimming the sequence manually. Okay, in the next block down, we have some choices to make regarding what we wanna search. We know now what our query is, what's going to be the subject, the target of our search. By default, it says non-redundant protein sequences or NR, so this is what people colloquially called GenBank, okay? All of the sequences that are publicly available through NCBI, so the non-redundant database. There are a couple of other interesting ones though that I wanna talk to you about. The first one is called RefSeq, and the second one we're gonna talk about in a couple of weeks called SwissPro. But these are specialized databases because they are highly curated. So let me tell you a little bit about RefSeq. So RefSeq stands for reference sequence. This is NCBI's reference sequence project, and the goal of this particular project is to represent each molecule in the central dogma, each DNA sequence, each protein sequence, or each mRNA sequence once and only once. So those of you who have done sequence-based searches before may have come across the situation where you put in a gene name, DCC, and you might get one, five, 10, 20 entries back, and you sort of quizzically look at the page, and you wonder, well, which one is the right one? Which is the canonically accepted sequence for that particular gene? So that's the whole point of this effort, is that you don't have to worry about trying to figure out which one is the right one. The curators at NCBI have already done this for you. What, and identified which sequence is the accepted correct sequence for a particular gene, mRNA, or protein. Because of this curatorial effort by definition, this is a non-redundant database. Each sequence is represented once and only once. And the cool thing is that there are curators, I believe somewhere below ground in building 38A, I'm not sure how many sub-basements down, who read the literature and make sure that all of the things in the entry are kept up to date. Okay, now, how do you know when you're looking at a RefSeq entry? So I wanna introduce the concept of the accession number, and you're gonna hear that term many times throughout the course. So the accession number is just a unique identifier that when you use that as the basis for your search, will return one and only one sequence. So in essence, it is that sequence's social security number, okay? They take various formats, they have codes, and the way you can tell which ones are in the RefSeq series is by looking at the first couple of letters. So the first set I have up here says from curation of GenBank entries, they all start with an N. So the ones that are NT and underscore and then six digits, these are genomic contigs. These are the sequences that come off of the sequencing machines in the kinds of whole, large scale, whole genome efforts that Eric Green spoke about last week, just cranking all of this genome data out. By the same token, the NMs are the mRNAs, and P, the P stands for proteins, and there's also entries for the non-coding transcripts, which are shown by NRs. So the thing that all of these have in common is that these were directly experimentally verified. Somebody had these molecules in their hands at some point and directly sequenced them. Now you can imagine a situation where because of these large scale genome sequencing efforts we do, the sequences are being cranked out at just amazing breakneck speeds. Sometimes what we do is take these genomic contigs, the data that comes off the sequencing machines, and we apply computational techniques to try to predict where we think the genes are, where the entrons are, where the exons are, genomic structure, all of that. So what we can start with one of these genomic contigs and then if one of these computational techniques has been applied, we would then have a series of model mRNAs, the X series here, so XM for model mRNAs, and then based on those predictions, in turn we could predict where the model proteins, what the model proteins are, and these are the XPs. So the important takeaway here is when you do a search against RefSeq, you always want to go in favor of the ends because again, these have all been experimentally verified. If these are not available, take a look at the Xs, but keep in mind again that these are computational predictions, they have not been verified. The computational predictions are pretty damn good, okay, but they are not foolproof. So you can look at those to get an idea of, well, I think I can make some sort of conclusion, but you're gonna need some other sort of verification to support that. So again, the ends are experimental, the X are predictions, clear? Okay, very important. Okay, so we can pick, we're going back to our last page, we can pick a page, excuse me, a target database that we want to use. Let's say we want to limit our search to particular organisms or particular types of organisms. We could just put terms in this box here so you could say human, or bilaterian, or whatever you'd like to limit the results that you get back, okay. In this section here it says program selection, blast P is already selected for us. You'll see that there are two other choices here, side blast and five blasts, we will talk about those in two weeks' time. Now, most people at this point will just click the blast button and go off on their merry way, but if you'll notice underneath, it says algorithm parameters. So we're gonna pretend that we've now clicked on that, it will expand the screen, there's more below. And this is where we now get to fine tune our searches. So all of the stuff that I talked to you about in the first 45 minutes comes into play here. Okay, first thing it asks you, how many sequences do you want returned to you? The default is 100, okay. I've, for purposes of the example, changed this to 250. I would rather get too many results back than too few and make the decision myself about what is relevant and what is not relevant, okay. And we're gonna see in this example that this is critical. Okay, expect threshold. What E value cutoff do you want to apply? So I've already given you a rule about the E values, 10 to the minus third or less when it comes to protein sequence comparisons. The default is 10. For purposes of the example, we're gonna leave it at 10, okay, but you can change that there if you want. Right below, you can see the word size is three, R, P, Q and G. Okay, going a little further down, here's now where you can pick the matrices. So remember, I'm advocating you using multiple matrices. Here's where you make that selection. Above that, before we get to the next one, you can see gap costs. Remember, I was telling you about the affine gap penalty. If you wanna make the gaps more permissive or less permissive, you can change the default values here. So by default, it's 11 for the opening the gap and one for the extension. If you wanna make it more permissive, you'd push those values down. If you wanna make it less permissive, you'd push the values up. Okay, filters and masking. So always you should check this particular box off to filter for things called low complexity regions. So what is a low complexity region? So these are regions of biased composition. What you'll sometimes see in sequences are just long homopolymeric runs, short period repeats, or just subtle overrepresentation of the same residues. So here's an example where we've got a homopolymeric just run of A's and Q's and alanine glutamine track here. And unfortunately, BLAST really relies on having somewhat of a uniform distribution of residues. So when it sees something like that, it sometimes doesn't know what to do with it. Because you've got all of this, in this case, all these A's and Q's, if it's got another sequence with an equal run like that, it doesn't quite know what register to put them in, okay? So it's best to actually just remove those, mask those from the sequence to not have the results get confounded. This usually just biologically, this usually comes from either DNA replication errors or you might have some unequal crossing over still a question mark on that one. But as far as these analyses go, it's important to mask them because again, often leads to false positives. So you just get in the habit of checking that box off when you do these searches, okay? Finally, we've gotten to the bottom of the page. It tells us we can go ahead and search. There's a little check box here that I always check off that says, show results in a new window, okay? So when we finally click that last BLAST button by checking that box, your results will, as it says, appear in a new window. The reason I like doing that is if you don't do that, everything's gonna be in the same window and if you wanna back up, it's harder to do that. So just from a practical standpoint, you've got your results in a different window. If you wanna now go back and make an adjustment, you still have your original window without having to put the sequence back in, pick all the same parameters again. It's just a nicer way of doing it. Okay, so finally, time to issue the query and this is what you get back. Okay, now you'll recall the default value for the number of results to return from the target database was 100. If you look at the screen, it says distribution of 204 BLAST hits on the query sequence. So we would have not seen the vast majority of our results, which in this case, most of them are actually biologically significant as you'll see in a moment. Above this box, you'll see that BLAST has identified some important protein domains and again, we'll come back to this in a couple of weeks. So let's concentrate on this box here. So this gives you sort of a general overview of your BLAST results. At the very top, you have a color key that shows you just sort of a heat map of as the scores get higher, we move from the blacks to the reds. Remember, the score is now what you're looking at. You were going to look at the probability values, but we'll just work with this for now. The alignments, each one of the hits that have been found, so anytime I use the word hit, it just means a sequence that was found in the target database that was deemed similar to the one that we started with. So the best ones by score are at the top then going down in descending score order. If you were to take your mouse and just mouse any place over any of these bars, there's a window at the very top that would give you the identity of what protein it is that was hit in this particular analysis. What's next? Okay, now you'll see some of these are connected by these very faint thin bars. Some of them are not. Whenever you see two of them connected like this, this means that we have a protein that was found in the public database where there are multiple high scoring segment pairs. So in the very case at the top, there were two separate alignments for sequence A and sequence B. The arrow is showing one, two, three, four. Some places you'll see this very light vertical bar. And what that is intended to convey to you is that in these cases, there's some space between this high scoring segment pair and this one. In this case, either the space is too small to show, but it wants to give you an indication that there is a boundary there. Or you have two high scoring segment pairs that overlap with one another. Okay, and that happens quite frequently. So in, let's find a good one. We'll come back to it later. Okay, okay. Anytime you see things on a line that are not connected, they're unrelated. It's just done to save space, okay? So that's sort of the overview. But we're gonna get, I like numbers, so we're gonna see much more useful information when we scroll down a little bit and look at the actual results. So this is the table that you get back nicely formatted. The accession numbers, the unique identifiers are in the first column. And if you were to click on any of those, would take you into GenBank and you could just see the GenBank entry for that particular protein. In the description column, you're given a brief one liner about what was found, usually just the name of the protein and the organism that it's in. The third column gives you the score. Remember, the score is not important. The E value here is, and thank you to NCBI for finally putting them in E value order instead of score order for many years. Those of you who may have done this before will remember that the results were always sorted by score instead of E value. You'll notice that there is a small up arrow here. So the sort is exactly the way you want it by default. But if you for some reason want to sort on one of the other values, you can just click on the column header and it will do that for you. You have some links in the final column. The cheat sheet is above the table here. So if you see U, U stands for unigine, which gives you some information from a gene-centric standpoint about gene clusters. E is for GEO, the gene expression omnibus. So if you see a E next to one of these, we don't have one here. That just means that there's gene expression information available to you as well. The G stands for entree gene, a nice gene-centric view of the world. So if you want to know more about the gene that encodes for that particular protein, you could go there. A couple of cases here we have Ss, Ss for structure. So that means that there is a solved three-dimensional structure either by X-ray or NMR for that particular protein on that line. Might not be the whole protein, might just be a domain, but there is some real structural information available. M would take you to the NCBI map. Your arterial will cover that very briefly next week. And finally, if there's some bioassay information available through PubChem, you'll get that as well. All right. So that's the orientation to the table. This is the column again we're paying attention to at the very top. You'll notice that the values are zero. So what zero just means is that it is less than 10 to the minus 1,000. So basically there's just not enough room to show the value. So they'll go out to three digits in the exponent, but not four. So if you see a zero, you're sitting pretty at that point. OK. As we go down, the values still look pretty good, given the criteria that I've given you before. And we're going to pretend to, oh, I'm sorry, just so you know how to read these. If you're not used to the notation, if you see something like 4E minus 97, that just means 4 times 10 to the minus 97. So the part after the E is the exponent. OK. All right. So now we're going to pretend to scroll down. And the values, if this is the E value column again, they look pretty good. We're going. We're going. We're going. But at some point, we now are going to apply that 0.001 criterion that I gave you and draw a cutoff line. OK. So just a couple of things to reinforce here. We changed how many hits to return. If we had stopped at 100, we would have only seen something maybe further up here and not seen any of the ones at the bottom. We left the E value at 10. And I think that's actually OK, as long as you know what the cutoff values are, you can draw the line yourself. So in this particular case, the line would go here because this one, 10 to the minus 7, that one's OK. But 0.006 is just slightly above our cutoff guide. So everything from there up, we're going to accept for now. And everything below, we're going to reject. We're going to come back to this point in a minute because you'll see in, if you look at these def lines, you see a protein called Prospero above the line. But you also see some below the line. So we're going to talk about that in a minute. Just keep that in mind. All right. We're now going to imagine we've scrolled further down. We're now looking at one of the alignments. And so you'll always see something like this. At the very top, it tells you what the hit is. So again, here's protein Prospero. In this case, the first high-scoring segment pair starts at position 17 of the query that's aligned with position 317 in the subject. And you can see how the numbers go across as you go down. The part to pay particular attention to is right here. So expect, that's your E value. Again, this one is 0. That one's good. You see two metrics here, identities and positives. So identities are the number of exact matches, percent identity. In this case, it's 100% identical. The positives are the exact matches plus the conservative substitutions. In this case, because the identity is 100, the positives have to be 100. But we'll see those numbers changing momentarily. So here's where guideline number two comes into play. This is a protein-based search. What I want you to look for is greater than 25% identity between the two sequences. If it was a nucleotide search, 70% or more. You'll see that some of these amino acids are in lower case and grayed out a bit. That represents the low complexity regions that we talked about before. So these have been masked out for purpose of the comparison. Now we're going to pretend again we're going to scroll down just a little bit. This is the bottom of the piece that we just looked at, ending at position 704. And you'll notice here now is another block of statistics. But unlike, let's just go back one real fast, unlike the previous one, we don't see any deaf line information. We don't see any protein information being given to us. So what that means is that it's the second high-scoring segment pair for that same protein. So our protein query against the same sequence, different alignment that was found. As before, we can see where the sequences align. We have our statistics. So here, 93% identical, and again, 93% positives. Here's our first case of a gap. So you can see what that looks like when you have a gap in the alignment. And in this particular case, there's actually a third one in this set as well. So if we look at this overview again, here is just the two little blocks of scores from the alignments in the two previous slides. So the very first one we had and the second one, our E values were 0, so that beats our 10 to the minus third cutoff. We look at the percent identities. We want this to be more than 25. Well, both of those values beat our cutoff. And if we just look at where these are, here's the first high-scoring segment pair from 17 to 704 in the query, this bar right here. Here is the second one going from this black mark, going all the way over. And the third one, which we didn't take a look at, is right here as well. OK. Questions? OK. I know it's starting to get a little bit overwhelming at this point. OK, so cheat sheet number two to try to bring some rhyme and reason to all of this. OK, so just by way of reminder, if we're looking at nucleotide sequences, we want an E value better than 10 to the minus 6, sequence identity 70% or better. If we're looking at protein sequences, we want an E value 10 to the minus 3rd or better, 25% sequence identity or more. OK, now these are guidelines. OK, and I told you to pay attention to what's above and below the line. I don't want you to use these cutoffs blindly. These are just guidelines as a starting point as you get more and more experience with using blast. OK, you have to pay attention to what's on either side of the line. So if you see things that are below the line, that you have biological evidence that supports that thing below the line, that protein below the line, is actually related to the one that you started with, the one that you used as your query, your biological evidence will always trump the computational prediction, and you will accept that as being related. So whenever you've got biological information, you always use that first, OK? And then you use these to either confirm the relationship or again to discover new relationships but don't just say, oh, Andy told me I have to draw a line here and so I'm not going to consider this. Always make sure to dig back into what you know from the literature, what you know from your own studies when you're looking at things on either side of the line. There might be things slightly above it that you want to reject. There might be things slightly below it that you want to accept, OK? So that will help you not fall into the pit again if you keep your biology into account, OK? Just very quickly, some things to keep in mind that can mess up a blast search. So again, we've already talked about low complexity regions. Repetitive elements also for the same reason because of the repeats tend to confound the analyses and I give you some ways of dealing with those here. Low quality sequence hits are sort of less of an issue now that we're cranking out sequence at the pace that we are but they still exist in the databases. So you may occasionally, if you're doing nucleotide searches, get hits to express sequence sags to ESTs or to single pass sequence reads. There's a Poisson distribution that governs how good a single pass read is as far as accuracy goes with respect to genome sequencing. So when you only sequence once, the accuracy is on the order of something like, what is it, 63%, OK? When you do the second pass, it goes up to 87 and then so on. So that's why we do genomes at 10 or 12x coverage to make sure we've got it right. So just keep that in the back of your mind if you hit one of these lower quality reads. You may want to discount the result. All right, so that is how you use BLAST. And hopefully that has helped to take it a little bit more out of the realm of the black box. There are some variations on BLAST. We're going to talk about one of them today called BLAST two sequences. So let's say you've got sequence A and sequence B. You don't want to search against the databases. You just want to do an alignment of A with B. So this is the tool that you would use for that, either at the protein level or on the nucleotide level. All of the BLAST programs are available to you. You can, as before, select any of the BLAST and matrices, any of the PAN matrices. The scoring works exactly the same. So we're back to our BLAST homepage at this URL. We're now going to look at this bottom block here, these specialized BLAST searches, and just click where it says Align two sequences using BLAST. Once you click on that, the screen looks very similar to what we saw before. At the very top, BLAST P has been highlighted, the BLAST P tab. But what's slightly different from before, before we had a query sequence for one sequence, because we're now comparing two pre-identified sequences, sequence A, sequence B. And as before, if we want to select a sub-range of that sequence, we can. BLAST P is selected by default. It's the only thing you can do here. Again, as before, we can take a look at the algorithm parameters. So this will take us further down the page. In the first block, how many sequences do we want returned? E-value, the E-value is somewhat irrelevant here as far as performing the search, because in advance, identify which two sequences you want to align. But you do want to look at the result later to get the measure of statistical significance. You can change the word size and so on. As before, matrices are selected the exact same way. You've got your low complexity regions here. You're going to make sure you always check that box. Check the box to get the results in the new window. And then finally, go ahead and BLAST. And the result format is exactly the same as what we've seen before with minor exception. We only were comparing one sequence to one other sequence. So there is only one line here in our overview. The results formatted exactly the same way. So this was the sequence in the second box, which was used as the target. It tells us the E-value is 4E to the minus 24. So we accept that using our criteria. And as before, we look at our percent identities. The cutoff is 25% or more. Here, these two sequences in this region are 46% identical to one another, 74% similarity. So just a very quick way to align two sequences. Here's the first high-scoring segment pair. Here's a very teeny one as well. This one, we would reject. Because here, if you look at the E-value 1.9, it's way above our threshold. Another way to look at these, here's just a graphical view of where those two alignments line up. So this is sequence one starting at position one, going to position 178. Here's sequence two going from 1 to 134. This was the statistically significant high-scoring segment pair that we identified going from 95 to 178 along here, going from 51 to 134 going up here. This is the one that we rejected, but you can see that it overlapped this one as well. So we're going to, again, by basis of our statistics, just reject this one outright and accept this as the best alignment of the two sequences. All right. Very quickly, we're going to talk about nucleotide blast. So two methods, one's called megablast, the other one's called blastn. I'm not going to walk you through an example. I'm just going to show you how to actually do it. So again, we're going back to our blast homepage, the center section, nucleotide blast, is the first option in the second section. It takes us to our blast page. And what I want to zero in on here is a section you have not seen before where it asks you which nucleotide comparison program do you want to use, megablast, discontinuous megablast, or blastn. And that's, of course, incredibly obvious. So here we go. So here's a table that just gives you what the word lengths are, the plus minus. So for matches and match, mass matches, what the scores are, how the gaps are scored. So affine is what I described to you before, where it's a penalty to open the gap plus the extension. When it's linear, you don't assess a penalty to open the gap. You just score the length of the gap. You subtract points for the length of the gap. If you have two sequences that are highly similar to one another, really long stuff, chromosome length, big contig length stuff, you want to use the first one which is called megablast. So this is what you use if you've got very long sequences or highly similar sequences. They mean really highly similar sequences, 95% or greater. You're going to get very long alignments back and it's incredibly fast. I'm going to jump to the lower block real fast. So the lower block is if you've got really short stuff that you want to make sure you've got almost exact matches for. These are the parameters that NCBI recommends for this but they don't recommend putting a cut off in so this is some place where manual inspection is very important. In the vast majority of cases you're in the middle where you've got things that are somewhat similar to one another, the results that you would get using either a discontinuous megablast or blast and are going to be not quite exact but they're going to be pretty much very similar to one another. The only difference is that megablast is going to run much faster in relative terms. It's like how fast can you word process, right? So you might want to just do both of those methods to see if you get slightly different results when you do your blast and searches. So that's all we're going to talk about on the nucleotide side of the house and we're going to keep coming back to this throughout the course so you'll see some more treatment of this later on. Okay, so finally in the last 11 minutes we're going to get to the last thing on the list which is called BLAT. So everything we've looked at so far, the scoring matrices, all of that stuff has revolved around NCBI's basic local alignment search tool but the same basic premises of how the scoring matrices are computed, how they're used to assess similarities, how searches are nucleated to find other sequences that are similar to one another also apply to BLAT. So BLAT is a tool called the blast-like alignment tool that's made available by UC Santa Cruz and what this is designed to do, like megablast, is to really rapidly align longer nucleotide sequences, so things that are longer than 40 in length that have really, really high sequence similarity. So again, we're talking things on more of a genomic scale at this point. It really is a method of choice when you're looking for exact matches in nucleotide databases. It's incredibly speedy, 500 times faster than BLAST for either mRNA or DNA-based searches. Because it is pushing in the direction of exact matches, you might miss divergent or shorter sequence alignments. That's when you're gonna go back and use BLAST N instead over at NCBI. It can be used on protein sequences. Truth be told, I've never actually done that. I've always used this on the nucleotide side, but just that you're aware that you can do that as well. All right, so when do you use this instead of one of the BLAST algorithms? All right, you might have a particular unknown gene, or you might have a sequence fragment, so it'll allow you to place it onto a chromosome to find its genomic coordinates. It allows you to help determine the gene structure by virtue of this comparison. Where are the introns, where are the exons? Or it might help you find some markers of interest in the vicinity of a sequence, so there might be upstream regulatory elements of other important genetic features of note. Laura Olnitsky will talk about this in great length when she talks about epigenetics and upstream regulatory regions. You, as already inferred, use it to find highly similar sequence, so other gene family members or putative homologues. You know what a homologue is now. And also just to display your sequence as a specific track. So track is a new concept. It's just a way to describe linearly a bunch of sequences that are related to one another, and we'll see an example of that in a moment. Okay, so we now leave Building 38A. We go to the West Coast, genome.ucsc.edu. This is UCSC's genome bioinformatics homepage. All of the tools that are available to you are on the left-hand side. The fourth one down is BLAST, so we're gonna pretend we're clicking on that for purposes of the example. Again, I just pasted a sequence into the box from the sample sequence page that I told you about earlier. This is a rat, cDNA clone, so I've changed the genome to rat. It's using the latest assembly, so by default it will give you the latest set of information available to you. We'll touch on that a little bit next week and why it's important to keep track of the assemblies. The query type, it's a DNA sequence. Here, we're just gonna submit it, okay? There are ways to change the parameters, but we're just gonna go ahead and click on the Submit button and see what we get back. And we end up getting back a very simple format on our results. And they are, in this case, shown in score, you don't see probability values here, okay? The best hit is the one at the top, so because it has the highest score, it is also the longest one, so the span in the last column tells you how long the alignment is. So in this case, we started with a sequence, it aligned from position one to position 733, 98% identity, it's on chromosome five on the plus strand, and it gives us the genomic coordinates of where it landed. One thing to look at it that way, we can now see it in the context of the genome browser by clicking on the browser link. So this is now what a standard UCSC genome browser page looks like, and again, Tira will make you experts in how to use this by the end of next week. Very simply, here's a red bar on this ideogram here, so we can see that we are in 5q31 is where our sequence landed. It's right here, your sequence from black search, and the accession number is at the front. You'll remember that this was 98.1% similar to the chromosomal sequence, so there are some gaps out here to show you where there have been gaps inserted. You'll also see very faintly there are some red bars that are in there as well, and those just indicate to you the positions of mismatches. On your handout, I've given you a color key, so red genome and query sequence have different bases at this position. You can read for yourselves when you see something in orange, in purple, in green, and what those all represent, so different characteristics of the alignment. Okay, there are a slew of other tracks of here that are of great interest, so there are some ratty STs, some conservation tracks that tell you how well the sequences line up between, in this case, mouse and human and dog and so on in this particular region. There's SNP information available to you, and a whole slew of really cool genomic information, and again, Tira will show you all of that next week, okay. There's another way to look at these results by clicking the second link over here, the details link, so we get a more text-based result here. The first sequence here is the one we started with. This was the query that I pasted into the black search page. This was the result. You'll see that some of these are capitalized or in different colors. It tells you at the very top what all of that means, matching bases in cDNA and genomic sequences are colored blue and capitalized. Light blue bases mark the boundaries of gaps in either sequence, and if we just pretend we scroll down a little bit more, here is the alignment in the more traditional way that you're now used to looking at these alignments, okay. So again, very powerful tool gives you chromosomal context, whether you're looking at human sequences, rat sequences, what have you, and allows you to now start thinking about things, not just between the two sequences lined up, but a whole just amazing collection of genomic information that may help you think about how that particular gene is implicated in certain biological processes and perhaps in certain disease processes as well. Okay, the last item, it's called Fast A. So Fast A actually predates BLAST as a sequence similarity method. It, like BLAST, identifies region of local alignment. It uses a very complex algorithm called a Smith-Waterman algorithm to do the comparison. So the method is significantly different than the one I described to you earlier in this lecture, but it's an incredibly robust method. While we don't have time to talk about it today, I wanted to point it out because again, if we think about where those cutoff lines are, if you want confirmatory evidence by some other method for some of those BLAST hits, you could go ahead and do a Fast A search or use some of the methods I will tell you about in week four to help you confirm or reject results that are around the cutoff line. So the same way that in the lab, you might use a second or a third method to confirm your results or your conclusions. You do the same thing in the computational world as well. Here's where you can find some online implementations of that and I know we covered a lot of ground in the last hour and a half. For those of you who would like some more information on how to use these methods, some more of the fine points of how these methods are utilized to answer specific biological questions, you can read chapter 11 in my own textbook which talks about both BLAST and Fast A in the Mount textbook by Informatics chapter six talks about sequence similarity as well. Both of these books are available in the NIH library so you can just check them out at your convenience. All right, so that's it for this week. What we're gonna do two weeks from now is build on all of this stuff to start thinking about things in more context, looking at profiles and patterns and motifs. We're gonna talk about structure a little bit, something that usually scares people but actually is pretty cool and it's not that hard to do and talk about how to start constructing multiple sequence alignments. So just by way of reminder next week Tiro Wolfsburg will pick up from here, talk to you about genome browsers, the stuff that I alluded to towards the end of today's lecture. So thank you for coming. I'm glad to take any questions at the podium. We'll see you again next week. All right.