 Okay, so today's lecture is devoted to biological sequence analysis and in the same way that Eric Green next week is going to lay the groundwork for a lot of the lectures that follow in this course by providing you a broad overview of the field of genomics. My job is to provide you a solid foundation for the upcoming lectures devoted to bioinformatics. As I think is obvious to most people at this point, genomics and bioinformatics go hand in hand with one another and this sentiment is best illustrated by this figure from NHGRI's current strategic plan showing the spectrum of genomic-based research going from understanding the structure of genomes, understanding the biology that is encoded by those genomes, moving on to disease, advancing the science of medicine and improving the effectiveness of healthcare. And our ability to work across this entire spectrum really depends in large part on our ability to use computational approaches to analyze the data that are being generated in the course of whole genome sequencing efforts, genome-wide association studies, clinical sequencing efforts, pharmacogenomics studies or any of the many other kinds of genomic science that you'll be hearing about throughout this series. So given this inextricable relationship between genomics and bioinformatics, my job again during the two weeks I'll be lecturing is to give you as gentle of an introduction as I can to some of the major tools that are used in bioinformatics. You're gonna see these tools being applied over and over again in many of the lectures in this series and the concepts that are underlying these tools are very important to understand. So having that basic understanding will take these techniques, many of them that you may have used before without quite understanding how they actually work and take them away from the realm of being the black box. So I hope you'll find this background useful and that you'll pick up a few strategies along the way that will hopefully be of practical use in your own studies. So we're gonna start off with a discussion of sequence analysis methods by reinforcing the importance of alignments as one of the most powerful approaches we have in the study of sequence data, specifically with respect to determining similarity and deducing homology. Those are words that carry special meaning and provide the underpinning for much of the fields of comparative and evolutionary genomics. So why construct sequence alignments? And you could ask this question, it applies to whether we're talking about a pairwise alignment comparing two sequences to one another or whether we're looking at a multiple sequence alignment comparing a collection of presumably related sequences. So these alignments allow us to come up with some measure of relatedness between nucleotide or amino acid sequences. It allows us to draw biological inferences regarding structural, functional, or evolutionary relationships. And because of that, it's really important to use the correct terminology when we talk about these phylogenetic relationships. So let's go back to those terms that I just mentioned a moment ago. I want you to imagine that you have two sequences, A and B, and that you align them. And the easiest way to convey how related these sequences are is to simply count the number of matches and mismatches you have between those two sequences. And when you do that, you have a quantitative measure called similarity. So similarity is always based on an observable, again in this case how many residues are lining up with each other. It's usually expressed as a percent identity and what it allows you to do is quantify changes that occur as the two sequences in question diverge. So looking for substitutions, insertions, or deletions. It also helps to very quickly narrow down which of the residues that have been aligned are important for maintaining a protein's structure or function. So whenever we see a high degree of similarity, again this observable metric, it will imply either a common evolutionary history or possible commonality in biological function. Okay, so regions that are important from a functional standpoint when thinking about things at the protein level, regions that might be important at the nucleotide level which are involved in things like expression. So the second term we need to define is the term that is more often than not misused and that term is homology. So homology implies an evolutionary relationship. And when we have two proteins that are called homologues, that means that their genes have arisen from a common ancestor. Two genes or two proteins either are or are not homologous, okay? This is not measured in degrees. This is the conclusion that one draws based on a similarity metric. And there's a very good quote that I always like to show that drives this point home. This comes from Walter Fitch, one of the grandfathers of the field of bioinformatics. It's worth repeating here that homology, like pregnancy, is indivisible. You are either homologous, pregnant, or you are not. And this quote I show because you will not forget this at this point that this is the conclusion. You never say 80% homology. You might say 80% identity when comparing two sequences to one another. But the conclusion based on some degree of identity is homology. Now, just to make things even more complicated, the term homologue might actually describe several different types of evolutionary relationships. And we're going to focus specifically on two of them here. So the first one is called orthologs. These are genes that have diverged as a result of a speciation event. So put more simply, you can just think of orthologs as the same gene in different species. Because of that, their direct descendants of a sequence in a common ancestor, they most likely have similar domain and three-dimensional structure, usually retain the same biological function over time and can be used to predict gene function in novel genomes. The other term, the other is called a paralog. And paralogs are genes that arose by the duplication of a single gene in a particular lineage. So here you can simply think of paralogs as similar genes in the same species that may have perhaps co-opted a pre-existing gene for new biological functions. And we see that all the time over evolutionary time, the whole concept of evolutionary innovation. Okay, so trying to put this all together in a diagram, because I realize that these terms are confusing. Let's start with here, gene A, that is in some common ancestor. And over time, there were a series of speciation events that gave rise to these three genes up here, one, two, and three. So since they're all direct descendants of this gene alpha, one, two, and three are orthologs of one another. All right, now if we roll the clock back a little bit and consider what happens when you have a gene duplication event, so we'll go back in time prior to alpha, some event happened here giving rise to both alpha and beta. Alpha, and perhaps it's an exact copy, perhaps it's slightly different. And in the same way that this lineage on the left has evolved over time, this lineage on the right has evolved over time, somewhat differently in this case. So the relationships here, again, one, two, and three are orthologous because they all arose from the common ancestor A. Four, five, and six are orthologous to one another because they all arose from ancestor B. But any pair of the genes that arose from alpha or from beta are paralogs of one another. So again, these are the genes that are related through a gene duplication event that you can trace back. So again, this admittedly can be confusing. I remember when I was starting out, I had a diagram like this over my desk. I would recommend you do the same. And of course, to make things even more complicated, there are sub-definitions of each of these. So I refer you to two papers. The first one is the Walter Fitch paper where the pregnancy quote came from. The second one is a very nice review by Eugene Kunin on orthologs, paralogs, and evolutionary genomics. So for those of you who want to delve a bit deeper into this subject. Okay, so that takes care of the terminology. How do we actually figure out these relationships? How do we identify these orthologs? Again, the same gene in different species. The methods it's most commonly used and straightforward is something called looking for reciprocal best blast hits. Okay, and you don't have to use blast or we're talking about blast, but we just want to use any method that deduces similarity between two sequences. Okay, so let's say we have a bunch of sequences from organism one and a bunch of sequences from organism two. Okay, and again, what we're trying to do is find the genes in this genome that are most similar to themselves in this genome. So we'll start with the first gene up here. We do a similarity search and it identifies this particular gene in organism two. You now do the reverse. You do the search using this particular sequence and if it hits the same sequence in organism one, you can make the deduction that those things are, those two genes are orthologous to one another because they both found each other in these reciprocal sequence searches. And this is a method that works really well. It has a very low false positive rate, but it doesn't always work. So let's take a second case where we'll take this particular gene, do the similarity search. It finds this gene in organism two, but then when we use this as the search query, searching against the genome, it hits one of the other genes. Okay, so you need to probe further here. Obviously these genes are all related to one another. They found each other in a sequence similarity search, but that reciprocal best relationship didn't happen and that is more than likely because these two genes are paralysis to one another, okay? So because these two have come from a relatively recent gene duplication event, it ends up confounding the analysis, breaks the reciprocity. So to figure this out a little bit more, you have to use phylogenetic or similar techniques to deduce exactly what's going on in this kind of situation. Okay, so that's it for the terminologies. Hopefully so far so good. So let's talk about the alignments themselves. There are two basic kinds of alignments. The first one is called global sequence alignments where you take two sequences and you align them across their entire length. You try to align as many of the amino acid residues or nucleotides as you can across the entire sequences. So this worked really well when you have highly similar sequences that are of relatively similar length, but as that metric declines, as they become less and less similar to one another, it really is hard to get a global alignment across two sequences. You end up forcing an alignment and you end up missing important structural or functional domains by doing that. So from a biological point of view, this is not going to be our method of choice. Instead, we're gonna look at local sequence alignments. And what this does is it tries to find as many regions of similarity between the two sequences as possible. So something that are called paired subsequences, a glorified term for finding two regions within those sequences in question that can align with one another. So any regions that are outside the area of the local alignment are excluded because of the way this method works, you could actually have multiple alignments for the same two sequences that you are trying to align. And that's actually a good thing because it focuses on regions of high similarity and it will pick up regions of evolutionary importance either from a structural or a functional standpoint. So this is really the method of choice for sequences that share some similarity or for sequences of different lengths. Okay, so this is what we're gonna focus on for most of this morning's lecture. Okay, before we talk about the methods, a little bit about these things called scoring matrices. So once we have these two sequences aligned, how do we actually measure how good the alignment actually is between sequence A and sequence B? Okay, so these scoring matrices come into play. They give us a metric to assess how good these alignments are. They, in the numbers that are found in these matrices, what they actually represent are actual biological characteristics and physical characteristics, so physical, chemical properties of nucleotides and amino acids. So these are dependent on the side chain structure and chemistry of the amino acids, their side, the function as well. And it takes into account things like when you look at cysteines and proliens, those are obviously important for structure and function and especially with proliens, when you find them at the end of an alpha helix, they more normally end up bit breaking the chain. Things like tryptophan, you can only put them in so many places because they have a bulky side chain, so as far as steric considerations go. Lycines and arginines, they both have a positive charge, overrepresented in proteins that bind to DNA, so they can bind the negatively charged backbone, but again, taking account in charge, shape, all of those kinds of things. So in putting these matrices together, there's two major things we're gonna think about. Conservation, what residues can substitute for another residue and not adversely affect the function of the protein, okay? So we know just empirically, things like isoleucine and envaline, they're both small, they're both hydrophilic, they can often substitute for one another. Same thing for serines and threonines, they're both polar. So the bottom line here, the game is to just make sure that you're conserving charge, size, hydrophobicity, and other physical chemical factors. The other thing that comes into play is frequency. When we look at the entire constellation of proteins and the amino acid composition of those proteins, how often do we see a particular amino acid occur across all of these proteins? So taking into account rare versus common. Now, okay, so you now say, all right, Andy, why should I care about any of this? Anytime you do any sort of sequence comparison, when you do a blast search, when you do a multiple sequence alignment, any kind of comparison like this, in the background, these scoring matrices are being used. And the choice of scoring matrix becomes very, very important because they implicitly represent particular evolutionary patterns. So depending on which one you choose or not choose because you've decided to use the default values, it can strongly influence the outcomes of your analyses. So the default options may not always be the best choice. So that's why we're talking about how these are constructed and I'm gonna give you some guidelines as we go through. So let's take the easy case first. Let's just look at a nucleotide scoring matrix. And here a simple match mismatch scheme is employed. So if you look at this matrix, the four nucleotides are across the top and down the side, you'll notice on the diagonal, okay, you see a bunch of twos. So let's say you had an alignment and you have two C's aligned with each other. Here's the C, here's the C. You see where they intersect and for that particular position in the alignment, you're gonna assess a positive score of two points. Every place else off the diagonal, you see a whole bunch of minus three. So any time there was a mismatch, you deduct three points. This assumes that each nucleotide occurs 25% of the time which is actually not true, but it's close enough and that's what's usually used. But this kind of scheme actually works okay because in general, the nucleotides are not all that chemically dissimilar. But when you compare them to amino acids, the situation, the landscape becomes much more complex. So this is an example of a protein scoring matrix. It's something called Blossom 62. This is the default matrix that is used for the blast searches at NCBI. You'll notice across the top, we have all of the amino acids across the top and going down the side. So let's take a couple of cases so you can see how this is different from what happens in the nucleotide world. Let's say again, you have sequence A and sequence B. You've aligned them and there's a particular position where you have two tryptophanes lining up with each other. So we're gonna look for the W up here, the W over here and see where they intersect. And in this case, for that particular position, because it matched, the value that we assess is 11 points. Now what I want you to do and you can probably see this better on the screen, look at the diagonal. And you'll see that unlike the nucleotide case where a match was always assessed the same value, here it's not. Okay, so here we're taking again into account how frequently these amino acids occur in the entire constellation of proteins. Okay, so with tryptophanes being the rarest of the residues, it has the highest score on the diagonal. Now let's say we, instead of that tryptophane lining up with another, tryptophane, the tryptophane lines up with a tyrosine, so a W with the Y. So now if we look at the intersection of those two, we see a value of plus two, okay? So this takes into account the fact that these two can and often are observed to substitute with one for one another in proteins, but it's not an exact match. So we can't give it as many points as we would for the exact match. Instead we give it a lower score, but it's still a positive score. We take the third case where it's just a total mismatch. So something like a cysteine for a glutamine. In this case, you see that there is a negative value here of minus three. Okay, so these are two residues that do not normally substitute for one another based on observation, so we subtract points. So the basic rules here are anytime you have an exact match, you get the most points. Rare residues earning more points than an exact match of common residues. You also earn points for conservative substitutions, just not as many as you would for an exact match, and finally you lose points for mismatches. So that's the way the rules work, no matter what kind of this type of matrix that we're looking at. So let's talk a little bit more about the blossom matrices. The good news about these matrices is you never have to construct them yourselves. They're all constructed for you, but again it's knowing which ones you have to pick. So how were that matrix just constructed, the one we just looked at? Again, it's looking for differences in conserved, ungapped regions of a protein family. So it's taking multiple sequence alignments of known protein families and looking for how often these residues can substitute for one another and how often they occur. So directly calculated based on local alignments coming up with these substitution probabilities and the overall frequency of amino acids. Because of how they've been constructed, they're really sensitive to detecting even subtle changes in structure or function. So because of that they are the best performing of the series of matrices available to you. Historically in the very early days of bioinformatics there was another series called the PAM matrices and these were derived at a time when very little sequence information was available. They were mostly based on globular proteins with a lot of extrapolations going on. The only reason I point that out to you is because especially if you're using phylogenetic programs you'll often see them as an option but I would instead recommend that you pick one of the blossom matrices for your studies. So bottom line, you can use these matrices to find both closely related and distantly related sequences. So here's where evolution now comes into play. You'll notice in that matrix we looked at with all of the numbers on it, it was called Blossom 62. So what does that number stand for? It always is blossom and a number. What the number represents to you is in this first bullet point. It's that particular matrix was built using sequences that share no more than n% identity in the Blossom 62 case, 62% identity. And so in order to remove bias, what ends up happening is we go through a process that looks like this. So let's say we have a multiple sequence alignment here. There are a couple of conserve positions which are noted by the asterisks and let's say we wanted to construct a Blossom 80 matrix. So we're gonna look for all of the pairwise combinations of these things and cluster together any of the sequences that are 80% or more similar to one another. In the next step, we actually cluster those together. So this cluster is represented by this sequence, these three by a single sequence and these two by a single sequence. And at that point, we can calculate the matrix. Why do we do this? The reason is really more of the history of sequencing than anything else. There are more and more whole genome sequences available to us, but they're not very perfectly spread across all of the taxa. As you undoubtedly know in doing your own searches over at NCBI, that you will see the same gene has been sequenced many, many times. So when you look at them, which one is the right one, if you took all of that information together, you would end up biasing the analysis. So it's really just a matter of trying to take this vast amount of sequence data that we have available to us, but narrow it down to them in a format that will end up being most biologically informative in order to do these comparisons. So implications. Again, when we do this clustering, we reduce the contribution of closely related sequences less bias. Bottom line, when we reduce N, you get more distantly related sequences. When we increase N, you get more closely related sequences. So this is your first cheat sheet of the morning. Put a star on this slide, please. And basically gives you some recommendations of where you should be looking for your, for selections. If you're looking for shorter alignments that are highly similar, you're gonna pick a blossom matrix with a higher number. And you can see here the range of similarity that it's best at detecting. At the bottom, if you use the blossom 30 matrix, you would get longer, weaker local alignments, but you would use these to detect more distantly related proteins, things that have much lower sequence similarity. The one in yellow, the blossom 62 is the default. And just again, from testing, this is the most effective in finding all potential similarities. This is the default value when you go to NCBI's BLAST website. Okay, so you can pick one of these, but I would actually defer to the guru in this field, Steve Altul, who is the driving force behind the development of BLAST many, many years ago. And what he normally recommends, when you don't really know what you're looking for, is to use a triple BLAST strategy. Pick the default, pick a higher number, pick a lower number, and run your query sequence three times. Again, using those three different matrices. And that's what I would recommend you do is you get more and more familiar with these methodologies, because you'll start to find that by tweaking these values, you're gonna start to deduce relationships that you didn't know that existed or didn't even have any inkling that they existed when you started. So start off with doing these triple strategies until you get more familiar again, or unless you really, really know what you're looking for. The bottom line is probably as apparent at this point is that no single matrix is the complete answer for all of these sequence comparisons. And so again, I've given you a strategy. There is more information in this current protocols and bioinformatics piece by David Wheeler that talks at much more length about the selection of the matrices, but you have enough to get yourself started now. Okay, so that's the matrices. Now, many of you have already done pairwise sequence alignments, multiple sequence alignments. And a lot of times you see gaps that are inserted. And the gaps are put there to improve the alignment. But the thing you have to keep in mind is that in putting those gaps in to improve the alignment, that you have basically implicitly inserted a biological event, okay? It could be a deletion, it could be an insertion, okay? So it's not just, oh, I'm gonna make it pretty by putting some gaps in. So there was actually something that has to happen in the background to make that actually true. So because of that, you have to keep the gaps to a reasonable number to not reflect the biologically implausible situation. And the rule of thumb that most people use is about one per 20, of course it depends on what case you're looking at. But going back to the scoring matrices, you can now ask yourself, well, how do you score these things? Because we've been talking about aligning residues coming up with a value for each particular position. How do you score a nucleotide or an amino acid to a gap, which is essentially nothing, okay? So it's not a match, it's not a mismatch. What do we do with the numbers? So there is something called the affine gap penalty. And this is because we are, again, this is an implicit biological event. We have to actually deduct points anytime we put a gap into an alignment. The formula is here. So G plus LN, G is a gap opening penalty. So just when you put that first gap position in, it takes a big penalty usually. So on the nucleotide side, you're deducting five points. On the protein side, you're deducting 11 points right off the bat. Every time you extend that gap by one more position, you assess this extension penalty, two in the case of nucleotides, one in the case of proteins, okay? So again, keep in mind, it's not just about making your alignment pretty. It's about really making it biologically plausible, okay? And you can change, these values can all be changed in the BLAST algorithm to make the gaps more or less permissive, okay? So with all of that by way of background, we can now can finally talk about BLAST, the Basic Local Alignment Search Tool. This is really the cornerstone, the most important technique we use in bioinformatics, because again, it allows us to make these deductions about orthology, paralogy, what have you, really trying to figure out how two proteins are available to other and what we can learn by aligning them to one another. So we're gonna hit this pretty hard today because it's important that you know how to use it properly. Unfortunately, there are many cases in the literature because of the improper use of BLAST that have led to really wrong conclusions. And at the end of the day, it's because people either didn't know how to use it or they didn't interpret the results properly. The most important thing to take away from this lecture or any of the bioinformatics lectures along the way is that just because you see something in a list of results does not automatically imply biological significance. We're gonna see probabilities, we're gonna see scores, we're gonna see all of these things, but a lot of those things that appear in the output list actually have no biological relevance and this is where we're gonna help you figure out which ones do and which ones don't. So it's an election year, so let's go ahead and take a poll. How many of you have used BLAST before? Okay, how many of you know how it actually works? Okay, let a little fewer, good, good for self-awareness and for those few of you who've done it before, I know you, John, how many of you have ever changed the default values? Oh, good, a few of you, thank you. Okay, now for the rest of you, nothing to be ashamed of, all will be revealed today and you'll be able to consider yourselves experts by the end of today's lecture. So, okay, back to BLAST, the Basic Local Alignment Search Tool. What BLAST seeks to do is find things called high-scoring segment pairs, HSPs, and this quite simply are pairs of sequences that can be aligned with one another. When they are aligned, based on our scoring matrices, they have the largest maximum score possible, that you can't make that score better by making the alignment longer or shorter. They have to break something called a score threshold and they can be either gapped or un-gapped so since this method is looking for these little paired sub-sequences, again, a local alignment method, the results you get back are not limited to just the best alignment between two sequences. It'll give you all of the plausible alignments between two sequences. Okay, there are five different programs in the BLAST family and you can just differentiate them by the last letter, so BLAST N starts with a nucleotide sequence as its query and it does its searches against a nucleotide database. BLAST P, the P stands for protein, so we're taking a protein sequence and searching it against a database of protein sequences. BLAST X does something a little bit funky. You start off with a nucleotide sequence, a six frame translation is performed and based on those translations, the search is done against a protein database so the comparison is actually proteins versus proteins even though you started with a nucleotide sequence. The others are on the slide for sake of completion but in your own travels these are the three that you will probably use more often than not. Okay, so how does this actually work? So let's say we have this query sequence of a particular length and what BLAST does is it doesn't try to take this whole thing at once and do an alignment. Instead, it starts with just a few, in this case, amino acids, the default being three so the length of what we call the query word is three. So what happens here is in essence, we're gonna work across different windows in this sequence. So the search is gonna start with the GSQ and then the SQS and the QSL and move across. Just for sake of symmetry on the slide, we're gonna start in the middle here with the P, the Q and the G. Okay, so we want to now look for the occurrence of that P, Q, G sequence in all of the other proteins in the database that we're searching. But think back to the matrices, right? We wanna find the exact matches but we also want to find places where there are conservative substitutions because those are also going to be biologically informative. So we're not just looking for the P, Q, G but other things that are close to the P, Q, G again using our scores as the metric. So we have here the neighborhood, the exact match and a bunch of conservative substitutions in this case just focusing on the middle letter. There's a score next to each one of these. So in this case, the P, Q, G if it matches exactly earns 18 points. And so if you go back to the slide with the matrix and just do a P for P, Q for Q, G for G that you'd end up with seven plus five plus six and that's 18 points. Again, every time we have not an exact match but a conservative substitution we still assess a positive value for that position but just not as many points. So you'll see that the scores start to go down. At some point, BLAST will automatically draw a line defining what this neighborhood score threshold is. So where is the fence that we're putting around the neighborhood? In this case, the value is 13. So anything from 13 and up will move to the next step. Everything from 12 down will be rejected. Okay, so again exact matches and conservative substitutions and that T value changes from search to search automatically. You can set it though. All right, so now we've got the alignment at the bottom here, here's the sequence that we started with. We have it now aligned with a different sequence here that it starting again with this P, Q, G it found here a PMG, the PMG is in our neighborhood. You'll notice that the middle position actually is not assessed as either an exact match or a conservative substitution. That's what the plus signs stand for. But again, it broke that neighborhood score threshold so it carries through. So just from this, you could do a qualitative assessment looking at this middle line to see how these look to you, how well they align with one another. But now here's the time to bring the numbers back into play. All right, so here's the alignment again. Here we have a graph that's showing the cumulative score on the y-axis and the length of the alignment on the x-axis. So here we have our alignment of three letters. That's this very first point on the graph here. It's above our neighborhood score threshold. And now we're just going to keep going out in both directions as far as we can go. So in this particular case, we keep going in both directions and let's just assume that we have more matches than we do mismatches. Or more conservative substitutions that we have non-conservative substitutions. So we keep going up as we extend. And here's this neighborhood score threshold, S. Now, herein lies pitfall number one. Everything above this score value will be reported to you in the BLAST results. Again, just because it breaks the score threshold, it does not imply biological significance. I'll show you what does in a minute. All right, we're going to keep going up and up and up, going as far as we can in both directions. But now at some point, I want you to imagine that we're going to start mismatching more than we're matching. So the curve is going to turn downwards because we keep subtracting points, all of those negative values in the Blossom matrices. Until we get, have gone a little bit too far. So BLAST will tell us, all right, you've now gone a little bit too far, too many mismatches, too many gap penalties. It's time to go back to this peak. And that peak is going to define the maximum length of the alignment between these two sequences, okay. So again, we're not going to pay attention to this. We're going to have to come up with a different value. And let's explore a little bit more why that is. Okay, scores and alignments don't tell the whole story. All right, so here's an alignment. This is now a contrived alignment. We're going to take a little side road here. Here's a contrived alignment between two sequences, 20 amino acids long. And what I did here is this is just a bunch of amino acids that more frequently occur amongst the entire constellation of proteins. In this alignment also just contrived same length, but using the most rare of the residues, okay. So they both are of length 20. So you might say, well, they're both of the same length. And so they're probably both, you know, equivalent, you know, just as good as the other one. But then if you look at the scores, okay. The top one, because it's made up of common residues, scores 91. The bottom one, because it's made up of rare residues, scores 138. So you now have a situation where you can't just look at the length of the alignment and you can't just look at the score. You need some metric that will normalize all of this. You can look at with confidence, regardless of what search you're doing every single time to let you know which ones of these are true positives and which ones of these are false positives. Okay, so how do we do that? There is a formula. I'm sorry to show you a formula, but this formula is called the Carlin-Altscholl equation. And it uses the score, a normalized version of the score to calculate a probability value called E, okay. And all you need to know here is that E represents the number of these high-scoring segment pairs, these local alignments that were found purely by chance. So this is a measure of the number of false positives. And because of that, you want this number to be as small as possible. So lower values of E signify higher similarity, okay. Rule of thumb that I'm gonna give you, one of the first rules of thumb as far as BLAST goes, a good starting point for determining what is statistically significant if you're doing a comparison between two nucleotide sequences, you wanna look for probability values 10 to the minus sixth or smaller. If you're doing a protein-based search, so this is BLAST P and BLAST X, because remember BLAST X, you're doing that translation before the comparison is done, 10 to the minus third or better, okay. So by using these values, you don't have to worry about what's being printed out on my screen. You can actually start to deduce which ones are actually true. Okay, so with all of that, as far as background goes, we're gonna pretend we're at the computer, the screen captures are all in your handout or on your tablets or on your laptops. I see a nice glow coming from the audience from down here. And what I'd like you to do, we're gonna go through these together, but I'd like you when you get back to your labs to actually start to practice using these examples as well, okay. So here we are at the NCBI website, lots of things going on on this page. There's a lot of good stuff here that takes you to PubMed, a bunch of the databases, but we're gonna focus today on BLAST. So the link is right there. And if we click on that, that takes us to the BLAST homepage. Let me orient you very quickly here. At the top of the page, you can go directly to a set of assembled genomes. At the bottom, there are some specialized BLAST methodologies available to you. We're gonna talk about a few of those a little bit later. And for right now, we're gonna focus on the stuff in the middle here, just basic BLAST. And the second link is the protein BLAST. So we're gonna pretend that we're gonna do a query again for your practice on this webpage. Here is the URL or all of the sequences we're gonna go through both this week. And when I see you again in three weeks time, and again, I would encourage you to just put fingers to keyboards and try all of this out. Okay. So this is what the query page looks like. So at the very top, of course you have to have a query sequence that you wanna start with. This is the sequence from the page I just showed you. So you just paste it into the box. If you don't wanna search the entire sequence, you have the ability to just say, all right, I wanna start this comparison just using residues 50 through 150. If you have a particular region, you know that you're interested in or you could just trim this sequence back. There are a whole bunch of databases that are available to you. So NR is NCBI's non-redundant database. The one I wanna focus on just for discussion this morning is something called RefSeq. And so again, let's take a little bit of an aside. What RefSeq attempts to do is to provide a single reference sequence for each molecule in the central dogma. So a single DNA sequence, a single mRNA sequence and a single protein sequence. And again, this goes back to something I already mentioned earlier in the construction of the blossom matrices. Let's say you're interested in the deleted for colorectal cancer gene in human. If you do that search, you're gonna get back somewhere between 20 and 50 different sequences and you kinda look at it all, sort of befall it all and say, well, which one is the right one? Or at least the canonical one. And this is what this is intended to help you to do is select what is considered to be the canonical sequence. Obviously, because each one of these molecules is represented only once, it is non-redundant. There are folks at NCBI who update these records to reflect the current knowledge of sequence data and biology. So that's very good when you're looking at the actual gen bank entries, gives you some information about the biological attributes of the gene, the gene transcript or the protein, wide taxonomic range. And again, the ongoing curation, which I think is the most valuable part of this database. Each entry in GenBank, as you undoubtedly know, carries something called an accession number, so that in essence is that sequence's social security number. You can just put that in and you know you're gonna get the same sequence back every time. They carry codes and the way to tell the RefSeq entries from the rest of the entries is by looking for these prefixes. So if you see the end series here, NT means it's a genomic contig. If you see the NM, it's an mRNA sequence. The P again stands for protein and an NR means that it's a non-coding transcript. What you'll also see are some entries that arise from genome annotation. So taking these genomic contigs, the NT entries, and using computational methods, you can predict the model mRNAs that come from this sequence and in turn the model proteins that come from these model mRNAs. So this is probably the most important thing to take away from this slide is when you're looking at these that start with an N, these have been directly sequenced or directly determined, experimentally determined. The ones with the Xs arise from predictions. So if you get back a set of results, if the only thing available to you are the Xs, that's fine, keep it in the back of your head though that they haven't necessarily been verified. Always go for the ends when you can. The other nice thing about using RefSeq is because of the non redundant nature of these databases, you get just a tighter, neater set of results back. Okay. But for purposes of the example, I'm just gonna leave this at NR. You can limit your searches by organism or by taxonomic groups. So you can say human, bilateria, whatever you wanna put in there, or you can just leave it blank and it will search everything. The program selection is Blast P right down here and this is where people usually stop. They'll just click the button and go. But down here tucked away are these, is a little pop up that says algorithm parameters. So this will expand the page now. And here's where a lot of the numbers we just talked about come into play. The first one just gives you the maximum number of target sequences. What this means is how many sequences will Blast return to you based on the search that you're issuing based on your search query. The default is a hundred. In this case, I'm changing it to 250. I like to set it on the high end because I don't wanna miss anything. So I would rather put it on the higher end and then determine which ones I wanna look at. There's an expect threshold. This is the E value threshold. So again, those probability values that we talked about today. Blast sets this at 10. Again, I'm gonna leave it at 10 for the example just for illustrative purposes. You could change it to the values that I gave you. But I'll caution you as to why you may not want to do that later. Again, going back to manual inspection. The word size is three. You can change the word size if you want. Selection of matrices. So you can see everything that's available to you here. Underneath, you can change the gap costs as well. So if you want to make it less gap prone, you can increase the penalties. If you wanna be more permissive, you can decrease the penalties. Filter, so we can filter out things called low complexity regions. So what are those things? These are regions of bias composition. Blast depends on the fact that the sequences that it is analyzing is of relatively uniform composition. But occasionally, you'll see proteins or even nucleotide sequences where there are runs of a single amino acid or runs of a particular nucleotide. And that ends up messing Blast up. It often leads to false positives. So things like homopolymeric run, short period repeats, and just the subtle overrepresentation of several residues. So I would advise that you filter them out, but it's not enabled by default. So we're gonna just put a checkbox over there. Finally, at the bottom here, there's a little checkbox that says show results in a new window, and I like to do that so I don't have to keep moving back and forth between the pages. It just makes navigation a little bit easier. So with all of those things, we're gonna finally now go ahead and click the Blast button and see what we get back. It's not gonna be quite that fast. You might have to get coffee certain mornings, but this morning we're good. Okay, so let me orient you to this screen as well. The first thing I wanna point out is right here above this very multicolored table that says distribution of 200 blast hits on the query sequence. You'll recall that the default to return is 100 sequences. So if we had not changed our value to 250, we would have missed 100 of the results that came out of this analysis. So again, it's good that we changed to a higher number. There is a section up here where your sequence is represented by this black bar, and what's right below it, it has now identified a protein superfamily that this belongs to, and we're gonna talk more about that domain analysis when I see you next and we talk about domain architectures. Let's pretend to scroll down a little bit and let me decipher what this graph actually means. There's a color key at the top showing you the alignment scores, not the probabilities, the scores. So each one of these lines, if let's say all of the red ones, it means that they scored more than 200. Again, not really relevant to our analysis in being able to make conclusions. They're all in descending score order. If you mouse over any of them, this box up here will show you what the found sequence actually is, the identity of that sequence. Anytime you see the little gaps here, it means, so here we have an insertion, here we have a deletion, just that little hash mark that goes in. It's just a gap in the alignment with what you started with, with the subject. I'm sorry, with the sequence it found, the subject. So it might be a case where you've got more than one high-scoring segment pair. Here we've got a sequence where there's one, two, three, and four different high-scoring segment pairs or it could be a masked region. Again, looking for those regions of bias composition and we'll see that in a little bit more detail in a couple of slides. All right, again, imagining we're scrolling down, and this is our hit list. This is what actually came back. You have the identities, the definition line for each one of these sequences. The score values are here, but again, we're gonna ignore those and instead we're gonna look at the column that says E value, okay? Now you'll see that there's a bunch of zeros here, okay? Now strictly speaking, in computer speak, that means that this is 10 to the minus 1000 or less. Basically it's just vanishingly small. It's just two, the values get so small at some point that they're not even worth showing on here. Plus there's only so much room in the field. Let's take a look further down. There's one down there that's 8E minus 179 and that just means eight times 10 to the minus 179, okay? So really, really small numbers at the top of our list and what we can, and there's percent identity values here as well as you can see, especially with the ones at the top, they're all very close to each other, 100%, 98% and so on. Let's scroll to the bottom of the list, okay? And here's where we're gonna start applying your some rules to help you deduce which ones of these are actually the ones that you wanna consider that might actually have some biological relevance. So we're gonna draw a line down here and we're gonna reject everything that is below the desired threshold. Again, from the rules I gave you before, 10 to the minus third or higher in this case. So now in this case, when you look at these, they all meet the criteria. So we're gonna go and accept them, but you'll see here I say for now because there's another rule coming that we're gonna use to let biology be your guide. We're gonna take a look at one of these alignments, these high scoring segment pairs in a little bit more detail here. At the very top it gives you the identity of what this found protein was and links back to its corresponding gen bank entry. In this case, the sequence we started with, the query, this alignment starts at position 17, the subject, the one that it found also starts at position 17. And each line shows you what the position numbers are. So this alignment goes from 17 to 704. Okay. At the top you have some information on identities and positives, okay? The identities are how many exact matches are there between these two sequences? And so 688 of the 688 positions lined up with each other, so this is 100%. The positives are the same as well. The difference between these two identities mean exactly that, the exact matches. The positives include the conservative substitutions, okay? So those numbers will not always match. And in this case, we didn't introduce any gaps. So here's where the next rule comes into play. So I've given you a probability rule. Now I'm gonna give you an identity rule. You're gonna be looking at this value if you're doing a protein-protein comparison, again, blast P or blast X. You want that value to be at least 25%. If you're doing a nucleotide search, you want that value to be at least 70% for starters. Okay. All right, you see a bunch of these that are in lower case here, and these are representative of the low complexity regions that I talked about before, these regions of biased composition. So it just points out to you where they actually are, okay? In the second example, this is still the same entry, this is still the same hit. Here's the 704 that you saw in the previous slide. Here, above this little statistics row, in the last one, we're gonna just go back for a minute, you notice that there was the identity of the protein was given to you. Here, you don't see that. So it's another high-scoring segment pair for the same two proteins, the one you started with aligned to the one that it found here. So this is just another part of the molecule that aligned. In this case, the alignment was from 777 to 1374. So here it says range two. That just tells you very quickly that the second high-scoring segment pair was identified. In this particular case, we have a gap that's introduced here as well. So you can see where that gap is, and you can see also that that's reflected in the numbers up here, that it's 4% of the total alignment. So let's bring the example all the way around. So here's our overview of our results again. The two lines that we saw that gave us our statistics. So for the first high-scoring segment pair and the second high-scoring segment pair, it beat our E value that we were looking at. Again, percent identity, these are proteins, so 25 or better, so we've got 195. We'll accept those as well. And just, you can see, if you were to do this on the screen and mouse over, it would show you that this was the particular hit that gave rise to these two high-scoring segment pairs. So it's pretty much as easy as that once you understand what the rules are. So the manipulations of cutting and pasting and clicking buttons, that's not the hard part. It's really understanding what you should be doing with respect to all of these default values. The reason they're the default values is in most cases they'll work, but now hopefully you understand a little bit better what can possibly go wrong if you don't pick the right values, especially based on the kinds of biological questions you're trying to answer. So the scoring matrices come into play if you're looking again for things that are more related or less related, so closely or distantly. Distantly related ones are gonna give you some great phylogenetic information about the origin of different genes and proteins. The closely related ones are gonna give you better information about the particular properties of a protein family and what makes, what is distinctive about that particular protein family. And these are the rules that are gonna help you find all of that out. So again, second cheat sheet of the day, put a star on this one as well. Here are the rules for the E-value and the sequence identity again for both the nucleotides and the proteins. Now, these are guidelines, okay? These are not holy rules, these are not, these are really just a starting point. So I don't want you to use these cutoffs blindly. This is just, again, some place to start until you get more comfortable with using BLAST. You should also pay attention to alignments on either side of the dividing line. So you'll remember, we drew a red line, right? But take a look if there are things below that line that might actually make biological sense to you and it'll happen more often than not, that it might just be under the cutoff, but you have biological knowledge that tells you that the things under that line are related. Biology always trumps these results, okay? Because remember, these are predictions, okay? So when you have biological knowledge, you use that biological knowledge. So never ignore the biology that you know, okay? All right, so that's it for BLAST proper. There were some other things on the BLAST page that I wanna point out to you. So let's say instead of just having a sequence of interest and trying to deduce new relationships by comparing it to a database of sequences, instead, let's say you have two sequences A and B that you just wanna align with each other. You're doing some study and you just wanna see how well they align. So we're gonna use something called BLAST two sequences to do that. We're gonna come back to our BLAST homepage and this particular program is found in the lower part here, BL to seek, align two or more sequences using BLAST. This page looks very similar to the one that we saw before. The only difference is that when you check the little box that says align two or more sequences, this part of the screen will then expand where you can put the second sequences, second sequence in. So we're gonna take a sequence here called SOX 10, another one called the sex determining region Y. These are related to one another. We're just gonna see how related they actually are and where. So again, sequence A, sequence B. As before, we could define query sub ranges on these things. We're gonna expand the algorithm parameter box one more time to see what's down here. So again, maximum target sequence is 100. In this case, we don't care because we know that we're only gonna get one back because we're only doing a pair-wise alignment. Changing the E value here is also not that important because the sequences are pre-selected. So you're already starting from some place of biological knowledge, but you'll still get back a measure of whether or not the alignments are statistically similar. So word sizes, all of those things remain the same. Below, again, you can change the matrices. You can filter for the low complexity regions again, but we're just gonna go ahead, check the show results in new window box off and go ahead and do the blast search. And what you end up getting back looks very similar to what we saw before. Here is the overview. Again, you're only comparing against one sequence, so you only see one line here. One is much shorter than the other. If we look at the results table here, it gives us an E value, one times 10 to the minus 26. So that meets our E value criterion, the percent identity of the matches 46. So that matches our identity, our criterion as well. And here are the two alignments. So the first high-scoring segment pair, the second high-scoring segment pair, and you can just see for yourselves how well these align to one another. This one is pretty short. I'm not sure I would buy that one, but this is the region of interest here that's biologically important between these two sequences. Okay, so pretty easy way to do it. Now, we focused on proteins for most of the morning because I think it's easier to understand how BLAST works because of how the matrices work as well, but of course, we need to be able to do nucleotide searches as well. We're not gonna spend much time on this morning other than for me to orient you to what's where. So we come back to this page, and again, the nucleotide BLAST are in this second section and within the basic BLAST section. And when you move down to the program selection, you can pick one of three different nucleotide algorithms to do this comparison. One is called MegaBLAST, one is called Discontiguous MegaBLAST, and the other one is called BLASTN. So what's the difference between the three? And they're summarized on this slide for you. So MegaBLAST, which is the default, is really optimized for very long or highly similar sequences. So when you want to do a search against complete genomes or against chromosome length contigs. So this really allows you to map transcripts that you might have, mRNA sequences, things like that, onto a sequence in seconds. You'll see the word length here is 28, so you get longer and highly similar alignments, and that's what you want in this particular case. You're looking for almost exact matches or just exact matches, and it runs a lot faster because that word length is longer. The middle block is probably what you're gonna be doing more often when it comes to searches at the nucleotide level. The results between these two methods will actually be very similar, but MegaBLAST is gonna run faster in relative terms. Pay attention here to the kinds of gap penalties here. Discontiguous MegaBLAST is using the affine gap penalty that we talked about before. In this case, MegaBLAST uses a linear BLAST penalty so there's no penalty for opening the gap. It's just a defined number of points for each position in the gap. Finally, the old stalwart BLAST N. It's best for finding short, nearly exact matches, and you can see the values that are being applied here as well. So that's really all we're gonna talk about on the nucleotide side today because it works exactly the same as what we talked about on the protein side. And again, you have different programs available to you, but it's gonna come down between BLAST N and Discontiguous MegaBLAST as far as your choices go on this side of the house. The last thing I wanna talk to you about this morning is something called BLAST. So we are now going to leave the Lister Hill Center on the other side of campus and move ourselves over to the sunny shores of California and a tool that's made available through the Genome Center at the University of California at Santa Cruz called BLAST. And what BLAST just stands for is the BLAST-like alignment tool. And the method is somewhat similar to MegaBLAST in that it really is optimized to align longer nucleotide sequences, things that are more than 40 in length, having 95% or more sequence similarity. And because of the way these settings are set, you can find exact matches reliably down to a length of 33. So really the method of choice when you're looking for exact matches in nucleotide databases, it's also much faster, 500 times faster than BLAST for doing either mRNA or DNA searches. It may, however, miss divergent or shorter sequence alignment. So again, here's a case where you wanna use BLAST and you wanna use BLAST depending what you're looking for. You can use it on protein sequences. I never have. I feel that BLAST-P is more efficient for doing those kinds of searches. So when do you use this? You're looking to characterize some unknown gene or sequence fragment. So where does it live on a particular chromosome? Finding its genomic coordinates, determining gene structure. So the presence or absence of exons, what other information might be available in intergenic regions. And looking for markers of interest in the vicinity of a sequence. Obviously it helps you find highly similar or identical sequences based on what I've already told you. So you can align mRNA sequence onto a genome assembly, find other gene family members or do some cross species alignment to identify putative homologs going back to the evolutionary question. It also allows you to display a specific sequence as a separate track within the browser. So we now start to introduce the concept of a track. We're gonna see an example of that. Shortly, Dr. Wolfsburg is gonna be spending quite a bit of time discussing the use of this browser when she presents her lecture in two weeks' time, but let me give you a first gentle introduction to it. So here we are at the UCSC Genome Bioreinformatics website. You can get to the BLAT program by just clicking on the BLAT link, which is in the left sidebar. Takes you to a very simple search page, much simpler than what we saw with BLAST. So I've gone ahead here and just pasted in a sequence of a CDNA clone that comes from the cancer genome anatomy project. This is from Makaka Mulata. So we're gonna change here the genome to Rhesus. The assembly automatically will change for you. It's as a DNA sequence. And you have two choices here. Submit or I'm feeling lucky. So if you're feeling lucky, what this is going to do is just return to you the highest scoring alignment between this sequence and the ones against the Rhesus genome. Instead, what we're gonna do is click on the submit button. Okay. All right, so here is what you get back. You'll notice that for each one of these, here is what we started with. You have a score for each one of them. It gives you a start and end position. Okay, and it then tells you which chromosome in the Rhesus genome the match is actually on and how long the alignment is. So since this is the longest alignment, that's the one we're gonna consider for this example. In order to visualize this within the UCSC Genome Browser, we're just gonna click on the browser link all the way over there on the left, and this is what you get. If we had clicked the I'm feeling lucky button, you were gonna come directly to this screen. So right away, we can see the chromosomal position of our particular sequence. So this is on chromosome six. This is the sequence we, where is, this is the sequence we started with, your sequence from Blatt Search. And it in turn now has found a whole bunch of non-Rhesus ref-seq gene. So again, that non-redundant database from a wide variety of organisms. So very quickly, you can now start to identify orthologous genes to the one that we started with and actually the identity of that orthologous gene. So if we started with an unknown, we now can actually assign it to a gene family. We're not gonna talk about this much more than that. Let me just show you one other view of this. If we click on the details button, we can look at a more traditional view of the alignment. So here's what we started with. You'll see a whole bunch of colors going on and the key is deciphered for you at the top. So matching bases in CDNA and genomic sequences are colored blue and capitalized light blue bases mark the boundaries of gaps in either sequences often spliced sites. And if we just keep scrolling down, we now just see what the pairwise alignment looks more traditionally more reminiscent of the blast view. Okay, so that is it for today. As far as what we're going to cover, we're gonna build upon these concepts when I'm back with you in week four in discussing how we can go beyond just looking at these letter by letter alignments to find the relationships between gene and protein sequences. Next week, Eric Green will be behind the podium. He is going to give his normal tour de force and give you a really good comprehensive view of the genomic landscape as of this year. It is always been one of the most wonderful lectures in this series because it really gives you a great overview of everything going on in genomics currently, but also giving you a lot of the history of how we've gotten to the position where we can do a lot of the great work that we currently do. So with that, I'm happy to take questions at the podium. Thanks for joining us today and we'll see you again next week.