 Alright, good morning everyone and thanks for coming today. Just a few housekeeping things this morning, first the usual disclaimer about CME and again for those of you who are wishing to earn CME credits for attendance, please make sure that you've signed in either today or at the teleconferencing sites that are joining us this morning. The past lectures are now being posted at two different places so I just want to make you aware of this first. On the NIH video casting site you can go to this very long URL here or easier would just be to go to our own course website at genome.gov slash course 2010, follow the links to lectures on the web and there is a direct link to this page and you'll see the last two lectures have already been posted. All you have to do is just click on play video and that will pop open the viewer and you'll be able to watch the lecture. So if you have missed a lecture or want to catch up or just review some of the concepts please feel free to take advantage of that. Another source where you can get the lectures is Genome TV. So this is a new channel that the Genome Institute has put up with a variety of lectures so here you can just see there's a sequencing group busily working, preparing a lot of the samples that we end up talking about in the course of this lecture series, in this case from the clinical sequencing effort or ClinSeq. The next lecture from the very first week is right here under the featured column we're having a new playlist put together so you can just find all of the lectures all in one place there as well and just works exactly the same way as if you had gone to YouTube. Alright so with that let's go ahead and get into this week we're going to just pick up where we left off last week about how to further analyze sequences primarily at the protein level and as you'll notice we've concentrated quite a bit at the protein level and this may seem a little bit odd for a course it's called Current Topics in Genome Analysis but we do this because we want to reinforce to you the importance of thinking at both levels, thinking at the DNA level and thinking at the nuclear at the protein level when doing your analyses. Obviously advances in genome science make it incredibly easy much easier to find mutations, chromosomal aberrations, look at changes in expression patterns and similar events taking place at the nucleotide level but we need to translate those events that are taking place at the nucleotide level over to the protein level to keep in mind that the proteins are actually a workhorse in the cell and so by going through all of this material for those of you who are focused more on the basic research side this will hopefully help you think about your experimental design a little bit better and advance your understanding of how these mutations at the nucleotide level affect things like structure and function. For those of you thinking of clinical questions you might have detected a mutation in a patient and you might have been doing some genetic susceptibility testing some other targeted sort of testing and you need to understand the net effects of those mutations in your patients to better understand why am I seeing the phenotypes that I'm seeing to start to get some insight into metabolic changes that might be taking place in those patients and to help determine which mutations are potentially pathogenic. So this will in turn help you hopefully start to think about therapeutic approaches so yes we are focusing primarily today on the protein side of the house but I just wanted to drive home why that's important. So sorry so we're going to talk today about things like profiles, patterns, motifs, and domains so talking about protein secondary structures. We're going to move to the three-dimensional level so going away from just looking at strings of letters and now looking at three dimensional structures I will tell you about an analysis tool it's very easy to use and a three-dimensional viewer and finally we're going to end up talking about multiple sequence alignments and this theme will come up over and over again in the rest of this course when we start to look at the genome browsers and other techniques that are available to us to do advanced genome analysis. Okay so we'll start with sequence comparisons once again and the approach that we used last time around was homology searches where we did these one-against-one searches taking a sequence of interest and comparing that against a set of sequences probably in a public database to find other sequences that are similar to the one that we started with and the method that we spent most of our time talking about last time around was BLAST but I also mentioned to you another tool called Fast A but again one-against-one comparisons against a large collection of sequences. We can also take a slightly different approach and look at the collective characteristics of protein families to find similarities between protein sequences. Now these searches can be one-against-many and I'll tell you about several approaches to do that or many against one where you've got a collection of sequences where you're trying to find a new sequence of interest and I'll tell you about a BLAST related method that can do that. So we're going to start with the one-against-many and talk about profiles. So a couple of definitions again to start off today's lecture. Whenever I talk about a profile a profile just quite simply is a numerical representation of a multiple sequence alignment so I can take any multiple sequence alignment and represent that as a matrix the same way that we talked about our scoring matrices last week those if you'll recall we talked about the blossom series of matrices that were derived from multiple sequence alignments and using those alignments we came up with those matrices they conveyed to us things like when conservative substitutions could take place and where important residues had to be conserved. So these profiles like last week depend on patterns or motifs that contain these conserved residues and they represent the common characteristics of a protein family so because of that in the power of using a collection of sequences as your basis for comparison we can start to find similarities between sequences that have little to no sequence identity. So I offer to you an example to make this point the homeodomain family if you look at the homeodomain 60 amino acids across but only a handful of residues there are less than 10 are conserved amongst all of the sequences in the class and if you were to use BLAST which we talked about quite a bit last week you wouldn't actually find all of the homeodomain sequences that are in the public databases so we need to add some additional tools to our arsenal to help us find additional similarities and make additional biological conclusions where BLAST won't quite get us all the way so more more things to put into your toolkit. So yes there is so I mentioned last week the definition of similarity is when you have your identical residues plus conservative substitutions okay so that is considered so you have a percent identity and a percent similarity and again the percent similarity includes conservative substitutions. Okay so how do we actually construct one of these profiles? So again the profiles are based on multiple sequence alignments here I have a multiple sequence alignment there's a number of sequences here 10 across you'll notice that I've highlighted some of these positions in red so the last position you'll see that there is always a glycine residue in the 10th position you always have a threonine in the 8th position if you look at the 9th position most of the times you have a proline but sometimes you have a threonine thrown in as well. So we're gonna take a look at this and we're gonna ask for questions we're gonna see okay what residues are seen at each one of the positions so we get an idea of the frequency of amino acids at each position what's allowed what is the frequency of those observed residues which positions are conserved either outright absolutely conserved or just conservatively where we see conservative substitutions and finally where we can introduce gaps we don't have any gaps in this particular example but gaps can certainly exist in these alignments based on those four questions we can construct something called a position-specific scoring table and this table all of these numbers in this table represent the answers to those four questions so let me take you through this you'll see across the top you have each one of the amino acids going down the side is the consensus at each position in our multiple sequence alignment so here we have position one here position one is at the top of the table position 10 is at the bottom so we basically just turned this alignment on to the its side now you'll remember that there's always a G in that final position in position 10 if we look at the G here and if we look at the G in the amino acids going across the top look at where those two intersect we see a value of 150 and 150 is the largest number in the table so anytime you have a position where that residue absolutely positively must appear you are going to assess a very very high positive score so drawing the analogy back to the blossom tables last week anytime we saw a conserved residue we always gave those conserved matches the highest possible score let's consider now the next to last position again most of the times we have a proline but sometimes we have a threonine we'll take a look over here at the P and see where it intersects with the P across the top and is this time it says 89 so not as many as the hundred and fifty to reflect the fact that this is a conservative substitution these residues can substitute for one another it would be better to have the exact match but we want to at least allow for that wiggle as well finally let's just take a look at the second position where you've got just about everything going on the consensus here is a proline if we look across to where the proline is we have a much lower positive score okay so your best scores arise when you have an exact match of that residue at that position that it's absolutely conserved and then the scores start to go down as the positions become less and less conserved now fortunately you don't have to generate these these are all generated for you and you can now use these and compare your single sequence of interest against hundreds and hundreds and hundreds of these so in essence this takes the place of a sequence so the same way last week that we use blast to compare sequence a to sequence b now we're going to say okay I have sequence a sequence b is my collection of these position specific scoring tables okay second definition is something called a pattern so in the case of the profiles we have two types of information we have positional information what residues can appear at a given position but we also have frequency information which is captured in that table by all of those numbers here we just have a shorthand to represent to us what residues can exist at a given position so we have no frequency information here just what is allowed where now not exactly intuitive how to interpret this so let me take you through this so at the very beginning we have a phenylalanine and a tyrosine in square brackets what the square brackets convey to you is one of so in position one of that motif you have to have either a phenylalanine or a tyrosine in the next position we see an x so the x means anything you can have any amino acid at that position we then see a single amino acid assisting with no embellishments around it so that just means you absolutely have to have a cystine at that position again here is the x but now the x is followed by a two so that just means two of them so any two amino acids in a row all right now again two amino acids here valine and an alanine but instead of the square brackets we have the curly brackets so it means the opposite of the square brackets so it means anything other than a valine or an alanine so not valine not alanine at that position again the x so any amino acid and finally here the same notation we've seen over here so here we've got a histidine the number three so it's just three histidines in a row so this defines a pattern that matches every single member of that particular class of proteins okay so a little bit different approach but both of them are quite useful so let's go ahead and put these both into practice and talk about our first database of the morning and this is something called PFAM and so PFAM quite simply stands for protein families and it's just a collection of these profiles these multiple alignments of protein domains and conserved regions so hopefully these represent regions that have either some sort of structural significance or some sort of functional importance when we look at these entries and we'll see an example in a couple of minutes we'll see the multiple sequence alignments of the family members will start to get a sense of domain architectures and so when I use that term all that means is what is the series of domains that I see in a particular protein the nature of the domains and the order of the domains I'll get an idea of which biological species have proteins that meet that are members of that particular class some information on known protein structures and links to some other databases as well now there are two types of PFAM entries the first one is called PFAM a and these are the better of the two options in prior versions of the documentation this had been the analogy had been drawn between PFAM a and a handcrafted beer where somebody lovingly put the multiple sequence alignments together to make sure that that particular profile was indeed accurate and biologically relevant so these are based on curated multiple alignments a method is used to find all of the detectable protein sequences that match that particular family so someone has gone ahead made this multiple sequence alignment for you generated the scoring matrix and then search the databases to find all of the other members of the family so because of the way it's done because of this handcrafted method the hits that you find to PFAM a are more than likely true positives PFAM B is a slightly different approach this relies off strictly on automated methods so because of that that is deemed to be lower quality but you should certainly look at those because you might have a situation where you've done a query you don't find a match to PFAM a but you might see something to PFAM B so again using the handcrafted beer analogy PFAM a is like the Stella Artois while PFAM B is more like Natty Bo all right so to remind you we're going to go through all of these examples as if we are sitting at the computer as with last week if you'd like to go back to your labs or back to your offices and repeat what we're doing in class this morning you can go to this page all of the sequences are on this page you can just cut them paste them into the boxes and follow the steps that we're going through all right so let's go to the PFAM site as always I will give you the URL at the top of the page so we've gone off to the Sanger Center in the UK and here are just a very simple front-end what we're going to do is do a sequence search but while we're here let's take a look at what else we have you can I love the name of this viewer clan so you can see some groups of related families with deference to the Scottish people that work on this database you can jump directly to a particular protein family if you know the name of that protein family so it's just a simple keyword search but again we're just going to go ahead and do our sequence search so if we click on the words sequence search we get a box and so we could quite simply put our sequence in that box but as you have come to learn I like to look at the options that are available to me so to see what parameters we can change here there is a very very unobtrusive link here that just says perform a range of other searches here you click on the word here and that will now expand this out so we can see what's available to us okay so here is our sequence box I've gone ahead and pasted in the appropriate sequence from our list of sequences but we have just a few options here that are worth talking about the first one is called cutoff and it is automatically set to use E value of 1.0 so that is the default value but keep in mind the guidelines that I gave you last week for protein sequence comparisons because that's basically what we're doing here even though we're using these matrices it's still a protein protein comparison so in the back of your head keep that 10 to the minus third guideline in in mind as you look at the results what gathering threshold here means is that it will automatically if you want adjust that E value to use the same E value that was used to construct each individual scoring matrix so if you have a reason to do that that's fine but I usually just leave that as is really the most important thing here is to just click off this search for PFAMB's box which is unclicked by default but we do want to see those results even though they are of lower quality because there might be something interesting there so we'll go ahead click on submit and see what we get so what we get is something that looks like this at the very top of the page we have an overview of our results so it just says we found three PFAMA matches to your search sequence one significant two insignificant but we didn't find any PFAMB matches so we're only going to look at one hit here we then have a representation of what was found it's marked P450 so this sequence that we started with had a cytochrome P450 domain in it you'll notice that the left side is rounded and the right side is jagged so the left side being rounded just means that our sequence lines up with the beginning of the motif for that particular protein domain but the jagged part means we didn't get all the way to the other end so this is partial overlap with a particular domain in this case P450 all right if we now look at the tabular results again cytochrome P450 our sequence aligns with the domain starting at position 41 ending at position 500 to see a little bit more detail you'll see in the red box here it says show or hide all alignment so if you click on the word show this just basically expands the page so instead now what we have here is the alignment and there's a number of lines in this alignment this might be easier to see on your handout your sequence the sequence that we started with for the query is the one that is just labeled Seq for sequence the HMM in the first row is the actual consensus from the profile that it matched so that is the actual P450 profile represented as a consensus sequence the next line down where it says match it's just quite simply which positions match so you have a qualitative just a visual overview of how good those matches were same rules as last week any place you saw see a exact match you see the letter repeated on that line any place you see a conservative substitution you see a plus sign all right and then that final line pp just stands for posterior probability so this is just a quantitative measure of how good the matches position by position to this particular position specific scoring matrix and the rules for that are in the documentation right above that so okay now we search for P fam B we didn't find anything let's go ahead and just now go to the entry where the more interesting stuff actually is so this is now just a summary page for a motif and again in our case this domain is the P450 domain it starts off with what I like to call an executive summary and we're gonna see a lot of these as we go through this morning and what these executive summaries represent is someone who knows this particular entity in this case this protein domain very very well who knows the literature inside and out is more than likely an active researcher studying these proteins has written for you what they think the most important things you need to know about this particular protein domain are okay anytime an expert is willing to take the time and do that and share that with you you should absolutely take advantage of it okay right below it you'll see that there are references going back to the primary literature so again the executive summaries are very important they are not a substitute for reading the literature but they will at least direct you to what are the more important papers to read in the literature so there's a little bit of a judgment there are as to which papers are important and which ones maybe not so much all right let's on the right-hand side here we have a sample structure and when you do this it will just randomly pick one of the structures that has this particular domain in it if you want to see different structures you have a pull-down menu right below where you can switch to other structures now we're going to take advantage of two links at the very top of the page and you'll see it says 152 architectures and 18883 sequences so let's start with the 152 architectures so remember when we talk about domain architecture we're talking about the order of domains in a particular protein or in a set of proteins so the very first one in this list tells us that there are 16,000 plus sequences that have a p450 domain in it alright and right below you'll see a button that says show so if you clicked on that button you can actually get all of those sequences collect those store them on your hard drive and use them for some other sort of analysis so a very quick way to just all at once get a complete data set of in this case all of the proteins that have a p450 domain in them so if you want to go off and do a phylogenetic analysis or anything like that it saves you the trouble of doing the blast searches compiling them editing doing all of that as we go down we'll see that there are other architectures that are part of this family so right below we've got a different architecture p450 times two where you've got two p450 domains next to each other a little further down p450 times three so you've got three next to each other and other domains are mixed in in some of the other lines as well so again good tool to have in your arsenal especially going back to how we started off this lecture thinking about why protein domains are important why it's important to think on the protein side of the house let's say you have a mutation that mutation may fall in one of these domains so now have you obliterated the domain by that simple point mutation or deletion or what have you at the nucleotide level if you're thinking at the at the clinical level do you have a mutation that knocks out some residue that's important for some sort of again metabolic process something that is relevant in a structural or functional domain that might explain why you see the phenotypes you do in your patient so something else to think about as you consider things at the protein level now the other link at the top 18,000 plus sequences so let's take a look at what we see if we click on that if we click on that there we go all right and so here we can actually see pre-made alignment so you'll recall in our handcrafted analogy someone has put together a seed alignment that initial set of sequences that were found there were members of the p450 class that were used to find all of the other members of the class so I could see just those 50 in this case that were in that seed or I could see all of the sequences that make up that particular protein family in this case I'm just gonna leave this set of the seed so I don't get anything unreasonably long and if I want to look at these I can just click on the view button and what I would get is a multiple sequence alignment that looks like this so we're gonna come back to this at the very end of the lecture when we talk about the viewer that allows us to manipulate multiple sequence alignments but the colors mean something the histograms all mean something and I'll tell you about all of that a little bit later all right so now let's say we're sitting at our machines again we're gonna pretend that we've gone back to the PFAM page and so we're on this page we're gonna pretend that we're scrolling down and when we get to the bottom we see a number of external database links and the one that I want to focus on is the one here where it says pro site and then there is an accession number and if I click on that accession number I come I leave the Sanger Institute and I now go to the Swiss Institute for Bioinformatics where they have maintained for many many years a database called pro site and so this is a collection of protein profiles those profiles that I told you about earlier that it just tell you position for position what amino acids can exist at a given position that characterize a given set of proteins and so in this particular case for our P450 domain we see the consensus pattern right there and now you know how to read that how to make heads or tails of that so you could use that as a basis for comparison the more important thing I want to point out on this page is once again yet another executive summary and more importantly the name of the person who put that together whoops so if you now have questions about this particular protein domain these people do make themselves freely available to members of the biological community so you could just click on their name send them an email and if you have a question about this these particular proteins they will answer that question for you so it's always nice to have that person at available to you when you've got questions that you just can't find the answers to in the literature or from your colleagues all right let's pretend we've gone backwards again and so we're on this page we clicked on this link here at the bottom I just want to focus your attention a little bit further up the page where it says Interpro entry another accession number and if we click on that that now takes us away from PFAM and over to another database at the Sanger Institute called Interpro and Interpro is in essence what we call a secondary database so there are collections is a collection of information that is a mass from a series of other primary databases so pro site is an example of one of those primary databases there are there are other databases that talk about protein domains and similar characteristics of proteins what Interpro tries to do is just collect those for you all in one place so it's one stop shopping so again we are looking at cytochrome p450 the first thing I want to draw your attention to here is something called the interpro relationships so what this set here you'll see it says children so the children are sub members of the class so these are these are sub families that are part of the bigger cytochrome p450 family so they are more specific members of the class so we've got a B class and E class various groups of cytochrome p450 proteins because the child children are always more specific than the parent if you have a match to the child you have a match to the parent okay so they by definition have to overlap yet again we've got another executive summary here at the bottom we have some gene ontology term annotation so we can see that this particular domain is involved in iron iron binding binding heme binding and so on again let's pretend we're scrolling down the page I will get used to this alright and here we have a very funky representation and what this is intended to convey to you is the taxonomic coverage of this particular domain so put otherwise what organisms have some protein in them that have a p450 domain you do not read anything into the lengths of the branches or anything this is not a phylogenetic tree it is just a representation of which organisms have p450 proteins in them so the center of this tree are just is just the root as we go further out the the inner nodes are the tree nodes by the time you get to the outside you'll actually see names of representative organisms like mouse and human the numbers represent how many proteins in that organism have a p450 domain at least how many characterized sequences it may not be the absolute number of sequences because you may have sequences that have been documented two three four times but if you click on those links again you can download a sequence set and use them for some other purpose possibly doing a phylogenetic analysis or something else okay again let's scroll down and we now have another way of looking at domain architecture so in this case we've got a little bit different representation so what we have here is protein by protein various entries that were found that have this p450 domain so in this case we've got a protein having this accession number the protein is represented by the red bar the domains are shown underneath and the colors that correspond to what each one of those domains are is shown at the bottom so a little bit of a different way of looking at what domains comprise a particular protein okay all right so putting all of that together I want to point to you to two additional things that I'd like you to read at some point again these come from current protocols in bioinformatics so this is the work that I'm editor-in-chief of and what I want you to look at specifically are two units the first one is on much deeper treatment of PFAM the other one talks in great detail about interpro what's nice about these units for those of you who have not looked at the current protocols in bioinformatics units before it's very similar to current protocols in molecular biology, immunology and the rest in that it is protocol-driven hence the name so you there are examples that you can work through so just print out the pages sit next to your computer and just step by step follow the instructions and it will teach you hands-on how to do this okay you can only learn a certain amount by watching me walk you through these examples okay the only way you're gonna really truly learn how to do these is to put hands on keyboards and actually start to bang away so it's very important that you spend some time trying to do these techniques in practice to remind you this is available to all of you through the NIH library free of charge so all you have to do is just do a search on online journals type in the name current protocols in bioinformatics and that will take you directly to those listings all right before we leave domains all together I want to point out to you an NCBI tool called the conserved domain database that uses a slightly different method so like some of the other things that we've seen this is a secondary database so we can search everything that's in PFAM and A and B we can search something called smart the simple modular architecture research tool clusters of orthologous groups which is a collection of protein families put together by Eugene Coon and NCBI and some other resources as well now in the interest of making sure you don't use these things as a black box I just need to point out that the searches here are you are performed using something called RPS blast so this is a variant of blast was reverse position specific blast where your query sequence is used once again to search a series of those position specific scoring matrices same general idea as what I showed you at the beginning of the lecture however the actual methodology the actual algorithm that is used is different than the one that is used by PFAM and many times you will do searches where one of the two tools will give you one set of results the other tool will give you a slightly different overlapping yet slightly different set of results so take home message when you want to do these kinds of analyses to see what kinds of protein domains exist in your proteins of interest do both all right there's always comfort inconsistency between the methods all right so let's again pretend we're at the keyboards and take you through an example so this is the CDD home page there's a very long URL here it's actually easier to just go to the NCBI home page and look for the structure link off of the home page but if you want to type that in there it is here we don't have any options we can throw it's just a box and we have a choice of which databases we can search so from the previous slide we could search any one of those individual databases or we could just search them all at once so that's what we're actually going to do and just leave it at the default the sequence we're using here is the deleted for in colorectal carcinoma gene sequence excuse me protein sequence from human all right so let's say we've put that in the box clicked on submit this is what we get back so it's a little bit reminiscent of the blast results we saw last week in that we have a representation at the top of our our protein hits so our protein is this one going from 1 to 1447 and you'll see below each one of these boxes represents one of the domains that was found in this protein and I'm going to focus in on this very first one which is called a neogenin domain which is also the first one in our list of hits so in this hit list you'll see the definition line just a brief description of what that particular found domain is a identifier which we'll come back to and the probability value so again same guidelines as last week apply 10 to the minus third because we're doing a comparison at the protein level if we want to see a little bit more about this first hit all I have to do is click on that plus sign and that will actually expand this out and I can actually now see an alignment so in this alignment I just have a sense of how good my sequence of interest matches this particular domain consensus so my sequence the human DCC sequence is in the first line so in my sequence positions 41 to 136 lined up with the domain every place you see position marked in red that's an exact match everything else is either a conservative substitution or a mismatch so just at a glance you can see most of the positions are in red and that bears well for this being a true positive and again here is the probability value which gives us the quantitative measure of good how good our hit in this case is if I want to learn a little bit more about this particular neogenin domain that we found in our protein of interest that's where now this number comes into place so this accession number is in a column that is labeled pssm position specific scoring matrix ID and if I click on that ID that takes me to an expanded view something called the conserved domain database at NCBI again a quick summary of what is known about this particular domain the references that support that particular description so you can go back to what are deemed to be the more important papers on this particular domain if I scroll down a little bit we have a representation of how this domain relates to other domain so there's a hierarchy here we're just going to bypass that I don't find that to be particularly useful but I just want to point it out to you at but most importantly at the bottom is the sequence alignment so very quickly you can see how what this alignment looks like who the other members are of this particular class once again you can download these sequences to use for some other purpose in some third-party software and there's a link off of this page that describes to you how to actually do that okay so with that I want to now flip the analogy around so at the beginning we said we can do our searches one to many so in this case the one was our sequence of interest the many was either our profiles or our patterns let's flip that around and now what we're actually going to do is construct a profile to enable us to find distantly related proteins related to the one that we start with the one that we're interested in and the way we're going to do that is by using a tool called side blast and the side just stands for position specific iterated and this is incredibly easy to use so let me tell you how this algorithm works in step one all we're going to do is do a blast p-search the exact same way that we did last week we take a sequence of interest change the various parameters that we want to change click on the go button and it will just do that blast p-search once we get back our list of hits it's going to take that hit list and everything that is above our probability threshold that we're going to set it will take all of those sequences in the hit list construct a multiple sequence alignment derive the position specific scoring matrix and now use that matrix as the input for the next round of searches so we started with a single sequence we ended up with a position specific scoring matrix we throw away that initial sequence okay and again just use these matrices that will be recalculated round after round until we've identified all of the members of that particular class so hopefully we will come to convergence where all related sequences are deemed found if we keep going round after round and the numbers keep getting bigger and bigger at some point you will help will have brought all of GenBank into your query so your query at that point is deemed to be a little bit too broad so what you have to do is just use a shorter region and make your cut also a little bit more stringent that rarely happens but something to be mindful of all right so how do we do this so hopefully this now looks familiar to you again here is the URL that takes you to the blast homepage our protein based searches are here under the protein blast link so again we will click on that as we did last week and this will take us now to the blast homepage okay as we did last week we're going to paste a sequence into the box in this case this is a DNA binding protein a high mobility group protein that we're looking for and let's take a look at some of our options the first thing we get to do is choose what database we want the same options are available as last week so we could pick ref seek which was the curated database I told you about last time around I also alluded to Swiss Pro so let me tell you about that this time around so the same way that ref seek at NCBI is intended to represent each molecule in the central dogma once and only once so you only have one entry for each DNA sequence mRNA sequence and protein sequence Swiss protein is intended to do the same thing but only on the protein side so this is only a collection of protein sequences this is a long-standing 30 plus year effort that has been going on at the Swiss Institute for bioinformatics what's nice about this is that by definition of course these are non-redundant there's integration with other databases there's ongoing going curation of these entries by external experts so this really relies on active experimentalists in the field to keep these entries up to date and more importantly when you look at the feature tables in these entries you'll see a bunch of comment line so yet another executive summary by the active investigator in that field so again do take advantage of those resources whenever they're available to you to discern when you do a blast search which ones of your hits actually correspond to a Swiss Pro entry when you look at that accession number and again the accession number is the unique identifier the sequences social security number in the first position you will see either a no or a P or a Q followed by five numbers so when you see that an O appear a Q followed by five digits that means that that is a Swiss Pro entry all right so let's go back to our search I selected Swiss Pro here and the reason for doing that is quite simple because this is a non-redundant database we're going to just get a nice tidy list of results back we won't have multiple hits on the same sequence all right as before hidden below the blast button algorithm parameters so if we click on that it'll expand the page open let's see what we have here so the very first thing we have is how many target sequences do we want to have returned back to us the default is 500 I just as a rule set that to the biggest number it'll let me set it to a thousand you were called last week's example if we left this at the default we actually would have missed things in our hit list so just set that number as high as you can the expectation threshold the E value our measure of whether something is a false positive the default is 10 I'm going to go ahead and use my guideline of 10 to the minus or 2.001 again remember those are guidelines those are not absolutes we're just going to use that as a starting point here we should again filter the low complexity region so these are those homopolymeric runs that I told you about last week where you've just got stretches of the same letter those tend to confound the blast searches so we want to just mask those and not consider those as part of the the sequence searches finally at the bottom something we didn't talk about last week we now have a section of this page that is specifically geared towards these side blast searches and something else called a five last search which we're not going to talk about today let's say you already had a scoring matrix and you want to use that as the as the input for your search you could just upload that if you found it maybe on some other database you could use that as the input for your first round we don't have that so we're just going to go ahead use our one sequence of interest set the side blast threshold at point zero zero one as well the default here is point oh five there might be a virtue actually to making that a little bit higher because as with last week we want to take a look at what might be on either side of our cutoff line using our biological knowledge to say which ones do I want to include which ones don't I want to include all right finally just my personal preference to make sure that you check this show results in a new window box to make things a little less unwieldy on your desktop you go ahead and click on blast and off we go all right so now this looks like what we have seen again not last week our sequence one to two fifteen in this particular case it has found two matches to conserve domain so there are two HMG boxes in this particular protein and you'll see once again our our very busy scoring table that we talked about last week showing a graphical representation of all of the hits that were found based on our initial query okay if I now scroll down to where the descriptions are just to remind you how this is oriented you'll always see the accession number at the beginning which is hyperlinked if you click on that it will take you back to GenBank you have a short description of what that particular protein that was found actually is the score for that and if you click on that that will take you down to the alignments remember the scores are less important here the probability values are what I want you to always look at in the value column okay now let's what you'll also see here which you don't see in a regular blast P blast and search is in the very beginning the word new and you might see one of two things in that position in the key is here at the top if you see new alignment score below the threshold on the previous iteration well we didn't have a previous iteration so everything is marked new but as we go through the successive rounds here anything that is carried over from the previous round will be marked with a green dot anything that was found that is new in the new rounds will then have the new label next to them let's move down to the bottom of our hit list here it says here run side blast iteration with maximum once again a thousand sequences returned what will now happen just to remind you is we throw out our initial query we have a position specific scoring matrix calculated based on the things in this list you can include all of the things in the list hence all of the checkboxes but again here's where your biological knowledge comes into play maybe you see things on towards the bottom of this list that do not belong if they don't because you have some other piece of information uncheck the box all right and that way it will not be included in the next round go ahead and click on go and this will go around as many times as it has to and you will just have to wait for it and you just keep clicking on go and you go two three four five in this case you will go around eleven times until you finally reach convergence and you know you have reached convergence when you see the message at the top no new sequences were found above the point zero zero one threshold so at this point we can have confidence that we have found all of the members of this in this case timeability group class of proteins that we can find using this particular method to drive home why this is a powerful technique that you should have in your arsenal if you recall it in round one I don't think I pointed this out we had a hundred and thirty two hits by the time we got to around eleven we had a hundred and eighty hits okay so we found forty eight additional sequences that we would not have found just by using our traditional last search okay so very important especially if you're dealing with proteins that have not been highly conserved over evolutionary time the things where there are there's a lot of evolutionary pressure to not have mutations to not have changes you're not going to really pick up anything by using side blast but for most classes of proteins where there is wiggle in these classes of proteins it's worth taking the time to do this extra set of searches to see what else you can find so this hopefully demonstrates now the power of using the collective characteristics of the protein family to find things that we wouldn't have otherwise found okay so with that we're going to leave things at the sequence level behind for now and we're going to move into things at the three dimensional structural level and I know most people tend to do this with a little bit of trepidation because I think we all have in the back of our heads you know the the image of the geek down the hall with the coat bottle glasses and the big machines with all the dials on them and and just structure as being something impenetrable something that is just hard to understand because of the technology that's involved in generating those three dimensional structures trying to figure out what Fourier transforms are and all of that what I want to show you this morning is you actually don't have to concern yourselves with how we got this actual structures but there are some very easy to use tools that you can use to now answer questions about structural similarity and the reason I want to make sure that you know about these tools is because of the very basic tenant that Chris Anfanson back in the 1950s won the Nobel Prize for sequence specifies confirmation this was battered in all of our heads in basic biochemistry but confirmation does not specify sequence the converse of that statement is not true so you might have multiple structures where you see similarity at the structural level whether it is in a particular domain or across the entire protein but if you look at the underlying sequences that make up that protein you might see very very little sequence similarity and there are cases in the literature where that percent identity goes down to the 10 11 12% range why is that important to you when you do your blast searches blast tends to start to fail below 25% sequence identity this has been very well documented in the literature and as you recall last week I gave you a 25% rule is one of your criteria to use for determining biological significance of your blast hits so when you start to enter that territory where blast is no longer the tool of choice now this is the thing that you should be having your arsenal ready to go to find things that as I put here on the slide cannot necessarily be detected through traditional methods all right so again a little bit of background on how this works and this is actually pretty cool so what we're going to do now is compare every known protein structure to every other known protein structure and to give you a sense of the size of that problem as of this morning there were 62,000 entries in the protein data bank PDB and 58,000 of them represent protein sequences deduced either by NMR or x-ray now to do a comparison of one structure to another structure using the most robust methods that we have takes on the orders of order of weeks two months depending on how much computing power you have now if you multiply that out 58k by 58k you have now entered the realm of the computationally intractable so we need to make some modifications and make some simplifications in order to make this a more approachable problem from a computational standpoint so the way we're going to do this is well this is done for you so you understand what's going on we're going to take each one of these 58,000 protein structures so here's just an example of a particular structure we see a bunch of blue bits so the blue bits here are the alpha helices the green lines represent the beta strands and what we're going to do in the first step is just get rid of all of the loop regions anything that does not exist either in an alpha helix or in a beta strand so that's what we have in the second part of the picture here what the method then does is for every alpha helix that it finds it passes a vector right through the center of that alpha helix to approximate the path of that helix keeping in mind which end is the end terminal end which end is the C terminal end if it sees a beta strand it will just do the same thing draw a vector that approximates the path of that gable that beta strand once that is done all of that information is thrown away so by the time we get to the last step every single atomic coordinate has been thrown out okay but based on those coordinates we now have a series of vectors for each one of those vectors we know which one is an alpha helix which one is a beta strand which end is the end terminal end which end is the C terminal end and which one connects to the next one in the series okay that is now going to be the basis for our comparison and this now turns into a glorified game of pickup sticks so we now have here for the example two protein structures protein one and protein two the first one has four secondary structural elements the second one has five for argument sake we're going to say they're all alpha helices just to make this easy we're now going to overlay these every way we can to find out whether or not these are structurally similar to each other so in the first alignment we might take all four of these secondary structural elements overlay them with the four secondary structural elements the four alpha helices from protein two and what we would see is well they're all going pretty much on top of each other they're all going in the same direction of course this is done much more robustly than this is not done by either actually mathematics behind this but I think you get the idea basically if we see something like this we would deem that to be a good match of the two structures over that region to one another let's take another alignment where you might take all four of these alpha helices from protein one and now combine those with one two three and five from protein two so as before one two and three all going the same way pretty much the same path five is off doing its own thing so we would not consider that to be a good alignment and that is just done over and over again every possible combination that the computer can come up with it does its math in the background and what you end up getting at the end of the day is something looking like this in your handout skip ahead one slide so the next two slides have been transposed in your handout this is remarkably good okay these are two protein structures that have been deemed similar to one another by this fast method keep in mind what we just did we throw away all of the atomic coordinates we're comparing these series of vectors but yet these two structures pretty much overlap each other almost perfectly now I've sort of pushed it in this case where I picked an example where there's only one mutation between the two but what you see time and time again is just a incredibly good match between the structure that you started with and the other ones that are found in PDB using this vast method alright and we're going to come back to this representation in a moment so just to remind you of some of the caveats of the method by definition because we've thrown away all of those atomic coordinates it is not the best method for determining structural similarity because we've lost a lot of information along the way so we have less confidence in our prediction but regardless of the simplicity of the method it is a great first approximation and I will show you in a minute you can do this yourselves right at your desktop so if you find something that has promise using this vast method you can then maybe seek out somebody who's a structural expert to say alright I want to delve into this a little bit more and get some help in that direction but this is something that you all can do yourselves alright so how do you do that so we're going to go back to the NCBI website and so this is the NCBI homepage we're going to use the entree search engine in the upper right hand side to do our search in the search pull down I've selected structure I'm putting in an accession number so that I get back one and only one entry in this case the accession numbers in PDB take the form of a number and three letters so the one I've used here is two LIV while we're here let me point out the structure link here as well so if you want to get to the conserved domain database that I showed you a few minutes ago instead of typing in that long URL you could just click on that as well alright so search structure for two LIV you click on the search button and that now takes us to our results page we get back one and only one result which we would expect because we've used an accession number here and so here is just a little pictogram of the structure that we found the accession number it tells us that it's a paraplasmic binding protein that is called a leucine isoleucine veiling by binding protein if we want to learn a little bit more about this all we have to do is just click on the accession number and that will take us to this structure summary page so again the pictogram the representation of our picture the primary reference that describes the solution of this particular structure so in this case this is an x-ray structure and just a link back to that reference if you want to read the paper this comes from E. Coli and down below we have a representation of what else we found in this particular protein so our sequence is sequence a the structure that we started with two LIV it shows us some domains that were found below that exist in that particular protein but what I'm going to do is use this graphic as our jumping off to do the vast search so we're going to now let the machine do this little vector alignment method to find what other structures are similar to the one that I've started with so in order to do that it's very simple all you have to do is just click any place on the bar that is labeled sequence a when you do that you get this very very busy slide your protein of the one that you started with is right here this bar at the top and you'll notice below a bunch of sometimes continuous sometimes discontinuous bars each one labeled with a PDB identifier so each one of these represents a protein dim similar in structure to the one that you started with what the discontinuities represent are places where you don't have structural overlap between the one you started with and the one that was found so this is just a visual to convey to you do I have a global alignment across the entire length of my protein as we do in the first case case with the exception of one residue as you'll see in a moment or do I have something where I have domains in common so that shows you some of the power of vast we don't have to force a global alignment same arguments we use last week about why we want to do local alignments versus global alignments both at the sequence and the structure level alright and in a current protocols reference I'm going to give you a little bit later you can change how this looks to render it as a table there are some statistical guidelines in that unit that I think would be useful to you in helping to decipher what these tables what these list of results actually should be interpreted I want instead though focus on actually looking at this structure so at the very top you see a link that says view 3d alignment and if you click on that this will launch a viewer called CN 3d which stands for C in 3d I would have loved to have been in the planning meeting to come up with that name alright so you can download that software by following this link here this will work on a Mac a PC a Linux box but let's go ahead and launch CN 3d you've already seen this slide and so here we can change how the representation of this protein is is rendered in this case the rendering is something called tubes which is just the lines that you see the coloring in this case I have set to identity so in this case there is only one match of an a to a v between the two sequences and that is this blue bit all the way here at the top so the reds are the matches the blues are the mismatches I could take my cursor and just highlight any part of this sequence it would light up the corresponding part of the sequence in yellow by the same token I could click on any part of the sequence and it would light up the corresponding residue in the alignment below so a really cool tool to find out where each particular residue actually lies so you can start to think about things instead of just being a string of letters not having any sense of how this thing folds now you kind of know all right and that's very important two other views of this that I think would be important and again the current protocols unit describes how to get these the first and for each one of these I'm sorry the alignments a little schnucked here for each one of these there is a rendering setting and a coloring setting under something called the styles menu the one on the left is what I like to call the seminar view because this is a very nice way to orient people in your audience to what your protein structure is all about you'll so the settings for those are shown below in this structure you see a bunch of green crayola crayons so what those crayons are are the alpha helices the flat end is the N terminal and the pointed end is the C terminal and the brown bars are the beta strands again flat end is the N terminal pointed end is the C terminal so you can see how they are oriented throughout the structure now certainly we don't have proteins in the cell floating around with a bunch of crayola crayons in them so we have other ways to render this to get a more realistic representation and so the one that is on the right is the space filling representation though the best approximation we can get using this tool of the true three-dimensional structure of this protein the shape that is being presented in the cell and so in this case I've also colored this using the charge setting so any place you see blue is a positive charge any place you see red is a negative charge everything else is neutral and I've already told you that that this is an ice loosing isolucine valine binding protein so you might be able to look at these charge distributions and see how they might point you towards a binding site or some other important part of the molecule that is actually involved in its function so incredibly easy to use I mean as easy as it was for me to explain this to you if you go back to your offices and actually sit down and do this it is that easy to use okay so I would encourage you to do this because it's going to help you conceptualize better what your proteins of interest the ones that you yourselves are working on in your laboratory are all about you now have a sense of what's on the surface and what's buried in what might be in an active site or or in a in a binding pocket or some catalytic site or something else that's important to what this particular entity actually does so if you're doing experimental design you make more intelligent choices now because you know where the binding sites are and other functional features are so you don't just do random site directive mutagenesis to determine gain of function or loss of function you can hone down your experiments a little bit better so please do take the time to to avail yourselves of these of this tool it's incredibly useful and again more importantly will serve you very well in those instances where simple sequence comparisons just won't be up to the task so some additional reading one more time so I've alluded to this several times now so this is a unit that I've written in CP BI unit 1.3 which talks about CN 3D which we talked about in this lecture you'll have a little bit more information in there on how to make those use but how to label them how to export them get them into your PowerPoint presentations and so on but also for those of you who are interested that unit takes you through a very rigorous overview of how to use entree I know most of you have probably used entree at some point but you probably know how to find those papers and maybe find a sequence and maybe don't exactly know how to use entree to its best advantage the full power behind entree so I think that really you should consider as required reading not just optional finally the second one here is an introduction to modeling protein structure from sequence we don't have time to talk about this today but let's say you've done your vast search and you come up with nothing that there is no other structure similar to the one that you're starting with and remember those comparisons are all of solved structures we're not doing any de novo structure prediction there so but let's say you actually do want to do that that you want to start to say right I want to model my protein it to see the effect of a mutation so I can just interchange a residue for another residue to see what would actually happen there are a number of more advanced methods that allow you to do that and this unit gives you an overview of those methods what to use where and where to find them so I would encourage you to do that as well if your research takes you in that direction okay so now in the last 25 minutes we're going to fly through multiple sequence alignment so so it's important that we talk about this one it puts a lot of the concepts we've gone through in the last lecture and a half but it also will set us up for things in future lectures where you will see these alignments over and over again whether it is in the context again of phylogenetic analyses even starting next week when Tira starts to talk to you about the genome browsers and the various tracks on them those are in essence alignments okay so it's important to understand where they come from I realize that most of you have probably done to them before but I want to give you some general guidelines to make sure that you're performing them properly and to basically just bring your game up a notch to make sure you're you're using these methods in the most advantageous way all right so why do we even bother doing these things what do we stand to gain what can we learn by doing these multiple sequence alignments so quite simply they very often can allow us to identify conserved regions and patterns and domains all the things that we spent the first half of this morning speaking about and again as I beat on many times now how those things can be to your advantage when you think about questions in experimental design in predicting the structure and function of a possibly unknown brand new protein that you've discovered or to identify new members of a protein family you absolutely have to do a multiple sequence alignment if you want to do a phylogenetic analysis this is actually impossible to do it any other way because all of the phylogenetic methods depend on the construction of a multiple sequence alignment as its input okay so it's not that you can just stick in a bunch of sequences and get back a tree you have to actually start off with a multiple sequence alignment and then that alignment will be used to generate the tree we've already talked about the next point about using this for the generation of position specific scoring matrices and this also might bolster confidence if you've done secondary structure prediction so we haven't talked about predicting alpha helices and beta strands in either one of these two lectures but let's say you have done that and the statistical support isn't very strong if you do a multiple sequence alignment and see commonality in those secondary structural elements across a whole host of different proteins that you've aligned then that gives you better confidence in those predictions as well so again that the same spirit of sometimes in the laboratory you might have a result that you're not exactly sure of and you use another technique to verify same game applies here all right so what do we need to consider when we do one of these multiple sequence alignments well of course we're looking for absolute sequence similarity what we want to do in these alignments is in each column get as many absolutely conserved positions as possible line up as many common characters as we can of course as with our scoring matrices we can't always have absolute conversation at each position sometimes we have conservative substitutions so we take those into account as well finally you may be lucky enough to have one of the sequences in your set where there is a known structure and that information can also be used to fine tune the alignment to add greater support to the ultimate multiple sequence alignment that you get all right some general guidelines things to keep in the back of your head as you do this yourselves we tend once again to concentrate on the protein level rather than on the nucleotide level and that's just because it is more informative there is more information content in protein side chains and there is by looking at just the chemistry of each of the individual nucleotides that's just because of this the different structure of each of the 20 amino acid side chains so because of that it's less prone to inaccurate alignment so we're taking those physical physical chemical properties into account you can certainly translate these back to nucleotide sequences after doing the alignments this certainly depends on the context that you're doing this in because you might be trying to align nucleotide sequences where there is no protein translation and we'll see examples of this when Laurel Nitsky gives her lecture in four weeks when she starts to talk about regulatory elements and considerations and epigenetics all right some more guidelines you need to use a reasonable number of sequences so so the the temptation is to throw everything you have at the method but that actually will start to work against you if you do that because this is a global alignment method so again we talked about global alignments last week the more alignments you do the longer it takes and the harder it get it gets so most of the alignment algorithms start to fail when you try to line up too many sequences and the truth is you sort of reach a point of diminishing return where if you've gone from 40 sequences to 50 sequences are you really learning anything more by adding that last ten sequences chances are if you've selected your sequences wisely at the beginning you really don't have to have very huge input sets also the phylogenetic studies that might arise from this are almost impossible to do I remember once where I tried to do an alignment phylogenetic tree on a set that had something like 130 sequences on it and it took a month for the computer to finally get around to giving me an answer so it becomes very very computationally unreasonable so good starting point 10 to 15 sequences that seems to be what folks like to use in the literature and your ballpark upper limit is around 50 before you start to see some of the problems that I've mentioned to you alright so that takes care of the number of sequences what about the nature of the sequences so because it's a global sequence alignment again it works best when you've got sequences of about the same length you want to use closely related sequences those will tell you what's quote required what residues are absolutely conserved if you use more divergent sequences you can use those to study the evolutionary relationships in that group of sequences so you want some of both so for usually good starting point sequences that tend to be 30 to 70 percent similar so you have a fairly large birth to work in here but the last point is really the most important one that the most informative alignments really come when you've got a combination of these the closely related and divergent ones things that are not too similar well if they're all too similar you're not going to learn anything you already know from the single sequence pretty much what you have to what you could find out if they're all almost exactly the same but if they're too different you then end up in a situation where you just computationally can't align them so again 10 to 15 as your starting point this should also be an iterative process where you start with that 10 to 15 do the alignment see how it looks so examine the quality of the alignment what I what I mean by that is how many gaps are in the alignment remember each one of those gaps represents a biological invent either an insertion or a deletion so you have to keep those to a reasonable number you just can't really nearly put them in if the alignment looks good add some more do the alignment again just keep going in that fashion if you see that the alignment is starting to break down so for example you might have just inserted a sequence that now is putting an inordinate amount of gaps in just take that one out okay so there's an element of fine tuning that you will learn over time it's actually rather intuitive alright now that's how to make them how do you interpret them so when you see a particular column in your multiple sequence alignments where you've got absolutely conserved positions that indicates to you that those are required for proper structure and function they have been conserved for a reason when you see relatively well conserved positions those are the ones where you can say all right I can tolerate a certain amount of change and not adversely affect the structure or function of the protein now most people tend to concentrate on these looking at the commonalities but I think it's actually quite interesting sometimes to look at the differences the positions that are not conserved because those are allowed to mutate freely and this is really sort of the source of evolutionary innovation where mother nature can actually come up with changes in those proteins that can be tolerated because the original function is supported but maybe start to develop new proteins that have slightly different functions in the cell alright if you see gap free blocks those are usually regions of secondary structure alpha helices or beta strands because you just can't have a gap in one of those if you see gap rich blocks those are usually unstructured regions or the loop regions okay that is what it is alright so enough of that how do we actually do this the method I'm going to describe to you today is something called cluster w and so this also comes from the folks at the Sanger Institute and this just very simply allows you to take a sequence set of interest and do your multiple sequence alignment there's a standalone version that you can download the web based version though is is very nice and that's the one I'm going to show you and then you don't also have to worry about do I have the latest version so how does this actually work again to get us away from the black box a little bit of background so what this is using is a method called a progressive alignment method so regardless of the number of sequences that I start with it's going to only align two sequences at a time it's not going to attempt to align them all at the same time and building on these pairs of alignments it's going to gradually build up the multiple sequence alignment clustering them on the basis of similarity it's going to use the same kind of matrices that we talked about last week the same kind of affine gap penalties to calculate the alignments that have the best score by doing it this way there's two major advantages one it's incredibly fast and the alignments are generally a very very high quality so that's why I like to use this kind of method with some copyouts alright so what does this actually mean when we say progressive alignment so here's a sequence set that I've put together four sequences a b c and d and in the first step what I want to do is just calculate how identical each one of these is to all of the other so a to b b to c c to d and so on and again with apologies for the alignments here in order to do that it will do all of these alignments there because of this equation here that dictates how many alignments have to be done based on the number of sequences if we have four sequences it results in six alignments but as you can see the numbers get pretty big as we go down so if you've got a hundred sequences you're up at five thousand alignments and this sort of drives home what I was saying a little bit earlier about the set being a little too big alright so here are four sequences again I've computed my scoring matrix so a b c and d going across the top a b c and d going down the bottom across the diagonal is the comparison to self so of course they are a hundred percent identical to themselves but now I'm just going to look for the largest numbers in the table to see which ones of these are most related to the others so what I see here is a is most related to b 80% c is most related to d 92% so I'm going to treat a and b together and c and d together so a and d share greater similarity with each other than c or d and so what I'm going to do is I'm going to take a and b and align them with each other create an alignment called a b and I'm going to fix that same thing with c and d align those two fixed alignment now in the next step I'm going to take that fixed alignment of a and b and align that with this fixed alignment of c and d alright so we're now using these blocks of alignments and aligning the alignments to each other and we just do that as many times as we have to until we get all of the sequences into the alignment okay so we start with individual sequences we build these little sub alignments we align the sub alignments with each other now so what that allows you to do is because we're starting where there is the greatest amount of identity we do the easy ones first okay and we use that information that we build along the way to to inform how to do the harder alignments so not that different than the philosophy that we used in side blast we start finding the things that are common and then using those collective characteristics to build out the problem with that though is there if there is an error in the initial alignment and you fixed it you're going to propagate that error throughout the alignment so what is nice in this new version of cluster 2 and why I like this over some other of the progressive alignment methods is it allows something called a removed first step where you can actually backtrack and take things out and put things in it it will recalculate to improve the alignments and I'll show you how to throw those flags when we get to the screens alright so what do we get well once we do that we get the scores are used to inform how to do this build up of the alignment luckily we get a multiple sequence alignment out that's the whole point we also get to tree based representations one is called a cladogram and this is just that built up tree that I showed you what sequences were aligned together to put that multiple sequence alignment to construct that multiple sequence alignment so it's an estimate of phylogeny the branches are all of equal length it gives you some idea of common ancestry but you don't have branch lengths that gives you an indication of evolutionary time the phylogram is basically the same idea but now you do the branch lengths do vary here to give you at least a visual idea proportionally of how much evolutionary change has taken place over time you will get back conservation patterns so when you see the colors on the various alignments this is the scheme that it uses to determine what is a conservative substitution so the aromatics are all grouped together the positively charged residues the negatively charged residues and some other classes that are common to hent ends of helices catalytic sites and of course sistine is in a special class of its own because of its role in maintaining those sistine cross bridges that are very important to structure and function the interpretation here is empirical so I don't have numbers to point you to as I have with the other methods to give you some general starting points for cut-offs and and similar considerations so the interpretation here is strictly empirical what you will get back for each column in your alignment at the bottom of each column you'll have see one of these three marks or just a space if you see a star that is an entirely conserved column so you want to see those stars in at least 10% of the positions across your alignment to to consider that to be a good alignment and that's a generally accepted indication of a good alignment conservation is dictated by those groups that I just showed you so if you have only residues in those groups at a particular position you'll see the colon here according to that color table this one's a little interesting if you just see a dot this is what they have come up with calling semi conserved so that just means you have residues from two of those conserved classes but no more than two of those conserved classes so really the ones you should focus in on are the first two here this is just the coloring table you can look at that on your own here's the screens that I want to focus on so so starting off first thing I want to show you is where it says matrix right now it says default as the default but we have as before choices here so we spent most of our time last week talking about the blossom matrices I mentioned the Pam matrices you'll notice though that there is no number here so remember last week blossom 62 blossom 80 blossom 30 and so on what the method will do is pick the appropriate matrix depending on which sequences it's trying to align so it will change them as it goes through your sequence set so I usually pick blossom there the other thing I want to show you here is what's in the red box is how you control that remove first alright so that so you don't end up in a statistical minimum where you've made a wrong alignment that propagates through your tree you can use these settings to dictate when that procedure takes place if you pick tree under iteration it will happen at each and every step more computationally intensive but it probably is the safer bet if you leave it on alignment it'll only do it in the final step the number of iterations default to three again just bump it up all the way to ten down whoops down below I've just pasted five sequences in here these are all fast proteins and these are also on that web page that you have for you to do your practice from so at this point I would click run down at the bottom here and this is what I get so at the top just a series of links that will allow me to jump around my page right below I just have all of my scores that were used to inform how this particular alignment was constructed if I scroll down here's my alignment after all of that finally you have the alignment according to the color table in your handout okay I'll show you how to change this momentarily going further down the page here is the clattergram so you can get a sense of how the alignment was constructed but also some sense of phylogenetic relationships again the clattergram doesn't give you any sense of evolutionary distance if you click on phylogram it will recast it into this format where you now have a better sense of evolutionary distance this is not a phylogenetic tree okay I will show you how to make it that's not it though so remember that okay how are we gonna get there how we're gonna make that tree we're gonna use a tool called Jalview and this is a Java applet that can be used to manually edit your alignments so let's say the method has created the alignment but you want to fiddle with it you want to move things a little bit over you might have a reason to to say well this residue should really be aligned in this column rather in this and in this column so you can actually have some fine control you can change the colors you can do a consensus sequence calculations and more importantly second from the bottom that's where we're gonna finally make our phylogenetic trees so to get to the Jalview applet there is a button on the top of your results page that says start Jalview and if you click on that you get a new window popping up that looks like this so on the next several slides up at the top here I'm going to give you a path the menus are in these teeny tiny barely visible like one-point type here at the top but I just want you to know where those are all right so here's your default view our five fast sequence is going from one to the end three histograms to indicate to you the quality of your alignment the first line says conservation so this is just how an indication of percent identity how identical is that position that goes hand-in-hand with the alignment quality so these usually parallel each other and finally at the bottom you see a consensus sequence in some positions you'll see a plus sign and those are just positions where no consensus sequence could be reliably determined all right now so that's my default view let's play with this so first one if I go to these menus go to the color menu and then pick percent identity the color scheme changes to these shades of blue here's what the shades of blue correspond to so why this is useful to you is this very quickly allows you to find motifs in your own alignment that are putatively important so I when you do this you should look for blocks of higher absolute sequence identity and what that is going to tell you is what parts of the these sequences have had on them some sort of evolutionary pressure to not change okay to keep those residues conserved let's change it one more time so now we go to the calculate menu in this teeny tiny menu here and then just ask for a pairwise alignment before doing that I've highlighted two of the sequences here phosphor chick phosphor mouse I click on pairwise alignment and there's my pairwise alignment of the two to each other and it will also give me a sense of the percent identity as well all right let's do it again so now this time let's say I select all of my sequences I go to the calculate menu I ask it to calculate a tree there are four choices there the one that I picked is something called neighbor joining using blossom 62 and here now is our first phylogenetic tree okay so this shows you the relationship between the five species here the five sequences who so mouse being most related to human rat being most related to mouse and so on you can overlay evolutionary distance information bootstrap values and stuff on these to get a sense of the quality of your phylogenetic tree so I like this a lot because it's integrated in with the multiple sequence alignment generation you don't have to take your data off to another program to make those trees but this will only take you so far some of the more advanced tree building methods have many many more options available to you where you can fine-tune what these alignments look like and dictate how many times it should recalculate the trees and so on so in the last two minutes so once more further reading there's an entire unit in CP BI on cluster w again examples for you to work through there's another method that I want to point your attention called tea coffee everything we did in the example just now relied on sequence data to construct the alignment what tea coffee will do is if there is a structure available for any of the sequences in your set it will use that information to better construct the alignment so again using that three dimensional information to give you a better alignment at the end of the day so with all of that hopefully now over the last three hours you started to gain some appreciation for why some of the basic understanding of what underlies these methods is important I know many of you have used the methods before but have probably just as I mentioned at the very beginning of the first lecture stuck the sequence of the box clicked on blast or clicked on go and gotten your results back but I there are things that you need to be mindful of to make sure you use these methods to their best advantage hopefully I've given you some hints you can take away with you to use them to your best advantage and more importantly to avoid some of the pitfalls that others have fallen into it doesn't require a hardcore understanding of the techniques as you see we didn't go into any math all right I've shown you equation here and there but we haven't really discussed them it's just that you have a general understanding of how they work so you make the best choices so that you don't again treat this as the black box where it sequence comes in results come out and you just trust them okay and the thing is what I'm telling you is use them in an intelligent fashion inspect the results to make sure that what you based on what I've told you over the last two lectures they make sense from a computational standpoint but they also make sense in the context of the biology that you already know when you put those two things together that will always serve you incredibly well okay so I leave you with that a reminder about next week's lecture Tiro Wolfsburg will pick up for me from from here we're going to move to a more decidedly genomic point of view next week where Tiro will be giving you information and giving you a very very nice overview of how to use the various genome browsers to mine genomic data alright so I'll happy to take any questions down at the podium thanks once again for