 So, on behalf of Antiboxidatus and Eric Green, I'd like to welcome you to week four of this lecture series, both those of you sitting in the room and those of you joining us by our remote sites. Before I get started, I need to disclose for those of you applying for CME credit that I have no relevant financial relationships with commercial interests to disclose today. So, what I'm going to be talking to you about today is how to get to publicly available sources of genome sequence data. As the Associate Director of the Bioinformatics Corps, I get a lot of people, a lot of researchers in our institute coming to me wanting to know how to get various genome, either genome data, genome annotation, gene sets, what have you. And normally what we use for these data analyses are one of the three genome browsers that I'll be talking about today. The three major ones are one at UC Santa Cruz, University of California, Santa Cruz, one at Ensemble, which is located in England, and one at the NCBI. So, just out of curiosity before I start, how many of you are familiar with genome browsers? Most of you? Okay. Well, I will do my best to keep this at a level where those of you who haven't seen a genome browser can follow along, but hopefully those of you who have used a genome browser in the past can pick up some new tidbits of information. So before I get started, let's just get some information about what do you expect to see in a genome browser. Well, basically all three sites will start with the same source of information, and that is genomic sequence from a variety of organisms. And then each site independently calculates annotations on that genomic sequence. What most people want are the positions of genes. Those are provided. You can also see things like SNPs, homologous sequences from other organisms, and the whole host of other types of data. In terms of genes, they're done in a couple of different ways, but they're all sort of similar. So basically in order to annotate a gene on a genome, you would start with a set of messenger RNAs from that organism, align those messenger RNAs to the genome, and where you see the best alignment, you would assume that is the position of that particular gene. The mRNAs will align with the exons, and the sequence in between would be the introns. For mRNA sequences, the three genome browsers use RefSeq mRNAs. These are the... I'll talk about these a bit more later, but these are the messenger RNA reference sequences that Andy mentioned to you two weeks ago. They use other messenger RNAs from GenBank, protein sequences, ESTs, which are short CDNA sequence tags, and they also do some ab initio gene predictions. That is, they look at the structure of DNA, look for open reading frames, splice junctions, and use that information to predict the locations of genes as well. Because the annotations are calculated independently by the three genome browsers, you may get different genes back. You may get different genes annotated. So I highly recommend that if you're looking in a complex region of the genome where you're not so sure that the gene annotation for... in the genome browser you're looking at is correct, that you check out that region of the genome in an additional genome browser as well. And I'll show you how to do that. So just a brief overview of genome sequencing strategies. Eric covered this three weeks ago, so I don't want to belabor the point. But basically, there are two strategies for sequencing genomes, or I should say historically there have been two strategies for sequencing genomes. This clone by clone shotgun sequencing strategy, which was used by the NHGRI funded human genome project, and the whole genome shotgun sequencing strategy, which was used by CELERA to sequence the CELERA genome back about seven, eight years ago now. In the clone by clone shotgun sequencing strategy, you first make a back map of the chromosomes. That is, you take each chromosome, break it into smaller pieces, about a few hundred kilobases in size, clone those into back vectors, bacterial artificial chromosomes, and map those backs along the chromosomes so you know exactly where on each chromosome each clone comes from. You then sequence the backs by a process called shotgun sequencing, where you break up the backs into smaller pieces, and then reassemble those smaller sequences into the sequences of the backs, because you've already pre-mapped the backs on the chromosome. You know where those lie, and you can end up with a complete sequence of a chromosome. The other strategy, this whole genome shotgun sequencing strategy, you don't bother with this clone by clone mapping procedure, you just take the entire genome, shotgun it, or break it into smaller pieces indicated here, and then write complicated computer algorithms to reassemble these pieces of sequence into the longer sequences of chromosomes. There was a lot of controversy when this method proposed by CELERA first was used. How can you possibly take sequences that have come from all over the human genome and figure out what chromosomes they would go back into? But the procedure actually worked quite well. They ended up with decent human chromosome sequence with a couple of gaps indicated down here at the bottom, as the holes in the sequence, and the procedures worked so well that it's being used for other genomes since then. Most subsequent genomes are done by this whole genome shotgun sequencing strategy. There are now newer strategies using next-gen sequencing technologies that I believe Elliott Markleys will touch on when he gives his next-gen sequencing talk next week. Regardless of how the genome sequence is assembled, the sequencing is usually done over the course of a couple of years. So there's a big push for sequencing in the beginning, but even over time, additional sequence keeps coming out. And the process of taking all these bits of sequence that are being produced and reassembling them back into chromosomes is a fairly laborious and time-consuming process, so it's only done every year or perhaps even less. As a new genome becomes available, these genomes would be assembled by either the NCBI, by the new consortium called the Genome Reference Consortium, or by the group that actually carried out the sequencing of that organism. As these new assemblies become available, they are distributed to the NCBI, Ensemble and Santa Cruz genome browsers, and those groups then work to get annotations available on those genomes and make the genomes' displays available on the genome browsers. These displays don't necessarily happen at the same time, so you may find that a new genome assembly comes up first at Santa Cruz, then perhaps at Ensemble, and finally third at NCBI, so you need to make sure when you're looking at data in different genome browsers, you need to make sure that you're actually looking at the same version of the genome assembly. Both Santa Cruz and Ensemble make available their pre-release genomes, that is genomes that they're in the process of annotating at the two URLs shown right here, and Santa Cruz and Ensemble also make available older versions of the genome assemblies, archived versions of the genome assemblies, so if you want to go back and see either the assembly or the annotations that were available at a certain time, say a couple years ago, you can always go back and see that. NCBI only has a limited archive available at this point for the previous versions of the human and mouse genomes. Why would you want to go back to an older assembly? Two reasons. One is although newer assemblies are newer and they're based on additional sequence data, they're not always better. Most of the time they're going to be better, but in the rare instance they could be worse in a very specific region, and you may prefer the assembly and the associated annotations that are available from an older version. The other reason is that if you are working on a particular region, so say perhaps you're hunting for a gene between two markers, between two stint markers, which I'll talk about later, you are working up the genes in that region, you'd like to be able to see the display of those genes and their associated annotations. If the assembly suddenly changes, the genes in the region change, you've lost your reference, you can't go back and look at things graphically. And this was a big problem in the early days of the human genome assemblies. I know people working with us would be working on a human genome assembly from a certain date, they've gotten all through looking at their critical region, analyzing all the genes. All of a sudden you come to work one day, NCBI has updated their genome browser with the newer genome assembly and you can't get back to what it used to look like. And at that point you've got to start your analysis all over again. I don't think that happens too much more, especially because the human genome is fairly stable, there's a new release that just came out about a year ago, but even so there are fairly minor changes at this point. But nevertheless it is nice to be able to go back and I'll show you a very concrete example of that later in the talk when I go into ensemble. So here's just a chart that I update periodically showing you five genomes I picked out looking at their status in the different genome browsers, human, mouse, dog, and Rhesus macaque at this point are all displaying the same genome assembly version at the three genome browsers. Zebrafish is not. There's a version called ZV8, which is available at Santa Cruz Ensemble. NCBI is still on ZV7. So you might think if you're looking at zebrafish data and you want the latest and greatest, you probably don't want to use NCBI for that. But I'm sure NCBI within the next few months will convert to the newer version. I will also point out that even if you're trying to do your homework and be good and make sure that you're looking at the same version of the genome assembly, it's not necessarily a trivial exercise because the genome assemblies have different names at the different genome browsers. For example, the most recent human genome assembly can be called Build 37 at NCBI, can be called GRCH37, Genome Research Consortium Human 37 at Ensemble and Santa Cruz. Santa Cruz also dates their assemblies and they also use a two-letter, two-number code. So the most recent human genome assembly you may also hear referred to as HG19 because that's Santa Cruz terminology. HG18 would be the previous version. So I had mentioned that a common way to annotate genes on a genome assembly is to use the NCBI mRNA reference sequences. Andy showed this slide two weeks ago, so again, I don't want to belabor the point. But basically what a reference sequence is, is NCBI's attempt to have one accession number, one GenBank record for each molecule of the central dogma. That is they want one accession number for each mRNA, each protein and each genomic sequence from all the organisms for which such data is available. These reference sequences are for the most part copied out of existing GenBank records, re-annotated and put again into GenBank with a special type accession number. You can recognize reference sequences because they always have an underscore in their accession number. So reference mRNAs that are curated have the prefix NM underscore and then a series of numbers. The corresponding protein translations would be NP and non-coding transcripts would be NR. There's a parallel accession series that start with a letter X. These are model sequences that come from the process of genome annotations. So these are not curated sequences. They are XM's mRNA, XPU protein and XR RNA. This is only the tip of the iceberg for different types of NCBI reference sequences and there's a much longer list down here at the URL shown at the bottom of the screen. I just wanted to briefly go through a GenBank flat file of a reference sequence. I presume most of you have seen GenBank flat files for other sequences. I just wanted to point out one or two things that are a bit different when you come to a reference sequence. So this is a reference sequence for the human beta-actin mRNA. I should point out that if you look in GenBank for human beta-actin mRNA sequences, I believe there are on the order of 15 or so different mRNA sequences. So what NCBI has done is they picked what they considered to be the best mRNA sequence, the one that either the literature refers to the most frequently, the longest sequence, the highest quality sequence and that's the one they've deemed the reference sequence and resubmitted with this new accession number. Up here at the top we see the accession number and the accession numbers are always versioned to be that in a reference sequence or another type of GenBank record. The version tells you the version of the sequence. So the first time a GenBank entry comes out, it'll be an accession number NM001001.1. If the sequence gets updated, the version increases. You go to version 2, version 3, et cetera. Most sequences in GenBank are still on version 1. Most mRNA sequences are on version 1 because they don't get updated. In this particular case, this human beta-actin mRNA sequence has been updated three times and we're now on version 3. What NCBI provides on the reference sequences are a lot of different PubMed references. So these would be brought in by curators who actually sit and look at the RefSeq records. There's a whole long list of PubMed records on this, a PubMed links on this accession number that I have not shown you here. Down below we have the comment block. And the comment block for the RefSeqs is useful because it tells you the status of the RefSeq. If a reference sequence has been analyzed by the NCBI staff and curated to some extent, it will be deemed reviewed. These are the highest quality reference sequences. Right here it says reviewed in capital letters. The other thing you might find useful is that it tells you what parent GenBank accession numbers this particular messenger RNA is derived from. So this particular RefSeq is coming from these two accession numbers shown here in blue. You get an executive summary about the mRNA which is written by a member of the GenBank staff and can be quite useful because it tells you right up front what this particular mRNA is thought to be doing. Down below, like at a normal GenBank record, we have a link to the CDS or protein sequence. Protein sequence is right here. If you want to go to the GenBank protein record, there's a link here to the NP accession number. And then down at the very, very bottom of the record, we would have the nucleotide sequence itself. Sure. So the question is do the mRNA sequences or whatever sequence in GenBank improve as the version increases? I think the answer is yes, one would have to hope so. But again, this is a place where you would use your biological intuition if there is some reason why you think that the older version is better, you might stick with the older version. So for example, just to give you a concrete example, if you were annotating mutations on a sequence, you have a list of mutations in a gene, you would probably always want to keep with the same version of the sequence because otherwise the coordinates of your mutations might change over time and that would be very confusing. So each isoform of an mRNA would have its own RefSeq record. So if there are 10 isoforms of a particular gene, you would get 10 separate reference sequences. Thank you for asking that. Yes. The GI number is another unique numerical identification number that NCBI provides on each sequence. It's a long string of numbers. As you can see right here, there's a GI on every nucleotide in every protein. The GI number also changes every time the sequence gets updated. But there's no, as far as I can tell, maybe there is some hidden code as to how the GI numbers are associated. So for example, the next version of the sequence will be version 4. But if the sequence gets updated, the next GI number will not end in 1, 4, 5, it'll end in something completely different. Anything else? Yes. For particularly, hold that thought. I will try to answer it in a couple of different ways. If I haven't fully answered at the end, I suggest come up and ask me that question again. So the question was, how do you get all the exons for all the isoforms of a gene? OK, so I'm going to launch directly into the Santa Cruz genome browser. Now, I'll say I'm going to show you screenshots of what I think were the most important parts of what I'm talking about. If you're interested in following along I highly suggest you take your handouts, go back to your computers, and try to reproduce these examples on your own to see the bits that I've left out. I'll also say somebody brought to my attention that while the black and white handouts look OK, the color handouts may not have printed completely. If that's the case, I thought I had checked them. I apologize for that. And I'll get newer handouts, newer versions of the color handouts posted later today. So what's shown here is the home page of the Santa Cruz genome browser. The URL was shown a couple of slides ago. Everything that you can do in the Santa Cruz genome browser is listed across the top and down the side. Don't worry. I'm not going to click on all these links today. I'm just showing you that they exist. One thing I like about Santa Cruz is right up here up front, they show you, they give you the news. They tell you what's going on. So if there's a new genome assembly released, that will be described here right up front. So you don't have to search to know whether there's a new human genome or new mouse genome. You can just read the news section. The most common thing you will do at Santa Cruz is up here at the top. You will click on the link where it says genome browser. And that takes you to a query page. On this query page is where you obviously enter your query. You can choose. There's a lot of different organisms, genomes available through Santa Cruz. You would select your clade or your type of organism you're looking for. In this case, we're looking for human, which is a mammal. So we'll stick with mammal. Under the mammalian genomes available, I think I counted last night. And there are currently 14 mammalian genomes available through Santa Cruz. You would choose the one that you want. And you would also choose a genome assembly. So here for human, if you can see, there are four different genome assemblies available. The most recent one being this GRCH37. That is now the default human genome, which is available through Santa Cruz. But I'm not going to be showing you that one today. Just because it is new enough, there aren't that many annotations available for it right now. So I'll stick with the March 2006 assembly, which is what a lot of people have been using for years. And that has the most rich set of annotations. In this position or search term box is where you would type your query. There's a whole list of different queries down here. You can type a chromosomal region. You can type an accession number. You can look for even a word like kinase. I'm going to look for my favorite gene, called atom2. I'm just going to type its name into the position or search term box, press the submit button, and get back a list of results. So the data at Santa Cruz is organized into different units or different tracks. The first type of track that comes back is a track called UCSC genes. Santa Cruz has two major efforts to annotate genes on the human genome. The first one is this UCSC genes track. That is taking mRNA sequences out of RefSeq, out of GenBank. They're taking protein sequences out of a database called Uniprot, and using all that information to predict the locations of genes in the human genome. The other track, which I actually prefer, is this RefSeq genes track. That is looking only at mRNA reference sequences from NCBI and using those to predict locations of genes. I actually prefer those because these reference sequences are at least partially curated, and I think the data set that comes out of here is a bit cleaner than the data you get out of the UCSC genes track. But that is my subjective opinion, and I'm sure other people would feel differently about this. If you look at the results that are returned, remember we typed in a query for atom2. Will you get back not only the atom2 gene, but other members of the atom gene family that have the string atom2 in their names, so for example, atom20, atom21, atom22, they all come back. So you have to filter through these results to get the one that you want. I'm interested in following the atom2 gene in the RefSeq track, so that is the one that I will click on down here. And that will take me to the default browser view of the region surrounding the atom2 gene. Up at the top, we have an ideogram of human chromosome 8, which is where the atom2 gene has been mapped. This red box right here shows you the position that we're looking at down here in this graphical view at the bottom. The data are organized in tracks or in data sets, and each track is labeled at the top. It could be a bit confusing if you haven't looked at this before. So there is a scale that shows you what range of sequence you're looking at. There is another coordinate system, so you know where on the genome you're looking. That information is also repeated up here if you can't read these numbers. And then we get into our data. The first set of blue lines, these three blue lines, those are the positions of the UCSC genes. So the UCSC genes track has three versions of the atom2 gene annotated on it, three different isoforms. The next track, and I'll go back to those in just a second. The next track is the RefSeq genes track, so the name is here in the middle, and there's one blue line under that. So that one blue line represents the alignment of a single mRNA reference sequence to the genome. Under that we have other tracks, which I'm not gonna go into in great detail. Human mRNAs from GenBank, human ESTs, comparative sequences down here that you may be interested, and so on. If you are looking at one of the genes track, it's very important to know what you're actually looking at. So you see there's this horizontal line representing the gene, and on the line there are these tick marks. Those tick marks are actually the positions of the exons, and the horizontal line connecting them would be the location of the introns. If you've never looked at the structure of a gene before, you may be somewhat surprised, at least in human, the introns are really big compared to the exons, oops, which are actually very small. The other thing that's shown here is a direction that the gene is pointing. If you look closely, I don't know if you can see it up on the screen, there are little arrowheads on the intron which are pointing from right to left. That's showing you the gene is actually pointing from right to left, and the five prime end of the gene is over here on the right side of your display. That may be a bit counterintuitive, if you're used to looking at genes in a textbook, typical textbook representation, of course, will have the gene going from left to right. But in the genome, genes can be pointing either direction, there's about a 50-50 mix. You always need to be sure you, when you're looking at one of these browsers, you pay attention to which direction the gene is pointing. Sorry, if you click on any item in a track, you will get additional information about that item. So for example, if we click on one of the Adam II genes in the UCSC genes track, we will come to additional information about the Adam II gene. There's an index up here that tells you what information is provided on this page, but basically Santa Cruz is drawing in information from a lot of different internet sources and compiling it all in one place. So you get some text from UniPro telling you what this particular protein is doing. You get text up here from NCBI from the RefSeq telling you what the gene is doing. If you scroll down, you can, for example, get some microarray expression data showing this gene is very highly expressed in testes, not expressed in the other tissues which are listed here. You can get things like protein domains like what Andy talked to you about last week. Further down, you can get structures that are available, links to sequence, all sorts of different things. So I recommend this as one place to go even if you're not interested in a genome browser view of a gene, it's a good compilation of information about that gene in one place. Are we okay? If you instead click on the RefSeq gene, the RefSeq gene feature, you get a different page coming back because this is an NCBI derived sequence. You get links mainly back to NCBI resources, some of which I'll talk about later up here. Down at the bottom, I wanna draw your attention to the links to sequence. In particular, I wanna get the genome sequence, genomic sequence of this, of the Adam II genes. You would click on the genomic sequence link at the bottom and you come to a page where you can choose to get promoter sequence, exon sequence, intron sequence, what have you. In this particular example, I'm showing you how to get promoter sequence because I get people coming to me every so often who want to know more about how a particular gene is regulated and to look at the regulation, they wanna download the promoter sequence of the gene and look for transcription factor binding sites. This is one way you can do it yourself. You click on the promoter box, select how much sequence you need and you get back the sequence that you have asked for. In answer to the question I got earlier, if you wanted to get exonic sequence, this is one way that you can do it right here. You could click on the exons, although to preempt your question, this would just get you exons for one particular transcript, not exons for all transcripts for a given gene. So how do we navigate around the Santa Cruz gene browser? Well up here at the top are a number of buttons that you can select. There are moving buttons over here at the top. You can choose to move either to the left or to the right if you wanna see the genes or the other features that are near this gene. You can choose to zoom in so you can get a more detailed view of the Adam II gene or you can zoom out to get a bigger perspective and see what's around it. Down here there's some additional buttons. I wanna call your attention to the one here called reverse because what that's gonna do is gonna flip the display around. I told you that this gene is pointing from right to left. If that bothers you, you can click the reverse button and that's gonna flip your gene around. So now it's pointing from left to right if that's easier for you to look at. Now the five parameter the gene is over here like you might expect. I honestly find this view very confusing because once you've reversed it, keep in mind you've reversed it for the whole chromosome and I just find it really confusing to keep track of what you've done. The hint that you have reversed it comes over here before your gene symbols were over on the left side of the display. Now they are over on the right. So you have to sort of, if you hit the reverse button, you need to pay attention to what the display looks like so that you know you're actually reversed. So I'm gonna get rid of that because I find it confusing. And I will show you what happens when you zoom out to see additional genes around atom two. See, we click the zoom out 3X button and it'll look something like this. Your atom two gene is now in the middle of the display and you can see the flanking genes. Downstream of atom two, we have the atom 18 gene. Upstream of atom two, we have a gene called IDO one. If you want to zoom in on this display, as I said, you can click on these zoom in buttons up at the top but I think a much better way is a new drag and zoom feature that Santa Cruz has implemented over the last, I'd say year, year and a half or so. And what you would actually do there, it's like a Google Maps type view. You stick your mouse in the very top of the graphic up here in the scale area and you drag your mouse over the region that you want to zoom in on. And you get this purple box showing up. In this case, I'm zooming in on the five prime most exon of the IDO one gene and when you let go of your mouse, you've now zoomed in on that region as shown here. So here we've now zoomed in on the five prime end of the IDO one gene. If you look at this exon carefully, you may think it looks a bit odd. It's not just one rectangle. There is a short part of the rectangle on the left and a taller part of the rectangle on the right. What that is telling you is the translation status of that particular exon. So in Santa Cruz terminology, exons are always rectangular boxes, but translated exons will be shown as taller boxes. Untranslated exons are shown as shorter boxes. So you very often see what looks to me like a top hat flipped on its side for both the five prime most and the three prime most exons in a gene. And in the middle, you'll have only translated exons. So those exons will be just tall. So I've just showed you the very, the default view of data at Santa Cruz. If you scroll, where to scroll down on the page, you'll see there's actually a lot of different available, a lot of different sources of data available that you can view in the context of the genome browser. Santa Cruz itself doesn't actually generate most of these data. They have very active collaborations going with bioinformatics groups all over the world. Those bioinformatics groups submit their data to Santa Cruz and make it available on the genome browser for everybody to see. The data are grouped into these organizational blocks. If you click on the plus button, you can expand a block and see what data are there. They very often have rather confusing names. If you want to know more about what something is, you would click on the name of that track to get an additional information. If you wanna add a track to the view, you would click on the pop-up menu here. You see by default a lot of these tracks are hidden, meaning they're not showing that display up the top. You can choose to see something in full mode, which means now you're gonna see the data and you're gonna see all of it. You could also choose to see it in some compressed mode like dense, squish, or pack, but normally you just wanna either hide the display or show it in full mode, and that will then bring in additional data. To refresh your display and see that new data, you click one of these refresh buttons. The reason you don't wanna have too many tracks open at the same time is that graphic is gonna get really, really big. It's gonna be hard to see what's going on, and also the page is gonna take a long time to load. So normally you just wanna turn on the tracks that you're actually interested in viewing. So I'm gonna, I wanna look more at this Yale TFBS track, but I don't wanna just put it in full mode, full mode automatically. I wanna click on the name of it so we can see what's going on here. So this is actually data that's coming out of the ENCODE project that Eric Green talked to you about three weeks ago. And it's a group from Yale, UC Davis and Harvard that is looking for transcription factor binding sites that are being bound by transcription factors in a variety of cell types. They've done chip seek experiments and they've put that, they've submitted that data to Santa Cruz. For those of you who are not familiar with chip seek data, I think that's something Elliot Margulies will talk about next week in his next gen, next gen technology lecture, so I'm not gonna go into that in any detail today. But basically you're looking for transcription factor binding sites that are actually occupying different cell types. Using the selection mechanisms down here, you can choose which of the many cell lines you want to display data for and also which of the many transcription factors you want to display data for. I'm just gonna leave the defaults checked, but you could certainly look at additional cell lines or add or subtract transcription factors. Change the display mode to full so we can actually see all the data and refresh the display. So we're still on the five prime end of the IDO1 gene and what you see here are the results of this chip seek analysis. Each one of these groupings would be a particular transcription factor binding in a particular cell type where you actually get that transcription factor binding, you would see an increase in your histogram. This is showing your raw data right here. The significant peaks or the significant binding regions would be shown by this track up here called the peak tracks. You can see there's one peak right here shown in the gray bar that sort of overlaps with the five prime end of the IDO1 gene. So you could say, okay, it looks like there is NFKB binding sites which are occupied in the GM12878 cell line. Yes. And G almost has a type. Right, so the transcription, I should make it clear, this is actually experimental data. We're not just looking at a potential transcription factor binding site. We're actually looking at transcription factors which have been bound to the DNA using a chip seek experimental approach. So these are actually binding sites that are occupied in a particular cell type. So there's only the one NFKB site here but this is looking at one particular cell line. If you were to look at NFKB and other cell lines, you might or might not see that binding site actually activated and that would reflect different transcription which is going on in different cell types. So something else that you can do on some of the Santa Cruz tracks is you could actually change the color of the displays to make the information a bit more useful to you. So for example, if we wanna get more information about SNPs, we wanna get more information about variants that have mapped to the genome, we could click on the name of the SNP track and that will get us to the detailed view of the SNP track. Sort of down the page is where you can specify colors. So what I'm interested in highlighting are the positions of these synonymous and non-synonymous SNPs in a particular gene. Synonymous SNPs would be SNPs that map to a coding region of a gene which do not change the protein coding sequence. I want those guys colored green. A non-synonymous SNP would be a SNP that maps to a coding region and does change the sequence of the gene and I wanna color those red. Every other type of SNP I'm gonna color black as indicated down here. Change the display mode not to full but in this case to pack. If you wanna see the color coding for the SNPs for some reason you gotta use pack mode. Full mode does not work. You click the submit button and you read while you're displaying. We're now back looking at the Adam II gene and down here I've expanded the SNP tracks. Each one of these little black tick marks with an accession number next to it would be the position of a SNP from NCBI's DB SNP. The synonymous SNPs would be colored green. There's one over here. There are three non-synonymous SNPs that would change the protein coding sequence of this gene and the remainder of the SNPs are other types of SNPs. So they would be intronic or they would be exons, untranslated exons that don't change the coding sequence. So something else that we can do at Santa Cruz is something called a BLAT search and you discussed BLAT two weeks ago when he talked about different types of sequence comparisons and again I don't wanna go into it in great detail because he's already covered it but basically BLAT is an algorithm developed by Santa Cruz and it's designed to do very fast sequence alignments against a genome. So you can start with either a nucleotide or a protein sequence and very, very quickly in like a second compare that sequence to a genome which is available at Santa Cruz. The downside to the searches being so fast is that they are not very sensitive as we'll see in a minute. So this is to compare to BLAS. BLAS is a lot slower but it's a lot more sensitive. So let's try out a BLAT search at Santa Cruz and see what happens. I have a sort of difficult question which is I want to find the chicken homologue of a human protein. So I'm gonna go to NCBI. I'm going to get the reference protein sequence my favorite protein, the Adam II protein. I would get that protein sequence, copy it, go to Santa Cruz. Up at the top of most Santa Cruz pages but not the BLAT page there would be a link to BLAT, BLAT. You would click on that and get to this interface to BLAT. You paste in your protein sequence. You choose your genome, in this case the chicken genome and you choose the version of the assembly that you want. To get this very contrived example to work we can't use the most recent chicken genome assembly we need to use an older one from February 2004. You would click on the submit button down here and very quickly get back your result. So basically you get four alignments between your human protein sequence and the chicken genome. Again, I don't want to belabor how BLAT does its scoring but suffice it to say a score of 44 for this top hit is better than a score of 12 which is what you see for the subsequent hits. If you want to see the hit in more detail you can click on the browser link over on the left and that brings you up your BLAT search in the context of the chicken genome. So we have added a track here in blue which is called your sequence from BLAT search and you could see three blue boxes that is three regions of alignment between the human protein and the chicken genome. Down below you see that there is a chicken EST or a chicken cDNA in this region so you can think, oh great it looks like there's actually a chicken gene here and my human protein is aligning in three regions. When you think of how a protein aligns to a genomic sequence not too surprising to see that protein split up because you would expect to see exons for the protein so you might think that you have here three exons for the chicken protein, not so fast. If you go back and look at your results a bit more carefully, you will see that the start and the end, this is the position of the protein sequence that gets aligned to the genome. The starting part of the protein that gets aligned is residue 539, end is 600, it's only a 71 amino acid alignment at a query size or query size of 735 amino acids so not a whole lot of your protein is actually getting aligned. And indeed if you click on the details button to get a pair wise sequence alignment of these proteins, you'll see the alignment is not very good at all. So up here is the sequence of your protein that you started with, this is human atom two. The blue sequences are regions that got aligned and the black sequences or the majority of the amino acids shown here do not get aligned. Down here we have our chicken genomic sequence with the same thing, blue is aligned, black is not aligned and down here we have a pair wise alignment. So in short, you are getting these short alignments between the chicken, between the human protein and the translated chicken genome but they really do not amount to much. I would certainly not call this a potential chicken homologue of the human protein. So I just wanna make two points here. One is just because you get a result in a genome browser or any sort of genomic website does not mean that that result is true. You always have to go back, use your biological intuition, analyze the results and decide whether or not you actually believe what you're getting at. The second point is although blad is fast it is not very sensitive. It'll work really well if you would take in a human protein and bladded that against the human genome, you'd get a great hit. But if you're trying to go so far afield as chicken to human, you're highly unlikely to get anything meaningful back. Yes? I understand that the third point of the specific point of the species or several other species, it doesn't find it on the log. Yes, so the question is can you look for homologs across species and not find something? And yes, that certainly happens. Again, two reasons. One is that the genome sequence that you're searching in is not complete and that region of the genome, the sequence really exists but that region has not yet been sequenced. A more practical reason is not all genes are present in all organisms. So there's certainly mammalian specific genes that you wouldn't find in a plant. So two more things we're gonna do at Santa Cruz. One is this add your own custom tracks. So what this is is say that you are doing a large scale analysis in your lab or even a small scale analysis and you wanna be able to see your genomic data in the context of other Santa Cruz data. So this is an example I'm showing you here. This is data from Francis Collins's lab from many years ago back when they were genotyping SNPs and they wanted to be able to see these SNPs displayed in the context of the human genome. You would format your data in such a way that each SNP or each thing you're annotated has a chromosome, a start and a stop position on the chromosome and a name. You would copy that information, go to Santa Cruz instead of issuing a query like we did before. Click on the add custom tracks button. Brings you to a page where you would paste in your custom track which is correctly formatted Santa Cruz because a lot of information about how to format these tracks. There's a number of different formats which are all described with the hyperlinks above. Click the submit button. You get an overview of the tracks that we're uploading. Click on the genome browser button and we have our results. So from here down, from this UCSC genes track down, this is normal Santa Cruz data that you'd always see. Up here is your data. You have four tracks here, a black track, a green track and two blue tracks which is showing you the data that you want to look at, the data from your lab that you want to look at in the context of Santa Cruz. You can see where these SNPs are with respect to genes in the human genome. The way I've submitted this data, it would be available only to you, but there are ways to make your data available at Santa Cruz where you can share it with your collaborators or you can make it publicly available and have it as a companion to a manuscript if you want to be able to show your data in Santa Cruz that way. And that information is all available with Santa Cruz website or you can always contact me and I can help you through that. Something else that you can do at Santa Cruz is called the Santa Cruz table browser. So the table browse you allows you access into the genome data, not as a graphic as we've been looking at, but actually as text. And there are a lot of different queries you can do here. I'm not gonna go into anything in detail. I've listed a couple here, but it basically allows you as a bench biologist to become somewhat of a programmer if you can actually issue the right queries and get Santa Cruz to return you back the information that you want. So I'm gonna show a very simple example which is just to find a list of all RefSeq genes that have a single exon. So for example, I would go to the table browser. I would select the RefSeq genes track because that's where my genes are. Create a filter which opens up a page that looks like this. There's a lot of things you can filter on which I'm not gonna go into. But in short, I wanna find genes where the exon count equals one. They have a single exon. If I press the submit button and nicely format the results in Excel so you can actually read them, here's what I get back. There would be a list of all the human RefSeq genes that contain one exon. There's a column here called exon count. That should be one in all cases. You get the accession number of the gene, the name of the gene, and the chromosomal coordinates. So again, if you're not a programmer, you can come in here and very quickly extract relevant information. And there's a lot of help on the Santa Cruz website to show you how to do this yourself. Any more questions on Santa Cruz before I keep going to Ensemble? So now that you're familiar with the genome browser, I'm not gonna go through in quite as much detail as I did with Santa Cruz. I'm gonna show you analogous features at Ensemble and then try to show you some of the more interesting displays that Ensemble makes available. I should take this point to say that I think what most people use, at least in my experience, is the Santa Cruz genome browser. It's the easiest one to use. It's very intuitive. You don't need to spend a lot of time figuring out how to use it. You can just go ahead, jump in and get the information that you want. I've always found Ensemble to be a bit more of a challenge, but now that I've been teaching it enough, I've gotten more used to it. It's still not the browser that I would go to for first choice, but there is a lot of information there that's very useful to certain people. I will also say that Ensemble has a very different way of predicting genes that Santa Cruz does. As I showed you, most of Santa Cruz's genes are coming from the alignment of mRNA and protein sequences to the genome, so the UCSC genes track and the RefSeq genes track are all coming from mRNA alignments. Ensemble has a much more complex, robust gene annotation pipeline which uses sequence from that organism, also brings in homologous sequences from other organisms and uses all that to predict genes. So for example, I was doing a project a few years ago where I wanted to get a good set of genes from the monkey from Rhesus macaque. If you went to Santa Cruz, there are actually very few mRNAs available from Rhesus, so you don't get a whole lot of genes out of Santa Cruz if you're looking at the Rhesus genome. At Ensemble, where they're using human genes to predict the location of genes in the monkey, you get a much longer list of genes, which is probably much closer to the actual gene list that you would get in monkeys. So I would use Ensemble for organisms which don't have as much mRNA sequence available. So the question we're going to ask here is we're going to try to find genes that overlap with an oligo tag. So say for example you've done an experiment, you've done, for example, a Selexa experiment, you've ended up with these short sequence tags and you want to know where they map in the genome and you don't have a bioinformatics group to help you. What would you do? You could come to the Ensemble webpage or you can do lots of queries. We're going to ignore that and we're going to click on the blast link up here at the top, which is going to give us the interface to Ensemble's blast or blot. Paste in your sequence in the query box. The sequence if you want it is available at the URL shown here in red if you want to reproduce this example on your own. We would select the human genome and then we select the type of search that we want to do. By default Ensemble will do a blot search just like at Santa Cruz. I had said a couple of minutes ago blot is not good for growing cross species. I will add to that blot is also not good for short sequence alignments. One of the ways they make the search faster is to use a longer word size which gets you longer sequence alignments by default. Blot probably is not going to find an alignment of this short 20 nucleotide sequence that we're pasting in here. I'm going to suggest blast in. Nucleotide blast is my query and then your search sensitivity you can actually allow Ensemble to make parameters for you. So I'm going to say I want a near exact match to an oligo. So this is going to optimize the search for a short input sequence. So you can sort of cheat and not worry so much about the blast parameters Andy talked about two weeks ago but come here and let Ensemble choose those parameters for you. Your search results come back after a few minutes looking something like this. We have an ideogram of the human genome showing you two hits. One on chromosome eight, one on chromosome 15 indicated by the arrows and the results are listed down at the bottom of the page. I have just highlighted the columns that I think are interesting. For some reason Ensemble likes to repeat information across the page. I could not for the life of me get rid of this information the other day. Sometimes you can, sometimes you can't. I couldn't get rid of it last time. So I've highlighted what I think is important. There are two alignments. In one case the query, the alignment starts at nucleotide one of your short sequence ends at residue 20. That's a good sign. So the alignment spans the entire length of your query. The percent ID is 100. So that means that it's aligning 100% over the entire query. The second alignment down here only goes from nucleotide one to nucleotide 17. So you're not getting the whole region aligned. So probably what you want is this first hit. There are a number of links that you can click on over here. I'm gonna choose the one called C which goes to what used to be called the contigue view that looks like something like this. So this should now look to you a bit like Santa Cruz but with more color. So the data is divided into a couple of different sections. Up here at the top we have an overview of human chromosome 15. This red box showing you the region of interest. That region of the red box is expanded in the next view down here where you see a couple of genes indicated by these red lines. The position of your blast hit is right around in here which is at the five prime end of the TCF-12 gene. This region is expanded further down below and now we have an actual track called black blast hit. So right here is where your 20 nucleotide sequence aligned and that's at the very beginning of the TCF-12 gene. Like at Santa Cruz ensemble shows you translated and untranslated exons although in this case a translated exon is a solid box and untranslated exon is an unfilled box. The other thing I find very confusing here but I will point out is how do you know which way the gene is going? Well genes that are above this blue line right here are pointing from left to right. Genes which are below the blue line indicated here are pointing in the other direction. So TCF-12 is pointing as you would expect from a textbook from left to right and our blast hit is near the five prime end of that gene. The data at ensemble are organized in a series of tabs and these tabs are shown up here across the top. Right now since we haven't done very much the only tab that we see is the location tab. So I will walk you through just a small number of things you can do with location tab. Many different views you can get of the data over here. The one I want to focus on is this one that says configure this page. This would be how you'd add additional tracks to the ensemble gene browser. So you click on that. These over here this list are the types of tracks that you can add. They're grouped together just like at Santa Cruz they are grouped together in two categories. I am looking at the variation features category. There are 19 different variation features which I can add to this display. I honestly don't know what a lot of these things are but you can find out by reading the ensemble documentation. If you want to add one of these variation tracks or variation features to the display you would click on the little box next to its name and change the toggle from off to normal. And you get a little block logo filling in the box. I'm clicking on all variations which is gonna allow me to see all the SNPs annotated in this region of the genome. Click save and close up at the top and that will redraw my display. And if you have really good eyes you may notice that there's a new track added down here called all variations and it has these blue tick marks in it. The blue tick marks are really hard to see in my opinion but those would represent the positions of the SNPs that have been annotated in this region of the genome and they are color coded depending what type of SNP they are. And I'll explore one of those in more detail in just a minute. If you want to navigate around this view the navigation buttons are over here like at Santa Cruz we can go to the left, we can go to the right. Sorry the mouse doesn't work too well you can go to the right, you can zoom in or you can zoom out. We are going to move over to the right to see a bit more into the TCF-12 gene. And I've moved and I've come over here so I'm now looking at a different exon of the TCF-12 gene and down here again we can see our variations. I apologize they're very hard to see but there's nothing I can change there. If you want more information about the variant you would click on its name that brings up this pop-up box which gives you more information about the variant as well as a link to the variation properties. So let me select that so we can see this in a bit more detail. Here's an overview of the variant showing you the variant itself. Right here it's indicated as an R and R is an ambiguity code which corresponds to either an A or a G. So there's this and then there's flanking sequence surrounding that SNP as available from DBSNP. Over on the left are a number of different things we can look at about this particular variant. So we can find out population genetics, individual genotypes. I just want to look briefly at the context of the variant which gives you a nice view showing me all the variants in this region of the gene and their color of code depending on what type of SNP they are. These blue SNPs right here are intronic SNPs, gray SNPs are upstream, light blue SNPs are sorry gray SNPs are intronic, light blue SNPs are upstream and our single green SNP is right here. If you look at the figure legend down here you will see a green SNP is a synonymous coding SNP, same color scheme that we had at Santa Cruz. If you want to explore other options we could look for example at the gene tab. So we've explored a bit the location tab, the variation tab. If you want to explore the gene tab you could go back to the location tab, click on one of these transcripts right here for the TCF12 gene and you pop up a menu that shows you all the places you can go from there. You can see more information with the transcript about the gene or about the protein. We're gonna go get more information about the TCF12 gene itself. So here we see there are actually five alternatively spliced forms of the TCF12 gene. Each one gets its own transcript identifier and its own protein identifier. Ensembl has a very strange naming system for the, well I shouldn't say strange, I'll point out the naming system for the ensemble items without giving you any opinions about them. The, they always start with the letters ENS which stands for ensemble. If you're a transcript you get the letter T for transcript and then a bunch of numbers. If you're a protein you get the letter P. If you're a gene you get the letter G. So we are looking at a gene which has an accession number that starts with ENSG. Here are the five transcripts in tabular form and here's a schematic showing you four of those transcripts. Each of these tick marks would be the position of an exon and the introns are shown as diagonal lines in between them. And you can see these first two transcripts look like they're fairly similar although they actually do have different splicing patterns. This fourth transcript right here is quite a bit shorter, has a different splicing pattern. If you want to get more information about the transcript that there are a variety of things you can do along the side. Again, I'm just gonna highlight a couple of them. One is to go to the orthologs for this transcript. So, sorry, ensemble makes available, as I said, genome browsers from a lot of different organisms. For each of those organisms they predict genes and they predict proteins. They then do these massive comparisons among all the proteins and all the genomes that they've annotated looking for best hits and therefore potential orthologs among these different organisms. So, if we are starting from the TCF12 gene in human, clicking on the ortholog link gets us a long list of potential organisms in a number of different species. For example, you can go to alpaca, you can go to lizard, you can go to armadillo, you can go to C. elegans. Orthologs potentially available for all these different organisms. Again, these are predictions, these are calculations. They may or may not be accurate. Nobody sat down and actually looked at these things. So, before you go too far with any sort of ortholog prediction you would probably want to look at the sequence alignment and make sure you believe that this is actually a reasonable guess as to the ortholog of the gene. Another view that I sort of like is this one called variation image because this provides a lot of different information. As before, we have our five transcripts for TCF12 up here and they are shown in a schematic, each exon being shown as a maroon box and the exons are then expanded down below. Right here, we are looking at the exon view for one particular transcript. Every place you see a maroon box would be a position of an exon. If you see a line like this, just a dash, that means that is an exon in one of the TCF12 transcripts but that exon is not actually present in this particular transcript that we are looking at right here. This exon right over here indicated by a dash is also not present in this transcript. The snips in the region are also indicated, color coded like before, blue is intron, green is a synonymous snip, yellow is a new color that would be a non-synonymous snip. So, you can see the positions of these snips compared to the exons and further down you can get protein domains available in this particular transcript and the protein domains are also, you can see those on an exon level as well. If you were to scroll down the page, you would see similar information repeated for the other four TCF12 transcripts. Additional information ensemble is available through the transcript tab. So, we've gone through the location tab, we've gone through the gene tab, we've gone through the variation tab, now we're on the transcript tab looking at information about one particular transcript. Here's a summary of the transcript showing the exon and intron structure. You might decide you want to get supporting evidence, that is what GenBank accession numbers went into supporting this particular transcript. That is shown down here. So, here again in Maroon, we have the exons in this particular transcript and here in this larger figure we have a whole list of GenBank accession numbers, GenBank mRNA sequences showing you what exons these different GenBank sequences contain. So, if you're interested in knowing, for example, for this particular exon, what other GenBank sequences contain that exon, you can look right down here and see them all here listed with their accession numbers. This is useful if there's an exon that you're a bit suspicious about. So, for example, there's not a good example right here, but say there were one exon that were only supported by one additional GenBank mRNA sequence. You might think that's not necessarily a great exon prediction. It might be a really rare exon that you only see once in GenBank or it might be a spurious exon that's some cloning or sequencing artifact that shouldn't be there. Something else you might want to do is to get the protein sequence of your transcript. You can click on where it's under sequence, where it says protein and that will jump you to the protein sequence. Nothing too unusual here. Although one nice feature is the proteins actually color coded by exons. So alternating exons are in different color. You got the first exons in black, the second exons in blue, the third exons in black, and so on. So you can actually see where the exons map to your protein. The red letters would show you codons that span an exon. So in a particular amino acid that's coded partially by one exon, partially by another exon. Now, if you look at the figure legend down here, you will see that you're supposed to see some sequences colored in yellow and others colored in green. Those should be the position of your yellow, non-synonymous snips, and your green synonymous snips. But you will notice in this view, they are not colored. I think that is a bug in Ensemble. But remember, I told you you can always get to a previous version of the display at Ensemble. So I thought this was a good opportunity to show you that. Down here, I can actually view this page in their archive site, right down here at the bottom. I click on that, and there's a whole list of different Ensemble versions in which you can see this particular view going all the way back to October of 2004. I'm just gonna, I picked the Ensemble 52 version from December 2008, so about a year ago. If we look at the protein sequence in that view, you can see here it is indeed color coded, and we have our snips colored green and yellow. Again, this is a somewhat contrived example, but I wanna make the point that Ensemble does make available all of its older annotations and assemblies for a long range of dates. You can always go back and see exactly what you were looking at, exactly what you would have seen at a particular time in history. So remember, about a half an hour ago or so, we did a blot search at Santa Cruz, looking for the chicken homologue of a human protein. And blot didn't work too well because blot wasn't very sensitive. Let's do this again at Ensemble, but instead of doing a blot search, we're gonna do a blast search this time. So we take our human protein sequence, paste it into the blast search box. You need to tell the Ensemble blast interface, this is a peptide query, so it knows it's a protein. We're gonna go against the gallus-gallus genome with the chicken genome, with a T-blast end search, which is a translating blast, run the program, and we get back a long list of hits. What you wanna concentrate on over here are the blast scores and the blast E-values in this column over here. And you'll notice there are three hits, the top three hits really stand out, the scores are all above a thousand, and the E-values are quite low. Those are probably some decent looking hits. And indeed, if you click on the A-link, which takes you to the alignment, you can see the alignment of one of these hits. On top, we have our query human protein sequence. Down below, we have our chicken genome sequence translated in all six reading frames, and this is just normal blast report like Andy showed you two weeks ago. The letters show you positions of sequence identity, the pluses show you sequence conservation. And you can see there's a long stretch of sequence conservation, not at a very high level, but just about what you'd expect for a chicken to a, sorry, for a chicken to a human protein alignment. So I would say blast is actually capable of finding the chicken orthologue, or chicken homolog, I should say, of this human protein. One other thing I want to show you, very briefly at Ensemble, is something called BioMart. So BioMart is sort of like the Santa Cruz table browser and that allows you to get into the details of the data available at Ensemble without actually being a programmer. What I find it very useful for is to cross-reference identifiers between different annotation sources. So for example, you can correlate Ensemble annotations with NCBI annotations without having to download data from Ensemble and NCBI and write scripts to correlate them together. So how do you do that? You would start out by choosing a data set. In this case, I'm going to paste in some Rhesus-McCock Ensemble identifiers and get some information out about them. So I choose an Ensemble, it's not called Rhesus-McCock, it's called McCock-Amalata, you choose that database. The data that you're inputting is sort of confusingly called a filter. So you would click on this filters button and that opens up the place where you would type in your input sequence, input data, you could type in a region. In this case, I put in Ensemble gene identifiers from the Rhesus-McCock. And then I then click on the attributes, the attributes are gonna be the output data that you want. I'm going to get a variety of annotations from Ensemble, I'm gonna get the gene ID, transcript ID, chromosome position, most importantly the gene name, so I can actually have a real name rather than ENS, MU, GGs and some number. And I'm also gonna try to cross-link to the NCBI-RefSeqs where that data is available. If I press the results button, I get back a nice table showing you for a given Ensemble gene ID, all the available transcripts, so one gene can have multiple transcripts as we've already seen. Here are links where available to the corresponding NCBI-RefSeq, the Ensemble position and then a name of the gene. So again, a great way to correlate between an Ensemble identifier and an NCBI identifier without actually having to do any programming yourself. Something else you might wanna do starting with that same list of monkey identifiers, you might wanna get orthologs. So say for example, you wanna get the human orthologs of those monkey identifiers, you would click the homologs radio button up here, select human information under the human ortholog section and you get back your human Ensemble gene IDs to go along with the macaque genome IDs, gene IDs, yes. Yes, so I just, the question is, does Ensemble allow a batch analysis? The answer is yes, I just pasted in sequence, pasted in my gene identifiers, you can also upload a file. So the final thing I'm gonna do is I'm gonna go to NCBI and very briefly in the 15 or so minutes that we have remaining, show you how to do some queries at NCBI. The example I'm gonna use here is to find a genomic region between two SNPs. So say for example, you've done a genome-wide association study, you've narrowed down your haplotype block to the region between two SNPs and you're wondering what genes are in that region that could be causing the phenotype that I've been studying. So you would go to the NCBI homepage. At the bottom of the page is a link to the map viewer down here, we click on that and you get to the query page for the map viewer. There are a lot of different organisms available here. I should say that NCBI makes available map viewers for organisms that Santa Cruz and Ensemble do not currently have. So for example, most of these fungi are available only through NCBI as well as the protozoa and plants, I believe are NCBI specific as well. You can either choose your organism and query directly on that organism's page or you can come up here in the search box, select your organism and type in your query terms. I have two SNPs indicated by their accession numbers and I'm separating them by the word or in capital letters and or is a Boolean search function that will return anything that contains both accession numbers in the result. It's a bit counterintuitive, you might expect an and but you actually need to use an or. You do your query and you end up with a hit shown here on chromosome eight, which contains both of your SNP markers. NCBI for human actually makes available three different genomes by default. So there's the reference genome that we've seen at Ensemble and Santa Cruz and that's the one that was produced by the NIH funded genome sequencing project. NCBI also makes available these two other genome assemblies. One is the salera genome assembly. So this is a genome assembly that Salera published back in 2003 that they used to sell to individuals and companies for lots and lots of money. It's now available for free right here at NCBI. And the final one is this cryptically named Hue Ref which if you Google you will find out is actually the genome sequence of Craig Venter who was the founder of Salera Genomics. He's had his Diplo genome sequenced and made available in GenBanks. You can actually explore his genome and all its glory with all of its variations right here. We are going to however stick with the reference genome assembly which as I said is the same one that you would see at Ensemble or Santa Cruz. Click on the button that says all matches to see both all the hits to these two SNP markers and you get this view coming back here. So you'll notice at NCBI the data are organized in vertical tracks instead of in horizontal tracks and we have three tracks showing up. The variation track will show you the positions of SNPs. The genes on sequence track shows you the positions of genes and this HSUNIG track shows you the positions of unigene clusters or EST clusters that have mapped the genome. For some reason when you do a query at NCBI and you tell it to look for two items whether it be two SNPs, two STSs, two genes, whatever you have it gives you the region not only containing those two markers but also some flanking region. So you can see our SNPs are highlighted here in pink and we have some additional sequence above and below that. If you don't want that, if you wanna eliminate that you can come over here where it says region shown. Type in your two SNP markers, one in each box, press the go button and when the display is rewritten you see only the region between these two SNPs here in pink. And you can see if you look at the gene track the gene name is cut off here but there's one, two, three, four genes in this region. So if you were looking at a candidate region for the gene causing a particular phenotype you might think it comes from one of these four genes. NCBI has an interesting way of doing their annotations which is the map that is over here. The track that's over on the right has the most information associated with it. So right now the variation map is over on the right and it has additional information associated with it which I will go into in a bit more detail later on. The genes on sequence track which is the one I wanna focus on doesn't have much information here besides the gene symbols. So how do we change that? If you go click on this link here that says maps and options that's how you control your NCBI maps. That opens up this maps and options box. Over on the left here are all the different maps you can add to the view. They are described over here in the human maps help link. You can choose to add any of these maps to the view by clicking on the map name and clicking the add button. If you don't like a particular map which is in the maps displayed column you would click on that map name and remove it. For example, I've removed the Unigene map because I don't find that overly useful. The other thing I've done here is I've made the gene map, the master map by clicking this button here. So the master map means the gene map is gonna be moved on over to the right side of the display so I can see more details as I have here. So the variation of the gene map of swap positions, the variation map is now somewhat compressed and I have all these extra hyperlinks coming off the gene map which allows you to explore different features about this gene at NCBI. There are a lot of links here. I'm only gonna go through a couple of them. The one that I find the most useful is this link to NCBI's entree gene. So entree gene is a good compilation of information about a particular gene. It gives you the executive summary about what this gene is doing. Gives you links to a lot of other NCBI sources of data about that particular gene. But what I come to it for the most often is down here where you get links to sequence. So you get a link to the mRNA reference sequence, the corresponding protein translation. If there were more than one isoform of this gene, you would have links here to all the different isoforms, all the different NM accessions for that particular gene. If you didn't want to go with the RefSeq, if you just wanted to go with the standard GenBank mRNA, those are linked down here at the bottom. So I find entree gene to be a great place to go to if you just want sequence information. Another link that comes off the map viewer is a link to OMIM. OMIM is NCBI's online Mendelian Inheritance in Man. This is actually data which is compiled by a group at Johns Hopkins University. They read the literature looking for relevant medical information to bring in here. They're looking if there's any phenotypic information about mutations in a particular gene that will come into OMIM. If there are any new and allylic variants that cause disease that will be brought into OMIM as well. This is a somewhat boring entry in that there's information about the cloning of the gene and the function of the gene. There's no actually any disease causing information about this particular gene, but if available, that would be here. But nevertheless, there's a nice compilation of information here would save you from going off and having to read all these references which are linked here on your own. Another thing NCBI makes available is a resource called Homology. This is another pre-calculated homologue information source. So like we looked at the orthologs ensemble, here we're looking at homologs or NCBI. It's a much smaller list of potential organisms, but again you get homologs in different species. You also can see here their domain organization. So all of these proteins, these BRF2 proteins in different species, they all have a zinc binding domain shown here as this blue box. So going back to the genome browser, how do you navigate around? This is probably fairly obvious to you at this point, since this is a third genome browser we're looking at. You can zoom in and out over here in the zooming views, in the zooming triangle. You can also click on a position that you wanna zoom in and it brings you this zoom control where you can more finely tune your zooming. You can choose to zoom in a certain factor to four or eight fold, or you can choose to display only a certain region of sequence. What I wanna do for this example is show 10 kilobases around the BRF2 gene, so I would click on where it says show 10K, and now zooming in on BRF2. So we got a gene just like we have an ensemble. The solid boxes represent translated exons, unfilled boxes like up here at the beginning and down here at the end would represent the untranslated exons. There is a very small arrow right here to the right of the gene symbol which shows you the direction of transcription of the gene. This particular guy is going from the bottom of the screen up but again, other genes will go like the one up here can go from the top of the screen down. Remember I talked to you just a minute or two ago about the master map. Right now the genes on sequence map is the master map and you have all this additional information available. Say you wanted to get additional information about the variation map. You need to make that the master. Two ways of doing that. One is to go back to the maps and options box like we saw before and change the display there. A faster option is next to a map name up here. There's an arrow button. If you click that arrow, that particular map is gonna pop to the right and become the master and there's gonna be a lot more information available about it. So you see the gene map has now moved over to the left. The variation map is now over here on the right and we can see each of our SNPs as a hyperlink to NCBI's DB SNP. If the SNP has been validated, it will get an asterisk next to it and if it's been genotyped, it gets this T next to it. If you want any more information about any of these SNPs, you would click on the link which will take you to DB SNP as I'll show you in just a minute. The one other thing I wanted to call your attention to are these colored L, T and C letters adjacent to each SNP. L is for locus, T is transcript and C is coding and what that means is if you have an L, it means the SNP is in a locus. That means it's in the region of a gene. If you have a T, it means the SNP is in a transcript. So for example, these two SNPs up here are in the, I can't quite read what this says, the GPR124 gene right up here, they're in the transcript for that gene. If the C is colored, that means the SNP falls into coding sequence or into a translated exon. So all these SNPs down here where the L and T are colored have the C colored because they fall into the coding sequence, the gene. These two SNPs, the C is not colored because they fall into the untranslated region of the gene which is not coding. Anyway, if you wanna see one of these in more detail, you would click on it. That would take you to DB SNP where again, like an ensemble, we get the flanking sequence of that SNP, the polymorph is at the variant itself here in S which stands for a C or a G and further down on the DB SNP page, you see the actual mutation. This is a MIS-SENS mutation or a non-coding change, changing a G to a C, which changes a codon from GGC to CGC, changing a glycine to an arginine. So I'm actually finished with five minutes to spare. I will point you to a couple of sources of information. All of the genome browsers have online help available. I will say ensembles online help is really, really good. It's far better than NCBI or Santa Cruz. They even have videos where they'll show people clicking around and show you how to navigate. If you don't like using these online sources of help, I point you to three chapters in current protocols and bioinformatics, which are available through the NIH library at this URL shown here. They're all reasonably up to date and they will walk you through examples of how to access genome sequence data in ways that I've shown you. So if you have any questions, I'd be happy to answer them up here afterwards. Thank you.