 Can you hear me? Okay. I'm Tira Wolfsburg. I'm one of the course organizers and also the associate director of the Bioinformatics Corps at NHGRI. So what I'm going to be talking to you about today are how to get data out of genome browsers. This is something, oh, sorry, before I start, I need to say that I have no relevant financial relationships with commercial interests for those of you doing CMA credit. So what I'm going to talk to you about today is how to get data from the three publicly available genome sequence browsers, one at UC Santa Cruz, one at Ensemble, and one at NCBI. This is something that I do pretty much on a daily basis, and a lot of our work in the Corps and the Bioinformatics Corps is based on getting data out of these different online resources and either helping people to analyze it or to display it themselves. So before I get into the browsers themselves, let me just go over a couple of details about the types of data that you're going to see. So all the browsers start with the same information, and that is genomic sequence. And I'll show you in a couple of slides. For the most part, that's the same genomic sequence, but not always. Then each of the three browser teams independently annotates the genome with relevant information. Those annotations can be different because they are done independently by the three different teams. The types of things they annotate would, of course, be genes, which they can annotate using RefSeq mRNAs that Andy talked about last week, and I'll touch on it a bit more today, Genbeg mRNAs, other sources of transcript sequences, and also Abinitio gene predictions. They also annotate things like SNPs and, for example, non-coding functional elements, and we'll go through some of the different types of annotations that the three browsers make available to you. And before jumping in, I'd just like to go over a very quick overview of how genome sequences get generated in the first place. This is taken from an older review that's 11 years old at this point, going over two modes of genome sequencing that were available at the time. This clone-by-clone shotgun sequencing is what was used for the publicly funded human genome project, that is the NIH funded genome project. And what they did was to take the human genome, which you might imagine as an encyclopedia, where each chromosome is a single volume in that encyclopedia, and break each chromosome up into a series of about 350 kilobase inserts, which they cloned into vectors called BACs. Before they even started sequencing, they first made a map of these BACs along each chromosome. They had one BAC that started at the left end of chromosome one, then another BAC overlapping that a bit, another BAC, another BAC, until they had a whole tiling pattern of BACs across each chromosome. Once they had mapped these BACs, then they sequenced them. The BAC inserts of about 350 kilobases, as I said, are too big to stick in a sequencing machine, so they were broken up into smaller pieces by a process called shotgunning, which generates sequences of a couple of hundred nucleotides, and those are the pieces of sequence that were actually put in the sequencing machines. So you end up with a sequence of each of these BACs shown here in different shades of blue, and then using genome assembly programs, you can stitch these pieces back together, looking at the overlapping sequences of letters, until you end up with a sequence of each BAC. Because you have made a map of the BACs along the chromosome before you started, you then know the order of these different color blue pieces, and you can end up with a chromosome sequence. The opposite strategy, which was taken by Sallera, which sequenced the privately funded human genome, was to dispense with this whole BAC mapping process. Rather, they just took the entire human genome and shotguned it into all these little pieces, and then wrote computer programs to assemble these pieces back into the sequences of individual chromosomes. There was a lot of controversy at the time about whether such a method was actually possible, whether you could have sequences that come from chromosome one and chromosome X, all in one big pile, and managed to figure out which comes from which. But it was a successful strategy, and they ended up with a relatively complete genome map. This process worked so well that it was used for most of the other genome projects until fairly recently. So, for example, mouse and rat are pretty much all done by this whole genome shotgun sequencing. Your strategies are based on next generation sequencing methods that you'll hear about in a couple of weeks from Elaine Martis. But just as an overview, this procedure is quite different. It's a lot faster. You would take, this is one example of a procedure that we're using to sequence a genome in our group, you would take the genome, break it up into smaller pieces, and sequence those pieces on a machine called GS454, which generates sequences of a few hundred nucleotides in length. And those sequences are assembled into these contigs, longer pieces of sequence. Because you haven't done a mapping step, you don't know the order of these contigs along the chromosome. But you can fix that by generating what are called paired-end reads off a machine called an aluminum machine, which generates much shorter sequences. So basically you would have clones, and rather than sequencing the whole clone, you would just sequence in the two ends of the clone shown here as these blue lines. Because you know that these two pieces came from this, were contiguous in the genome, you can then pull these two pieces, these two contigs together with a blue sequence, and end up making longer scaffold, perhaps with some gaps that you would go back and do more sequencing on. So this is a strategy that is starting to be used over the last year or so for other genomes. For example, a panda genome was sequenced using this methodology. A bit more about the sequence assemblies. It's pretty complicated to generate a sequence assembly from the available data. So even though data are continuously being, are still being generated for genomes, which have been declared finished, like the human genome, assemblies are still being calculated. So the human genome gets updated every couple of years. The mouse genome gets updated every couple of years, even though they've been declared finished. The assemblies are not always displayed simultaneously on the three different genome browsers. I'll show you an example of that in a minute. If you don't see a version of an assembly that you're looking for, both Santa Cruz and Ensemble maintain these pre-release websites. I've shown you the URLs here, and they may have genome assemblies or gene annotations that are more recent than the ones that you see on the regular genome browser, but that are still sort of a test phase, so you wouldn't want to trust them entirely. Both Santa Cruz and Ensemble provide online archives of older assemblies so you can always get back to your old data. So say, for example, you're looking at a region of a genome in one of the browsers, and all of a sudden one day you come in and they've updated the genome assembly. And all of a sudden the coordinate system is going to change. Some of the gene predictions may change. It can really be difficult to figure out what you were doing the previous day. So both Santa Cruz and Ensemble, you can always get back to your old data. NCBI provides only limited archives. So bottom line, if you're trying to compare data from the different web, from the different genome browsers, you need to make sure you're looking at the same assembly. And this can be a bit cryptic because all three browsers name their assemblies using slightly different fashions. So I'm just showing some examples of the different naming conventions shown here. And for example, the dog genome at present, NCBI is displaying the most recent version of the dog genome where Santa Cruz and Ensemble are still one assembly behind. So you're not going to be able to directly compare data between NCBI and the other two genome browsers. I'd like to encourage you to ask questions as I go through. So if you have anything that you want to ask at this point, wave your hand. Or as I go through later, please stop me and I'll stop. So jumping, sorry, one more thing. I want to just briefly touch on the reference sequences. Andy talked about these last week, but I'll go over them again. So basically the RefSeq project was initiated by NCBI to come up with one good copy of each mRNA protein and non-coding sequence in the genome. And this was just a way to eliminate confusion. So for example, there are at least 20 different CDNA sequences for human beta actin in GenBank. And you, as a researcher, if you want a good sequence for beta actin, you really have no idea which one to pick. Well, NCBI has using a combination of computational and manual methods, selected which one they think is the best version of the beta actin sequence, copied it over, assigned it a new accession number, which looks like this. It has an NM underscore and then a series of numbers. And they put that out for people to get. And also for the genome browsers to use as part of their gene prediction pipelines. You can recognize these RefSeqs because they always contain two letters and underscore and then a string of numbers. The ones that start with a letter N are ones that are derived from GenBank submissions and have real mRNA sequences backing them up or real, I should say real sequence data backing them up. Things that start with the letter X are predictions based on an annotation pipeline, which may or may not have real sequence data backing them up and may or may not be as good quality. So those should always be viewed with a bit of skepticism. And here's just an example of the beta actin reference sequence. Looks very much like a normal sequence except you have this weird style accession number. You get an overview summary, which was written by the NCBI staff, telling you what this does. And it also tells you which original GenBank accession numbers this particular sequence was derived from, which can be handy if you wanna go back to the original record for some reason. Okay, so jumping into Santa Cruz, I'll tell you that I'm gonna proceed. If you've downloaded the handouts, you already know this. But I have what are a series of screenshots that you would use, that you would generate yourself if you were going through these examples. So I encourage you, as time permits after the class, is over that you go back and try some of these examples. Cuz it's one thing to sit here and watch me go through this. It's quite a different thing to try to do this on your own, if you haven't tried that already. So we're gonna start out with a very common query at Santa Cruz, which is to view a region of the genome by querying with a gene symbol. This is the Santa Cruz homepage. They give you news right up front here, as well as a variety of tools across the top and down the sides. If you just wanted to do a simple query, you click on the genome browser link. And then in the search box, you would enter your search term. Before you do that, you'd need to choose what clade and what genome you're looking for. We're doing human and also what assembly. You'll see that they make available four different human genome assemblies going back to 2003, up until 2009. Most people at this point are using a 2009 assembly. But there are some reasons why you might want to use the 2006 assembly, because there's a lot more annotation data available for that. And I'll go and get into that a bit later. So you put in your query. There's a list of potential queries that you can do down here. I'm searching by a gene symbol. You can also search by an accession number, a chromosomal coordinate, a keyword, et cetera. You'd press the submit button and come back to a list of entries. So what Santa Cruz does when you do a search in the general search box is it's just looking for a text string. I typed in the word Adam 2. So it's matching any instance of Adam 2 that it can find, which includes not only the Adam 2 gene, but also Adam 20, 21, 22, 23, et cetera, because those all have the text string Adam 2 in them. The data are organized into what are called tracks, based on the source of annotation. Up here at the top, we have the UCSC genes track. So these are genes which are predicted using a variety of online transcripts. So they use RefSeqs. They use Uniprot. They use something called CCDS that I'll get into a bit later. These down here called the RefSeq genes only contain genes predicted by the location of RefSeq transcripts. I tend to prefer these because the RefSeqs are manually curated and I trust what I'm getting out of them. Some of the sequences up here are not as, some of the transcripts they have up here are not as well curated and you get some sort of weird looking splicing coming back. So for the purposes of this example, I would select the Adam 2 link in the RefSeq gene and go on to the next slide, which is giving you a basic overview of what you're going to get from Santa Cruz. The data are organized into tracks. So up at the top here, we see where we are in the chromosome, which is also shown here in the schematic. And then we have our genes. The first track is called UCSC genes. So as I said, these are based on a variety of different sources. There's one gene, yet four different transcripts, four different spliced forms. So the reason I know this is that these tick marks right here, those represent exons and the horizontal line connecting them represents the introns. So if you look carefully, you'll see four different spliced forms, which you're either using or splicing out a variety of different exons. The direction of transcription of the gene is shown by the very small arrow heads that are present on the exon sequence. It's a bit hard to see in this view, but in this case, all the arrow heads are pointing to the left when the gene starts over here on the left and points to the right. So that's something to keep in mind. When you're used to looking at a textbook representation of the gene, they always start on the left and go over to the right. But in the genome, the genes can point either direction. The genome doesn't care which way the genes are going. Some of them point this way. Some of them point that way. And you need to make sure when you're looking at them to know what the orientation is or you may get very confused and think that this is the first exon rather than the last exon of the gene. Underneath the UCSC genes track, we have the RefSeq genes track. Again, this only has one transcript. As we noticed before, actually, this only has one transcript with one set of exons. And then down below are some other annotations that I'll be talking about a bit more later. If you click on a track name, so, for example, here, if you click on any of these Adam II genes, you're gonna get more information. So this is an example of what you might get if you go to the known genes track description. You get all types of information. The page is very, very long. I'm just showing here one particular example, which is some microarray data that you can get right off the bat. If, on the other hand, rather than clicking on the UCSC genes track, you click on the RefSeq genes track, you get a different look. Again, these data are coming from NCBI, so you get a lot of links up here back to various NCBI resources, which I'll be talking about later. But what I wanna point out is down near the bottom where you can retrieve sequence, in particular, the genomic sequence. So what we wanna do here is get the promoter sequence of this particular transcript. That's a question I get quite a bit. People wanna promoter sequence of a gene so they can copy and paste it into their favorite transcription factor binding site prediction program, but they need to have the sequence before they can do that. So we're gonna say we want the promoter sequence and some other defaults that I've checked down here. And when you click that, you very quickly get the 1000 nucleotides upstream of your gene of interest. And note that this is just doing it one gene at a time. I'll show you a bit later how you can do all the transcripts in the genome. So how do you move around the genome browser? What we're looking at right now is just a zoomed in view of atom two. Actually, before I do that, let me mention one other thing. You can now move tracks around on the browser. So say, for example, you didn't like this UCSC genes track up here. You wanted to move it further down so you could align it, say, next to the SNP track, which is down here. You could just highlight it so it turns green and drag it down and it would show up there. Another thing you can do is flip the orientation of a gene. So if it really disturbs you that this particular gene is going from right to left, you can hit the reverse button down here and that's gonna flip the display. So now the gene is going from left to right. The way you know this is that the gene annotations, the transcript names are no longer on the left side of the display, they're now on the right side of the display. So up to you, whether you like it this way, just keep in mind, once you have clicked that reverse button, everything that you look at is going to be reversed. So I find this actually very confusing because all of a sudden you're in a different orientation from what you might be expecting. So I would say use that reverse button with a bit of caution. If you want to navigate around the genome, you can use these move buttons to move to the left or to the right to see more details. You can zoom in and you can also zoom out, which is what we're gonna do here. So if you zoom out by three-fold, you're now gonna be seeing a longer region of the genome. So atom two is in the middle now and you're seeing the two flanking genes on either side. If you wanna zoom in, as I said, you can use these buttons up here. Or for a more directed zoom, what you'd probably wanna do is what I've shown right here. So I'm just gonna, I've put my mouse in the very top track up here where the numbers are and I sort of dragged it over to the right and that highlights the section of the genome in purple and what's gonna happen on the next screen is I'm gonna focus in just on this region that I've highlighted in purple, which is shown now right here. So we're just looking at the very five prime end of the IDO-1 transcript. Something I wanna point out here is that this particular exon looks sort of weird rather than just being a single vertical box like most of the other exons are. It looks sort of like a top hat flipped on its side. So what Santa Cruz is trying to show you there is the difference between the translated and the untranslated regions of the transcript. So any place where you see a tall box like right here, that indicates that that part of the exon is translated or made into protein. Any place you see a shorter box would be the untranslated region of that particular exon. So the subsequent exons off the screen down here would all be tall, but those would all be translated exons and you would see the opposite orientation of the three prime end of the transcript, the hat would be flipped on its side to show the three prime UTR way off on the right. So what you saw there was just the default tracks that Santa Cruz shows you. I wanna point out one thing down here. There's a track called common snips. So this is basically showing you all the snips in this region of the genome. Each one is indicated by a tick mark. So let's see how we change that display and also add tracks to this browser in general. If you scroll down to the bottom of the screen, so below where you had that nice graphic, you end up with a very long list of other data which you can add to this display. If you click on the names of any of these tracks, you get an information page, which explains what it is. I'm gonna be concentrating down here on these snip tracks. There's now four different snip tracks at Santa Cruz. It's a bit confusing, which are described here. I'm not gonna read through those, but the one I'm gonna focus on here is this thing called common snips. So these are snips that have a minor allele frequency of greater than 1%. If you wanna change the display mode of a particular track, you would click on the pull-down menu underneath it. And you'll notice that there's five different ways of showing that particular track. By default, most of the tracks are hidden, which means, I guess, intuitively, that they are not displayed, which is a good thing because if you display a lot of tracks, your window's just gonna get longer and longer and longer. So you wanna hide most of the tracks that you're not interested in. The opposite of hide is full, and that displays your data, all of your data. That's normally the mode that you wanna do your display in, with the exception of the snip track, and I'll explain that to you in a minute. Dense, squish, and pack are just different ways of condensing the data so it fits better on a single line, so you see it all, but it's not shown in great detail. So let's change the snip track to full and see what happens. You would change it on this pull-down, you would then click the refresh button and go to a new view of the browser. So if you remember, a minute or so ago, the snip track was all on one line with lots of tick marks. Now you have the snips individually displayed. Each one of these is the accession number of a snip from NCBI. Most of them are black. Excuse me, a couple of them are red and green. So you may wonder, what are these color coding of the snips? What does that mean and how do I change it? Well, if you click on the link here called common snips, that's gonna bring you to a page where it explains what the common snip tracks is, and also allows you to change the configuration of that track. Different tracks will have different types of configurations that you can change. In this case, we can change the color. So we can turn all the snips to being shown in black except for coding snips shown in green. Sorry, synonymous snips shown in green and non-synonymous snips shown here in red. I'm also gonna set the display mode to pack. I'll spare you the details, but in general, the snip track displays best when it's in pack mode as opposed to the other modes. And when you come back, you see a display very much like what you had before, except that you now understand that the red snips are non-synonymous and the green snips are synonymous. So the next thing I wanna talk about briefly are the encode tracks which are available at Santa Cruz. So if you remember two weeks ago, Eric Green talked about the encode project. This is an NHGRI initiated project to annotate all the genes and all the regulatory regions in the human genome. Much of that data is submitted to Santa Cruz so it'll be viewable in the context of other genomic features. You can recognize these encode tracks because they have a double helix, the NHGRI logo, next to them. So all these things with the double helix, these are all coming off the encode project. A number, they're now a growing number of encode tracks available on the most recent genome assembly, which is from 2009, also called HG19. Some of the tracks were actually originally generated on the HG18 version of the genome sequence, which is from a couple of years ago, and then ported over to the current version. Those are highlighted here with this number 18 next to them and I think there are some encode tracks that came from the previous assembly, although I'm not seeing them right now. And there's a large variety of encode tracks, which I'm not gonna go into in any detail. The one I wanna focus on is this one here called encode regulation, which is on by default. But that's a super track, which is composed of data from these six different encode experiments. Three of them are in hide mode, so we're not seeing them by default. The other three are on by default. And if you look at them, here's what you get. So remember, I said one of the goals of encode is to figure out where the regulatory regions are in the genome. So what I focused on here is this region between the atom two gene and the endo one gene, where both genes are pointing in opposite orientation. So atom two, the five prime end is here. Endo one, five prime end is here. So they're pointing like this. So you would imagine there should be a lot of regulatory data in this region because that's a promoter region, which would control the transcription of both of these genes. And indeed, that's what you see. In this encode super track, we have histone acetylation marks that are often found near regulatory elements. Those are shown as these different color histograms. We have a lot of DNA hypersensitive regions. DNA hypersensitive regions mark regions which are accessible to protein binding. And we also have a variety of transcription factors binding in this region. For more detail on the encode data, I point you to this manuscript that Eric Green pointed out two weeks ago, the user's guide to the encode published in PLOS Biology last year. And they have walked you through many different examples showing you different types of tracks and the type of data that you can get out of them. For example, in this particular example, we have a SNP, which is upstream of the oncogene mech. And it's associated with a number of different regulatory elements, which all have characteristics of enhancer sequence. The histone modifications, as well as transcription factor binding, I'll point to this being an enhancer region. So this particular SNP, which I forgot to mention is a cancer associated SNP, might be involved in the regulation of mech through enhancer activity. Or this region might be involved in enhancer activity. I would also point you to a seminar tomorrow, which is being given here in Lipset at 11 o'clock by Ewan Burney, who's basically the head of Ensemble. And he's gonna be talking about the encode projects and the latest and greatest results that aren't yet published. So if you wanna know more about encode, I highly recommend coming back tomorrow in 25 hours. So a couple more things to touch on at Santa Cruz. One of them is attempting to use a Santa Cruz blot engine to find the chicken homologue of a human protein. So if you remember, last week, when Andy Baxivanus went through different methods of sequence alignment, he talked about both blot and blast. Blot, so blast is a traditional sequence alignment tool developed by NCBI. It's very sensitive. It's used for comparing both sequences that are close evolutionarily, as well as sequences that are more distant. Very sensitive, been a workhorse for many years. The downside is, is that it's slow. So Santa Cruz developed a program called blot, which is meant to find sequences that are very highly similar to each other, either nucleotide sequences against the genome, or you can even take protein sequences and blot them against the translated genome. It does a great job of finding things that are within the same species, but not so good at finding hits between species. So we're gonna attempt to use it here to find the chicken homologue of a human protein. We're gonna get that human protein sequence from NCBI, from a RefSeq, paste it into the blot search. If this were a normal Santa Cruz page, not a blot page, there'd be a link up here called blot. You'd paste in your sequence, select the genome, which is chicken, and the assembly. I should say that this is a rather contrived example, which I'm showing just to make a point about blot. This particular older chicken genome assembly is no longer available off the main Santa Cruz page. I had to go to the genome preview page that I told you about earlier in the talk. And to get this older assembly, the newer assembly is from 2006. So you submit this data, and very, very quickly, it goes off and it takes that human protein and compares it to the entire translated chicken genome, and returns you four results. You get two links, one to the browser, and one to the details. So let's take a look at what the browser length looks like. This opens up a screen on the chicken genome browser, and this line right here shows you your blot hit. So these solid rectangles indicate regions where the chicken, the human protein sequence aligns to the chicken genome. The regions between them indicate regions where there is no alignment. So on first glance, you might think this looks pretty good. If you take a protein sequence and compare it to a genome sequence, what do you expect? You don't expect one big block of sequence rather than you expect individual exons. Because remember, the exons are stitched together to make the mRNA and later the protein. So you might think that these individual blue blocks that you're seeing here actually represent individual exons in the human genome. However, if you were to look at the details link which shows you the sequence alignment between the human protein and the translated genome, you'll find out that you've been sorely misled. The regions, the three boxes that you were getting are actually these very, very short regions of alignment. So anything that's in blue aligns, anything that's in black does not align. So here we have the human protein, and it's showing you in blue the bits that align to the chicken genome. This doesn't look very good. So here's a chicken genome showing you the three regions of alignment, and here's one example of the alignment. This would be something that would just be a spurious alignment. You can get, there happens to be a bit of sequence over a lot between these two sequences, but nothing of any importance that you would really care about. So I just wanna make the point that you need to, just because you get a result back on a genome browser doesn't mean that you have some interesting biological result. You always need to go back and look at the underlying data, look at the sequence alignment, look at whatever you need to look at to convince yourself that what you were saying is actually the correct answer. If you had looked at this in a bit more detail, you might have noticed the following. So these columns here tell you the start and end positions of the sequence alignment here on the protein and here in the genome. So for example, we have a 600 nucleotide protein that we started with, sorry, a 735 amino acid protein that we started with. The alignment only goes from nucleotide 539 to 600. So there's only about 70 amino acids worth of sequence alignment, which by most standards would not be enough to predict some sort of homology between one protein and the genome that you're comparing it to. Questions? So another thing that you can do at Santa Cruz, it's actually used a lot by the genomics community is to add your own custom tracks. So this is basically adding your own data to the genome browser such that you can see it in the context of other genome browser data. All you need to do is to format your data in the right format. There's a number of different acceptable formats, but in short, there are also some sort of tab delimited text where you would have the chromosome number, a start position and a stop position. If you wanted to try this yourself, this particular text file is available at this URL down below. And when you put it into the browser appropriately, you end up with something that looks like this. So from the red box down is all the normal Santa Cruz data that anybody would see at Santa Cruz. What's highlighted here in the red box are the data that you added yourself. So you've created four different tracks, you color coded them in different colors, and each of your data points is indicated as a little tick mark. There are four different ways basically of sharing your custom track data or getting your own data into Santa Cruz. One of them is that you can upload your own data from your computer. The upside to that is you can see it, not available to anybody else. You can have your top secret data. You can look at it and it's not gonna stay around. Another option is that you can post your annotation data that text file I just showed you to your website. And if you create the right type of URL, or web link that feeds that data into the Santa Cruz genome browser, you will have the view that I showed you before as well. And you can then send that link out to other people. You can make a companion page for a manuscript if you wanna display your data. If you wanna have a Santa Cruz tracks that you say in your manuscript, look, here's where you can get to them. That's one way to do it. A third option is to create something called a session which configures your browser with specific track combinations, including custom tracks. This is a great place if you're sort of in the process of analyzing your own data. You can create a session without making it publicly accessible and you can come back day after day and look at this data in the context of the genome browser. Once you decide it's more ready, you can then share that session with a limited group of people, if you so choose. Or if you have data which seems very appropriate, you can actually contribute it to the Santa Cruz genome browser team and they will make it available for the whole world to see on their website. And there's more information about how to deal with these custom tracks at the two URLs at the bottom of your screen. The final thing I wanna talk about at Santa Cruz is something called the Table Browser. So this is basically a way to get the underlying data out of the Santa Cruz databases. So you're seeing this nice web display that shows you your data in nice graphical format but underlying that is a database that contains all your data in text format that can be extracted and if you're doing any sort of computer programming you probably wanna be able to extract the data as text and then run your own programs on it. So the Table Browser's a very powerful way to do that. Types of things that you can do, you can get back DNA sequence. In a track I'll show you how to do that. You can calculate the intersections between tracks. So for example, if you wanna find all the snips in a particular gene, you would intersect the snip track with the gene track to pull out the data that's shared between those two tracks. And you can also filter the track data based on certain criteria. So for example, show RefSeq genes that only contain one exon. So an example I'm showing, we wanna get all the promoter sequences for all the genes in RefSeq. So a couple of slides ago, I showed you how to get promoter sequence from one gene. Here's how you do it from all of them. To make a long story short, you would select the appropriate tracks, output format of sequence. You go through a couple of selection screens which look a lot like the screen that we saw before where you tell it what type of sequence data you want and out the bottom will come a list of promoter sequences. I've chosen to just show 200 nucleotides here because that way I can fit a couple on the screen but you'd probably want something bigger. So that was all I had for Santa Cruz. Do I have any questions from anybody before I move on to the next browser? Okay. So the next browser is Ensemble, which is created by, it's a collaboration between the EBI and the Wellcome Trust Sanger Center and it's based in Cambridge, England. I should say before I do this that my distinct preference for genome browsers is really the Santa Cruz genome browser. It's very easy to use, it's user friendly, it's pretty intuitive without having to read any documentation. That's my perspective, I'm sure other people feel very differently. Ensemble is another popular genome browser. I think it has a higher learning curve but once you get through that learning curve there's a lot of interesting data and I'm gonna go through that now as well. So here we have the Ensemble homepage. We have links to a number of different genomes. There's a very long list of genomes available. And then, excuse me, all sorts of different things you can do with those genomes as links across the page. We're gonna do a Blatt Blast search by clicking at the link at the top of the page. And in my example here what I'm doing is I'm taking a short sequence tag, a 20 nucleotide sequence tag, and I want to compare that, I wanna find its location where it maps in the human genome. So I paste my sequence in the box, choose my organism, I can select the search tool, so it allows you to do Blatt, which is gonna be quick. It also allows you to do Blast in, which is gonna be more sensitive. Blat is probably not gonna work for a 20 nucleotide sequence tag because it's too short for Blat to handle, so you really do need to do Blast in. And they have a number of pre-can parameters. I'm gonna pick the one here called near exact matches to an oligo. So these are gonna set some of the configuration parameters that Andy told you about last week. It's gonna do that for you automatically. One could argue that may or may not be a good idea that you may want more control over your Blast searches than this gives you, but if you want more control, you can always click on the configure button and you'll see a lot of the same controls that Andy showed you last week with NCBI. So the search runs for a couple of minutes and you come back with results that look like this. You get a schematic of all the karyotypes in the human genome showing you the two locations where that particular sequence hit, one here in chromosome eight and one here on chromosome 25. Down below, you see details of those matches, both of them hit at 100% identity, so that's good, but only one of them hits over the full length. So it says query start and query end. That tells you where in that 20 nucleotides a sequence alignment goes. Both of them start at nucleotide one. The first one extends to nucleotide 20. The second one only goes to nucleotide 17. So that's telling you that only the first 17 nucleotides in that query sequence align with a genome on chromosome eight. So we're obviously more interested in the hit on chromosome 15 because you get a full alignment. A number of different links that you can choose here, the one you probably want is this thing called C for contigue view because that is going to take you to a view that looks like this. So this looks a lot like Santa Cruz, but yet different. So let me walk you through it. The displays organized into a couple of sections. So up here at the top, we have an overview. So this is showing you the context of your blot hit, which I believe is this red line right here and it's showing you that there's some stuff around it, but at a very 10,000 foot view. Down below is showing you more details. So again, here's your blot hit and here are some genes. We're hitting the TCF-12 gene, which has lots and lots of different transcripts. So each of these transcripts is indicated by a separate line. The translated exons are the solid boxes. The untranslated exons are open boxes, so that's showing you where the UTRs are compared to the coding sequence. The reason there are so many different transcripts is that ensemble uses a variety of ways to predict genes. The blue guys and the, sorry, the red guys and the blue guys are the normal ensemble transcripts predicted by their annotation pipeline. The red ones are coding. The blue ones are non-coding. Then there's some shown here in yellow, which are called merged ensemble Havana. What that means is Havana is a project to manually curate genes in the human genome. So these yellow ones, or I think they're called gold, are ones that are both predicted by ensemble and then manually curated by Havana. Then there's a third type of somewhat curated data represented up here by these green tracks. This is something called the CCDS set or the consensus coding sequence set. The logo's right here. So those are, the CCDS project is a joint collaboration between a couple of different genome centers to create a good set of protein coding genes. So you'll notice there's no untranslated sequence here, just the translated exons. So that's another source of well annotated transcripts in addition to the ref seeks and in addition to these gold ensemble Havana transcripts. It gets a bit confusing, I do admit. There's this blue line down here at the bottom and that's showing you the contig to which these things have been mapped. So the human chromosome is still in some sense divided into individual contigs or individual long chunks of sequence that all make up a chromosome. Any transcripts that are shown above the blue line are pointing in this direction, the normal direction from left to right. Transcripts that are shown below the blue line like this guy here are pointing in the opposite direction. A bit confusing, but that's the way they show it. Yes? I'm not sure to tell you the truth. It's not part of the CCDS and I can't read from here what it says. Yeah, I'm honestly not sure, but if you wanna come down afterwards, we can look into that one together. Any other questions? If you wanna add tracks to this view, you do that by clicking on configure this page and that brings up a window where there's a long list of tracks, sort of like what you saw at Santa Cruz, but perhaps not as intuitive as to what's going on. If you click on one of these track categories, such as DBSNP, you get a list of types of DBSNP data that they have. We're gonna turn on the DBSNP variants by clicking on the box and you get this sort of weird splotchy display that's supposed to show you that the display is now turned on. And when you go back to your viewer, you've added on the SNP track down here with a variety of SNPs with different colors indicating different types of SNPs, which I'll show you in just a minute. To navigate in the browser, you would use these links up here. You can zoom in, you can zoom out, you can move right, you can move left. We're gonna move to the right a little bit. And go into this Exxon right here, which is now a completely translated Exxon. And here's SNPs down below again. So I've brought up the SNP color coding legend. At least for these SNPs, we have blue ones that are in introns. Not surprising. This is a big intronic region right here. We have green that are synonymous and yellow that are non-synonymous that you're probably having a hard time seeing. If you click on any of these SNPs, you will bring up a little pop-up menu that looks like this. It gives you a bit more detail about that particular SNP. And it also allows you to click on this link up here, which takes you to more information about that particular SNP. What I have clicked on is the yellow non-synonymous SNP down here. Sorry, I've clicked on the green synonymous SNP so you can actually see it. It tells you it's synonymous and gives you some more information here. So if you click on this link right here, you get to a page that gives you various links to various different properties of this SNP. So there's eight possible different views that Santa Cruz will give you of this particular SNP. I'm not gonna go through all eight of them. I'm just gonna focus on the two that I thought might be the most interesting. Of course, the background information, what type of SNP this is, is always present on each page. So you know it's an A to G SNP. And this was a synonymous SNP so there's no protein change in this particular view. Up here, we see, sorry, I clicked on this link here called genomic context. And when that brings you up is a page that looks like this, showing you a very nice picture of this region of the genome with the two exons and all the SNPs down below color coded if you ever needed to make a figure of the SNPs in a particular gene, this might possibly be a place to come. Another link you can click on is this one here called population genetics. This is gonna show you the distribution of that SNP in different populations. So that's this link down here. So again, this is our A to G SNP. And this is showing you it's just frequency distribution in different populations. CHB plus JPT, that's the Han Chinese and Japanese populations from the 1000 Genomes Project, which Eric Green talked to touch briefly on two weeks ago. In this particular population, A's are represented at 70%. The G allele is seen about 30% of the time. That's to contrast with the YRI population. This is, these are the Euruban ethnic group from Nigeria. They have a much higher frequency of A's than do the Asian populations. And you can get this information for any of the SNPs. Something, other types of things you might want to get out of ensemble. If we go back to this particular view, where we're seeing all of our transcripts, if you click on any of these transcripts, you get a pop-up menu that allows you to do a variety of different choices. You can link to the transcript, you can link to the gene, you can link to the protein product. So we are going to explore more about the TCF12 gene by clicking on this gene link. And that opens a page that looks like this. To orient you, all of the Santa Cruz pages are organized into tabs. So we were earlier on the location tab because this is showing you the overall genomic context of what we're looking at. We're now in something called the gene tab, which shows you information about this particular gene. And there's also a transcript tab, which I'll touch on a bit more later. So this gene is TCF12. At ensemble, it has this accession number. Let me just digress a minute a bit about the ensemble style accession numbers. They all start with the letters E and S, and then they have a gene, a G for gene, a T for transcript, or a P for protein. So that's how you recognize whether that's a gene, a transcript, or a protein. They also give you the species information. So by default, human is the original ensemble annotation. Human doesn't have any species orientation, any species information. But if you were looking at mouse genes, they would be ENS for ensemble, MUS for mouse, and then G for gene, T for transcript, P for protein. So if you're smart about it, you can sort of figure out what type of ensemble product you're looking at. So this is just one gene, the TCF12 gene. And as I said earlier, this gene has a lot of transcripts, numbered from one down to 20, and I think there's even more going off the screen. Each of these transcripts gets its own identifier, because each one of them is gonna have a slightly different spliced form. And a good number of these have corresponding protein products, although some of them are just processed transcripts that aren't actually translated. There's a variety of things that you can do from the Ensemble gene tab. One of them is look for orthologs of this gene. So these are homologies that are automatically calculated by Ensemble, and they would link to homologies and all the different organisms that Ensemble has genomes for and support. So for example, just off the top of the list, we have links to alpaca, a lizard, armadillo, a bush baby, and the list goes on. So if you don't wanna do any blast searches yourself, but you wanna very quickly get homology information from the gene that you're looking at to a different species, this is one place to come. I'm sorry, what was that? So the question is, what's the difference between the target and the query? Is that what you're asking? And where are you saying this? That's a very good question. So the question is, what are these two percentages? There's a target percent ID and a query percent ID, and why are they different from each other? I don't know the answer to that one either, but I'll offer you the same answer. I offered the other gentleman, which is we wanna come down afterwards and we can discuss it after my talk and try to get the information. Sorry. Anybody else wanna stump me while I'm on a roll here? Okay. So going back to the view that we had before, oops, wrong direction, I'm not gonna go through all the different possible things that we can do, but one other thing that, hello. One other display that I think is nice is this thing called the variation image. So what that gives you is, again, another pretty picture that you could potentially use in a manuscript, and that's showing you the locations of all the variants in the gene. So what we have here is a blown up view of all the potential exons in use by the TCF-12 gene, and each of these exons is shown by a vertical brown bar, and it goes through transcript by transcript. This display goes through transcript by transcript, showing you all the possible transcript. So this is the first transcript here, and this is showing you the exons in use by that transcript, so you can see there it uses some of them indicated by these solid rectangles. Other exons are skipped in this transcript, indicated just by a horizontal bar, and then in each one of these exons, it's showing you what variants have been documented. So again, potentially a nice display. You also get down here some pro-site profiles and other protein domains that you will be learning about next week. Yes. Okay, so as I mentioned, we were on the gene tab. If we skip to the transcript tab for one particular transcript, it looks very similar to the gene tab, although if you have very good eyes, you will notice the list of things on the left that you can do is a bit different. One thing I want to show you is supporting evidence. So each of the transcripts in Ensemble was predicted in some way. Using the Ensemble pipeline, which is using some computational methods, and also using some transcript alignment data. So here's how you get that transcript data. On the supporting evidence tab is again, the transcript is displayed, all the exons in that transcript are displayed up here, and down here is a long laundry list of all the transcripts that went into annotating this particular gene. The place here are the accession numbers of each transcript and the place where you see either the green or yellow boxes are the places where that particular transcript had an exon. So if you focus in on this third coding exon right here, you'll see it's actually quite rare in the transcript data. It's present in this yellow transcript right here. The accession number starts to letter B, and I can't read beyond that. Repeated here, and then it's present in one, I'm in the right column here, one other transcript right here. So that particular exon in this predicted transcript doesn't have a whole lot of supporting evidence. So if you were looking at this transcript in detail, you might or might not actually believe this particular transcript just because that exon doesn't have much biological data supporting it. Something else I wanna point out at the ensemble is the protein sequence for this particular transcript. So if you click on the protein link, that will take you to the translated protein sequence, which is nicely color coded. Each exon is a different color, and the splice junction, the splice sites, the junctions between the exons and the introns is shown in red. While I'm in this view, I want to point out the link called view in archive site. So as I mentioned at the very beginning, you can always get to old ensemble data. And at the bottom of every ensemble page is this link called view in archive site, which brings up a page like this. So you can choose from a variety of different ensemble genome builds going back to August of 2007. So I need to make a distinction between ensemble versions and genome assemblies. So remember I said way back in the beginning, the human genome, the mouse genome, whatever genome you name it, are assembled every couple of years. So some of the older data from 2007 up to May of 2009 is annotated on NCBI 36, that is NCBI build 36, or the previous version of the human genome. The later data is all coming on the other assembly called GRCH 37, which is the newer assembly. It's very cryptic, if you didn't know this, it probably wouldn't be at all obvious. Now on a given genome assembly, ensemble updates all their annotations every couple of months, and these annotations sets get sequentially numbered. So ensemble 46 was back in August of 2007, so that was an annotation set on NCBI build 36. They sequentially numbered these assemblies up to the current one is 65, so the previous one is 64. So what I wanna point out is this group of ensemble annotations from 46 up to 54, those are all based on one genome assembly, so the underlying data are the same, but the annotations are different because they perhaps updated their computer algorithms. So even though the bottom line genome assembly hasn't changed, the annotations may be different, and there may be reasons why you wanna go back to an older annotation because perhaps you think the gene prediction in that particular region of the genome was better a year ago than it is today. So bottom line, you can always get back to it through this archive page. Okay, so the next thing we wanna do on ensemble is try to find a chicken homologue of our human protein. So remember, we tried this using BLAT at Santa Cruz and it did not work, so let's try this at ensemble using BLAST. The interface is similar to what you saw before, but we're pasting in a protein sequence. We are gonna go against the chicken genome and we're running a version of BLAST called TBLASTn, which takes a protein sequence and compares it to a translated genome sequence. And you get back a lot more results than you did by BLAT. Now that you've learned a little bit about what these results may look like, you can study this page before you actually look at the alignment data. So if you'll notice the top three hits all have fairly decent length alignment. So for example, the first hit, the alignment starts at amino acid four of the protein and goes to amino acid 600 and some odd. The second one starts at amino acid, I think it's six and goes to amino acid 500 and some odd. So they're decent length alignments, which is good. Remember, we were back in the BLAT view, it was only a 70 amino acid alignment. And if we look at one of these particular alignments in greater detail by clicking on the A link, you get a page that looks like this. There's not, which I will just say is actually a fairly decent alignment between human and chicken. There's not a lot of conserved sequence, but you wouldn't expect human and chicken to be all that similar to each other. On this particular example, they're about 30% identical. So we have our query sequence up on top. This is a human protein sequence. Our translated chicken genome sequence down below and any place that you see a capital letter in the space in between represents positions at which those sequences are identical. If you see a plus sign, you have a conservative substitutions. So bottom line, BLAST works a lot better than BLAT if you're trying to go between species. The final thing I want to touch on ensemble is using a search engine or a database called, a database interface called Biomart, which is a wonderfully wonderful resource for cross-referencing data from different sources. I find that people don't really know about this, but it's really a nice thing, a nice tool to add to your arsenal. So in this particular example, we're gonna use Biomart to start with a list of ensemble gene identifiers shown here. And we wanna pull out their genomic coordinates, the gene symbol, as well as the RefSeq accession. So we're gonna cross-link between ensemble and NCBI's RefSeq. So you basically first need to choose your database, which is, these are zebrafish genes, Daniel Rario, so we're gonna choose zebrafish. We paste in our gene identifiers. So remember, the key that I told you earlier, ENS is ensemble, G is a gene, so these are all gene identifiers. The DAR stands for Daniel Rario, so these are zebrafish gene identifiers. We paste them in the box. And then we, so what's a bit confusing is you put your input into this section here called filters. And there's a long list of different types of filters that you can use. I'm filtering based on gene IDs. Then to select what you wanna get out of the process, you need to select these attributes, shown on the next screen. The attributes that are always on by default are the ensemble gene and transcript IDs. Those are always there, although you can take them off. And then you can add additional attributes. So I'm adding on the chromosome name, the gene start and stop, as well as the associated gene name. And those are all found under this section called features. You can also go to a section further down the page called external references. So this is how ensemble has correlated itself with other databases. In our case, the database that we're interested in is the NCBI data. So we're telling it we want the refseq mRNA and the refseq mRNA predicted. So all of those things together, you click on a link up at the top, which is sort of cut off called results. And you get back a page that looks like this. So we have our starting ensemble gene identifier from zebrafish. We get back the transcript identifier. Each gene has multiple transcripts, which is why you see the genes repeated with unique, different, unique transcript identifiers. You get the chromosome to which each one maps. The genes start and end. And this would be the transcript start and ends, the different transcripts. Actually, I'll take that back. These are the genes start and end, the gene name, as well as the refseq accessions where those are available. So you get, for some of these, you're getting an NM refseq, which again are the nicely curated refseqs that are coming from GenBank sequences. In some cases, you're getting XM refseqs, which are the ones coming out of the gene prediction pipeline at NCBI. So not all of the ensemble genes are going to map to NCBI. And conversely, if you were to perform a similar search at NCBI, I'm not quite sure how you'd do that, but if you were to do a similar mapping, not all of the NCBI refseqs would map to an ensemble identifier. Making my point again that the different groups are doing their annotations independently and you're going to get different data depending where you go. Anecdotally, I would say that ensemble is a much better source of gene annotations for zebrafish than as NCBI. Ensemble has a much more active zebrafish annotation program for zebrafish than at NCBI. A second quick biomarker example is we're going to start with that same list of zebrafish gene IDs, and so you want to get back the IDs of the predicted human orthologs of those genes. We would go click on the, put in the data the same way we did before, click on the homologs link, select the organism for which you want homologs. The alphabetical list starts with Atlantic Cod and goes all the way down to, I don't know what's at the bottom, but humans somewhere in the middle. You say you want the gene and protein IDs from human, the percent identity, and you get back the predicted orthologs right there. You'll notice that not all of the genes have human orthologs, so this particular one, there's no calculated ortholog. So again, it's a very handy way of very quickly getting data from different resources all integrated together. Questions? Yes. So you want to start with a RefSeq number and get out an ENS number. I'm pretty sure that there is, under the filter option, I think there's a way to choose external identifiers and you would paste in your RefSeq accession numbers and then do the converse when you get to the attributes. You would export your ensemble links. If you want to come talk to me afterwards again, we can work through this for sure, but I'm pretty sure I've done that. Any other questions? I'm glad I could actually answer one. Does anybody else have a nice question? Okay. So moving on to NCBI, I'm going to go through this fairly quickly. I will say the, I'm hoping there's not too many NCBI people in the audience. The NCBI Genome Browser is by far my least favorite. I'm using this, I'll show you in very brief detail how to use NCBI, but I'm also using it as a way to highlight some other NCBI resources that I think you should be familiar with. So this is the NCBI MapViewer homepage, a whole lot of organisms available. You choose the organism you're searching in and our search that we're doing today is looking for a region between two SNPs. So you've done, say you've done some sort of association study, you've narrowed down your critical region to a region between these two SNPs and you want to know what genes are in that region. So you do your search, you get back that you hit, these two SNPs hit on the short arm of chromosome eight, and they hit two different genome assemblies. So this is something we haven't encountered at either Santa Cruz or at Ensemble. So I should say these are two different human genome assemblies. This one here called reference, that's the normal genome assembly, the build 36 of the genome or HG19 or whatever your naming system is. This other thing's assembly down here is called hue ref primary assembly. And if you Google this, you will find out this is actually CraigVenter's genome. So CraigVenter, for those of you too young to remember all the controversies, CraigVenter was the founder of Celerogenomics and did the privately funded human genome project. He's decided to make his personal genome available in the genome browser. This reference genome, I didn't say this earlier, this is a composite of about 20 or so different people's genome so it doesn't represent one person, this one does. We're gonna stick with the conventional reference assembly and we're gonna look at all the matches on chromosome eight for these two SNPs. You come up with a display that looks like this which is sort of similar to what you're seeing at Santa Cruz and Ensemble, hopefully becoming sort of familiar to you. Although at this point, the display is organized vertically instead of horizontally. Over here in our variation track, we have a list of SNPs. The one that we started with is up at the top. We queried for two SNPs. The first one is up here at the top. The second one would be off the screen at the bottom. Other tracks that we have, we have a unit gene track which represents a synthesis of EST sequences. I'm not completely sure what that's shown here because I don't find it to be an overly useful track. The track I think you're gonna be more interested in is this one here called genes on sequence which are basically the NCBI gene predictions. So let's try to change the track display a bit. You would click on the maps and options button on the left which brings up a screen that looks like this. For those of you who've used the NCBI map viewer in the past, you'll be aware that this is a newer version of the maps and options box. It used to look very different. But the function is basically the same. Over here on the right, you have a list of the displayed tracks. Over here on the left, you have a list of the tracks you can add to your display. I used to have the unit gene track up here because that was on by default. I clicked, there was a minus button next to it. I clicked on the minus, it went away. If I wanted to add tracks, I would click on their names over here. So for example, you could add the ensemble genes or ensemble transcripts to your view at NCBI if you wanted to. You can also reorder your tracks here by default because I searched for SNPs. The SNP track or the variation track is showing up on the right. That's sometimes called the master map and that has the most detail available for it. That particular map has the most detail. I'll go into that a bit later. I'm switching things around rather than having the rightmost map be the variation track. I want it to be the gene map and I just did that by dragging the maps with respect to each other. So my new view looks like this. I've gotten rid of the unit gene track and I've moved the gene track over to the right. So you'll notice when the gene track is on the right, the master map, you got a lot of links to NCBI resources. So let's explore some of those in a bit more detail. We're gonna look at what happens when you click on the gene symbol, what happens when you click on the OMIM link and what happens when you click on the HM link. So the gene symbol first, this takes you to an NCBI resource called Entree Gene which I hope that you're all familiar with and if you're not, I suggest that you look into it. This is a great curated catalog of all information available for a particular gene. So you get, we're looking now at the Erlang II gene. Erlang II gene, you have links to some resources. You have, usually, I think I've cut it off down here, you'll have a description of what this particular gene does which is manually curated, manually written by NCBI staff. And you have what I find the most useful is a link to the ref seeks for this particular transcript so rather than searching Entree Nucleotide to find the ref seeks, you can come directly to Entree Gene and they're right here. You have the NM transcripts, the corresponding NP proteins, if there are multiple isoforms, they all show up. And there's a long list of other features that are available in here as well which I'm not gonna go through but I encourage you to look at on your own just to see all the information that NCBI's brought together. Yes? The NC number, another one I can answer, down here. So that is the reference, the ref seek accession for the chromosome, for the chromosome. So this is NC underscore a bunch of zeros and an eight. That is human chromosome number eight. So the chromosomes are NC's, there's contigs or scaffolds, I can't remember which are called NTs, an NT underscore and a string of numbers. You'll sometimes see those as well. But it's nice, they have chromosomes because the coordinates that you get are then chromosomal coordinates that start with nucleotide one and go to the very end of the chromosome. I don't know who asked the question, but did I answer it? Yes, another one? That was you, okay. Sorry, it's very bright down here, I can't actually see anybody. Okay, so another thing you can link to is the OMIM record for this particular entry. So this is the OMIM display for this gene. OMIM is online Mendelian inheritance in man that I believe is still handled by Hopkins? Yes. Yes. It is manually curated by a staff at Johns Hopkins University who read the literature and then bring in lots of information from the literature, specifically medically oriented information about a particular gene. So you'll see right off the bat this particular gene is associated with the phenotype spastic paraplegia 18, I think that says. And there's links to other things here as well. I hesitated about where this was based. It used to be that OMIM did the annotations and it was displayed at NCBI. OMIM has recently taken back the handling of the online display. So it looks a bit different from what you might have seen before, but I believe the data are the same as what they were a couple of months ago. So this could be a great place to come if you just want a synthesis of a literature about a particular gene. Another link that was available off the map viewer page, the link called HM, goes to something called homology. So this is NCBI's tool to find predicted homologs between the protein that you started with and other proteins in the NCBI databases. So here we're getting predicted homologs between the human gene, mouse, dog, zebrafish, worm, et cetera, and further down the page you get more details. Up here you're getting links to the accession numbers. So this is sort of similar to what you saw at Ensemble for the ortholog prediction. NCBI doesn't have as an extensive list of species that they're doing their ortholog or homolog comparison between. So it's gonna be a shorter list. Okay, so how do we navigate around NCBI? You can zoom in and out using this zoom control over on the left or you can put your mouse over a particular track and click on it and it'll allow you to zoom in or zoom out by a specific amount. I wanna zoom in on one particular region of this Erlen II gene. So I'm gonna show 100K, which is gonna zoom the display in and I didn't show it here. After I did that, I did another zoom in forex just to really zoom in. And I ended up with a display that looked like this. We were focusing in on two exons of Erlen II. Here's one of them. Here is the other one. The direction of transcription of this gene is rather cryptically noted by this little arrowhead next to the gene that shows that the gene is pointing down or if it were sideways, it would be pointing left to right sort of the traditional direction. And then over here, you can see we have all of our variants and you're now seeing individual variants rather than a smear that you were seeing at a higher, a higher, a more expanded resolution. If you wanna change the order of the maps, if you wanna now bring this variation map over to the right to see more details, you have two options. One is to go back into the maps and options and change the master map there. Another one is to click on the arrowhead next to the map name and it'll jump over to the right, which we shall do. And now we're seeing the variants listed in greater detail. So these are each of the SNPs. They have little lines going from the accession number to the actual position at which they appear on the genome. So you don't get confused just because the accession number is right here. Doesn't mean it corresponds to a SNP right here. There's actually an arrowhead that's pointing down a little bit further down the chromosome. Next to each of these SNPs, you have an L, a T and a C. The L tells you that the SNP is within a locus or within a gene. The T tells you that it is within a transcript but they mean by transcript it's within an exon. So you can be at an intron and you'll get the L button lighting up but not the T button. And the C tells you that it's inside of coding sequence. So most of the SNPs within transcripts are going to be within coding sequence unless you're in the five prime or three prime untranslated region. To get more detail about a particular SNP you would click on its accession number and that takes you to NCBI's DB SNP. I'm just showing some highlights from that here. Up at the top we get more information about the particular allele, the nomenclature that's used in different transcripts and I've scrolled way down on here sort of the meat of what you want. It's a missense mutation. Missense means non-sononomous meaning your SNP is changing the sequence of the protein. You're changing an A to a G showing you the two codons here and here showing you the actual amino acid change and then here showing a slightly different view of what that looks like in its genomic context. So I'm basically done. I'll point you to another couple of sources of online help. All of the genome browsers have various tutorials and help documents online which are available here. And there's also through current protocols and bioinformatics there is a chapter on each of these genome browsers available if you are logged in through your, on the NIH network you can get to these units for free using the URL down here. You can also get to them through the NIH library. All of that link is currently broken so I'm using this link directly to the current protocols website. So I'm happy to take any questions. All of the before I do that I should announce that the next lecture is given by Andy Boxavarez called Biological Sequence Analysis 2 and they'll be held here next week. Thank you.