 Good morning. I'm going to go ahead and get started. I'm here at Wolfsburg. I'm one of the co-directors of the course along with Andy Bokkevonis and Eric Green, and thank you all for coming out this morning. So what I'm going to be talking to you today about today is genome scale sequence analysis. Like the previous speakers, I have no relevant financial disclosures. And I have four different subject areas I'd like to cover today. The first two are discussions of two different online genome browsers, one hosted at the University of California Santa Cruz and one at Ensembl. I expect that some, if not all of you, have had some experience with at least one of these two genome browsers, but I'm hoping that I can give even those of you who have had experience some extra tricks that you weren't familiar with in the past. Then I'm going to go to two different places where you can go to get other types of, well, you can go to get data in different formats, not just graphical web browsers, but also text format and other formats that you can calculate the computations on. One of those is the Biomart portal, which is part of Ensembl, and the other one is the Galaxy tool out of Penn State. So before we actually launch into the details of the genome browsers, let's take a quick look at what sort of material you expect to find there. Both the genome browsers, that is Santa Cruz and Ensembl, start with the same basic material, and that is your genomic sequence. And then each of the two browsers independently calculates a set of annotations. Those annotations will include genes, sequence variation, SNPs, as well as non-coding functional elements, as well as many other things as well, which I will show you a bit in the course of the lecture. The genes can be predicted based on where existing mRNA sequences align to the genome. So they may start with RefSeq mRNAs. If you'll recall from Andy's talk two weeks ago, RefSeqs are these partially curated mRNA transcripts that come from NCBI. They come in two flavors. There are those that are well annotated, which those accession numbers start with an NM, and there are other ones that are models just predicted from the genomic sequence itself, and those accession numbers will start with an XM. And we'll talk about that a bit more later on. Genes are also predicted by the locations of where other GenBank mRNAs or ESTs align to the genome, as well as by different types of admonitio gene predictions, different types of software. The genome sequence assemblies themselves are pretty complicated to do. So they're based on a variety of different types of genomic sequence, which is available at NCBI or other databases. These genome assemblies are updated periodically as new sequence becomes available, usually every couple of years for the larger genome sequences. The mouse, human, and zebrafish genomes are being assembled by the genome reference consortium, and other genomes are assembled by other sequencing centers or consortia. The updated assemblies do take some time to percolate down to their display in the genome browsers. So, as we'll see, we'll go into that a bit more in just a second. But sometimes both Santa Cruz and Ensembl make early genome releases available at these two URLs, which are listed here. So you can get an idea of what genome sequences they're currently working on annotating and what may be available on their main browsers in the future. And one nice thing is that both Santa Cruz and Ensembl provide archives of older genome assemblies. You can always get to an older assembly that you were working on earlier. And keeping in mind that coordinates of features can change from one genome assembly to another. So even though it may just be a couple of regions in a genome that get updated from one assembly to the next, if those updates happen near the beginning of the chromosome, if the annotations that you're working at or at the end of the chromosome, all of those locations are going to be completely different from what they were on the previous genome assembly. So you need to make sure if you're looking at two different genome browsers or two different sources of data, you need to always make sure you're looking at the same version of the genome sequence assembly, otherwise you're going to get very confused and probably make some pretty bad mistakes. So some of you may be aware that there's a new human genome assembly that just came out at the end of last year called GRCH38. Although that assembly was completed in December, it's just now being displayed in the Santa Cruz genome browser, but there aren't that many features annotated on it yet. I think there's genes and maybe some variations, but not many other data tracks, so we're actually not going to be looking at it today. It's not available at all at the Ensembl browser, although Ensembl expects to have it up sometime later in March on their pre-Ensembl site. How is this genome assembly different from the one that came before, which was called HD19 or GRCH37? Well, there are some sequence fixes as well as some misassembled regions that have been fixed and also some gaps have been filled. More importantly, there are over 260 alternate loci, and what I mean by this is that there are regions of the genome that differ significantly in their sequence from one individual to the next, so these alternate loci are alternate representations of the sequence in those regions. Most of the alternate loci are in the MHC region of chromosome 6, but there's some other alternate loci on chromosome 19 as well. And you can recognize these because they have these strange-looking names of the chromosomes. A regular chromosome will be called CHR1 or CHR2. These guys are called CHR in a number, and then some additional text after that that represents its accession number through that GRC. So let's start off by looking at what would happen if you queried the Santa Cruz genome browser with a gene symbol. So this is a Santa Cruz genome browser homepage. What's nice about this is they tell you, they give you some news about what has recently happened, so in this case there's a new human genome assembly available. If there were other genome assemblies from other organisms, those would be featured here as well. There's a lot of different things you can do up here across the top as well as down the side. We're going to be sticking with this link here at the top called genome browser, and when you click on that, you get into the main query page. So I don't think I mentioned this yet, but you can actually see a lot of different organisms, probably close to 100 different organism genome assemblies through the Santa Cruz genome browser. You can get to those by selecting which group you want to look at as well as which genome you want. And importantly, you can also see older or newer versions of the genome sequence assembly. So as I mentioned, this new HG38, as it's called at Santa Cruz, is now available. We're going to be sticking with the older HG19, which has been out for quite a while, and you can see you can get some older genome assemblies as well, dating back to July of 2003. In this search term box, you would select whatever you would type, whatever you want to look for. You can look for a chromosomal position, you can look for gene symbols, you can look for accession numbers of mRNAs, or as we're going to do in this case, you can search for the name of the gene. In this case, a gene called Adam2. When you start typing, Santa Cruz will bring you down a selection of gene names that it thinks you might be interested in. So we're just going to go ahead and highlight the Adam2 gene name right here. And this will take you to the main part of the Santa Cruz genome browser, which is shown right here. So up on top, we have a karyotype showing you all of chromosome 8. The position that we're focused on is right here in red, sort of on the short arm of chromosome 8. And here it is blown up down here. The data are organized on tracks. And what I mean by that is there's different layers of data here. So there's information up here at the top about where you are on chromosome 8. And then we have our first data track. It's called UCSC genes, or the known genes track. And this has a variety of transcripts collected from a couple of different sources. And then down here we have some other tracks that we'll come back to later. If you look at one of these tracks, you'll see these gene tracks, you'll see there are blips, vertical bars. Those vertical bars represent the positions of exons. And the line that connects them represents the position of introns. If you look really carefully on this intron, you will see little arrowheads. In this case, the arrowheads are pointing from right to left. That shows you the direction in which the gene is located on the genome. So genes can be, some genes point left to right if you're looking at a chromosome. Some genes point right to left. Some genes point left to right. And there's about a 50-50 mix in the genome. So you always need to make sure you know which orientation your gene is. You'll also notice here there are four separate blue lines. Those represent four different transcripts of the Atom II gene. And they differ in whether or not they include certain exons. So, for example, the second and fourth, the second transcript is missing the sexline over here. Other two transcripts are missing the sexon right here. If you want more information about what is in a particular track, you can just click on over here on the left the identifier of that particular track. In this case, it's the Atom II gene symbol. And that'll take you to a details page provided by Santa Cruz, which is collected information on that transcript from all across the Internet and is displaying it in one place. There's a table of contents up here where you can get an idea of what's available. Up near the top here is some preliminary microarray expression showing you where this particular transcript is expressed. The track I actually really like is this one called RefSeq Genes. And the reason I like that is that's showing the alignment of the NCBI reference sequences, the curated reference sequences to the genome. And I tend to think those are a good source of transcript information because they have been computationally and sometimes manually curated as well. If you'll notice, this RefSeq track just shows up as a single line and there's no little arrowheads on it. So that is Santa Cruz's sort of cryptic way of telling you that the data in this track is all compressed into one line. There's actually more information there, but you're not seeing it. So if you come down here below this whole track display, if you come down here to the bottom, you'll see a lot of different tracks listed, most of them are hidden, and the RefSeq gene track is in a status called dense. If you click on that, you'll see there's actually a couple of different ways to display this track ranging from hide to full and then three steps in the middle. Those three steps in the middle are just ways of compacting the data so you don't see it all at once. If you want to get the most information out of a track, you normally want to have it shown in full mode. So if we click on full and then click the refresh button, that will now expand our RefSeq track. And you can see there are indeed three separate RefSeq transcripts for the Atom 2 gene, and again, they differ in the positions of certain exons. There's four transcript tracks up here in the UCSC genes track, three down here in the RefSeq track. They're just different sources of transcripts, and as I said, I tend to prefer the RefSeqs because I feel like there's a bit more duration that's gone into them. But there's obviously some data that's missing from the RefSeq track, which is present up here, which may or may not be a good quality transcript. You need to look at that on your own. So if you want more information about a transcript in the RefSeq track, you would click on its gene symbol. That will open up a different page. From here, we get links to a lot of different NCBI resources. Not surprising since this is an NCBI track. We can go to OMIM or Unlined Mendelian Inheritance in Man, which will give us some clinical information about that gene if that's available. Also, go to Entree Gene at NCBI, which will give us a lot more, a collection of all sorts of different annotation information about that gene. What I want to look at is at the bottom of the page down here, where you can get the sequence of this particular transcript. We're going to look for the genomic sequence. And in particular, say that I want to get the promoter sequence of this gene because I want to do some sort of transcription factor binding site analysis. I could click on this genomic sequence link, go through some details as I've shown over here on the slide, and end up very quickly with the 1KB upstream of this particular transcript. And I could plug that into other websites and perhaps analyze which transcription factors are binding here. So what we're looking at now is the Atom 2 gene, and all we're seeing is the gene itself. As I said, it's oriented to the beginning of the gene is over here. The end of the gene is over here on the left. If you really don't like that view, there is a button down here underneath the tracks called Reverse. And if you click on that, that is going to flip the display. So now the gene is pointing from left to right. If you can see the arrowheads, which I can't, they are pointing from here down this way. So now at the beginning of our gene is over here. The only way you know that you flip the display is now the gene symbol Atom 2 is over here on the right instead of over on the left where it used to be. But if you like this display better, that's certainly one thing that you could do if you like seeing things from left to right. If we want to zoom, so we can see additional information besides the Atom 2 gene. There are zoom controls up here at the top, zooming out threefold and flipping the display back. So Atom 2 is going from right to left. We now see the Atom 2 gene, sequences, transcripts that are upstream of this gene, as well as a transcript that is downstream of the gene. If you want to zoom in on this display and see additional details, there's zoom in controls up here at the top where it says zoom in. What I prefer to do is if you put your mouse in the very top of the display where they have the sequence location and you drag it, you can just highlight a particular region of the genome browser. And when you release, it'll turn purple. And at that point you have a choice where you can either highlight this region or zoom in. If you highlight it, it will come out looking like this. And this is handy if you want to correlate what's going on in different tracks. You now have a highlighted region that you can see what's going on in some other tracks that we'll talk about later. There's your beets putting a piece of paper up to your monitor, which is what I used to do, to see how the different tracks align with each other. The other thing you can do after you've clicked on, after you've highlighted this region, is you can zoom in. And now we're focusing just on the very end of that IDO1 gene. So if you look carefully, you'll notice that the exons of this gene come in two different heights. Most of the exons are tall, but the exons in the end may be a mixture of tall and short. What this is telling you, the tall exons are translated, these are coding exons that go on to make protein. The shorter parts of the exons are the untranslated regions which do not go on to encode protein. So in this case, your gene is pointing left to right. Your ATG, or the start of your translation, would be right here at the junction between the tall exon and the short exon. So how do you get more information on the genome browser besides just the genes? Well, if you scroll down on the page, you'll see there's a wide variety of different tracks that we can add. If we want to get more information about a particular track, rather than changing how the track is displayed right here, we could actually just go ahead and click on the track name. And for some tracks, that brings you to a page where you can see more information about that track and change what it looks like. So this common SNPs track is pulling SNPs, or variants, at an NCBI's DBSNP and displaying them on the genome browser. I should point out that Santa Cruz is currently displaying data from DBSNP version 138. I believe NCBI itself, for their DBSNP site, is up to DBSNP version 139. So not all of the most recent SNP data is available here through Santa Cruz, and I believe the same is true for Ensemble as well. There's a lot of SNP data, and it does take some time to get integrated into the genome browsers, just like the genome assemblies do. So anyway, if we click on the common SNP track, that takes us to a page where we can configure that track in particular, what I want to do is look at the color coding of the SNPs. So there's a variety of different types of SNPs here, and I want to highlight the SNPs that are either coding SNPs, the synonymous SNPs I want to turn green, and the non-synonymous SNPs I want to turn red. Synonymous SNPs are the ones that occur in the protein coding sequence, but do not change the protein sequence itself. Whereas a non-synonymous SNP is one that does actually change the protein sequence and you might expect that would have a larger effect on the protein itself. So after I've changed my colors, I want to change the track display from, I think it was in dense mode before, I want to change that to pack. I told you earlier that the full mode was the best way to view most tracks at Santa Cruz. The one caveat to that that I know about is that the SNP track actually looks better if you put it in pack mode rather than in full mode. So once I do that, I then see up here are my transcripts that we were looking at here before and down below here are all the SNPs in this region of the genome. So each one of these little black lines would represent the position of a particular SNP. The RS numbers are their accession numbers from NCBI. Most of them are coded black, but there are a few that are red. So those would be your non-synonymous SNPs and then also some that are green and those would be your synonymous SNPs. And there's other types of SNPs as well. Those occur at Intron-X on junctions. Those occur at untranslated regions. You can color code those as you want so you can see them more clearly. So a source of data at Santa Cruz that some of you may find very interesting are these encode tracks. Eric Green in his lecture three weeks ago talked about the encode project. This is an NHGRI funded project to try to do a full annotation of the human genome. They've been annotating genes, and most importantly, they've been annotating regulatory regions. So if you want to know what parts of the genome might be involved in regulating gene expression, the encode tracks are a great place to look. So when you come down here and you look at the different tracks, which are available, anything with the NHGRI logo with this black and white double helix, those are all tracks that are coming off of the encode project. The one that I personally find the most useful is this one down here called encode regulation. So if I click on that track name, you can see this is what's called a super track, which means it is composed of data from a number of different individual tracks all pulled into one place. So what you can see in this encode regulation super track is you can get RNA-seq data that will tell you your transcription levels or your gene expression levels of this particular of whatever gene you're looking at. You can get a variety of different histone marks, which are going to give you some indication about the expression patterns of the gene, different hist... sorry, the expression patterns of the region of... Let me try that again. The histone marks are tied to whether the region of the genome is involved in gene regulation. There are certain combinations of histone marks that suggest promoters, other regions, other combinations suggest enhancers, et cetera. So you can turn those on. You can get information about DNA's hypersensitivity. DNA's hypersensitive sites are ones that the enzyme DNAs can very easily get to. It implies these regions of the chromatin are accessible to transcription factor binding sites and other regulatory regions, other regulatory proteins. And some other information you can turn on as well, including this chip-seq data where you can find out the binding site for various transcription factors. Once you turn all of this on, you get to something that looks like this. So here we are looking at our Adam II gene and our Ido gene, and we're focusing in on the region between the two of them. So one thing to note is the Adam II gene is pointing from right to left. The Ido gene is pointing from left to right. So this region in between could be considered to be within the promoters of both of those two genes, and you might expect that would have a lot of different regulatory information. And indeed you can see there's a lot of DNA hypersensitive sites in here suggesting that different protein, different regulatory proteins can combine in this region. You have a number of different transcription factors binding in here. You have various histone methylation sites going on suggesting regulatory elements. And the final track that I've turned on here is showing you the transcription levels of the number of cell types. Color coding is here. So the Ido I gene seems to be expressed in this yellow cell line, which are embryonic stem cells. The Adam II gene is expressed at a much lower level in this green cell line, which are hep G2 liver cells. So again, you can see this information for all regions of the genome and get some interesting data on what might be happening to your particular gene. If you want more information about how to use this in code data in the context of the Santa Creme Genome Browser, I recommend this manuscript, the user's guide to the encyclopedia of DNA elements that came out in PLOS Biology almost three years ago now, so it is a bit out of date, but it still does explain to you how to use these code tracks. And just for example, what they're highlighting here is one region of the genome upstream of the myc gene that looks like it is an enhancer region for the myc gene based on the combination of DNA hypersensitivity sites as well as other transcription factor binding sites that are listed here. So a popular thing that you can do at Santa Cruz is a tool called BLAT, which Andy talked about, I believe, two weeks ago now. So BLAT is a very quick way of comparing your query sequence. It could be a nucleotide sequence or a protein sequence. It's a quick way of comparing that sequence to the sequence of the genome. So what I want to try to do in this example is to take a human, the human atom 2 protein sequence and see if I can find a putative ortholog of that sequence in the chicken genome. So we would, given an accession number for the human reference sequence for the atom 2 gene, we would go to NCBI, search entree for that accession number and pull out the protein sequence of this particular gene then following the BLAT link at Santa Cruz, you would paste that sequence in choosing the chicken genome and the most recent chicken genome assembly and you get back a single hit which looks something like this. So just if you're looking at this very briefly, you'll see it has 71.6% identity to the human genome, which sounds, if you don't know anything, maybe that sounds reasonable, maybe that sounds high. It's hard to say. And then some other information about where that aligns the genome over here on the right. If you really want to see what's going on, you would click on the browser link which brings up a page that looks like this. So if you don't look at this too closely, this actually looks pretty good. Here is the protein sequence that you started with and this new track that we've added which is called your sequence from BLAT search is showing you the alignment between that protein sequence and the chicken genome. And you'll see you have three different blocks of alignment. These rectangles indicate where the sequence is aligning and these horizontal lines will indicate places where the sequence is not aligning. So you might think, this isn't bad. We started with a protein sequence. We're blotting against a genome. We would expect that protein sequence to align in separate exons. Perhaps these are three exons of the Adam II protein in chicken. Not so fast. If you come over here and see the details link, here you see a schematic of the sequence alignment. This is our protein, our human protein sequence that we started with. The black parts do not align. The blue parts are the regions that align with chicken. Here's the chicken genome showing you the same thing. And here's just a screenshot of one bit of that alignment. So you can see that those three things that we were thinking of were actually exons are just really short bits of alignment. The alignment doesn't... It looks vaguely, vaguely okay, but it's really, really short and you only have these three small regions of the protein that are aligning. So I would say this is just a completely random hit. It doesn't actually mean anything and we have no way discovered the chicken orthologue of the human Adam II protein. So this is just to call your attention to the fact that just because you do some sort of sequence alignment search and you get a result does not mean that you've actually gotten anything biologically relevant. You always need to go back and look at your data and see what it looks like before you go on and make any conclusions. All right, so what else can we do at Santa Cruz? Well, one thing we can do is add our own custom tracks. So if you have your own genome-wide data and you want to see that in the context of Santa Cruz, we can very easily do that. It just needs to be formatted correctly. So by that, I mean you need to have your data in some sort of tab delimited format where you have the chromosome, the start and stop position of each feature, a name for the feature, and if you paste that into the appropriate part of the genome browser, you end up with something that looks like this. So down here, we have the regular Santa Cruz data, which Santa Cruz supplies for you, in this case the gene track. Up here is our own data that we just pasted in there. This is actually some very old data coming out of Francis Collins' lab back when they were genotyping SNPs, showing you the positions of SNPs that they were currently genotyping as well as SNPs that they were planning to genotype. But the point here is showing you you can show your own data in the context of other information which is already present at Santa Cruz. This is a rather simplistic example. Nowadays, you might have, for example, RNA-seq data where you have the alignment of individual RNA-seq reads and you want to display those in the context of the genome browser, and that's what that type of data looks like. So down here, we have a ref-seq gene. Again, here are exons. And this green and orange track are showing us two different RNA-seq samples, showing us where those reads align to the genome. Not too surprisingly, those reads align over the positions of the exons, and we can see how the expression pattern of the individual exons as well as the gene look like across the genome. Custom tracks can be done in a couple of different ways. The simplest thing, if you don't have the data, is you can just copy and paste your annotation files into a page on the Santa Cruz genome browser and just display it yourself. That is something that only you can see on your own computer, and that data will only hang around the browser for about 48 hours. If you want to be able to share your data with your collaborators, you have two choices. One is that you can post it as a text file on your website and give your collaborators the URL to that website. They can paste that into the genome browser in context there. You can also do something which is called creating a session at Santa Cruz where you can save off particular track combinations as well as custom data. You can save that information. You can email links to the custom session to your collaborators, and they can see that data as well. This data, unless you update it, will persist for about four months at Santa Cruz before they delete it to make room for somebody else's. Very briefly at Santa Cruz is a feature called the Table Browser. This allows you to get data not so much in a graphical display, but as text which you can stick into Excel or your favorite text program and work with there. You can, for example, retrieve DNA sequence upstream of a gene across the whole genome so you can retrieve the promoter sequence for every gene. You can get intersections between tracks. For example, if you want to list all the SNPs that are in a particular gene, you can do that through the Table Browser. You can also filter track data on certain criteria. For example, if you want to find all genes in the human genome that are just a single exon long, you can do that type of analysis as well. This is just a brief overview of what it would look like as you are configuring the different pages of the Table Browser. Finally, in this example, retrieving the sequence 200 nucleotides upstream of each transcript in the human genome. I'll come back to this more later when we're talking about the Galaxy Browser. Now moving on to Ensembl. Ensembl is hosted in Cambridge. The browser is somewhat similar to Santa Cruz. The types of data that you see are similar. It has a different interface. What I'm going to be focusing on here are the different types of views that you can have for your data. Some of which might be handy if you're trying to create publications or you just don't like the way Santa Cruz displays their data. I'm going to start with a relatively new tool here called the Variant Effect Predictor or VEP. This is a tool where you can input your own variants. They can be either variants in different positions in the genome. You can input... Sorry, let me back up. The Ensembl genome browser itself. Here we have a screenshot of the home page. We can choose our genome. We can choose a number of different data sources. For example, you can also view encode data here at Ensembl. What we're going to be doing is looking at this VEP tool. As I was saying here, you input your list of variants. In my case, I am inputting a list of four SNPs from DBSNP with their REFSNP accession numbers. But you can also just put in chromosomal positions. What you're going to get back is a list of an interpretation of what those variants... what function those might have in the genome. The first thing we're going to see is how these variants fall out where they are. Are they synonymous or non-synonymous? Are they in introns? Are they upstream? Are they downstream? You'll get this nice graphical overview, which would look much more impressive with 100 variants or 1,000 variants rather than just the four or five that I put in for this example. And then you get a table which shows you for each variant where it hits in the genome. So it's hard for me to see over here, but there are... their variants are listed by their accession number. It tells you where they map if you haven't provided that information. It tells you what gene they've fallen, as well as what Ensembl transcript is. There's a quick word about Ensembl gene names. Gene symbols, they all... for human, they start with the letters EnSG, and there's a whole bunch of numbers after that. So this is just the accession number of one particular Ensembl gene. You'll notice this gene is repeated a number of times. That's because there are, I believe, four or five transcripts for this individual gene which are listed here with their separate transcript identifiers. Let's start with the ENST. Next, for each combination of variant and transcript, the VEP will tell you what sort of effect that particular variant has. Is it a missense variant? A synonymous variant? Does it occur in an intron? And that's going to vary. Even if you're hitting variants that are in different exons of a gene, in some cases they may be missense, in some cases they may be synonymous with an intronic depending where they hit. We get information about the position of the variant in the CDNA as well as in the protein. You can see that the protein sequence has changed. We can get things like predictions about the effect that this variant is going to have on the protein sequence. So for example, SIFT and Polyfin tell you what the effect of the substitution on protein sequence. Blue means it's not known. Red means it's likely to be a deleterious effect. Green means it's likely to be a benign effect. You can see that SIFT and Polyfin do not always agree. This particular variant is flagged in one case as causing a problem with the protein sequence. Another case not. And some other information as well. One thing I glossed over here, these two columns, the AAMAF and the EAMAF, are telling you the minor allele frequency of this SNP in either an African American population or a European American population coming out of the NHLBI ESP project. So if you're interested in how variants, how different variants occur in different populations, you would get that information here. So focusing in on this one particular variant down here, you'll see it hits two different ensemble genes, one here and another one with a different accession number right here, depending where it hits in that gene, in the main gene, it may occur in the intron of one transcript or it may cause a mis-sense mutation in a different transcript. If you want to look at that in more detail, you would click on, for example, the accession number of the SNP itself, which will take you to a page where you can navigate to more details about that SNP. So this is now the SNP viewer for ensemble. We can get to information about the SNP itself by following any of these links. So for example, here is a genomic context of the SNP we're looking at. The one we are focused on is here in this yellow box. It's yellow because it's a mis-sense mutation. Up here are three different, four different transcripts, actually five different transcripts. One of them occurs in the exons of these two transcripts in the intron of this third transcript as well as in the intron of this fourth transcript down here. And up here is another transcript which is not part of this particular gene. It's actually an mRNA and it's showing you that this SNP occurs downstream of that mRNA. Neighboring the SNP, you can see other SNPs that have been annotated in the genome as well, all color coordinated based on their predicted function. Here is the data in a slightly different way. Here is the nucleotide sequence around the SNP, the one we're concentrating on is this guy right here in red that's underlined and you can see the positions of the other SNPs, color coded based on their function as well. What you may be more interested in is seeing how the SNP occurs in a larger genomic context similar to what we saw at Santa Cruz. So for that, I'm going to click on this location tab and that's going to take us to a view of what looks like this, which is very similar to what we were looking at at Santa Cruz. Up here at the top, we can see the overview of where we are. These yellow bars represent the positions of genes. The position that we're looking at right here is in red and it's highlighted down here. We are zoomed in far enough right now that you can actually see the genomic sequence. The one strand of the genomic sequence goes up here. The reverse strand of the genomic sequence goes across the bottom. One sort of confusing thing about the ensemble display is the presence of this blue line. Any features that are above the blue line are going from left to right. So genes above the blue line start over here on the left of this display and end over here on the right. Anything below the blue line is going from right to left. So this particular gene, again we're looking at the Adam II gene and we're looking at Santa Cruz, is going from right to left and you know that because it is below the blue line. At least you're supposed to know that because it's below the blue line. Down here, we see our SNP track that we were looking at earlier and these are just different sources of the Adam II gene. The gold ones are the transcripts that ensemble most believes in so they're the ones you should probably be focusing on. They're the best annotated. At Santa Cruz, we can zoom around the display, we can zoom in and zoom out and also most importantly, we can configure the page and add additional tracks to our view. So if you do that, you will see there is a whole giant load of different tracks that we can add to this display. The one I want to concentrate on is another RNAseq track because that's going to give us some more expression information. This is RNAseq data coming not from encode data coming from something called the body map project. So this has data from a number of different tissues. By default, it is off. It's in light blue. If you want to see the RNAseq data, you would click on these little individual boxes, turn them in dark blue and then you see a screen that looks something like this. So I've zoomed out a bit. I'm now seeing a lot more of my Adam II gene. My exons are these blips. Up above here is my RNAseq data. So here is gene expression from testes. And then we have muscle and then we also have added brain. So these are histograms showing you the frequency at which your reeds are lining. So again, this gene is very highly expressed in testes. You can see that there's large blips corresponding to the individual exons of this gene showing you high expression in testes. Not so much expression either muscle or brain. And if you look closely, the blips that you do see don't really correspond with the Adam II exons. They're probably, I would guess, these are probably just random blips and don't actually represent any real gene expression at all. They may represent something, but they probably don't represent expression of this particular gene. The ensemble data is organized in a series of tabs. So if we want to go to a different tab and see different types of data, go down here, click on one of the transcripts, one of these goal transcripts that pulls up a number of different resources that we could link to. I want to link to the ensemble gene link with this ENSG accession number. That will take us to the gene tab. So this is a gene tab which has information about Adam II and all of its transcripts. There are a total of six transcripts, four of them code for proteins, two of them do not. And most importantly, over here on the left, we have a number of different things that we can do now that we know now that we are on our gene. So for example, I can click on the orthologue link over here which is going to show us orthologs, 54 different potential orthologs of the Adam II gene in the organisms for which ensemble displays genome sequence. So the organisms are grouped up here by type. I'm looking at the I believe the bird and reptile orthologs and those are shown now down here. So if I wanted to find the orthologue of Adam II and the Anole lizard, I have a link right here. Again, these are computationally predicted orthologs. So before you do anything really drastic with this information, you should probably verify at least by sequence alignment that you believe that this is a potential orthologue, at least this will lead you in the right direction and you're not having to do the blast searches and analyze those yourself. There's also a view called the variation image which is I think an interesting one. So this is now showing you the one of the Adam II transcripts up here and they've blown up each of the individual exons showing you the positions of all the snips so you can see that the yellow missent snips, the green synonymous snips as well as down here the different protein domains. Do you remember Andy talked to you last week about how to get data from PFAM and other protein domain databases? That information has been incorporated here into Ensemble and you can see where the protein domains are as they cover individual exons in this particular transcript. If we look at the transcript tab rather than the gene tab so now navigating up here at the top of the display, I was in the gene tab click on the transcript tab it looks a lot like what the gene tab did with information about the six transcripts. We have a schematic of one of them showing you the exons across the genome. The main difference is there's a bunch of different links over here on the left. So in this case I want to follow one link which is called supporting evidence and that's an important one because it tells you what data in the sequence databases went into predicting this particular transcript as a valid transcript at Ensemble. So in gold we are looking at the exons of the Ensemble Adam 2 transcript and down here in different colors are transcripts from a number of different databases. So we have here the NCBI reference sequence which you can see because the accession number starts with the letters NM. We also have something coming out of the CCDS project, the consensus coding sequence project which is an integrated source of data showing you support for all of the different exons. I guess the coding sequence, the one thing I didn't point out at Santa Cruz the coding exons or the translated exons are colored in. The untranslated exons are these boxes that are not colored in and you can see the NM transcript has support for all of the exons whereas the coding sequence, the transcript of the coding sequence project is of course lacking both the first untranslated exon. And here there's data from a number of other GenBank sources as well showing you what other data sources support this particular transcript. If you like looking at protein sequences, there were a number of different protein links that we can click on over here. The one that I'm highlighting is just showing you the protein sequence of Adam 2 with the exons highlighted in different colors. There have been occasions in my life where I wanted to know which sequence comes from which exon. This is a really handy way to do that because you can see they're alternating black and blue. The other thing I'd like to point out is that on each ensemble page there's a link down here that says view in archive site. So what that lets you do is to see any page that you're on going back many years to see what it looked like in the past. So this is showing you not only genome-brow, different versions of the genome assembly, but also different versions of the genome annotation. So for example, all of these archive sites except for the last one are all on the GRCH37 which is the relatively most recent genome assembly. There's one down here that's going to NCBI build 36 which is the previous version of genome assembly from 2006 I believe. These guys that are all on build 37, these are just different sources of gene annotation as ensemble updates their gene predictions periodically. You can always get that information back. So that is a handy thing to have. So the final thing we're going to do in ensemble is try to really find the chicken homolog of our Adam II protein. So remember when we tried this just a few minutes ago at Santa Cruz the search did not work because while BLAT is really fast it's also not all that good at finding things that are cross species because it's fast but it's not all that sensitive. So at ensemble we are going to use the BLAST tool. So we're going to paste in our Adam II protein sequence. We are going to choose the chicken genome. Ensemble is smart enough to realize that there is a BLAST program called TBLAST in which takes protein sequences and compares them to a nucleotide sequence that's been translated in all six reading frames and gives you the results which look something like this. So here we have a number of different hits to different regions in the genome highlighting just these first three that occur on the page I'm highlighting those because come over here and look at the E value. Remember when Andy talked to you about the E value as being a mark of how significant a particular hit was. These guys all have really really low E values less than 10 to the minus 130th or so and there's a big difference between this E value and the next highest E value. You'll also see the alignments are long so our query sequence alignment starts at the beginning of our query sequence goes to amino acid 660 some odd so it's a long bit of our protein sequence aligns and it's hitting a large chunk of the genome. If you want to see one of those in more detail you would click on the A link and that will show you your actual sequence alignment. So here on top we have our human protein sequence. Down below we have our chicken genome sequence translated in all six frames. This particular frame has the best hits and you can see regions of good alignment. The sequences are about 35% or so identical which is about what you would expect to be between a chicken and a human sequence. So again making the point this BLAST search took longer than the BLAST, sorry this BLAST search and ensemble took longer than our BLAST search at Santa Cruz but it is far more sensitive. So if you ever want to do anything cross species you're much better off using BLAST. BLAST is great if you're looking in the same organism. So for example you have a human transcript and you want to see where that aligns and that is great for that. It's really really fast. You'll be done, you hit the submit button it will be done right away and it will give you a great alignment. You can do the same thing in BLAST it's just going to take you longer. But you need to take the time to run BLAST if you're trying to do anything cross species or trying to look for paralogs for example where the sequences are not going to be identical where they're just going to be some amount similar. Okay so leaving the genome browsers and moving into the two web resources where we can get more text type data. So Biomart is we're going to look at the Ensembl Biomart portal and we're going to try to cross reference data from different sources. I should point out that Biomart is actually a database and web interface to a number of different data sources if you go to rather than going to Ensembl as I've listed here if you go to biomart.org you will see there are I believe 46 different Biomarts available for different types of data ranging from protein sequences to pathways to gene data from organisms that you've never heard of and so on. But we're going to concentrate here on the data from Ensembl although the interface should be similar if you're looking at other data sources as well. So the first thing you need to do with the Ensembl Biomart portal is to choose which data source you're looking at. We're going to start with ZebraFish so I've chosen the ZebraFish genome in this pull down menu where all the other genomes at Ensembl supports as well. The question I'm going to try to ask here is I'm going to give Biomart a list of ZebraFish gene IDs from Ensembl and I want to pull out the corresponding map the location to which these things map in the ZebraFish genome and I also want to get the gene name and more importantly the RefSeq accession numbers from NCBI so a lot of times people come wanting to convert from an Ensembl gene identifier to a RefSeq accession number Biomart is certainly the place to do that. The first thing we need to do is to input our our Ensembl gene identifiers these are from ZebraFish the prefix for ZebraFish gene identifier is ENS for Ensembl and DARG for Daniorario genes and these go into our queries go into something called the filter it's a bit confusing click on the filter link click on the gene region to expand that and paste in our gene identifiers to get out our data we want to select various attributes so if you click on the attribute link it takes us to a number of pages we have six different choices of what types of attributes we want we want to stick with features we want to get back out the Ensembl gene and transcript identifier of the data that we started with so that's going to be from ZebraFish and if we scroll down the page you get to a section called external references which allows you to link to non Ensembl data sources so we want to get the RefSeq mRNA both the NM accession numbers which are the somewhat curated data as well as the XM accessions which are the mRNAs that are coming off gene prediction pipelines select both of those hit the results button and you end up with your data so here we have our Ensembl gene identifiers again we have one gene is mapping to multiple Ensembl transcripts we get the position at which each of those transcripts maps in the ZebraFish genome we get an associated gene name and then we get links where available to the NCBI mRNA accession numbers a different type of thing that you can use you can use Biomart 4 is to pull out the data that we've seen in the Ensembl genome browser pull that out in textual format so say we want to pull out a list of all the human orthologs of the same set of ZebraFish identifiers that we just started with so I've kept my ZebraFish identifiers pasted into that filters box and now I'm going to get a different set of attributes out rather than getting features I'm going to ask for homologs I'm going to scroll down on the page until I get to the human orthologs section ask for the human gene identifier protein identifier as well as a percent identity and I get data back so each of these Ensembl ZebraFish gene transcripts mapping to a human Ensembl gene protein identifier as well as a percent ID and again you can do this across all of the tens of organisms for which Ensembl provides genome scale data and like I said earlier whatever you're looking at orthologs these are all you need to make sure you go back and verify that you believe at least some of these before you go too far with them these are all computationally predicted so while most of them are probably going to be correct there's going to be some fraction of them if you were doing a manual annotation nevertheless I will say if somebody gives me a list of 500 ZebraFish gene identifiers I'm not going to manually blast all of those I would probably come here to biomark and use biomark to pull out the corresponding the corresponding human protein sequence identifiers so what I'd like to end with is the galaxy tool the galaxy web resource which comes out of Penn State so I'd like to think of galaxy as basically being a web based resource that will allow a basic bench biologist bench scientist to become a computational biologist without actually having to download any software or really know how to run programs on the command line so I hesitate a bit to show this to you all because I'm afraid that if you really get your hands on galaxy you're going to put us bioinformaticians out of business so please use galaxy but also come back to us if you have additional questions or want want more in-depth help so what galaxy lets you do is using your computer or I guess even your iPad that has a web browser you can download data into the galaxy portal not onto your own computer but download data from Santa Cruz or from other data sources and then you can manipulate that data using unix-like command line tools that you might not know how to use on unix but galaxy gives you a web based interface to those commands galaxy will also bring in third party tools that you can use for example analyze next gen sequencing data so let's take a look at how this would work we go to the galaxy web interface again just using our web browser tell it that we want to get data out of Santa Cruz this is going to bring out the same table browser that you would see at Santa Cruz except this table browser is now in the context of the galaxy website and what I'm going to do for this example which is fairly simplistic is I want to find the lengths of all of the transcripts in the UCSC genes trap or the known genes trap so basically I want to get a list of the lengths of all the transcripts in the Santa Cruz human gene database to say I don't want just the lengths of the transcripts I just want the lengths of the coding transcripts so I'm going to exclude the 5 prime and 3 prime UTRs for each of those transcripts so the first thing I would do is use the table browser to download the text based data into galaxy so what you're going to end up with is for each transcript transcripts are shown here by their transcript identifier this confusing looking thing in the fourth column for each transcript I'm going to find what chromosome it's on the start and stop position of the transcript itself also the start and stop position of the coding sequence that is the region without the untranslated regions but this is just the positions in the genome so that's going to include the start and the stop including all the exons all the introns we just want the coding sequence itself well over here to the right unfortunately I can't over on the right is another field which shows you the length of all of the exons for each of the transcripts so BioMART is going to take that information and I'll show you in a bit it's going to calculate the length of each coding exon you might think that you could just take this data out of the table browser and stick this into Excel and do the calculations yourself because of course you can use Excel to calculate sizes the problem is that because each transcript has a different number of exons you're going to end up with a different number of columns for each of your transcripts this is going to make things very difficult at least in my simple-minded way of using Excel this would make things very difficult to do in Excel so this is a great place to come to Galaxy to take care of that for you because Galaxy has a tool called gene bed to exon intron code on bed so basically my data before was in something called a bed file and Galaxy can extract the data from that gene bed file and figure out where all the exons and introns are so I'm going to tell Galaxy please extract just the coding exons from that previous data using this data that I imported in step one and you end up with something that looks like this so now for each of our transcripts we're seeing each exon on a separate line so there's this first transcript that occurs on three separate lines that's because that transcript is made up of three different exons the next transcript just has a single exon occurs on a single line third transcript has two exons that occurs on two lines and in each case this is a start and end point of the exons excluding any non-coding information these are just coding exons they're going to go on to make protein then there is a function in Ensembl call that says compute an expression on each row of your data so what you're going to do now is you're going to say I want to take the results of the number in column C3 minus the number in column C2 so column C3 is the exon end coordinate column C2 is the exon start coordinate I'm going to calculate that for each of my exons and display the results right here in the galaxy browser I've highlighted here these first three lines because they're all coming from the same transcript so now we have the length of the three exons of this particular transcript the length of the one exon of this next transcript the length of the two exons of the third transcript and so on so this the next thing I want to do is I want to be able to sum up the links of those exons for each transcript so say for each transcript identifier I want to add up the exon lengths and that's going to give me the length of that particular transcript so I come to another function which is called group data by columns and perform aggregate operation on other columns so what I'm doing now is I'm grouping by column C4 column C4 is my gene symbol and I am performing a where we go and operation type of sum on column C7 so grouping sorry grouping on column C4 which is the transcript identifier I want to sum up column C7 which is the exon length and that will then give me my final result for each transcript now I have the length of all of its exons all summed up so this is something that if you know command line unix and you have your data in the right format it's fairly easy to do or alternatively if you know Perl or Python you can do it as well on the command line on your Mac or on your Linux box but for those of you who aren't familiar with programming who don't want to learn that you can do all of this work right here inside the Galaxy browser and there's a lot of different tools over here on the left I've really only scratched at the surface of what you can do in terms of tools once you have a pipeline that you like you can actually save that as something called a workflow and once you have a workflow saved you can then apply that to other sources of data run that and get that information for a different source of data apply the same pipeline to get the same type of results and now if I have if I want to say get the links of the coding sequence of all the zebrafish genes and ensemble I can do the same thing so as I said there's a lot of different tools that you can choose from over here one powerful thing that you can do that I'm not going to show because I think it really goes beyond standing up here and lecturing at 300 people all at the same time but one powerful thing that you can do is you can intersect tracks so one example they go through at the Galaxy browser is say for example you want to pull out the 10 exons in the human genome that have the highest density of snips so you're dealing on the one hand with all the exons in the human genome on the other hand with all the snips in the human genome and you want to intersect that data and find your exons that have the highest density of snips you can do that using the Galaxy browser by intersecting the gene data with the snip data and then using other computational tools to count the number of snips in each exon divide that number of snips by the length of the exon to find the density sort that data and pull out the top 10 you can do all of that just in the context of this web browser without actually knowing any command line unit skills the other thing that you can do in Galaxy is a lot of different manipulation on next gen sequencing data so if you've gotten data back raw reads back from a chip seek or an RNA seek experiment you can actually load those into Galaxy and perform a lot of the analysis right here in this web browser so for example you can do basic quality control using a program called FASQC which is what bioinformaticians often use to monitor the quality of their sequence reads you can map those reads to the genome of your choice be that human mouse zebra fish whatever galaxy has stored using a number of different programs including Bowtie which is a popular mapping tool if you have chip seek and data you can call peaks net chip seek data using max another popular tool as well as some other tools that Galaxy makes available you can also perform RNA seek analysis with top hat cufflinks other RNA seek analysis tools that you may have heard of and there's a list of the different types of tools that are available through Galaxy so again I stress that you can do all of this without actually having to load any of these programs on your own computer they give you the opportunity to change the parameters so you can again without understanding Unix you have a nice web-based interface where you can change the parameters of these tools and try out different types of data manipulation and hopefully get some very nice results for those of you who are in institutes where you do have bioinformatic support I certainly recommend that you use the Galaxy tool and try to do as much analysis as you can on your own if you have opportunities to go talk to collaborators or go talk to bioinformatics core facilities you can certainly follow up with them as they can do perhaps more detailed analysis than you can do in Galaxy but Galaxy certainly is a really really nice first step in the analysis path I realize I've talked about a lot of different things today some of it very new to many of you there is online help available for all these different resources shown here and Galaxy both provide video tutorials for a lot of the things I've talked about today so you can listen to somebody from their help desk walking through a number of these different data sources and finally if you would prefer to read a book rather than looking at stuff online I recommend the current protocols and bioinformatics series which has units on the three types of data sources that I've talked about today as well as a number of other bioinformatics tools for example some of those next gen analysis tools that I talked about that are available through Galaxy have individual chapters in the current protocol series and if you're here on the NIH campus you can get to the series for free by following the link down here in red so just a reminder next week's talk is going to be given by Laura Elnitsky of NHGRI and she's going to be talking about regulatory and epigenetic landscapes of the million genomes so I'm going to go ahead and stop the formal part of my presentation now but I certainly welcome any of you who would like to come down the stairs here and ask me questions at the podium thank you