 The third lecture in our series of Current Topics and Genome Analysis, I'm Tira Wolfsburg, one of the co-organizers of the class, along with Andy Boxivano, and Eric Green. Just a bit about my background. I have an undergrad degree in molecular biology, a Ph.D. in biochemistry, and I came to the NIH a very, excuse me, a very long time ago. I want to admit how long to do a postdoc at NCBI in computational biology. So that's how I got into the field of bioinformatics. Since 2000, I have been at NHGRI as the Associate Director of the Bioinformatics Corps, and I work with a staff of programmers and data analysts, and we provide bioinformatics analysis for members of the NHGRI intramural community. So what I'm going to be talking about today are genome scale sequence analysis. Before I get started, I need to say I'm, like everybody else, have been very boring, no relevant financial relationships with commercial interests. So in short, I'm going to be discussing, for the most part, how to use various genome browsers. I'm going to spend most of my time going over the use of the UC Santa Cruz genome browser, a good portion of time on the Ensembl genome browser, telling you not only how to browse genomes with those graphical viewers, but also how to download sequence to get it onto your own computer using either the Santa Cruz table browser or Ensembl's BioMart. I will then touch on two other genome browsers, IGV and Jbrows, talk about a new variant browser that's come out, allowing you to view 60,000 exome sequences and their variants, and finally give you a brief overview of how to do some data analysis yourself using online tools at the Galaxy server. So before I get started, let me just go over what types of data you might see integrated into a genome browser. The starting sequence, the starting material is all going to be the same, and that is going to be a genomic sequence of a wide variety of organisms. And then you're going to have annotations on that genomic sequence, including the positions of genes, ESTs, gene predictions, and now next-gen sequencing data, including chip-seq data and RNA sequencing data from a variety of sources. One thing to keep in mind when you're looking at genome browsers is the underlying genome sequence that is displayed on those browsers. So keep in mind that assembling a genome sequence is actually a fairly complex undertaking. So you have sequences back in the human genome project. You would have had back sequences that were sequenced with Sanger sequencing. Now newer technologies allow genomes to be made with next-gen sequences, short aluminum reads, and so on. There are complex algorithms needed to incorporate all that data into a comprehensive assembly, and assemblies can be updated as more data becomes available. So for example, even though the human genome sequence was declared finished a couple of years ago, there's still more data coming out, still more data being integrated, and periodically that human genome sequence assembly does get updated. The human mouse and zebrafish genomes get updated by a group called the Genome Research Consortium. Other genomes get assembled by various sequencing groups. If you want to get a preview of work in progress, assemblies aren't quite ready for prime time. I have two URLs here, one for the Santa Cruz and one for the Ensembl pre-genome browsers that give you data that's not yet quite available to the public. And one big caveat is when you're comparing data from different genome browsers, you need to make sure that you're looking at the same version of the genome assembly, because as genome assemblies change, they incorporate some new data, they delete some data, the coordinates of that genomic sequence will change. So if your gene used to be on chromosome one, position 1,000, you look at it in a new version of the genome browser, it may have moved to chromosome one, position 1,000, and it may be at position 2,000, depending on how much sequence has been incorporated or deleted into the new assembly. This isn't so much a problem now when the human genome sequence is fairly stable, but it really was an issue in years past as the human genome was getting updated on a more frequent basis. So we're going to jump right in to the Santa Cruz browser and do a couple of different types of queries. I'm assuming that most people in the audience are familiar with the Santa Cruz genome browser and have played around with it to some extent, but I think it's still a worthwhile exercise to go through, not only to bring everybody into the same page, but also to provide some of you with tricks that you may not have known about before. So this is the home page of the Santa Cruz genome browser. There's a lot of information here, a lot of different tools that you can use, both down the side and across the top. There's also one nice thing, they have a news feature telling you what new genomes they have available, what new annotations they have available, so you can find out exactly what's going on. Your basic query would normally start out by clicking this link in the upper left to the genome browser, and that takes you to a page where you would put in your query of interest. So there are a number of different genomes available at Santa Cruz. You can select from the group that you're looking for. We're going to focus on the human genome sequence, so we're going to be looking at the mammal group. Within mammals, there is a long list. It continues down past here of all the different mammalian genome sequences that are currently available in this browser. And then as I mentioned, there are also different assembly versions. The current human genome assembly came out in December of 2013, and that is called GRC for Genome Research Consortium, GRCH for human, 38. The older naming system is one that was based on Santa Cruz conventions. It's also called HG38. Previous assemblies would be GRCH37 and HG19. So at least they've tried to standardize the numbering system. So the GRC numbering system and the HG Santa Cruz numbering system are now at least both in sync. We're both on 38. Your query, you type into the right side of the page. I'm looking at my favorite gene called Adam2. When you start typing that in, you'll get a list of all the gene symbols that start with the text string Adam2. You could also type in, for example, a GenBank accession number, a chromosome position, a SNP accession number, whatever it was that you first wanted to look at in the genome browser. You then get to a page that looks like this. So what we are looking at here is an overview of the Adam2 gene. We are on chromosome 8 in this position here, indicated by the red bar. And this is the Adam2 gene. These vertical tick marks indicate the position of exons. The line connecting them indicates the position of introns. And if you look really closely, you may see that there are arrowheads on these introns that show you the direction that the gene is pointing, the direction the gene is transcribed. This particular gene starts at the right end and goes to the left. So in a text, traditional textbook, genes are always printed from or drawn from left to right. But in the genome, there's a 50-50 split, whether genes go from left to right or right to left. And you need to make sure that you're looking at them in the correct orientation, or that you understand the orientation, I should say. One thing that may jump out at you if you haven't looked at genome sequences before is that exons are actually really small. The majority of space taken up by this gene in the genome are these introns. Are these long bars with the lines on them? The vertical exons are actually just a very, very small proportion of the genomic sequence. The data here are organized in tracks. So we're in a track right now called, I can't quite read it, the gen code track. Gen code is a human gene annotation, consensus annotation, which is being used by the encode project. This gen code track has replaced the old UC Santa Cruz genes that used to be their default gene track. This is now the same human gene track that's being displayed by ensemble, which is nice, at least they're now displaying the same set of genes. They weren't in the past. Another gene track here are called RefSeq genes. And then we have a number of different other tracks which all have a name and then some features on them and we'll be talking about some of those a bit later. In order to figure out where the genes are, it's fairly simple. You take mRNA sequences, you use an algorithm to compare them to the genome and where those genes align, that's where they show up in the genome browser for the most part. So there are, I think, five different gen code transcripts in this region, and three different RefSeq genes. They just represent alternative transcripts, alternatives forms of the exon, alternative splice forms. So if you look, you can see the second isoform right here is missing a bunch of exons and there are other differences between the different isoforms as well, which is why you see the five horizontal, five different horizontal lines. If you want more information about a particular item on a track, you can just click its name. And here we're getting to more details about the gen code track. There's a whole list of data that Santa Cruz has brought in from a number of different sources, including some old microarray expression data. If you want more information about what's in a RefGene track, you would click on that RefGene Sequence's name. This gets us to a differently formatted page, which has links back to a number of different NCBI resources. One that I will highlight is down at the bottom, not a link to an NCBI resource, but rather a link to a way that you can download genomic sequence for this gene. You click on that link, you get to a page which lets you get either promoter sequence, exons, introns. You can check off what you want. I want to get the promoter sequence upstream 1,000 nucleotides of the Atom-2 gene, and that will get me the sequence right here, which I could then put into my favorite promoter analysis program if I were so inclined. So how do we see things other than the gene that we queried on? Coming back to this view, there's a number of controls across the top, down here on the bottom. Let me just highlight one down here called the reverse button. If it really bothers you that genes are shown in the opposite orientation and you want them going left to right, like in the traditional textbook, you can click on this reverse link, and that then flips the orientation of the gene around. The only way that you know this is that the gene names, instead of being on the left side like they were before, are now on the right side. And also your arrowhead should now be pointing left to right. So if this helps you, that is an option. I personally find it really confusing to have to keep track of which way I'm looking at my genes being oriented. If you want to move around, there are links up here at the top with arrows where you can move to the left and right. But an easier option, which they've recently added, is more of a Google Maps style navigation where you can actually just grab the screen with your mouse and drag it left and right, just like in a Google Map, and use that to move the display either left or right. If you want to zoom in and out, there are controls up here at the top. So for example, if we want to zoom out threefold to see other genes around the Atom 2 gene, we can click on that link. And that takes us to a view that looks like this, where we have the Atom 2 gene in the middle of the page. On its left flank, we see the Atom, I believe it's the Atom 18 gene. And down here we see IDO 1, it's neighboring downstream gene. If you want to zoom in in a very targeted way, they provide that functionality as well. So the Zoom button up here will just zoom right into the middle of the display. But say I want to explore what's going on right here at the very end of the IDO 1 gene, this little exon looks a bit strange from this view. If I put my mouse in the very, very top of the Santa Cruz viewer and I drag it left or right, it will highlight this purple bar. And that gives me an option to do one of two things. I can either choose to highlight that region in the genome, which draws a bluish box around that region so I could look down and see all the different features that are aligned to that particular region. Or I can choose to zoom in, which will make this region bigger. And now I can see in great detail that first exon of the IDO 1 gene. You may notice that this is not drawn as a simple box. It looks sort of like a top hat flipped over on its side. And what that is showing you is the translated and untranslated portions of that particular exon. So the tall parts of the exon are translated. The short parts are untranslated. So this part goes and is made into protein. This just makes the 5-prime UTR of the messenger RNA when Centroscope displays together. So say I want to add additional tracks to the browser. There's a whole host of different data integrated here at Santa Cruz that we can choose to add. Most of the tracks are by default shown in hide mode, which means that they don't show up on the browser, which is good, because otherwise that window would be really enormous. If you want to change the display and add something to this browser, you can click on the pull down menu and see it in a couple of different formats. I want to bring this all snips track into the browser in a format called pack. Hide, as I said, means it's completely hidden. Full means you see it in a really long format with lots of detail. Dense, squished, and pack are different forms of condensing the data. For the snip track, the pack works best. Other tracks, full, actually looks best. I'm going to go ahead and hide the common snips track that's on by default. This one shows you only those snips that have an allele frequency of greater than 1%. I instead want to see all the snips. And if I want to get more details about them, I can click on this all snips track, which takes me to a window where I can configure this track. So Santa Cruz allows you to color code the features that you show in the genome browser. In this particular case, I want to color code my snips by function. I'm going to make all of my snips black, except my synonymous snips. I'm going to turn green and the non-synonymous slip snips. I'm going to turn red so I could easily see what's going on the genome browser. And when I come over here, I now see all of my snips displayed. Each one of these bars represents an individual snip from NCBI's DB Snip. The synonymous ones are green. The non-synonymous ones are red. One thing I would like to mention while I'm here is that Santa Cruz is currently displaying data from DB Snip build 144. So just like genomes are assembled periodically, the SNP database is also redone periodically as new SNP data come into NCBI. So as the SNP doesn't show up at NCBI the day it's submitted, it rather has to be promoted as a build. And this happens every couple of months. NCBI releases a new build of their SNP database. NCBI is currently on DB Snip version 146. So Santa Cruz is about six months behind in terms of the SNPs that they're showing on the genome browser. So another set of data that Santa Cruz makes available are data from the ENCODE consortium. So the ENCODE consortium is a group of a number of investigators who are trying to get experimentally determine all of the functional elements in the human genome. So they've done experimental RNA-seq analysis, transcription factor, binding site analysis, DNA hypersensitivity, histone modification. All that experimental detail is available to the public for you to integrate along with your own analysis. And much of that is available through Santa Cruz in the form of these ENCODE tracks. The majority of the ENCODE tracks are actually displayed on the previous version of the human genome assembly. So the GRCH37 assembly, you can't see a lot of them on the newer GRCH38 assembly that we were looking at earlier. I apologize for the formatting of the slides. We seem to be having some MAC to PC issues here. So any of these tracks listed down here in this regulation section that have this logo, this double helix logo next to them are all coming from the ENCODE consortium. If you wanna get more details about what is shown, you would once again click on the name of the track. I like this track called ENCODE regulation. This is actually a super track composed of data from a number of different sources. We have transcript levels from RNA-Seq on nine different cell lines. We have a number of different histone modification tracks showing you histone acetylation, histone methylation. We have a DNA hypersensitivity track and we also have transcription factor binding sites as determined by Gypsyq. So again, this is all experimental data that is displayed in the context of the genome browser. If I configure these tracks as shown here, I end up with a result that looks like this. So again, I'm focused on the region between the atom two and the IDO one and IDO two genes like I was looking at before. You would expect that this region should be full of regulatory information. The atom two gene starts here and points to the left. The IDO one gene is divergently transcribed. So you're looking at a region which is between two genes which is upstream of two separate genes and you would expect that there's a lot of regulatory information in there regulating both of these genes. And indeed, if we look at some of these regulatory tracks, we see there's some histone acetylation right here in this histone acetylation track. This transcription factor binding site track, we have a lot of different transcription factors binding here. Again, these are experimentally determined transcription factor binding sites not computationally predicted. We have some DNA hypersensitivity in here and we also have some RNA expression, color-coded, this orange is showing expression of the IDO one gene in this orange cell line which is an H1 human embryonic stem cell line. If you want more information about how to use this encode data which is integrated into the Santa Cruz genome browser, I highly recommend this paper that came out in PLOS biology a couple years ago called the user's guide to the encode elements and they go through a number of different examples showing you how you would actually look at the data and make sense of it. In this particular example, they're looking at a number of common variants which are upstream of the mic oncogene and showing how those variants are integrated with a number of different regulatory tracks including histone modifications, transcription factor binding sites and so on. So another thing that you can do at the genome browsers is do a sequence comparison search to figure out where a sequence of interest aligns in the genome. So Santa Cruz uses a program called BLAT which Andy Boxivana has talked about two years ago. It's a very, sorry, two weeks ago, not two, two years ago as of well, but more specifically two weeks ago that allows you to quickly compare your nucleotide or protein sequence to a genome and figure out where that lies in the genome. So we're gonna try and experiment here. We're gonna see if BLAT is capable of finding a chicken homologue of a human protein. So we're gonna go to NCBI. We are going to get the protein sequence of the human atom two gene. We're gonna copy that and we're gonna paste it into the Santa Cruz BLAT interface which you can find by pulling down the tools menu and clicking on the BLAT link. We are going to BLAT against the most recent chicken genome assembly shown up here and we're gonna get back results that look like this. So our sequence aligns to one place in the, our human sequence aligns to one region in the chicken genome. It is, if you just look at this very superficially, 71.6% identical which sounds maybe somewhat reasonable for a chicken to human comparison. If you were to click on the browser link in your BLAT output, you get a graphical view of your sequence alignment compared to other features in the chicken genome. So we see these three blue blocks would represent three blocks of alignment in the chicken genome. One might imagine that since we're bladding a protein against a genome sequence that these three blue blocks represent the three exons of this gene. Sound reasonable? Well, when you go back and you actually look at the details of the sequence alignment you'll see something very different. So here we have our human protein sequence. Here we have our chicken genome sequence. Only these blue letters got aligned to each other. So all the block sequences are not aligned. And actually those three blocks of homology that we were looking at that we thought were exons are just these very, these very short three sort of spurious alignment, random alignment that's happening between the human protein, the chicken genome. I do not think these actually mean anything. And I don't think that BLAT was capable of finding an alignment between the human protein and the chicken genome. So BLAT's really fast. I forgot to point that out. I'm sorry. This comparison took all of about maybe half a second. Gave you back a result. But the result does not mean anything. So just to caveat that when you're using any of these online tools you need to actually go back, look at your results, and make sure that things make sense that you're getting the data that you think you're getting and not make assumptions based on pretty pictures. So one thing that Santa Cruz does that is very helpful is to allow you to add your own custom data tracks. So it's great to be able to see what other people have provided in terms of genome annotation, but say you have your own genome annotation data and you want to be able to see that in the context of other data that's available. You need to format your data correctly. There's a nice page of Santa Cruz that explains how to do this, but it's fairly straightforward. You have a chromosome, a start-stop position, a name for your sequence, and in this case a score. What we're looking at here are potential CRISPR target sequences on the zebrafish genome. So those targets are up here. They have scores. And when you view that in the context of the genome browser you now see this new track called CRISPR-Cas9 targets with each of those sequences shown with their scores. You can see which ones are better or worse targets. And you can do that with any type of DNA. You can add chip-seq data, you can add RNA-seq data, or you can add something very simple, like just a start-and-stop coordinates of a CRISPR target. You can add custom tracks to the browser in a number of different ways in a very simplistic way. If you don't have too much data you can just copy and paste it into a form on the Santa Cruz genome browser. That's available for just you to see. If you want to make the data available to your collaborators or as a companion piece to a manuscript, you could host the data on a website and read the URL of the website into the genome browser and see it that way. If you have a lot of data or a lot of complicated data that you want to integrate, you can create something called a session or even a hub to share lots of data. And I refer you to this website at the bottom of the screen if you need more information about how to set up a custom track. The last thing I want to talk about at Santa Cruz is how to use the table browser. So a table browser is a graphical interface that allows you to download data out of the genome browser into normally an Excel file that you can then manipulate on your own computer. And here's a list of the different types of things that you could do with the table browser. You can retrieve DNA sequence, which is the example I'll be showing you. You can calculate intersections between tracks, meaning if you have a SNP track and you have a gene track, you can get a list of all the SNPs that are within a particular gene. You can also filter data based on particular criteria. So say, for example, you want a list of all RefSeq genes that only have a single exon. You can do that as well. It's fairly straightforward. This is a screenshot of the table browser. We choose the appropriate assembly. We choose which data tracks we're interested in looking at. Choose the output format. In this case, we want sequence. Tell it what type of sequence that we want. We're trying to get upstream sequence of each RefSeq gene, 200 nucleotides upstream. And when we click the submit button, we get back all of our 30s, 1,000 or so RefSeq genes showing 200 nucleotides upstream of each one. So moving now on to the ensemble genome browser. I'm not gonna spend quite as much time on this because I think once you have a general feel of how one genome browser works, you can probably figure out the rest of them fairly easily. But instead, I wanna highlight some of the features that you can do at ensemble that are not available at Santa Cruz. So this is the ensemble homepage linking to a number of different tools here. I'm gonna look at something called the variant effect predictor. So this is a tool that allows you to input SNPs and it will predict the effect that these SNPs have on the genes or the transcripts that it overlaps. So I'm gonna give it a list of RefSeq, sorry, of RefSNP, DBSNP accession numbers, but you could also supply it with a list of genomic coordinates and the changes, the nucleotide changes that happen at those coordinates and it would annotate those for you as well. The results look something like this. I put in four different SNPs. It gives you an overview of whether those SNPs are synonymous, non-synonymous at the top. And then for each SNP, it gives you this long row which continues on way off to the side of the screen, this long row of annotations on that particular SNP. It tells you what the coordinates of the SNP would be in the genome. It tells you which ensemble gene that SNP overlaps with as well as which transcript. So I should backtrack and say this is the format of an ensemble gene identifier. It's very long. For humans, it starts, well, I guess the format always starts with E and S for ensemble. For humans, you then get a G for gene and this long number. This is an ensemble transcript, E and S for ensemble, T for transcript, and then a long number. So these SNPs are overlapping, for the most part, the Adam II gene. The gene symbols are all the same. And then a variety of different transcripts in that gene were also downstream of this gene which doesn't have a gene symbol but just has a gen bank-style accession number. Depending on where that variant falls in a particular transcript, it can either fall on an intron or it can fall within an exon and be a missense mutation or a non-synonymous SNP. And it'll tell you that the VEP will also give you information about where in the gene that particular SNP falls. To get more information about the SNP, you would click down here on this link that you see down here. And that'll bring you to a overview page of the SNPs. It gives you some pretty pictures that you can use. I'm gonna look at the genomic context as well as at the flanking sequence. So this is now showing you your one particular SNP in the context of other SNPs that fall around that. So we're looking at up here, we're in the Adam II gene. So this should look sort of like the Santa Cruz genome browser. These things are positions of exons. This dotted line would be positions of introns. We have five different transcripts. Our SNP of interest is this one with the black box around it, which is falling right near the end of, in these two transcripts falls near the end of an exon. In this third transcript, it's falling in, actually in all the rest of the transcripts, it's falling within an intron. It's color-coded yellow because it's a missense variant. You can see it's surrounded by all these blue variants which are intronic variants, as well as some other color-coded variants that are falling in exons that are color-coded according to the table over on the left. You can get a slightly different view of this SNP if you look at the Adam II protein sequence, and now you can see all, or sorry, the Adam II, the genomic sequence around this particular, around this particular SNP, and you get your SNPs color-coded whether they fall in introns within the exon, similar color-coding to what you see up here. So pretty pictures that if you ever needed to show the context of your SNP in a slide, these might be useful for you. The most common thing that you'd be doing at Santa Cruz is probably in a region that looks like this. So this is an view that looks like this. This is very similar to what you see at Santa Cruz. The page is divided into a couple of different sections. Up here at the top, you see a context of what you're looking at, all of these different color bars would represent the position of individual genes. This orange guy right here, whose label is down here, is the Adam II gene. Further down the page, we see a zoomed-in version of where that red bar is, and again, we are looking in a region within exons of three different transcripts of the Adam II gene, and we were looking at this SNP, which is right here. So Santa Cruz has a, sorry, Ensembl has a different way of showing you which direction the gene is pointing. If you'll remember at Santa Cruz, there are those arrowheads that were hard to see on the introns showing which direction the gene is pointing, and Ensembl, it's far more subtle. This blue line right here represents the genomic sequence. Anything that is above the blue line is going in the normal left-to-right orientation. So from the left end of the screen to the right end of the screen, there's no genes going the left-to-right orientation in this region. Anything below this blue line is going in the opposite direction. So if you remember, from the Santa Cruz site, Adam II was pointing from right to left. That's why the Adam II gene is drawn here below the blue line. And this view is something called the Location tab. We'll see a little bit later where you can find other different tabs in the Ensembl viewer. So just like at Santa Cruz, you can add different tracks to this page. In this case, you do it using this configure, the track, configure this page button. That will take you to a page where you can control which tracks are shown and hidden on this particular view. I am highlighting some RNA-seq data that I wanna bring in. There's RNA-seq data available from a number of different tissues. This is not in code data. This is coming off Illumina's body map project. And there's RNA-seq data, expression data from a number of different human tissues. We're gonna bring in blood, skeletal muscle, and test use RNA-seq data into our view. And I get something that looks like this. So I have zoomed out, I forgot to say, using this zoom control right here in the view. I have zoomed out, I've added my RNA-seq data. I'm now seeing the full length of the Atom II gene, all the different transcripts you can see here. And up here is the RNA-seq data that I have added. On the top is the test use track. So this is showing RNA-seq expression data from the test use. And you can see there are pile ups of reads that correspond to the individual exons of the Atom II gene showing that that gene is expressed in test use. There are no reads in either skeletal muscle or brain. This particular gene is not expressed in either of those two tissues. If you want to get more information about a feature on a track, you would click on any of these lines. You pull up a menu that looks something like this. I have clicked on the first Atom II transcript and I want to link to more information about the Atom II gene. And I do that following the link there. And the gene overview is just gonna show me all the different transcripts that Ensembl has available for the Atom II gene. They are all listed here. Some of them, three of them correspond to NCBI RefSeq transcripts. And the rest of them are unique to the Ensembl annotation pipeline in this case their pipeline as I said earlier is showing the gen code genes. This is the human encode reference gene set that is now being adopted as a human gene reference set by a number of different projects. I had alluded to earlier that Ensembl organizes their view in these different tabs. So earlier we were looking at the location tab that was showing us a graphical overview of this whole region. Right now we're in the gene tab and we can also look at the transcript tab, the variation tab, or if you wanna go back to your SNP prediction results, they would be here under the VEP results. If you wanted to add your own data to Ensembl, you would click on, I can't read it from here, you would click on this add your own data link and that would allow you to add your own data in ways similar to those that I discussed at Santa Cruz formatting your data in similar fashions. I'm not gonna show you how to do that here. Instead, we are going to click on one of the links over on the side, say that I wanna find pre-computed orthologs of my gene of interest. I could click on the ortholog link. That brings me up to a new page where I can select what group of organisms I'd like to look at orthologs from. It gives me an overview of how many orthologs Ensembl is predicting. I'm looking at the seropsid slash seven birds and the birds and reptiles ortholog list and that list is shown down here. So again, these are pre-computed orthologs that Ensembl has calculated based on comparing the protein sequence of the gene that we're looking at of the atom two gene to protein sequences predicted from other genomes. Before you go too far with this analysis, I would advise you to look at the alignments that they're providing, make your own educated decision about whether these are reasonable orthologs, but if nothing else, it gives you a start of where to look. Going back to the gene page, if I wanna look at a different tab, I would just click on its name at the top of the screen. If I wanna look at the transcript tab, that view is shown right here. So this is just one particular transcript. You see the exon intron structure of this transcript, a number of different things that you can do over on the side. I want to highlight this thing that says supporting evidence. So what that does is it shows you how did Ensembl come up with this particular transcript? What other data is there in publicly available databases that supports the presence of this particular transcript? And here we see a number of different accession numbers. You may recognize some NCBI, so the NCBI reference sequences that Andy Boxavott has talked about two weeks ago. These are sequences that are curated to some extent to represent good references for particular sequences. We also see a number of other GenBank-style accession numbers, and you can see that the exons in the Adam II gene from the Ensembl transcript up here in gold are supported by most but not all of the other associated data available in public databases. Another nice thing about Santa Cruz is they provide, they maintain data going back many, many years, and you can see any of their pages in a previous version of the genome assembly and genome annotation where available. And you do that by clicking this view in archive site link, and that will actually allow you to see what this page looked like, not only in older versions of the GRC H30, I can't see from here, H37 assembly, but even going back to a very old NCBI 36 genome assembly which was done in 2008. So if you ever wanted to go back and see what the data looked like in a previous genome assembly, you could certainly do that. What I forgot to say is that in addition to updating the genome assemblies periodically, Ensembl also updates their annotations on a periodic basis, and you will see those annotations are dated here off to the side, and they also receive Ensembl version numbers. There's Ensembl build 78, build 79, et cetera, and the annotations may be slightly different from version to version. One would like to think that they would always get better with time, but that's not necessarily the case. There may be some, sometimes where for your particular region of interest, the annotation has gotten worse, and you may wanna go back and look at the graphics for an older version of that annotation. So the final thing I want to do at Ensembl is to repeat the question that we did at Santa Cruz and see whether we can find a chicken homolog of a human protein. Ensembl, rather than using the BLAT search engine for comparing proteins against genomes, uses the BLAST search that Andy Boxavadas talked about in great detail two weeks ago. So we have an interface that looks very similar to what Andy showed you at NCBI or even the Santa Cruz interface to BLAT. You would paste in, once again, your human atom two protein sequence here. I'm gonna choose the chicken genomic sequence. This is a BLAST search, takes a bit longer because it's gonna be more accurate, more involved than the BLAT search from Santa Cruz. You get back many, many more hits than you saw before. And if you are now a sort of smarter looking, if you look at the output a bit more with a more educated eye, looking at the scores and the E values, you will see that there's some fairly high scoring, low E value sequence alignments. And if you look at one of those, you will get an alignment that looks like this. And I will tell you just from knowing a lot about this protein, this is actually a very nice alignment between the chicken protein sequence on top and the translated human genome sequence on the bottom. There are regions of homology throughout the regions of identity throughout the protein sequences which show you that these sequences are, if not orthologs, at least related to each other on some level. So my moral of this particular story is if you're doing cross species comparisons, you're much better off using BLAST than BLAT. It may take a bit longer, but in the end you're gonna end up with data that's actually usable. So moving on to a tool called Biomart, which is available for a number of different databases, but I'm gonna focus on it in the context of the ensemble data. This lets you grab data, like the Santa Cruz Table Browser, lets you grab data out of ensemble and bring it onto your own computer. So in this particular case, I am going to feed the Biomart, I'm gonna feed Biomart a list of ensemble zebrafish gene identifiers and I wanna get back the positions of those gene identifiers, the names of the genes in some recognizable format that doesn't just have some long string of numbers. And I think that was all I'm trying to get, as well as the NCBI RefSeq accession numbers. So there's three steps to doing a Biomart query. The first thing you need to do is to select a data set. That's fairly straightforward. We're looking at the DaddyOrerio or Zebrafish gene data set. The next thing you wanna do is select your input, which confusingly is called a filter over here on the left. Our filter is a list of ensemble zebrafish gene identifiers. They start with an ENS for ensemble, D-I-R, D-A-R for DannyOrerio, and then G for gene and then a long string of numbers. You paste those in and then you want to select your attributes, which is their confusing name for the output. The attributes that I wanna select, I wanna get back out the ensemble gene identifier because that's what I started with. Might as well get the transcript identifiers while I'm at chromosome start, stop, and the associated gene name, which should be some hopefully easier to understand name than what I started with. I also want to get external references. I can get up to three of those. So these are data that are not provided by Ensemble. These are data that coming from other sources. I'm gonna get NCBI's RefSeq mRNAs, both the curated mRNAs that start with an NM accession number as well as the genome-predicted mRNAs that start with an XM accession number. I click on the results button and I get back my output. Here I have my gene identifiers. Some of them are repeated because they correspond to more than one ensemble transcript. You can see chromosome positions, gene symbol, as well as the NCBI, as well as NCBI accession numbers. Different type of thing that you can do at Ensemble, starting with the same input, starting with the same filters. I again have my list of Ensemble zebrafish gene identifiers. This time say I wanna get back the human orthologs and I don't wanna do any blasts. I just wanna get a precomputed list of human orthologs. I can, as an attribute, select human orthologs and get back, in this case, the Ensemble gene ID, protein ID, and percent identity between the zebrafish and the human proteins. I'm done talking about the large genome browsers that our people are probably most familiar with and I'm gonna touch briefly on two other genome browsers that you may not know about. So the first one I'm gonna talk about briefly is called the Integrative Genomics Viewer, or IGV. So this is actually a desktop application that you can very easily download yourself. It's meant for both bench biologists as well as bioinformatics, as well as bioinformatics people. It's very easy to install, I installed it myself, but it has enough bells and whistles to make bioinformaticians happy. This allows you to, it's a genome browser, it allows you to download data from publicly available sources, but more importantly, it allows you to upload your own personal data. And you can just see that by yourself in this desktop-based genome browser. You might wanna do this because it's a lot faster, you just have this nice desktop app, you upload your own sequence and you can see it right there without having to upload into Santa Cruz or Ensemble. Another reason you might wanna do this is say that you have patient identifiers in your data. You really do not wanna upload anything with patient identifiers to Ensemble or Santa Cruz. It's sort of private, but not really. And in theory, they will both admit that anybody can see the data that you upload there, even if you're just sharing it with yourself. So you really don't wanna put anything at Santa Cruz or Ensemble that you would be very concerned about other people seeing. But with IGV, because the application lives on your desktop, you can do whatever you want. So just as a quick example, this is what you get back. I actually didn't have any of my own human genome data to display here. I uploaded data from the body map project. That was the RNA-seq database that we were looking at earlier with Ensemble. And I have here RNA-seq data from heart and liver. So we're focusing in on the SLC-25A3 gene. We can see the positions of our individual exons. The arrows are showing you the direction of transcription of that gene. And any place that you're seeing gray are places where the RNA-seq data is aligning with individual exons. So you can see beautiful transcription of this gene in heart and beautiful evidence for transcription of this gene in the heart and the liver. If you focus in on this region right here, however, you will see a difference in exon usage. In the heart, you have this left exon being expressed. Whereas in the liver, you have this right exon being expressed. Very clear expression differences, which sometimes it's nice to actually be able to see graphically. Another nice feature of IGV, if you have RNA-seq data, or more importantly, if you had exon data, is the ability to zoom in and see the sequences of the individual reeds that are aligned to the genome. So just showing you an example from here, I've zoomed in to one of my exons. My reference genome sequence is down here. Any place that the reeds are colored gray would indicate an alignment with the reference sequence, but you can see color-coded nucleotides where that alignment does not hold up. So this is looking at RNA-seq data, but this would be especially useful if you were looking at exon data and wanted to see variants in your individual sequence reeds compared with each other and compared with the reference. J-browse is a different type of genome browser that it's claim to fame is it allows you to upload your own genome sequence. So we started using this in NHGRI's Bioinformatics Core about eight years ago. We were working on a project with Andy Boxivanis, and he was sequencing the genome of a novel organism that hadn't been sequenced before, the Tino4 neomyopsis ladyi. So as we were generating genome sequence data and annotation information, we couldn't just, and we wanted to see that neographical viewer, we couldn't just send that to NCBI, to Ensoble or Santa Cruz, because Ensoble and Santa Cruz didn't have the genome available. So we needed a browser where we could import our own genome sequence as well as our own annotations. And for that, we use J-browse. I will say J-browse is a web-based application. If you want to use this, you need a computer that can actually host web-based applications, and you will very likely need a friendly system administrator who can help you get this browser set up. And especially if you want to be able to share that data with outside colleagues, you need it to be put onto a computer that is available to the outside world. Here at NIH, that's a bit of a challenge, and you definitely would need a system administrator to help you with that. But once you've gotten through those hurdles, you end up with a view that looks something like this. So our neomyopsis genome actually existed in 5100 contigs. So we had 5100 separate views of the genome, each one looking something like this. So our contig is in here. We are zoomed in on one particular region, and these are some of the annotations that we brought in. So in yellow, you can see the positions of RNA-seq data. We can see the alignments of the RNA-seq data predicting exons in this uncharacterized genome. We have run an ab initio gene prediction program shown here in green, again predicting genes in this genome. And in purple are the results of running a domain annotation program called PFAM, and you will be talking about this next week, showing you predicted protein domains on this genome, and the exons of those predicted proteins aligned very well with the other data that we had in here. So we've been using this JBrows tool to visually analyze and assess our genome assembly and gene annotation qualities. Another type of browser that you might be interested is a browser that lets you look at variant data. So obviously the big thing in bioinformatics these days is to do exome sequencing of cases and controls to identify disease-causing variants. So you do an exome sequencing project, you identify variants which are different between your cases and controls, how do you know if those variants might actually be disease-causing? This is a lecture on it unto itself, but one of the very basic things you might wanna do is to see whether that variant exists in other populations of data. If you expect that you have a disease-causing variant, you're probably not gonna see that variant in high frequencies in populations of otherwise normal individuals. So the exact consortium is a large group of people who have gathered exome data from 60,000 individuals displaying it all in one place. So you can very easily come and see variant data collected from 60,000 individuals in one place. I am going to focus on one particular gene here and I apologize I have to cheat because I can never really remember the name of the disease we're looking at. Okay, so this is a gene called KMT2A. This gene is associated with a syndrome called Wiedemann-Steiner syndrome which is caused by a loss of function mutation in the KTM2A gene. So a loss of function mutation would mean that you completely knock out one copy of this gene. You have one functional copy, one non-functional copy. Wiedemann-Steiner is first noticeable clinically in pediatric patients. So one thing I forgot to tell you about the ENCODE data, about the exact data is while they have both healthy and some people with disease, they do not have any patients in here with pediatric disease. So because Wiedemann-Steiner is a pediatric onset disease, you would not expect to see any patients, any individuals in exact who have this mutation that causes this particular disease. So what we're looking here at here is an overview of variants in this gene. Here is the gene itself. It has lots and lots of exons shown down here in light blue. All of these individual green and orange blips represent the position of variants. And this histogram up here shows us the exome sequencing coverage of this entire gene. The newest version of the exact browser shows you some statistics about the mutations that they have found. So if you look up in here, in terms of synonymous mutations, the number of observed mutations, the number of expected mutations is fairly similar. In terms of missense mutations, you would expect to see just based on statistics alone, based on the amount of data here, about 1200 variants, you're only seeing 764. So this obviously missense mutations in this gene are selected against. They're not as many as you would expect. Loss of function mutations, such as a nonsense mutations or splice donor or acceptor mutations are especially highly selected against. You only see four loss of function mutations. You would expect to see 115. Again, this is a pediatric onset disease. We're looking at patients, we're looking at individual exons from people who did not have pediatric disease. So you would not expect to see any of those loss of function mutations. If we scroll down the page, we get a list of all the different mutations, all the different variants that have been cataloged in these 60,000 individuals. I have sorted them by annotation up here, and in red are the four loss of function mutations that you do see here, whether those are people with the syndrome who were previously uncharacterized or whether there's a sequencing artifacts remains to be discovered. But for each mutation, you can find out what the, for each variant, you can see what the variant was, what the consequence of that variant is in terms of protein sequence, as well as what the frequency of that variant is in our exact population of 60,000 individuals. If you click on an individual variant, you get more information, population distribution up here, for example, as well as a graphic that even shows you the individual sequencing reads that went into this particular variant, the red line showing the position of the mutation compared to the reference genome. So the final thing I'd like to talk about is Galaxy, which is, this is now completely switching gears, this is a product put out by Penn State. And what Galaxy allows you to do is to do some of your own analysis, your own genome scale analysis. So you've sat here, you've listened to me talk for an hour about all these really cool analyzed data sets that you can get through Ensembl, through Santa Cruz, even through the IGV browser. What if you have some data and you wanna do an analysis on your own? You're not a bioinformatics person and you don't have accessibility, you don't have the benefit of having a programmer that you can work with. So you could come to Galaxy. There are a number of different tools integrated here. I've just given you an overview down the side and highlighted a screenshot of some of their different categories of tools, but you can do a lot of next gen sequence, use quality control and trimming, you can map sequences, you can call peaks, for example, you can do chip seek analysis and you can even do some steps of RNA seek analysis. So as pieces of publicly available software becomes compatible or as the Galaxy team finds publicly available software that could be incorporated into this web-based format, they will integrate it into the Galaxy server. The URL that I've shown here is the main Galaxy server that anybody can get an account on. I was able to get 250 gigs of data, which honestly may not be enough for a really large-scale project. You can download Galaxy, run it on your own computer. There are instances of Galaxy in the cloud that you can pay for if you have a lot of data that you want to integrate and there's some other public Galaxy servers as well. Unfortunately last summer, NIH shut down their Galaxy server so you can no longer use the CIT services to view your data in Galaxy. So I obviously don't have time in the context of this lecture to go over how to do any sort of complex Galaxy manipulation. I instead just want to concentrate on a very simplistic example showing you how you could get the coding sequence, how you could get the coding sequence length of each NCBI reference sequence. So I don't want the length of the entire mRNA sequence. I just want the length of the coding sequence. I realize this is probably not overly relevant, but it is something that I can cover in five minutes and not lose everybody's attention, which I would lose, I'm sure, if I went through some of these other tools. So you go to the Galaxy instance. All the tools are down the side. The first thing we need to do is to get data into our view. We're gonna do that by using this UCSC main link over on the left. That actually takes us directly to the UCSC table browser that we saw earlier. Just like before, I'm gonna get the RefSeq track, although instead of getting sequences, I'm gonna get something called bed format, which is basically a tab delimited format showing you the position in the genome of each RefSeq sequence and its exons. And by default, because I came from Galaxy, it's going to my send, my output to Galaxy, rather than sending it to my computer, to my personal computer. So you end up, once this is loaded, with something that looks like this. So in the Galaxy browser, we now have our first data set indicated here is number one. This is the position of each of our RefSeq transcripts showing you this is the start and stop of the transcript itself. This would be the start and stop of the coding sequence. The part that's translated into protein should be a bit shorter than the transcript sequence. And over to the right where you can't see it are the positions of the individual exons. Now I just want, I wanna get the lengths of the coding sequence. I don't wanna get the whole genomic sequence including exons and introns. I need to get the lengths of just the exon sequence that is coding. So the first thing I'm gonna do is use a tool that is in the calculate, the operate on genomic intervals section of the Galaxy server. It's called a gene bed to exon intron manipulation. And what I'm gonna tell it here is I want to extract the coding exons from that bed file, from that text file. And I wanna output it as, sorry, I'm just going to extract the coding exons. I'm not gonna output it anywhere, that's the next step. And once I do that, I end up with this output shown down here. So for an individual transcript, if you look at things that have a single shared transcript ID, I have outputted each of the exons from start to stop. So this is slightly more complicated because then just getting the exons themselves, I want the coding exons. So if a coding sequence starts in the middle of an exon, like we saw way back when I first started talking about Santa Cruz, it's only gonna get the coding portion of the exon, not the entire exon including the untranslated sequence. And I will say this would be a difficult exercise to do at Excel if you wanted to download the data yourself, just the way the exons are formatted in a bed file. Take my word for it, Excel would make this rather complicated to do. It's nice to have a built-in program that can do calculations on this bed file yourself. Then the next step will be to group these coding exons together, no, I'm sorry, I had it myself. The next step is to figure out the length of each one of these coding exons. So I'm going to do that by going to text manipulation and saying compute. I'm gonna subtract column C2 from C3. So I'm gonna take the stop minus the start coordinate to get the length of each coding exon. And I'm going to do that subtraction. I'm sorry, it's hard to see sideways here. And what I end up is something that looks like this. I have my coding sequence exons that I just had before and the length of each of them added as a column off until the right. So now the next exercise is fairly simple. For each transcript, so I've highlighted one transcript here in blue, for each transcript, I just need to sum up the lengths of all these exons and I will get the length of the coding sequence. And that is one more manipulation called group by a column and perform an operation on that column. So I'm going to group by column four. So column four is my accession number and I am going to do a sum on column seven. Sorry, wrong direction. Column seven are my exon lengths. I'm going to add all of those up. And I'm going to get back results that look like this. So now I have for each of my RefSeq exons at RefSeq transcripts, I have the length of its coding sequence. So again, not a really dramatic example, but just showing you that you can do calculations within this browser, within this web-based application that'll be a bit difficult for you to do on your computer and you didn't need to get a bioinformatician, you were able to do this yourself. So just imagine what you could do if you were to actually look at some of the more complicated tools for RNASeq or ChIPSeq analysis and you would have the power to do that on your own computer by yourself. So some of the tools that I've talked about, I should say all the tools I've talked about have online help documentation. If you want to go for a more structured documentation, I refer you to current protocols and bioinformatics, which is available from the NIH library at the URL shown here. And there are chapters on the Santa Cruz and ensemble genome browsers as well as on the Galaxy server. And that's all I have to say. Yeah, I just want a couple of programming notes. I noticed a lot of you taking notes, scribbling down URLs as I'm going along, just to reiterate, the PDFs are all available on the course website at genome.gov.ctga2016. I'm hoping that you're all getting the mailings that Andy Boxavanas is sending out each week, telling you what classes are being held, also has the URL of the class. If you don't have those, you can certainly come by afterwards and I will give you the URL more directly. Also for those of you who are signing up for CME credit, there's signup sheets in the back of the room available each day. I've already printed out the list of people who have come to previous versions of the class. If you're showing up for a second time, you just need to sign next to your name. You don't need to actually enter your information again. Thank you for your attention and we'll see you next week when Andy comes back to talk about sequence analysis. Thank you.