 Welcome back. So this afternoon's session, the main focus is going to be functional annotation and analysis of transcripts. The lecture portion here is actually going to be kind of on the short side. We're going to have more time for the lab. The lab is a little bit more expensive. There's a lot of information to cover there. And hopefully it'll be bringing together parts of what we've learned over the last couple of days in addition to some new stuff. So for this last module, certain learning objectives, and we're going to explore the different methods that are available to glean biological function from transcript sequences and want to be able to differentiate between homology based and sequence composition based functional inference. So two major ways of going about predicting functions of your transcripts or any proteins that they might encode. So when we do a transcript assembly, we end up with a fast file containing our transcript sequences and essentially strings of nucleotides. We need to figure out why is this transcript important? Functionally, what does it represent? What is it doing? Can we gather hints of biological function from this raw sequence data? There are a couple of ways of going about doing that. The most straightforward way is to look for a sequence homology. So we take that transcript sequence and we can search it against a database of all known biological sequences and ask are there other sequences that are similar to this and are they similar enough? Such that we might infer that maybe they're related in some way. And if that protein that matches to has a known function, perhaps this transcript that we're looking at encodes a protein that has a similar function. Alternatively we might look at sequence composition itself. Are there signals within the sequence that might suggest a certain type of function? So there are neural networks, hidden Markov models, there are various types of machine learning methods that we can use such as pattern recognition algorithms for looking for sequence features that are conserved and that could be indicative of a function. Starting with database searches, the most common way of doing a database search is to use the BLAST program. If we're starting with a transcript sequence, we can use the BLAST X utility to effectively search protein database with 6 frame translation of that transcript sequence. If we have a protein encoded within our transcript, we'll detect homology that way, or we can perform BLAST P instead of BLAST X. We have the protein sequence to do a more direct search. We're sticking with the nucleotide BLAST for now. We'll search our protein database. There's a nice collection of protein sequences that are available in SwissProt. SwissProt is part of the Uniprot knowledge base, which includes Tremble, which is essentially a non-redundant database of all known sequences and translated. You can see there's 114 million sequences as of fairly recently, I just put this together or updated it within the last week, I believe. There's 114 million sequences in our protein sequences. SwissProt is a manually reviewed subset of this. It has 557,000 sequences. That was as of May 2018, last week or so. SwissProt is really one of the best resources for doing this kind of search. There's a lot of information that's tied to each one of these SwissProt records, as they've been manually curated. Just to give you an example, there's one record that we find in the SwissProt database. It tells us information about the function, the known function. We have genontology terms. Everyone familiar with genontology? It's a structure vocabulary for defining the molecular functions. It's broken down into a few different sections. They have molecular function, biological process, and cellular component. It's structured in a graph. It's basically a directed graph. We talked about directed graphs earlier this morning. Instead of having directed graphs with cameras or sequence data, this is a directed graph with terms. Each node is a description of a molecular function. There are parent terms and child terms. You have edges that are of different types. There's a part of... This one protein... There's one protein of interest, which is a fructosamine 3 kinase activity. It's also a kinase. It's part of phosphorylation. Basically, all these parent terms apply to this protein. You get more... more broad in terms of your terms as you go up the hierarchy. One of the things that's useful about genontology is once you have assignments to different products, you can look for enrichment. If you're doing something like expression analysis and you have a set of transcripts that are found to be differentially expressed, you might be interested in other certain pathways or certain functions that are enriched among that set. It'll give you more information about the biology. You basically run a statistical test or an enrichment test given genontology terms. Typically what it'll look like is a matrix like this. It's a contingency table where we have differentially expressed one axis and genontology terms on the other axis. You just break up your terms in here. In this case you get 50 transcripts that are differentially expressed. This indicates that it's enriched. It's very similar to the classical problem where you have a jar filled with marbles. You've got two kinds of marbles. You've got green marbles and red marbles. You have drawn or not drawn. In the context of differential expressed, a drawn marble is basically a transcript that we find is differentially expressed. Instead of being green or red, it has a genontology term associated with it, or it doesn't have that genontology term associated with it. You can run it through this formula in order to calculate the probability of seeing such an event. This is useful to get probabilities for whether you're enriched for having certain types of functions. There's tools that are used. You can go seek that includes all the statistical analysis. You can combine differential expression analysis with genontology category assignments and look for terms that are statistically enriched. It could be indicative of that biology. In the case of SwissProt, if you have a match to a SwissProt record, and within that protein record in SwissProt indicates that it has certain genontology terms, you can apply those genontology terms transitively to your transcript or protein adventurous. It's a nice easy way to capture genontology terms for your transcripts if you have homology. If you don't have significant sequence homology from junior blasters, what else can you do? We can ask questions like is there an open reading frame for a potential coding region? We just scan through the sequence and see do I find store codons and stop codons. There's a nice long open reading frame that's uninterrupted. For example, here we find ATG and we can keep going on and eventually find a stop codon that's in frame. Does this look like a reasonable coding region? You can use a tool like Warffinder or you can input your transcript sequence and look for all the different possible orfs. It's just an example. Put in a long transcript sequence and then hit submit and it'll return to you a whole bunch of different open reading frames that we find within that sequence. Usually, if you have a nice long open reading frame, it's unlikely to happen by chance, but this is very much tied to the nucleotide composition. If you're GC rich then you get this opacity of stop codons and you can have really long open reading frames purely by chance. It's tied to the GC content whether you find long open read frames purely by chance or not. There are tools that can take the sequence composition and determine if it really looks like it has codon metrics that are similar to other coding sequences. There are tools such as TransDecoder, it's a tool that we developed where you'll find all open reading frames within your transcripts, but then it applies a Markov model to score each of the orfs and determine whether it looks like it's a actual coding sequence or not. You can recognize functional domains in coding regions. You do PFAM searches. It's a database of Markov profiles that are made based on conserved regions of domains. You can search your open reading frame against that collection of PFAM profiles. You see evidence for a domain that might have something to do with calcium binding or DNA binding or some specific function like glycosylase or methylase. This is one of the best tools available for characterizing functions of other sequences. Take your open reading frame, put it in this case you have a web-based interface. You can just drop your sequence into the text box and press go and it will return to you all the domains that it finds from searching this database of Markov models. Based on individual domains that you find within the sequence that could be interpreted as functional or the whole combination of domains. In this case you get a whole bunch of different domains. Based on this combination of domains you might be able to figure out what its function might be. Other types of tools like a tool called TMHMM they can use for finding transmembrane domains. You just input your protein sequence and it will try to find domains that look like they're hydrophobic regions in your protein that might cross a lipid bilayer or cross multiple times. Here's the case where you just have one membrane domain. Here it goes back and forth a bunch of times. Here it goes back and forth a bunch of times as well. The tool we'll use for this is called TMHMM. Most of these tools have websites where you can just input a protein sequence. When you're doing this for many thousands of sequences you're not going to do it this way. We have a command line tool that will run and capture the information running lots of sequences through. TMHMM is what this might look like. We run it through and we get a nice picture showing individual regions within the sequence that look like they're predicted to be transmembrane domains. If we know nothing else about this protein sequence we know it has an architecture that's consistent with it being a transmembrane protein. It's just better than nothing. Does the protein have a secretion signal? Does it look like it's a secreted protein? If we know nothing else maybe we can just detect that it has a secretion protein as N-terminus. There are various tools to do this. There's some sequence context for secretion signals at the N-terminus of a protein. You've got a hydrophobic region here or hydrophilic region at the very N-terminus with positive charge, hydrophobic center, and hydrophilic polar sequence at the end. These are really short sequences. 30 to 70 amino acids at the very N-terminus. Do you need a secretion signal for transmembrane proteins? This is different. Transmembrane proteins, you run TMHMM and look for this kind of signal. This is just a secretion signal. Some of these might actually be transmembrane anchors too. It's hard to tell the difference if it's a transmembrane anchor or it's a secretion signal. A lot of times the signal peptide predictors will actually predict signal peptides when their cases are actually anchored into the membrane. It's not just secretion. The latest version of the software, SignalP is much better about that. It takes into account transmembrane domains as well as the secretion signal to do a better job at predicting just the secreted proteins. Again, just pop in your sequence. This is based on a neural network I believe and it's trained for different eukaryotes versus gram-negative or gram-positive bacteria, so you have to indicate what organism you're working with and it'll give you a few different scores. As a signal peptide score it looks for a cleavage site because the signal peptide is cleaved off as part of the secretion machinery and it generates a combined score and you basically just get a list of your predicted secreted proteins. Again, it's just one of those things that if you have no sequence homology, you don't really have a whole lot of other information to go by or domains that are showing up might be a secreted peptide. It could be useful information, particularly if the transcript for this peptide turns out to be really highly differentially expressed. It's like one of your best candidates from your differential expression screen. You know nothing else about it, but it's got a nice open reading frame and it's got a strong secretion signal at the end-term risk. It's good useful information to know. So we have a tool that we developed to try to organize all this information, the homology information and all the prediction results that you get from running tools like secretion signals or transmembrane domains and it's called Trinotate. It also includes our trans decoder software for predicting coding regions within transcripts. It integrates gene ontology as well. Essentially the way this works is that you get your transcript set. I don't think I have a slide in here that walks through the whole details. Save your transcript set we'll run it through trans decoder to predict all the coding regions within the transcripts. We'll search the transcript sequences directly against SwissPro to identify homologs. We'll also search the protein coding predictions against SwissPro using BLASP. We'll take the protein coding regions and we'll search them through TMHMM, get transmembrane domains, a signal P for secretion signals. We then organize everything into a relational database that's really structured to capture all the information from SwissPro records. So basically all SwissPro has been previously parsed and stored into this relational database and if we have a good protein match or even a transfer level match to a SwissPro entry we can capture all that SwissPro entry attributes like gene ontology assignments. So if that entry has a molecular function assigned or a biological process we can capture that and basically apply those terms to our sequence of interest. We can sort of tie this all together with the expression information too. If we have expression information, we have differential expression results we can tie that together, get our gene ontology assignments, we can use go seek for doing the enrichment analysis or statistical enrichment analysis. So all this stuff sort of tied together and wrapped around this trinotate software. And of course there's no substitute for experimentally validating different protein functions. That goes without saying maybe. And for the workshop that we're going to be doing the practical that we're doing revolves around our trinity protocol. So the way this works is we start with a bunch of samples. These could be different experimental conditions or they could be different tissue samples. And for each one of your conditions you're going to want to have multiple replicates from R&C. You probably talked about this yesterday. For differential expression you really need to have biological replicates. So you have each sample, within each sample you've got at least a few biological replicates. And what we're going to do is we're going to do a trinity genome free de novo assembly and this requires that we combine all the reads. So combine all the reads in just one big sample and that's what we generate our trinity assembly on. There is a normalization process that's sort of baked into trinity right now. So if you end up starting out with like a billion reads, you really don't need to have a billion reads to generate transcript assembly. Because of the dynamic range of gene expression, you get low express transcripts, highly express transcripts. For highly express transcripts you really just need a moderate number of reads in order to reconstruct that transcript. You don't need a million X coverage, 100 X coverage or 50 X coverage might be just fine. So we have a process in here that actually looks at the amount of read support for each read. It does this using K-mer values and it basically will toss reads into two bins. One bin is the bin that we're going to assemble. The other bin is just the excess reads that we really don't need for assembly because we already have enough for the highly express transcripts. Just to give you an example of what you kind of see here sometimes. With our Salamander transcriptome assembly that we published last year, we started off with about 1.5 billion reads. But after normalization we ended up with less than 100 million reads. So that's a huge reduction. Less than 10% of the reads we needed for assembly. So just a lot of excess. We need to know this when we're doing quantification, but for the assembly process we don't need to have all of excess reads. We put it through Trinity, our assembly, and once we have our assembly then what we're going to do is we're going to measure expression levels for each one of our samples and more specifically for each replicate within each of our samples. Did you already say what it is normalized to when you combine the reads? The normalization is based on K-mer composition. So what it does is it first takes all the reads, it counts up all the K-mers and then basically after it has this database of K-mers, for every K-mer it knows how many times it's seen it within all the reads. It then goes through the reads again and based on the K-mer composition of the reads it can stochastically either keep the read or throw it out assuming that we want to have 50x coverage instead of it could be 10,000x coverage or something. Because we have the read and you look at how often where the K-mer is found, we just take the median K-mer abundance. Based on that you can sort of use that as a proxy for how many times you've seen that kind of read within your dataset. And if you only want it 50x, well then you just basically sample it so that on average you get 50x coverage instead of 10,000x. So it's going to compress and you do that? Yeah, it's like a data compression, sure. And it's different than just doing sub-sampling. If you start out with a billion reads and you just randomly take 100 million reads, that's very different than doing normalization. Because normalization, you're just removing the excess reads for the abundant transcripts. If you're downsampling then everything gets downsampled. A low expressed look gets downsampled, the same rate that the highly expressed. So it's a different process. Okay, so for abundance estimation we're going to use a tool called salmon for doing this. We're basically going to quantitate each replica for each sample. And that's going to give us an expression matrix. We'll have all of our transcripts as rows and we'll have all of our sample replicates as columns. Once we have that we can run a differential expression analysis. For that we'll use bioconductor tools. Yesterday you used ball gown as part of the tuxedo package. Here we'll use either EDGE-R or DEC2. So bioconductor packages are all very flexible. EDGE-R and DEC2 are also two of the most popular tools for doing differential expression. And once we have that we can cluster transcripts. And we're not going to do, I think we're going to do genontology and enrichment analysis. What we're going to do is we're going to do functional annotation next. So we're going to extract coding regions. Basically we're going to run abstract coding regions. We're going to do BLAST searches. We're going to do TMHMM, SignalP. We're going to take all those results including all of our expression analysis and our differential expression results. And we're going to put them all into our trinitate relational database. This is going to generate an annotation report. Then the fun part is that this actually comes with a little web-based navigation tool. So we can fire up a trinitate web and that provides us with a nice interactive way of exploring our functional annotations and also exploring our expression data. So that'll be the fun part that we'll finish with. And I told you this part was going to be short and I didn't lie. It's annotation lab time.