 All right, so we have a couple objectives for this module. We're going to explore methods to glean biological function from transcript sequences, and learn to differentiate between two different approaches, homology-based and sequence composition-based approaches to functional inference. The general challenge is that we have, we've done our transcript assembly, we've identified our different transcripts, we've got these sequences, but we don't necessarily know what these sequences represent functionally, biologically. So can we gather hints of the biological function by looking at the sequence data? And there are a couple key approaches that we might take. One is using sequence homologies, probably one of the best ways, most reliable ways to infer a function of a given sequence. In this case, we might somehow search a database of proteins of a known function and see if we can capture a significant sequence similarity or sequence homology to proteins that have known functions. And alternatively, we might just analyze the, or in addition, I should say, we should analyze the composition of the sequence and see if the sequence composition alone can provide us with hints about what that protein might be doing. And there are different machine learning methods that can be applied to gather these insights. Starting with sequence homology, the most popular way of doing this is to perform a BLAST search. We have our transcript sequence from our assembled transcripts or might have run string tie, given a genome. And we can take that transcript sequence and we can do a BLAST X of that transcript against the protein database and see if we can identify proteins that are homologous that have functions that are known. And it's a great database that we can use for doing this called SwissProat. SwissProat is part of the Uniprot Knowledge Base, which also includes Trumbull. The Uniprot Knowledge Base is a very large database. It's part of, it's Uniprot, it basically goes to uniprot.org. It has many millions of sequences, just over 88 million sequences in the database. SwissProat is a small subset. There's only, what's that? I hope so. Amos is working on the next Peru. Next Peru? Okay, human SwissProat, next Peru? Okay, I don't know, I just, I mean, SwissProat has been around forever, so I kind of just, yeah. Is it to have new data steps? Oh, I have no idea. I mean, I'm kind of assuming that it's still growing. Maybe that's not growing at the same pace or by the same mechanism, but they have to have some way of separating out like the highest quality annotations, right? Yeah, that's all right, that's all right. Hopefully it's still being maintained. Given how important it is to the community in general, I mean, it'd be a crime if it wasn't. Yeah, hopefully, hopefully someone is still maintaining this. I'm just wondering about that. But I mean, the database is just, they keep continuing to grow at a rapid pace, right? I mean, the basis of anything growing exponentially. It's just like I said, there's over 88 million sequences now. This page is constantly updated. The SwissProat right now, so this is actually as of July this year, SwissProat is just over a half a million entries. But these are arguably the most high quality records in the dataset. They've been manually annotated and reviewed. Most of them have functions that are known functions. And there's just a lot of rich information that's tied to these records about these proteins. And so we can basically search, one of the best things we can do is just search SwissProat directly and see do we have any proteins that are highly similar in sequence and waiting for our homologous, share some evolutionary history or evolutionary related and it might have conserved functions. This is just an example of a page within Uniproat for one of the entries that's in Uniproat, sorry, that's in SwissProat for posting me in three kinase. And you can see it has some nice functional information there. It tells you about what the protein's doing. You have genontology information. So genontology is a structured vocabulary for defining molecular functions, biological processes and cellular components. And here we have gold molecular function and they have different activities that are assigned to it, likewise for biological process. If you click on the genontology link, it'll also show you the full structure. So genontology is represented as a graphical structure. We have nodes. We like our graphs, right? Like our de Bruijn graphs and all sorts of ways of using graphs and bioinformatics. So in this case, it's a directed acyclic graph where we have parent terms and child terms. And in this case, the information stored at a given node, at a given child node, is more specific than the information is stored at a parent node, which is actually more general. So in this case, you'll see a fructosamine in three kinase activity, which is down here, that's the terminal node in this hierarchy and its parent term is kinase activity, which is more generic. If you're assigned, if you have a given assignment, a given node within this hierarchy, it automatically means that you have all of these parent assignments as well or you have all these parent functionalities. So you'll find, given different proteins that might have assignments at different levels within this hierarchy and that's different levels of specification for wherever that function might be. So this is hugely useful. And this is going to allow us to computationally explore different functional properties of gene sets. So for example, when we did differential expression before we ended up with collections of genes, this had similar expression patterns that were co-regulated. And we might ask a question, given the annotations that we have assigned to those genes, do we see any particular enrichment and function or are there certain pathways that are enriched, statistically enriched for functions given that set of genes? And the statistical tests are much like doing Fisher's exact tests. In this case, we have gene ontology categories and we have differential expression characteristics. And we can basically build a two by two table or a contingency table and where each number of genes or transcripts in each of these cells, each of these four cells is basically this joint. And we can ask a question for those genes, these 50 genes that are both differentially expressed and have this particular gene ontology category, is that statistically enriched? Is it significantly enriched if we run something like Fisher's exact test? All right, and this is just, this is very similar to, the more classical descriptions of these tests where you have an urn, in this case, you have an urn that's filled with green marbles and red marbles, and you draw some number of marbles out of that urn. All right, some are gonna be drawn and of those that are drawn, some might be green and some might be red. And we might ask the question, is it, what's the probability of drawing exactly K green marbles? And that's the formula for making that kind of calculation. So it's very, very similar to what we would do with gene ontology. Instead of green marbles and red marbles and drawing them or not drawing them, our urn is filled with the transcripts. All right, and these transcripts have gene ontology assignments and they have a characteristic of being differentially expressed or not differentially expressed. And you're doing the same thing. You're pulling these transcripts out of the urn. In other words, you're identifying them as being significantly differentially expressed. And then you're just testing to see, is there a relationship between these two characteristics, being differentially expressed and having a certain gene ontology category assignment. All right, but what happens if you run your blast searches and actually you don't have any good hits? Or is there anything else that we can do? Well, there's a number of questions we might ask. We might ask, well, does it look like it might be a coding transcript? We can just look through our sequence and see, do we find any open reading frames? All right, do you find any stretches of nucleotides where we have a start codon, a candidate start codon, a stop codon, and no intervening stop codons? Because that has the potential that it could be translated into a protein sequence. And given that protein sequence, is there anything that any characteristics that it might have that might lead us to believe that it could be a functioning protein? In order to find open reading frames, there's a tool called Warfinder. There's a bunch of tools to do this, dozens of tools to do this. But one simple popular one, if you're doing this on a case-by-case basis, is using Warfinder. You can just pop your nucleotide sequence into a text box, click Go, and it will give you this nice little graphic showing you the positions of all the different open reading frames that it can find along the sequence. One of the problems here, though, is that just because you find an Orf, doesn't this really mean that it's a real coding region? Because in a random sequence, you can find open reading frames. It's generally the case that the longer the open reading frame, the less likely it is to occur at random. The more likely it is to be something that's real, but that also depends on your GC content. Because if you have a GC rich genome, you find open reading frames all over the place, just by chance, even long ones. So we have a tool that we developed, and there's other tools that do this as well, where they'll try to not only identify all the different open reading frames in the sequence, but then test those reading frames using statistical models in order to determine if it looks like it's likely to be coding sequence versus just non-coding sequence. And our tool that we developed is called TransDecoder, and you'll be using that this afternoon during the workshop. So once we have that coding region, even if we don't find good blast matches to other proteins, we can still, we can mine those coding regions to see if there's any evidence that they might have domains of conserved function. You might find that there's a domain that could be consistent with being a DNA binding domain or an RNA binding domain, calcium binding, any kind of metal binding domain, or you might find other domains that show up that could indicate enzymatic function or other kinds of regulatory activities. So oftentimes if you find a domain match to a kinase or something that is a protease, chances are you're gonna find a good blast match to that protease or kinase too. But if you have a small region, it's like a calcium binding region or some kind of metal binding region, that might not be sufficient to find evidence for that in a blast search. But you can search a database of domains, a domain profile, it's called PFAM domains, hidden Markov models, and based on those searches of hidden Markov models, you can often detect signatures of these functional regions. So PFAM is incredibly popular. There's a website, and again, if you have a protein sequence predicted or just captured as an open reading frame, you can input your protein sequence into the text box, press go, and it will search it through an entire collection of hidden Markov model profiles, an HMM library, and it will return to you all the domains that it finds that appear to be significant or high scoring. Here's just an example of a protein sequence that has lots of different protein domains, and you can see there's a nice report here that shows you this is the name of the domain and this is where it's found, and this is the significance of that match to the domain profile. So this is incredibly useful. So even if you don't know exactly what this protein might be doing, you can still get some useful hints, like it binds DNA or binds RNA or has some other sort of functions related to the activity of these domains. Another thing you can do is look for evidence of transmembrane domains. See, you have many different types of transmembrane proteins with different domain structures, but they tend to have certain signatures, like a stretch of hydrophobic regions surrounded by hydrophilic regions, so the hydrophobic regions will extend through the membrane. And there's a tool that we use called the TM HMM, and this will predict the transmembrane domains within your protein sequences. Again, you just input your protein sequence of interest, press go, and it will give you a very pretty report, very informative report that shows you the predicted transmembrane domains if it finds any, and the orientation of being inside versus outside the cell according to these persons. So here, this case here, the blue indicates it's on the inside, the red part is the transmembrane region, and you have a little purple part here that's on the outside. And I'll show you the scores as well across the sequence. And another thing we can do is we can look to see if we can predict secreted proteins. So there are secreted proteins that have a signature. Again, this is a signature where you have a hydrophobic region at the very end terminus of the protein that's surrounded by small hydrophilic regions, and there's a cleavage site, a signal peptide cleavage site, which also has a signature. And there's a tool that you can apply to the end terminal regions of your sequences to look to see if there's evidence for it to be having a secretion signal. It's called SignalP, this is incredibly popular. Again, put your protein sequence in, press go, and it will give you a report showing you the evidence for any secretion signal. So in this case, we've got a few different scores that has a signal peptide score, that has a score separate for just the cleavage site itself, and then it has this combined score that integrates these two, and based on this combined score, it'll make a prediction as to whether or not you have a predicted signal peptide. So at the very least, if you know nothing else about this protein, if it has a secreted peptide sequence predicted, then at least you know that it might have something to do with being secreted and interacting with the environment or other cells. So going through and doing this one by one is something you're not gonna wanna do if you have a list of thousands of protein sequences. And each one of these tools, in addition to the web interface, they have command line based tools. So you can actually just run an entire set of proteins and get the results. But it's a challenge to try and integrate all that information, yes. All of these tools have, I guess, databases to query. Is there a way to have local databases? So you're not independent, or do you find that generally you need to have it in order to get these kinds of tools from? Well, for the web-based ones, it's all, you know, it'll use their server. But for the command line tools, you don't need to have access to the server. You can say you could run it. Okay, so you can put it back in the database, which they, okay. Yeah, you basically download the tool and any resources it requires to your computer, and your business run it vocally. It doesn't require having any sort of external activity. So we have a tool that we developed called Trinitate, which basically, it's both a tool and a protocol. It's a protocol in that you have steps to run to, from the command line, you're executing these different tools. But then, as a tool itself, it basically organizes all that information. It takes the results from doing your BLAST searches and your PFAM searches and TMHMM. And it basically incorporates all that information into a single database, that you can then generate a single tab to limit a report that describes all of these characteristics of your input sequences. And there's another tool called Trinitate Web, that we'll play with later, that provides you with a nice web interface for being able to interact with that data. So in addition to having a large report, they can put an Excel and you can use Excel for querying different things, or even just, you know, whatever your favorite way is from, you know, using grep and other command line tools for extracting information from that file. The website allows you to interact with it and perform searches and gives you lots of, you know, pretty graphs. And it also incorporates your expression data, which is one of the key features. Not only do you have your annotation data, but Trinitate Web will incorporate the expression data so you can go between the two. You'll find genes that look like they're interesting or differentially expressed. Click on them within an interactive plot or zoom in and click on them in an interactive plot and automatically go to that annotation report page to get the sequence information or understand what the domains and whatever functional information you might have. Of course, there's no substitute for experimentally validating protein functions. Sorry, I don't think there ever will be. So this is key. If you find something that looks interesting, you think you know what it does. It's super important to you. It's not going to keep you from doing an experiment. Okay, any questions? Now, before we go into the workshop activities, the practicals, okay. Yep. I guess we'll work the other way. Right, right. So this would be, right. So there would be a lot of transcripts that are going to have, and finding like RNAs is sort of like another challenge in itself. But yeah, it has an obvious coding region with a good blast match. There's a way to put a filter on it to remove all the things that are obvious for coding genes. But of the ones that remain, trying to discern which ones are actually truly non-coding versus being parts of other coding transcripts. Because maybe the coding region was too small or maybe you're looking at mostly a five-prime or three-prime UTR for a transcript. It's a coding transcript. That kind of study is definitely challenging. So I guess it's still there through any form that's what we're looking at. We have a tool that was developed called Slinky. It's L-S-S-L-N-C-K-Y that will identify conserved non-coding transcripts in human. Now you've discovered that most people don't have the software. That's right. So Jenny Chen and Naveed Vragev's group built this out, published it a couple years ago, and it's freely available software. So if you're looking for, again, if you're looking for conserved non-coding RNAs, that would be one way to go. But link RNAs in general, again, it's one of those things where it's part art and part science and different people will have different definitions for what they're going to call a long non-coding RNA. So in all different counts too, so people will say there's many, many thousands of them. Others will say, no, there's really a small number. So it depends on what's your length cut off for being a long non-coding RNA, what are the criteria you have. But it's a tricky thing to do.