 Okay, now we're going to talk about network and pathway analysis. So you've had an introduction already to gene set enrichment analysis where you are looking for enrichment in groups of genes in a gene set of interest, so we're going to go a little bit beyond that now. So first we'll do a little review. So why does one want to do pathway analysis on biological data sets at all? The main reason for this is statistical considerations. When you have thousands of genes, you have thousands of hypotheses, and that causes multiple testing correction problems. When you reduce thousands of genes into a smaller number of pathways, often you can achieve a reduction down to dozens of pathways, you're increasing the statistical power. So now rather than looking for overrepresentation of a disease process in individual genes, you're clustering and combining those genes together and looking for overrepresentation in a smaller number of pathways. It allows you to do things like find the meaning in the long tails that we often see in disease processes. So for example, in cancer genomes, when we're looking at somatic mutations, we usually see five or six genes that have a large number of mutations and then a very long, long, long tail of genes that are currently mutated at a low level, and we don't know how to deal with them, but often by clustering them into pathways, you start to see statistically unlikely clustering and you can find the meaning in that long tail, and that enables you in turn to tell biological stories and get your papers published. So I'm going to be talking broadly about pathway slash network analysis. This is a very heterogeneous field of many different approaches, but the way I define it, it's any analytic technique that makes use of biological pathway or molecular interaction information to gain insights into a biological system. Because we're in a cancer institute here, a lot of my examples will be drawn from cancer, but by no means do I mean to limit this just to cancer. Pathway network analysis is a hot and very rapidly evolving field, and there are many different approaches that people use for this. So let's look at what the difference between a pathway and a network is. So again, this is my definition. It's not the broad definition by any stretch of the imagination, but the way I like to look at it is a pathway is the detailed biochemical view of a biological process. The thing that we like to draw on the blackboard when we're explaining the causation of a particular biological phenomenon. So for example, here is a traditional pathway view of the epidermal growth factor, receptor pathway, the initial part of the signaling pathway, where EGFR receptor and the EGF ligand bind to each other, forming a complex. This is inhibited by a negative inhibitor, then there's a dimerization step, ATP is hydrolyzed to ADP, causing creating a phospho product of the dimer, which then has further pathway, further steps below, and there's a positive regulator SRC1 here. This is generally the way we create our hypotheses and interpret them. Unfortunately, this type of representation is very difficult to scale, and it's very difficult to do large quantitative computation over. So often, if not usually, this very detailed mechanical model gets turned into a logic model represented as a network. And in this network, we're throwing away all the detailed information. That is, there's a dimerization, there's ATP hydrolysis as a phosphorylation step, and instead we're just capturing the logical association. EGF activates EGFR, LRG1 inhibits this interaction. The nice thing about the network representation is you can start adding other information where other types of interactions, where you don't know the exact mechanism, but there's say genetic information, or there's proteomic information that shows that these additional genes, KRT17, are interacting somehow, we don't know exactly how. But we can do some network analysis even with that limited amount of information. And so you can convert, by the way, from the pathway representation to the network representation, but you can't go back the other way because there's information loss. So I'm going to give you now three different, talk about three different types of pathway network analysis. They start simple and they become increasingly complex, but they all start out with the same set of ingredients, there are basically two. One is a list of altered genes, proteins, or RNAs that you will be analyzing, so this comes out of your own experiment. You've done a CRISPR knockout and then done an RNA seek, and you found that a bunch of genes went up in expression and a bunch of genes went down. That's your list. The second is you need a source of pathways or networks to analyze your data set with. So let's talk about where those ingredients come from. So pathway databases, and there are quite a few of them, are databases which collect the mechanical, mechanistic, biochemical representation of biological processes. They are usually curated. They provide an intuitive human readable view of biological process. They capture causation, unlike networks, and they have a human interpretable visualization because we have been educated to understand these diagrams. Disadvantages are because they're curated, they don't usually cover the whole genome. It's human labor and it's also limited by what gets into the literature. And in addition, it's very subjective. Different databases will disagree on the boundaries of pathways. So going back to my EGF example, the EGF signaling pathway actually leads into the RAS signaling pathway, which is then fed into by a bunch of other signaling pathways, including TGF and others, which actually interact all together. So when you carve out something called the EGF pathway, where are you drawing those boundaries? And different researchers make different reasonable choices. So one pathway database is EGF pathway may not be the same as another pathways. That's a problem in communication and reproducibility. Here's an example of a classic pathway. This is pentose, the pentose glupuronate pathways represented in keg, the Kyoto encyclopedia of genes and genomes. I think most of you have seen these diagrams. Keg is the oldest and most storied of the pathway databases. It curates biochemical reactions in multiple organisms and maps them onto multiple species from the carriots up to human. And annotates everything using EC numbers. So it's ability, it's a good sort. PC numbers are the enzyme commission identifiers for enzymatic activity. Its advantage is that it's quite comprehensive. It covers multiple species. The disadvantage is that because it's based on enzyme activity numbers and there is not a one-to-one mapping between enzyme activity and genes and proteins, can be a lot of source of confusion in it. In addition, keg has changed its business model over the last few years and it's no longer completely open and you have to license it. The reactome database, which I'll talk about a lot because it's a product of my lab in collaboration with NYU, EBI, and OHSU in Oregon, is another pathway database. It's probably the largest open pathway database for human and it's created by a team of curators who read the literature and create a data model from which we can create these pathway diagrams. And it focuses largely on disease-related pathways, signaling pathways, developmental pathways, response to infectious disease, and other pathways that people are interested in. We also take a somewhat opportunistic approach if we're approached by a researcher who says, well, you really don't cover the field that I'm working in such as retinal development. We say, oh, well, that's great. We'll work with you to curate retinal development. Can you spend some time reviewing articles for us? And so it is created, reactome is created by a collaboration between curators, expert authors, and it's peer reviewed. So each of the reactions that we represent has been reviewed by at least one peer. So we have rigorous curation standards. Every reaction is traceable into a reference in the primary literature. We do cover non-human species, but a lot of that is automated. We do computation through orthology to create non-human pathways. And at the current time, we cover a little bit more than half of the human genome, 10,651 in the last release, across 2,132 pathways. It provides a Google-style zoomable map that you can overlay your and other people's information on. And you can do some pathway and network analysis using reactome tools. We'll go into that in more detail. And it's open access. Anyone can use it. There's no fee. You can republish it. No restrictions whatsoever. So those are a couple of pathway databases. There are many more. I'll give you a resource where you can get the comprehensive list in a bit. So networks capture more of the genome, including the less well-understood portion. And networks can span a large variety of interactions. Any two molecular entities, whether they're lipids or proteins or DNA or DNA and proteins, you can make a network out of. So you can create networks that span genetic interactions. For example, you can capture suppressor enhancers. Physical interactions like proteomic pulldowns or these two hybrid experiments, co-expression data, sharing of go terms or literature citations or authors who publish on them or adjacency in pathways. When you're working with networks, it's really important to understand this key fact that there are many, many different kinds of networks. So you have to understand what the most appropriate network is for the problem you're trying to solve. Pathways are much more homogeneous in that respect. So network databases, again, they can be built via curation. They have a group of people sitting in a cinder block room curing literature and building these up. Or you can build them automatically from high throughput experiments. And most network databases are built in both ways. They have extensive coverage, much more extensive coverage of biological systems. Some of them have all the genes in them in one way or another. However, they have a higher error rate. They generally, the underlying evidence is weaker and the relationships are less well understood. And so the interactions are less reliable on the whole, not that they're not useful. Popular sources of curated networks include biogrid, intact, and mint. Each of them have tens to hundreds of thousands of genes, some in one species, some in multiple species. The comprehensive list of pathway and network databases can be found at this resource called Pathway Commons. Which is hosted by Memorial Sloan Kettering and multiple in New York and multiple collaborators. And here you can get a comprehensive list of all the network and pathway databases with a brief description of each one. And you can download information from a large subset of them. They've created MSKCC and Pathway Commons has created a common interchange language called Biopax. That allows various participating pathway and network databases to submit their data in this common language. And then they get integrated and merged together with a web-based interface. So you can actually get pathway and network data here as well. Okay, so those are the two ingredients where you get pathway and network data. What do you do when you've got them? There are basically three different classes of pathway and network analysis that I'll take you through here. The first one you've already seen, this is Gene Set Enrichment. And in Gene Set Enrichment the basic strategy is to take a pathway or network and break it into clusters of related genes, proteins or other molecular entities. So a classic way to break it is by using the gene ontology. And you divide the biological network into 500 or so categories based on what biological process or what subcellular compartment that gene participates in. Other ways of doing it are you accept, you take a pathway database like Reactum or Keg and you use the pathway name. Whatever the curators decided was the correct boundary for that pathway. And then using your gene list, you look for overrepresentation of the genes or proteins in your gene list in one or more of the bins that you created from that pathway or network database. Okay, that should be familiar to everybody. The, at this point, the advantage of this is that the statistics are very well understood. It's straightforward. You have plentiful tools, both online tools and command line locally run tools to do this. And people basically understand what the results mean. The disadvantage is that there are many, many different ways of slicing and dicing the genome. There is a tendency, it's subjective, and there's a tendency for people to keep picking different gene sets until they find the answers that they want. And there's that bit of subjectivity that is not good. Second is the problem that in most gene set, that most gene set enrichment algorithms don't do well with genes that belong in multiple sets. Some of them will correct for it, most do not. And you can end up emphasizing one bin at the expense of another based on an arbitrary decision of which bin to put a gene into. However, that being said, GSEA is the gene set enrichment analysis is probably the first thing one should do, and it gives you a lot of information. The second class of pathway network analysis is subnetwork construction and clustering. And in this technique, you do not start with a preconceived notion of what sets the genes belong to. Instead, you discover them from the network by looking at your, through your data. So you start out with a network of all possible interactions, whether they be physical interactions or genetic interactions or sharing of go terms or whatever. And then you superimpose your gene list on this network in order to identify clustering, statistically unlikely clustering of your gene list inside that network. So you find, for example, that instead of being scattered throughout the entire network, all the genes in your list are clustered in one little section of the network. And then by extracting that cluster and looking at what genes your list are interacting with, you can start to tell stories and understand what is going on there. So it's an unsupervised style, discovery of the potential role of the genes that are on your gene list. And there are quite a few, quite a few published and heavily used programs that let you do that. The final method, the most complex one, is pathway modeling. And here, instead of losing the information that happens when you go from a pathway into a network, the algorithms attempt to build a computational model of the pathway itself so that the sequence of events, the cause and effect relationships remain, remain there. And this shades into systems biology, but it allows you to put your gene list on and see what the integrated effects of alterations in that gene. So a typical example is you've sequenced a cancer genome and you found two different genes in the TGF pathway that have been altered. You don't know for sure that that's a significant finding, but you put it onto a model of the TGF pathway and you find that, yes indeed, those two genes will interact with each other in the pathway in order to constitutively activate the pathway. Or perhaps they will do nothing, or perhaps they will decrease the pathway activity, but it gives you a prediction that you can then test. And then there are a number of pieces of software for this, all of which are harder to use and less commonly used in literature, but are gaining in importance. And so you use these three methods to answer three different types of questions. For gene-set enrichment, it helps you ask, answer the question of what biological processes are altered in this disease. I'm using cancer as my example. For the Novo subnetwork construction, you can discover new pathways or identify differences among tumors or among patients that are related to which subnetwork are affected by the alterations in the genome in that patient. And for the final way, it allows you to ask, how are the pathway activities altered in the particular path of a patient? And to ask and answer thought experiments like, can I identify members of the pathway which are targetable by drugs or by genetic manipulations? So let's go through these in more detail. Enrichment affects gene sets. I think I said everything on this already. But the main advantages are it's easy to perform and there are lots of tools to do it. And the disadvantages are the arbitrariness of the gene sets. And that at the end of the day, you get a bag of gene, but you don't have the regulatory relationships among them, and you still have to do a lot of work in order to understand what the enrichment is telling you. For the Novo subnetwork construction and clustering, the process is very straightforward. Using the appropriate algorithms, you take your list of altered macromolecules, genes, proteins and RNAs, and you apply them to the biological network of your choice, and the algorithm will identify topologically unlikely configurations. You can then extract those clusters of unlikely configurations, and then usually there will be a small number of them, you know, half a dozen to a dozen. And then you can annotate the clusters to understand what it is that you've discovered. And in fact, kind of in an iterative way, you can annotate the clusters that you discovered from the network analysis using gene set enrichment techniques. So you found a cluster, it has some of the genes that you applied, plus some others that are close to each other. You can now annotate the whole thing and say, oh, these are all part of the ribosome, for example. So here's an example of doing this with the Reactome, Reactome. This is from a publication, Genome Biology, about seven years ago now. What we did here, and this was done by my colleague, Juan Ming Wu, who is now at LHSU, is to take the curated pathway databases that were present in Reactome, turn those into a network, and then added uncurated high-throughput interaction data. And at that point it's mostly yeast to hybrid data, since then we've added modern code transcription data and other types of networks to create a big hairball. And then we used machine learning techniques to weed out false connections from the yeast to hybrid data in order to reduce the number of false positive interactions. And that gave us 11,000 proteins and 270,000 interactions. The network is still grown, has grown since then. And then the next thing one does is to take a gene list, and our usual use case was somatic mutations and a cancer, apply that to the network, and then to use the same algorithms that are used to identify social groups, social clicks in Facebook, or on the web, or other web-based social networks to find groups of our genes which are talking to each other, interacting with each other more frequently than you would expect by chance. And this pulls out a series of clusters, which we can then annotate, and in cancer these end up usually, these end up being signaling clusters, or clusters having to do with cell cycle, or cell motility, or usually things that you would expect. If you do it on autism spectrum disorder, you get lots of clusters having to do with neural development, which is reassuring. And then what do you do after that? Okay, so I'll give you an example in a second why this is useful. So, okay, as of the current date, the ReactMFI network has grown to 12,280 genes, 230,000 functional interactions, has 60% coverage of the genome, so we still can't say anything about the 8,000 genes which are not well expressed or not well annotated with literature or don't come down in yeast to hybrid screens. We believe the false positive rate to be less than 1%. However, the false negative rate is quite high because there's a lot about the genome we don't know, so that's a big caveat. And it's a very big complex network. I'm just showing you 5% here. And you can kind of see a clustering that has to do with complexes and pathways. Okay, so here's an example of using this. The labels didn't come out very well, but you'll get the general idea. This is exome sequencing of, or whole genome sequencing of 52 pancreatic cancers. There are more than 200 recurrently mutated genes in this. And here's a typical long tail with a few genes, KRAS, P53, SMAD4, being very frequently recurrently mutated. We can immediately say these are driver genes, but then there's a long tail that goes way, way, way out of things which are still way out here. There are five out of the 52 patients that have recurrent mutations. They have roughly a 10% mutation frequency. We don't know whether these are real or not. They don't meet statistical cutoffs for calling a driver gene by recurrence alone. So what do we do with this? Well, if we do that clustering and extraction, we end up with a total of 11 clusters of highly interacting genes. And by and large, they are what we expect. We have a large module which has KRAS right in the middle, which is highly frequently mutated in pancreatic cancer, and involves ERB signaling, FGF signaling, EGF signaling, and axon guidance, which is one of the processes known to be altered in pancreatic cancer. We have another one for P53 signaling, another one for Hedgehog, TGF beta. We have all the extracellular matrix interactions that have been associated with metastasis. And then we have things that are coming from other cell types in the genome, such as lymphocytes. So we have MHC class 2 in here, and we have the spliceosome. You can then do fun stuff with this. So for example, one of the things we were able to do with this pancreatic cancer set is to identify four major genomic subtypes of pancreatic cancer based on which modules are mutated in any individual patient. So one type is characterized by absence of the first three modules, and mutations in module 7, I forget which one that is. There's another one characterized by mutations in modules 2, 1, and 10, one that's characterized by 2 and 10, and so forth and so on. If you attempted to do this just on the genes themselves, you don't get any, you get very little clustering at all. So this gave us some subtype information. In this case, however, the four different subtypes didn't have any difference in prognosis, so it was of academic interest. However, when we did this in breast cancer, we actually very readily found a strong biomarker of survival. So here is an example in which one of the modules that we found in early stage estrogen receptor positive breast cancer involved genes from M-phase of the cell cycle and Aurora B signaling. It clearly has to do with mitotic activity. And when you apply this to breast cancer survival, high expression of any of the genes in this module are associated with a much worse disease-free survival. Patients do much worse than if they have low expression shown in the red line. In fact, the effect is so strong that patients with high activity in these modules have as bad a survival as patients with triple negative breast cancer, which is generally considered highly actionable. Patients with estrogen receptor positive breast cancer would usually get a kind of non-aggressive therapy, surgery, a little neoadjuvant therapy, and then watch and wait. Patients with triple negative would get much more aggressive treatment. And this argues that there is a subset of patients in the estrogen receptor positive group that should be treated more aggressively. So there are multiple algorithms that will do this type of clustering. Gene Mania is a fantastic web service that was developed by Quaid Morris and Gary Bader at University of Toronto right here. And it uses a birds of a feather principle to identify genes that are related to an experimentally defined set. You upload your list of genes to the Gene Mania website and it extracts and creates modules for you. And it has several hundred different networks that you can apply to this. So you can choose the network that is best suited for your experimental questions. One of the disadvantages of Gene Mania is that there's ascertainment bias. And genes which are very highly connected have been very well studied, like P53 tend to create clusters no matter what you do, just because they have lots of connections and they end up coming down. The hot net program that was written by Ben Raphael at Brown University in Rhode Island avoids this ascertainment bias by modeling the network as a metallic lattice. When you put an up-regulate, you then put a series of up-regulated or down-regulated genes in it and it creates hot or cold spots and then it uses thermal diffusion to identify hot regions of the network. When you have a highly connected gene like P53, it has lots of connections so the heat disperses more quickly. I've never understood why this way of modeling networks actually makes any sense, to be honest. But it does produce very useful results and has been used to identify drug targets and to identify tumor subtypes. So it works even though if you can't explain exactly why it works. There are then a couple of site escape applications and you've all used site escape at this point, right? Which I recommend. One is called hypermodules and this is a very good general-purpose tool for finding and extracting network clusters. It allows you to co-cluster with clinical characteristics. So if you have a gene list of up-regulated expressed genes and you're looking for a clinical characteristic such as high rate of migraine headache, you can put those two together and we'll find clusters which are correlated with patients who have the migraine phenotype. And then the Reactome Functional Interaction Network has its own bespoke site escape app. You'll actually be doing an exercise using it after I finish this lecture. And this is more limited than the others because it uses the Reactome FI network and you can't change it. But within it, it offers multiple clustering and correlation algorithms including hotnet, including paradigm that I'll talk about and allows you to do survival correlation right within the application designed specifically for cancer genome analysis. Okay, so then the last type of network pathway analysis I'll talk about is pathway-based modeling. So in pathway-based modeling, there is a... You or the algorithm creators have created a computational model of one or more pathways in which the relationships between upstream and downstream events, the positive and negative regulatory relationships and their weights have been preserved. And this then allows you to apply a list of altered genes in the direction in which they're altered to the model, compute over it and see what the effect is on downstream... downstream effectors of the pathway. So you can do very specific things like looking for increases or decreases in a particular phosphorylated...active phosphorylated product or look for increases or decreases of ubiquinolation of a product or the presence or the activation of a transcription factor. So it attempts to preserve and integrate the multiple molecular alterations and it is kind of a babysat step into systems biology. So there are...this is where there's an incredible diversity in the community. I'm just giving you a very superficial list of the most popular tools. There are many different...many different formalisms that people use. The oldest is partial differential equations based on reaction kinetics. Software like CellNet Analyzer is used. These are mostly suitable for metabolomics and biochemical systems such as yeast and fermentation, undergoing fermentation where you're predicting from the reaction kinetics what happens when you add maltose to the mix. They're good for up to a couple dozen genes after which time the computation becomes increasingly intractable. Unless anyone who is using metabolomics, you probably won't use these. Then there are information flow models that are designed for signaling cascades, usually mostly for kinase cascades. The two that I'm most familiar with are NetForest and Networkin from Uppsala University in Sweden. And they model specific signaling pathways. They have pre-built models of most of the ones that you would be interested in. And you can perform experiments on these models to see what the effect on signaling would be from mutations or other alterations in that pathway. For transcriptional regulatory networks, the best software that I know of and the oldest is from a call to RACNI from Andrea Califano's group at Columbia University. What this allows you to do is give an RNA-seq or microarray data across a series of perturbations, typically used in knockdown experiments or transfection experiments. It will build a cell-specific regulatory network for you and allow you to identify the key master regulators of the process. So the transcriptional switches. And it does this using an information theory paradigm. And then there's a large family of methods based on probabilistic graph models. These build a logic model of the pathway with arrows of influence between each pair of genes. There can be positive influences and negative influences. And from a data set, you can learn the weights of these influences and then integrate across them so that if there is an increase in the activity of a gene up here at the top, it'll propagate that changes down through the various positive and negative arcs. And tell you what happens at the bottom. The most popular of the algorithms that are used is called Paradigm from Joshua Stewart's lab at UCSC. And it's been used very extensively for cancer analysis. So I'm going to give you an example from how Paradigm works. This is a very simple example. We have a very reduced view of P53 where the MDM2 gene inhibits P53 activity and P53 activity promotes apoptosis. So if you increase P53, apoptosis will go up. If you increase MDM2 activity, a P53 activity will go down and apoptosis will go down. Very simple model. Now even this model turns out to be much more complicated at the genomic level because you can have alterations at the gene level or at the RNA level or at the protein level. So you can increase MDM, let's say decrease MDM activity by knocking out the gene or you can hypermethylate the gene and reduce the amount of RNA or you can degrade the protein. And I have a question here. You can absolutely put the non-coding genes as well as long as you know what they're doing. So if you have a link RNA and you know for sure that it is activating a gene in apoptosis or some other pathway, then you can absolutely put that in. Same thing for P53. Oh yeah, I'm using the wrong mouse. And so alterations at any of these levels can have a different effect on the activity of the protein or gene itself and then the proteins interact with each other to cause the effect. So here's some examples of the various things that Paradigm was developed to handle. So you do a genomics experiment on a cancer, you can find mutations, you can find changes in RNA expression, sorry about that. You can have copy number changes, you can assay by mass spec and Paradigm is set up so that you can put one or more or all of these types of omics analysis into the model and get a sensible result out of it. And it actually works quite well if you apply it to a complex cancer type such as a glioblastoma multiforme. You readily find subtypes of the cancer which have to do with the, which each of which has a different set of network activities, pathway activities that are either up-regulated or down-regulated by the particular combination of mutations in that cancer. So what we're looking at here are, whoops, I'm clicking, our patients going across the columns, so each one is a different tumor, and then the rows are members are the integrated and inferred pathway activities of a series of pathways they built models for. So for example, the interleukin gata pathway here is very strongly down-regulated in subtype 3 and not significantly down-regulated in the other three subtypes. For EGF receptor, types 3 and types 2 both tend to up-regulate these. So this has been used to discover multiple cancer subtypes. In some cases they've had clinical significance. Now the good and bad news about paradigm itself is that it's actually rather hard to run because it's distributed in a source code form, somewhat hard to get installed in your machine. And making it even more difficult is there are no pre-formatted pathway models. You have to go in, carry the literature yourself and create the models, which can be quite difficult. It also takes a while to run. The good news about this is that it's one of the algorithms that we put into Reactome F5 is in the Side Escape app. It runs paradigm and we have pre-populated it with pathway models based on Reactome. It also improved performance so that it runs in a reasonable time and can actually run in minutes to hours rather than days. So you will get a taste of Reactome viz when Robin Hall gives his workshop in a few minutes. And that is the end of what I wanted to cover. It's just an overview of the field. I'm ending with some useful URLs for databases and for algorithms that you can use. They're all in your materials. And I want to thank everybody for your attention and then happy to answer questions for a few minutes after which I need to run off to a meeting, which explains why I'm dressed this way in case you're wondering. So any questions? Comments? Yeah. Well, in this case we're clustering individual tumors by mutations in the genes. So protein coding changes. Okay. And there are six frequently mutated genes and another 200 which are not so frequently mutated. We tried to find patterns in just the mutations of the genes. Didn't really get anything. When we turned it into network clusters, we got very strong patterns. It wasn't, no, that's exactly the point. So our biggest module was KRAS and KRAS is mutated in 95% of pancreatic cancers. But it turns out there are many other genes which are more rarely mutated and sometimes you would get no KRAS mutation but a lot of mutations in genes that are in the pathway. So they end up getting clustered together. So the question is whether the tools can be used to analyze data from model organisms or non-model? The answer is that in many cases, yes, you can, but there has to be a source of the pathway or network information. And so it needs to be a well-annotated genome in order for any of this to work. So you need to have the best scenarios in which the animal you're looking at is related to some model organism that has been annotated. So it's a nematode. Say it's a nematode, then you can use the CL against annotations for it and then use orthology information to transfer over go annotation terms or networks and pathways that have been built. It will require some upfront work in order for you to do it. And if it's a species that has not been well studied, then you might be out of luck.