 Okay, so I'm Lincoln Stein, and as we introduced you from the introductions yesterday, you know I work upstairs. And my main area of interest is in databases, and I work on a pathway database called Reacto. I'm going to try to give you an overview both to pathway-based and network analysis of genomics data, and in honor of this being a network talk, I've tried to set up the talk as a network with various levels of success as you'll see. So we're going to start with an introduction, and then I'm going to talk about various types of pathways and networks, and how to get the data, and then we'll talk about applications of these data in generating hypotheses from gene lists. We'll spend most of our time on this because I think that's your major interest. We'll touch briefly on using networks and pathways to predict the function of new genes, and Quaid will talk much more about that later, and we'll also talk about how this type of analysis can be used to classify patients, samples, and diseases and discover new structures underneath them. Yes, Gary? I'll just mention if the notes are coming, because I think some of them aren't. Oh, of course. And because this is sort of set up, this is not a standard PowerPoint presentation. We had some technical issues in getting it printed, but we've resolved them in the detailed copies of this. We'll be coming in a few minutes. Okay, so let's start with the introduction here. So why would you do pathway analysis at all? What are pathways and networks? How are they different? Where can you get the information, and how can you use them to do interesting science? So the idea of pathways and networks is to further reduce the amount of information from gene lists. So in gene lists, as you've learned yesterday, you can take a large number of genes and shrink them down into a smaller number of gene functions using gene enrichment analysis and other tools that you've learned yesterday. But that still gives you, unfortunately, a very, very large list of gene functions. And a lot of the problem in that is the fragmentation of knowledge and differing historical perspectives. You have the issue with the Go hierarchy, you have the issue of different groups have annotated genes in different ways, and you still have a lot of data reduction. So pathway and network clustering seeks to take that long and complicated list of functions and turn it into a more discrete set of pathways or interacting networks such that instead of now having hundreds of functional classifications, you have maybe dozens or fewer of pathways or networks can also help you find, perhaps, your false positives, things that really are not part of the process that you're studying, the obvious outliers. So a second thing that makes pathway and network analysis useful is that there's a certain amount of intuition that it gives you as to the biological context. So instead of a whole series of disconnected concepts such as mitochondrial inner leaflet, you get out of a pathway or network analysis a more picture and a concept of how the genes in your list are interacting with each other, what the regulatory relationships are. You can also create nice diagrams to explain how your genes are working with each other. You can also do various types of computation on the pathway by identifying, so we can link up processes that you did not know or people did not know were connected. You can identify pathways or network modules which are up or down regulated in your system. You can identify potential regulators of the process you're looking at. The basic idea is to take your microarray with up and down regulated genes and turn it into a diagram of how those genes are related to each other and what mechanisms they contribute to. Okay, so I'm going to talk basically about three different types of resource. One are pathways and databases. One are networks of interactions and then I'll touch a little bit on relationship maps which may or may not be the correct name for this concept which integrates phenotype and genotype. Sure, an interactome is an interaction network, so it's a series of bimolecular interactions of various types. We'll talk more about this between two molecules. In distinction, a pathway is what you intuitively conceive of as a pathway where you have multiple things going into a process and coming out. So if you think of glycolysis with a step-by-step kind of goal-directed process, that's a pathway. So let's talk about pathways first. So if you see something which is a step-by-step process that looks kind of like a chemical reaction or biochemistry, that's very likely to be a pathway database. Most pathway databases are hand curated from the literature. In fact, you can think of the literature as being a big community pathway database in a very disorganized fashion. It's very, very useful for extracting actionable biological knowledge. So if, for example, you realize that a chemokine pathway is involved in driving your tumor, then you can immediately look at the pathway, identify drug-able targets, and come up with a plan for inhibiting that pathway and treating the disease. Unfortunately, the whole concept of pathway is very, very fuzzy. And a lot of what people call pathways are really the result of a particular historical trajectory which might as well have been another. And in particular, the signaling pathways are incredibly interdigitated with each other. They have lots of crosstalk. And the fact that one is called the PDGF pathway, another is called EGFR pathway, a third is called FGF or TORC is really a historical accident of what growth factors people were studying at a particular time. And that makes working with pathway databases quite difficult. So the goal of pathway databases is to recreate on a very large scale the Borenger Mannheim metabolic pathways chart which adorned our laboratory, our cinder block walled 1960s and 70s laboratories and restrooms. And so this is a wonderfully hand curated chart of all of intermediary metabolism in E. coli. You can zoom into this and see really incredibly detailed information on how small molecules are being interconverted, what enzymes drive this, what cofactors are needed, what small molecules go in and out. And the idea of modern pathway databases is to recreate this but across all biological processes, fundamental processes such as DNA repair, signaling pathways, neuronal signaling, blood coagulation, you have it, but in an online and computable way. So there are, as we'll see later, there are hundreds of things that are pathway databases, I can't review them all, but I'm going to touch on I think some of the high points, some of the most popular ones and the ones that are most useful. So the granddaddy, grandmother of pathway databases is keg, the Kyoto encyclopedia of genes, and keg is both a genome database that has information on genes, their proteins, their structure, as well as a curated collection of pathways. Their basic user interface device is this biochemical pathway, these chemical maps, they have several thousand of these maps that have been hand drawn, and it represents each process as an input, an output, and an EC number, which is its enzyme commission number, you can immediately see one of the limitations of keg is that it's very EC number associated, and many processes that we'd like to talk about, such as transport reactions, don't map well onto EC numbers. So it makes using keg data in an automated fashion a little bit difficult. It is an incredible project, they cover over a thousand species, 5.2 million genes across each of those species, more than 100,000 pathways that have been laid out by hand. The pathway diagrams are manually curated, so somebody has looked at each of these arcs, each of these reactions, and has dredged it out of the literature, but there's been a lot of computerized assistance, so they've done a lot of projection of pathways from one species to another, so all of intermediary metabolism, for example, started out with a curated E. coli set of maps, and then they were projected onto human and other species via their EC numbers, and that has introduced, so you should be aware that that has introduced some artifacts, so the projected maps are only as good as the EC mapping, and because of issues in orthology and gene evolution, it's not a given that a gene that has one enzyme activity in E. coli has exactly the same activity in human, so errors can accrue there. It's free for academic use, but because it's not completely open access, there have been some issues in incorporating keg pathways into other resources, and its main service, which I'll show you later, is a colorizing service, so you take your gene list, you upload it to keg, it finds the pathways that your genes belong to, and then it'll show you a series of maps with those boxes colorized to show you where your genes are. So here's another service, this is similar in some respects to keg, it's called biocarta, and again I'm having this weird scrolling problem, which is driving me crazy, so biocarta is a commercial endeavor in which biologists and artists have collaborated to create very detailed scalable vector graphic images of many processes, sorry about that, and there are, so you have a beautiful series of hands on pathway diagrams, they make great PowerPoint substrates, they have, it's a little hard to get the statistics from them, but they have 120,000 genes in 154 pathways, most of the pathways are oriented towards pathways that drug companies would be interested in, so they're largely kinase pathways, signaling pathways, and there is a community annotation service that they offer where people can put their own annotations on top of these pathways, they're largely used, the service is largely used by drug companies to pop up, put little ads for their drugs, so if you're looking at a pathway with a kinase, and some drug company has a kinase inhibitor, you'll see a little ad for it in there, it doesn't have an underlying database, you can't automate it for use in an interpreting gene list, but what you can do is, again, give it a list of genes and you'll get a set of diagrams out of it, and then look at those diagrams. Here is a, this is a GenMap, is a desktop application that you can run on your PC, uses Visual Basic, it consists of a series of, again, hand-drawn biochemical pathways, it has a different focus than Keg, this one is primarily signaling pathways, and then high-level biological things such as inflammatory response, hemostasis response to disease, it provides the same pathway search and colorization services, the interesting thing about GenMap is that it is integrated with Wiki Pathways, which is an online community curated set of pathways, which currently contains 19 species, 4,300 genes, and 178 human pathways, and many more pathways in other species. The nice thing about Wiki Pathways is that you can go to the Wiki Pathways URL, look at other people's diagrams, click on them, and then start adding to them, save them, and they'll be added, and they'll then be accessible to other users, you can download them into GenMap and work with them for your analysis. So it's a nice system, but it comes with it the caveats that you have for any Wiki, how reliable is the data, its community annotated, how many people are looking at it, but it does give you the history of how many people have edited it, and probably like Wikipedia, it's more accurate than not most of the time. Okay, now I work on Reactome, I'm hoping I'm not going to be too biased in this. Reactome is a hand-curated database that my group has put together over the past 10 years in collaboration with you and Bernie at the European Bioinformatics Institute and Peter DiStacio at New York University. Reactome is created by inviting authors from the community to come in and write modules in a database way. We try to be very, very rigorous about what goes into Reactome. It has a kind of a Borenger-Mannheim chart up here, each of one of which is a reaction, they're connected to each other, and to indicate the sequence of events. Each of those reactions has a series of inputs, a series of outputs, a series of regulators. Those inputs and outputs can be small molecules or proteins or macromolecules, and everything is absolutely explicitly identified. So when we talk about a protein, we are talking about a specific Uniprot accession number in that you can look up in Uniprot. Every assertion has a primary literature reference behind it. So it really is run like a journal, you can see an author, you can see peer reviewers, you can see what they wrote, you can usually see a picture, and then hiding underneath here is a computable database. My zooming is not working at all. So Reactome is primarily human curation of human pathways, but we also project the pathways across orthologies to other species. We cover 22 species, including the popular model organisms. Currently on human we have 4,600 unique proteins, about 5,000 if you count splice forms, and over 1,000 pathways, and it's exportable in multiple informatics formats, completely open to the community to use, you can reuse it a lot. And the features are same sorts of things. You can find pathways containing genes of interest, you can, there's a colorization service, which I'll show you later. We actually do offer a gene list over representation service, which I'll show you, where you upload lists of genes, or genes in their expression vectors, and it'll give you a p-value in an FDR for participation in a pathway, and there's also an orthology mapping service, so given a pathway in one species, you can find pathways in another. And in our beta version that is still in, which is online, but still needs a lot of work, we're hoping to have it out during the summer, we have a nice scrollable, Google map style scrollable pathway display, so you can zoom in and out, you can grab with the mouse and move around, plus an interaction overlay service. So here, this is a little bit hard to see, and I'm not sure the zoom is going to work, okay, this is okay. Here, I've wanted to find small molecules that interact with the SHC1 gene in this EGFR response pathway, and it's going to one of the chemical databases, CHEMBL, and popping up a list of small molecules that interact with these genes of interest. And in this view, I've done a similar thing, but popped up various physical interactions with SHC1. These are genes that are known to physically touch SHC1 either by proteomics or yeast to hybrid studies, but we don't know exactly what their mechanistic relationship is. So then, if you are interested in cancer pathways, I strongly recommend the NATURE NCI Protein Interaction Database, PID. Okay, this is curated by several curators who work for NATURE, and they've put their data into PID using a data model which originally was derived from reactome and since diverged. They also import pathways from reactome and from bio-carder. The nice thing about this is probably one of the best gene list interpretation functions. Do I have pictures of it? No, unfortunately, I didn't put a screenshot of that, but their interface is very polished and again, give it a gene list. It will find pathways that involve that gene list and show you an interactive colorized version of the pathway diagram. I think my problem here was it was too big for me to paste it in. Okay, then if you have money, there is a commercial system called ingenuity. That has been around for about eight years and ingenuity is an online service. It costs over $10,000 a year to subscribe to. The algorithms they use and their content are unpublished, so there's no way to find out exactly what's in there. There's no way to come downloading and counting, but according to their marketing literature and the one publication they made, the content is a combination of curation as well as integration of multiple data sources from literature mining, from high throughput experiments, from microwave experiments, and they've put together a searchable system called Paris, which is designed to again, to gobble up gene lists, find processes and pathways which are overrepresented in that set and then let you explore it. Its greatest advantage is that it has good integration with pharmaceutical information. So you can, for example, filter the pathway list only by pathways which are druggable by some known patented or non-patented drug. You can also, it also provides tools to extract and build your own custom pathways, networks, share them with other people and save them for later use. And it has a very, very well designed user interface. It's nice to interact with and it's been used for a lot of publications. The only caveat is that it costs money and the methods that they use to construct it are unpublished. So it's a bit of a black box. And then finally the pathway commons is not in itself, in and of itself a pathway database, but it is a unified collection of information from pathway databases. This is a Sloan Kettering project. Gary Bader contributed to it. The PI is Chris Sander. And what pathway commons does is to collect pathway and interaction information from nine different databases and it will grow, that number will grow, in a standardized format called Biopax, which is an XML format for representing pathways interactions. And then they bring them together in a common user interface which can be browsed, searched, and manipulated in various ways. Its main advantage is that it provides a uniform interface for tools like Psydoscape to bring pathways down. So Psydoscape no longer has to go to keg versus one mechanism and reactome versus a third mechanism and gem map, biosecond mechanism, and deal with the idiosyncrasies of each of them. It simply goes to pathway commons, gets the pathway in its standard format and modes it. It's more a resource for analysis tools to use. It does have an online presence which provides you with basic summary and searching tools. The best problem, the biggest promise of pathway commons is that it's going to allow for unification and integration among pathways so that we have now a unique view of what's known about pathways in the computable form, but that hasn't happened yet. It's a very difficult problem that will be likely more than a year before that is in place. And the issue being that keg and reactome and wiki pathways and so on have a largely overlapping set of pathways, but because they name the pathways differently they have different curational standards for declaring what a pathway is and they're using different nomenclature even for what's in the pathway. Uniprot versus entre idios versus ec numbers, it's hard to ask even the simple question of is a pathway that's in keg, does it correspond to a pathway in reactome? How are they similar? How are they different? So here are some screenshots from the pathway commons. Website, it's showing a, this zooming is really bothering me. It's zooming by randomly here. This is the last time I'm going to ask for a Mac. So we're seeing, oh actually, well for some reason we're only looking at pathways from reactome here. I did a search on it, but usually you'll see pathways from a number, a large number of resources. And then you can select a pathway and get a picture of it, click on the nodes and arcs, and explore it further. Okay, so any questions before I move on to, move away from pathway databases and start telling you about networks? One place in one time, so there's specific to one cell compartment at a certain time, you're better off looking at the hand curated one. That's absolutely correct. And there are a series of, so a typical curation standard requires you to ensure that two molecules which are in the same reaction are known to be in the same cell in the same compartment at the same time. With a few, they also need to be from the same species, with a few exceptions, such as when you're talking about pathogen host interactions. And typically, if it's a reaction that involves the membrane, one of them has to be known to be bound to the membrane or have a trans-membrane domain. So there's a series of human checks that goes into these. With an interaction database, you can impose that type of information, but it's not necessary, it will not necessarily be the case that that's been done. Okay, any other pathway questions? All right, let's talk about networks. Okay, so network, so one of the problems with pathways is that it's very hard to compute over them. The data model is complex. You have inputs, you have outputs, you have rate constants, you have binding constants, you have various regulators, you have movements between compartments and really to compute with it correctly, you need to go into biochemistry. You just need to start talking about partial differential equations and really create models. And that's often there's just not sufficient information to make those models that relate to reality. So for rapid computation, networks have to make a big simplification of biology. They say that everything is a bimolecular interaction. So that's the big simplification. I have protein A and protein B or complex A and protein B or small molecule A and small molecule B and they interact with each other. And sometimes that interaction is a very abstract thing such as a genetic interaction. So say you do a screen for synthetic leafels or enhancer suppressor screen in your favorite model organism and you find that two genes are influencing each other's behavior. So you can call that an interaction. Or they can be physical interactions. You do a yeast to hybrid study and you find that two genes A and B in a target-bate relationship are physically interacting with each other, that you can create a physical interaction on that basis. Or two genes are co-expressed in across many different microwave experiments under different experimental conditions and different life stages and different mutant backgrounds. And so that suggests that they have that they're either one is regulating the other or they have a common gene that's regulating both of them. So that would be a co-expression interaction. Or they can share go terms more often than not. That's kind of a weak evidence that they're interacting in some way. Or in the literature, whenever gene A is mentioned, they mention gene B as well more often than you would expect. That's going to text mining. Or in fact you can go to a pathway database and throw out all that detailed information that the curators have labored to put in and instead say, oh yeah, two genes are close together in the same pathway, maybe they're interacting. So a network ends up looking like this, where you have a series of vertexes and these can be proteins or complexes or something even more abstract, like a process. These are called vertexes or they're called nodes. And they're connected by a relationship which is in graph theory is called an edge. And then there's some more lingo that comes along with graph theory that you'll hear more about today. If nodes are connected in a circle, or if there's a path which can take you from a starting node back to itself, that's called a cycle. And then there are different kinds of edges. There are undirected edges like this one where there's symmetrical. So that's not saying that this node is doing something to that node. They're kind of mutually affecting each other in some way, as opposed to a directed edge in which there's an arrow. And so this is saying that node A is doing something to node B. In a regulatory network, you'd say this is up regulating node B. You could also have an inhibitory arrow where it's down regulating B. And then there are weighted edges where edges are not created equally. Some are more significant than others. And this is typically used in predicted interaction graphs where you have various types of evidence for interaction. And edges which have more evidence behind them will have higher weights. Here are some more network concepts. I've got to cut off a little bit of my crazy zooming here. There we go. So a degree is a metric applied to nodes. It's simply the number of edges going into or coming out of the node. So this node has degree four. This node is because of its four edges going into it. This one only has degree one because it only has one. The shortest path you'll hear a lot about is simply between any two nodes, how many hops connect them. So between these two nodes, there is a the shortest path is two. If you've heard of the six degrees of separation, the Kevin Bacon game, this is just a shortest path type of exercise in which you choose any two people in the world and you can find relationships that connect them with no more than six hops. It's actually been established. It's actually a smaller number. It's something like five now. The original experiments were given a name and an address. Can you by handing the letter off get it to that person, or person-to-person connections? And the number was six. And now with the web and Facebook, it's five or something. It's smaller in certain communities like in the math community. It's even smaller in math because everybody knows everybody else. And then this is related to the idea of betweenedness or centrality, which is every node is labeled by the number of shortest paths that traverse it. So you find all the pairs in the network and calculate their shortest path. They're good graph theoretical algorithms. I'll let you do this really, really quickly. And then you label each node by the number of shortest paths that go through it. And there will be one or several which are most central to that network where everything passes through. And so these are basically bottlenecks in information flow if you want to think of it that way. And these can be very interesting because, well, if you want to knock out a process or the drug, maybe it's a central one you want to go for, or maybe it's the one you want to avoid because it'll be lethal. Before I go on, any questions about that? By the way, I did this by hand late last night. The betweenedness number might not be correct here. So any questions? Okay, great. So let's look at a few kinds of how you map biology to a network. So basically, any set of pairings can be turned into a network. So for example, the protein-protein interaction, as I talked about before, it's kind of a canonical example. Or, but you can use other relationships. So for example, a regulatory relationship of kinase in its phosphorylated target, a genetic epistasis, or even protein sequence similarity to proteins have a high blast score. So you could give them a weighted edge indicating their blast score. And people will talk about network analysis in a very broad way, but there are as many different types of networks as there are relationships. And it's critical to understand what type of network they're talking about. So here's a, and I borrowed this, I borrowed many of these slides from Gary's talk last year. So thank you, Gary. Here is a beautiful visualization of all the proteins in Uniprote blasted against each other to create a network of similarity. And you get this gorgeous, but probably not very useful visualization of protein relationships. Here's another one. It actually looks kind of similar, which is a proteomics, the results of a proteomics experiments in Saccharomyces cerevisiae in which protein complexes were spun down and then fraction, and then fractionated and put into a mass spec machine to identify which proteins are co-complexing with each other. And so you can build a map of which proteins are in complexes with each other. Very similar representation actually, but quite different underlying meaning and you would use it in a very different way. So you can build networks completely automatically, or you can build them via curation, or you can do, or you can do both. You can build ones that have both curated and uncurated sides. And this is, I'm going to talk about a few of the ways that people make automated interaction maps. So one of the most popular ones is to try to do what curators do but do it automatically. Go into the literature, find genes or proteins which are related to each other, and extract it from the text without having to go through a curator's brain because computers are faster and cheaper than curators are. And so usually it's from PubMed Abstracts because you can get PubMed Abstracts for free. Some groups have gotten access to the entire PubMed corpus and so can go into the body of the paper. Google could probably do this really, really well. It makes, and there are a large number of increasingly sophisticated algorithms for scoring when two proteins are likely to be related, for using semantic information to avoid false positives such as A is not involved in this process with B. We don't want to say A is not related to B and score that as a positive relationship. But it's certainly not a perfect system. First of all, they're just problems recognizing gene names. So, and even if you recognize the gene name, distinguishing Oct3 from Oct3 is a trivial problem. Even if you recognize that it's a gene name, how do you know what species it's in? And even a trained curator often has problems figuring out from a paper what species they're talking about. Because frequently they're going to be moving from human to mouse to zebrafish and they're not clear what, the papers are not always clear what experiment was performed in which organism. And then you have gene name problems if you have a paper that talks about hedgehog. Is that a, is that a Drosophila gene or is this a process in the hedgehog animal? And then there are genes that have terrible names like this gene called A in Arabidopsis. And how do you pull out the gene name, the gene name? It's a transcriptional regulator. How do you pick out A? So it's a difficult problem. There are a number of tools though that do a reasonably good job at this. One is a literature search service provided by Agilent. And they may, they have made a free cytoscape plugin. And you can use it really as a way of pulling out papers which have something to do with your gene list. So you upload your gene list and manipulate an Agilent and then you can annotate your, you can build a network in, inside escape. Then you can annotate your network with papers that have something to do with, with that list. And then there's a net internet based service called IHOP, which is really quite fun to play with. I recommend that you play with it in which you start out with a paper or with a gene. And then from that you can find other genes or papers that are related to it via literature links. And you can hop from one to the other and you can quickly kind of explore what's known about, known about your gene. Then there's a whole class of interactions that are derived from high throughput laboratory experiments, omic style experiments, yeast-2 hybrids we've talked about, complex protein complex pull down some aspect we've talked about, genetic strains. Again these are not perfect. In particular yeast-2 hybrid interactions are taking proteins out of their natural context in the human cell or the Drosophila cell and putting them into yeast. And so they, they're no longer separate in the proper compartment, they're no longer being expressed in the correct context. And so there are a high number of, of true interactions. They're, they really are physically interacting with each other. But they're, but in the real world they're probably never, never even being co-expressed. They're never in the same compartment. So this, from the biological point of view, the biological significant point of view, there are a lot of false positives in Y2H. For the same reason there are actually a lot of false negatives as well because you've taken them out of their context and no longer remembering boundin, for example. And they're not interacting when they should be. There are also artifacts that you need to be aware of. There are sticky proteins, actin, some of the ribosomal proteins which just kind of stick to everything. And so in both Y2H and in mass spec experiments you'll find some really kind of huge proteins that seem to be connected to everything else. And so those need to be, those need to be removed. And then genetic screens, unfortunately have their own artifacts. An epistasis experiment, enhancer suppressor screen is actually highly sensitive to, ironically, to network effects, to the genetic background. So two genes will interact genetically in one genetic background, but not in another because of the influence of other genes. So it's a snapshot that has a lot of false negatives in it. So there are lots and lots and lots of pathway and network databases. 325 of them at last count according to this very good, well curated resource called the path guide at Sloan Kettering. And no, I'm not trying to, to, to zoom in. Yeah, I'm getting seasick with this. There's no scroll wheel on the back, is there? Pinch? No, none of that stuff works. Okay. So you can go to this, you can go to a path guide and, you know, and read about each of these. And in particular the details panel will tell you when the resource was last updated, what its focus is, what its curational or lack of curational model is, and its holdings when those are known. And it will sometimes give you a link to download the dataset. Okay. So here are some of the, here are some of the popular sources of curated interaction networks. One of the biggest ones is BioGrid, which is located here in, in, in, in Toronto among a collaboration among a number of institutions. And this is built out of the, the older bind database, which you may have heard of. These are, these were created by curators reading the literature and deriving, and deriving interactions from them. It covers multiple species, 529,000 genes, and it has 167,000 interactions among them. Now, on the other hand, there is the intact database at EBI, which also has a curational model. They're covering 60,000 genes with 203,000 interactions. Now, you'll immediately notice that there's some, there's a crazy imbalance here in the terms of number of interactions per gene. And this is, this is coming from the different curational models, what their standards are for calling in, for calling an interaction. So intact, it has a very low definition of what an interaction is. BioGrid has a very stringent one, and they also have ways, different ways of group, of grouping the genes together. And so you have to be very, very cautious about interpreting the data that comes out of, out of these, out of these databases. Another one, sort of intermediate between them is, is mint. It's an Italian resource. Again, curated interactions from literature primarily in, primarily in yeast. It's 31,000 genes and 83,000 interactions. So usually, I would say that it is very difficult to use these primary sources of network of network interaction, these primary interaction network databases as is. You'll get very different answers depending on what, what database you go to. They have different standards. They have different definitions of what interactions are. It's, it's better, in my opinion, to adapt integrative approaches in which multiple sources of interaction information, multiple networks are, with, with different, different types of experimental and literature, literature evidence are combined together to, to create a, a, a integrative network which tries to capture the, the interactions which are well supported and separate them from others. So the idea being that if they're multiple sources say that two genes are interacting with each other, they're like, it's more likely than if only one out of the many sources does so. And here's a very simple example from Mark Vidal's lab. So Vidal's group did, has done many Y2H screens on yeast and others, on other species and observed the false positive problem very early on. But they found that by taking interactions and combining that with basically filtering that by gene co-expression and subcellular compartment data so that they throw out any interactions which are not in genes which are co-expressed and or in the same compartment, they're able to remove most of the false, false positives. And they created what they call date hubs and party hubs to describe them. A more complex example would be to take yeast to hybrid data, combine that with proteomics and curated data with literature, literature mined data, no expression data, gene ontology data, and to build a kind of a machine learning system or a statistical system to classify things based on the weight of evidence towards them. And I'll, I'll talk about one, one such effort. So here is a, an, an example integrated network, which I'll talk about because it was, it was done in my group by, by Guan Mengwu. And so I know it, I know the example best, but I'm not saying that this is the best approach. This is the best approach. There are many, many, it's a very active area of research and there are many groups are working on it. So we were very concerned, the motivation for this, we were concerned that reactum was being underutilized. And when we talked to biologists about why they thought about it, they said, well, you know, I, I do my screens and when I ask for pathways, you know, 80% of my genes are not even in reactome. How can, how can I use it when, when it's got such a, when, when you don't have enough coverage of the genome? And even with our, even at our current curation rate, even with, we have basically eight curators working on this, we're, we're going to hit 5,000 proteins sometime this year, but that's still, and that's over five years, about 1,000 proteins a year. We're still only, you know, at a quarter or less of what's in the genome. And so we wanted to increase our coverage. So the concept here was to, to create a corona around a reactome, around reactome core pathways, such that we would have a core of curated genes with their correct regulatory relationships and all the evidence supporting them. And then we would have a corona of less well characterized interactions derived from literature mining and high through, and high throughput data. But because we consider ourselves a, you know, a high quality curated database, we didn't just want to start adding all the interactions from all the databases. We, we wanted to add these cautiously so that we had a high probability of hitting, of having a correct corona. So we, we, we derived from our corona information from many other pathway databases, some that I've talked about, some that I didn't have time for, bimolecular interaction databases such as BioGrid. We actually started mining interactions from other species, yeast, worm, and fly and project them into possible interactions via their orthologs in human. We used share go terms because two things are somewhat more likely to interact if they, if they're in the same biological process than other, or in the same subcellular compartment than others. We, we downloaded basically all of GEO and did a co-expression, made a co-expression matrix and took things which are highly co-expressed. We mined database of transcription factors in the targets and we used a pre, a, the gene ways, literature mining effort from Andre Rajetsky at University of Chicago to find, to find genes which are co-mentioned in the literature and created a, and this was, this was published in the, in the article that we would give you in your, in your material. And then this created a huge network in which everything was connected to everything else. And then we, and we, which have a very, very high rate of false positives. And then to remove false positives from the high, from this network, we took the, the, the reactome dataset, which was curated that we feel confident in and used, used as a, a series of algorithms to reduce everything in reactome to a set of, of high confidence bimolecular protein interactions. So everything that's in a complex get, got turned into a series of molecular interactions. Inputs and their catalysts interact, catalysts and the outputs interact and so on. We had a whole series of rules for just determining what was an interaction or what wasn't. And so this gave us a, a training set of interactions that we felt of functional interactions that we felt were significant. Okay, then we used a machine learning system. The very first one we tried was a naive Bayesian classifier, which is a simple one and actually seemed pretty stable, worked pretty well for us. We trained that classifier using the protein interaction pairs derived from curated reactome pathways. And as a negative training set, so you have to train these systems both with a positive set of things that are true and a negative set of things that are false, we just chose random pairs of proteins from the genome. Where under the assumption that two proteins in the genome are, are highly unlikely to interact, which is debatable, but it's, it's, it's probably a better negative training set than others that you can come up with. And then after the training, we did a test of this using a 10-fold cross-validate. Yes, go ahead. So just one for the question of negative training sets. Is there, is there any utility in restricting those interactions or rather than going random, taking, deliberately taking protein pairs from the different sub-cellular compartments and saying like a nuclear one and the cytoposmic one, you know, you know, they don't. Yeah, that was, that was the very first thing we tried. And the problem with that is it, it was such a strong signal that it trained the classifier to use the GO compartment as its major piece of evidence. So, you know, yeah, so anything that was in the same compartment within, it was then so much more likely to interact that it was, just wasn't performing well. I actually did back up a little bit and give you the, the background of what the, what a classifier does. The idea that the classifier then acts as a black box. You give it two proteins and it gives you a probability score, high probability that they interact, low probability that they interact. And then you have to choose a cutoff to say yes, they do or they don't. The way you, now, ideally you want to test this empirically. You want to make predictions and then you test them in the lab. Unfortunately, you know, the obvious ways of doing it such as with yeast to hybrid targeting bait, we'd already, or going into the literature, we'd already brought all those pieces of evidence in. So, you actually have to put it out and then wait for new, new, new science. So, but you can test it internally by doing something called a tenfold cross validation, where you, you train with a, with, with 90% or you train with 50% of your data of your, your, your positive and negative training set. And then you validate it by testing it with the, with the part of your training set that you with, you've withheld. Okay, to see how well it can predict the whole from the, from the part and you do that multiple times. And that, from, from that, you can calculate the accuracy. And what this one then can give you is something that's called a, a, a infinitely zooming receiver operator curve. All right, I'm just going to leave it here. And so what this is doing is it's graphing the false positive rate from a 0% false positive rate to a 1% false positive rate against the, the, the, the true positive rate, going from 0% true positives to, to 100% true positives at different thresholds for the cutoff. So you give, give it two pairs of proteins. It gives you a, a probability from 0 to 1. And then you can choose different levels of that cutoff. You can say 0.5, 0.4, 0.3. And it'll, and under that threshold, it will, you know, if you declare that everything that's greater than, has a score greater than 0.4 is an interaction, you can then measure it against the part of the dataset that you left out to see if did it call it right or did it call it wrong. Now, a really, really good rock curve will, will look like, will look like this. It'll be, it'll be, it'll, it'll have a right angle so that the, so that at a false positive rate of 0, it'll immediately zoom up to a true positive rate of 1. Random will look something like, something like this, you know, a diagonal line where it's just randomly calling things and it's about 50% right each time. And so this is actually a fairly, a fairly nice one where, you know, it has a very, very high rapid slope. And so, you know, I increased the false positive rate just a little bit, except a little bit of false positives. And the true positives go up quite, quite, quite rapidly. And so we can get close to, you know, 90% or accuracy if we're willing to accept a false positive rate of 20%. But we weren't willing to do that actually. And we, oops, and we took a very, very conservative threshold for this that gave us a false positive rate of under 100, under 1% for our network. And then accepted a lower true positive rate of 20%. Okay. And that's the, and so that's, that's the foundation of, of the Reactum Functional Interaction Network. You can argue that we might have wanted to choose a different threshold for that. But we, you know, because network analysis is so sensitive to false positives, we wanted to keep them as low as possible. Also, I believe that this is, this analysis is underestimating our, our error rate. Okay. Okay. So what does that give us? That gives us a big, fuzzy hairball, looking only, only at 15% of the network in, in the side escape here. But it has greatly increased our coverage from less than 5,000 proteins to almost 11,000 proteins. And that includes 209,988 interactions, which is a fairly good number. And what we have, what, what we now have is a, a mixed network that consists of curated interactions from Reactum. So of course, we didn't throw that out. We used it plus the, the predicted ones from the Bayesian classifier. Okay. And in the paper I gave you, we used it to do a very simple analysis of glioblastoma multiforme mutations from the TCGA set and show that you can cluster this and pick out apparent functional modules in this. So before I, before I go on, are there any questions about bill, how we built that network and, and, and what it means? Okay. And I just want to also emphasize that this is just one effort of many, many. There are kind of hundreds of papers published on this field every, every year using, some are using much more sophisticated techniques than, than we used. Okay. And then I'm going to touch very quickly just on a, on a different type of, of, of network map. And these are, I don't think, no, if this is the correct, this is the official name for them. I call them relationship maps where we're creating maps of concepts rather than maps of rat maps of genes. And relationship maps are developed for the problem of integrating, doing semantic integration of knowledge. Different pathway databases cover the same processes, but they have quite different names for them. Are these processes the same ones? Are they overlapping? Are they partially the same? Are they completely distinct? And so just, you know, heart disease, cardiovascular disease, vascular development? Are these the same? Are they different? Sometimes you, you can't tell. And so the relationship maps take gene information, typically and integrate them with, with phenotype information to discover how, how these, how these, these concepts are, are, are related to each other. So a very nice paper that appeared a few years ago was on the, the human disease network. And essentially what they did here is they, they took OMIM, which is the online, the database of the online Mendelian inheritance in man, which is a curated database of diseases. And for each disease, some curator has read all the literature about that disease and has assembled both the description of the disease and its clinical phenotype and its treatments and so on, plus a list of genes which are thought to be involved in that gene. So you have a relationship, you can extract a relationship between a disease and the genes that are involved in it. And they built up a very, very large network like this in which they have, they have diseases and each disease has one or more edges going towards the, the genes that have been, have been described in them. And they, I believe they used a weighted graph with some confidence that they belong, they belong together. And then from that they're able to, they're able to do a network-based clustering. So two genes are likely to be related to each other if they're connected by a common disease. And a whole group of genes are going to be connected together if they're related by a common disease. And on the side, on the other hand, a series of diseases may be related to each other. If they, if they share more genes in common, then you would expect, and you would expect by chance. And so they created two networks. I'm scared to scroll. One was a disease network where they are connecting each, each disease to, each disease to, to every other one via a weighted graph in this case, where the edge corresponds to the number of genes that connect them. And the size of the nodes is the number of genes involved in that disease. So it's related to its degree. Okay? Recall. So the size of the node is degree in the number of genes in the disease. And the, the, the width of the edge is the number of genes that connect the two. Okay? And so this is what they come up with. They came up with two, two maps, one of the, one of the diseases and one of the, one of the genes. And it made some very nice connections that are intuitive. So diabetes and obesity are connected by a lot of genes. So they, they cluster together. And then some ones which are, are less obvious. Well, all right. So we have colon cancer and breast cancer together in lymphoma. Oh, where was it? Yes. Fancone is anemia, which of course is a DNA repair problem, a DNA repair disease that, that is clustering with the, that is clustering with the cancers. Okay? But then there are some less obvious ones here, which I'm, I'm not going to be able to find because I'm in infinite zoom mode. I don't believe they did, but I haven't, it's been a while since I read the papers. So I may be wrong about that. It's under the same, the same way they are able to cluster genes by their, by the diseases that they share. This is the same thing turned inside out. And sure enough, it does some obvious things. It puts p53, puts all, a lot, many of the signaling genes together. And there's a few inobvious things too, which again I'm not going to try to find. But it's a, it's a great, it's now a great resource for exploring. Okay, so any questions about that? Okay. Yes, Charter? Yes, it is. Yeah, you can go to the, there's a URL in the, in the paper. You can browse it. You can download it. Okay. How am I doing for time? I have 25 minutes? Okay. All right. So I've talked about different kinds of, different kinds of networks and pathways. I've talked about where you can get them. I've talked about building integrated networks. Now I'm going to talk about using them, applying them to interpreting gene lists. I'm going to spend most of my time talking about hypothesis generation. Okay. So before I restart, it should go without saying that the quality of what you goes at, what comes out in terms of hypotheses is directly related to the quality of what goes in. So before you start, you need to, you need to correctly normalize your arrays. You need to subtract out background. You need to do quality control to remove things which are, which are obviously wrong. Do all the, run all the statistical tests that you, that are appropriate to reduce the number of, the amount of noise in your experiment and in particular false positives are going to give you a lot of false associations. And be aware of the ID requirements for the particular software that you're, you're intending to use. So using the tools that we discussed yesterday. So make sure that if it wants uniprot IDs, you're using uniprot IDs. If it wants ensemble IDs, it's using, it's using those. Okay. Okay. We'll briefly, we'll briefly touch on pathway colorizing services, which are very straightforward to use. I'm going to give you two examples. One is from Keg. Didn't, that screenshot didn't come out too well. All right. So Keg has a, has a pathway mapping service in which you can upload a list of genes in a variety of formats. You can use Hugo IDs, which is what I used to make this screenshot. Or you can use Keg IDs or you can use uniprot, Swiss broad IDs. It will then attempt to map those to its internal IDs and give you a report on how many mapped. Okay. And you're seeing the result. It's telling us which ones were not found. They had a lot of non hits here. And then it'll give you a list of the pathways that it knows about and the number of IDs that were in those pathways. And then you can click on the links. And for each one, you'll get a nice little chart. And it will show you the gene on your list in the context of where it is in the pathway diagram. And then you can, you can scroll through that, click on the, click on the nodes to get more information about it and make your own mind about how it is used. Okay. Here's Reactome's version of this. It is really, really, you know, extremely simple. You upload a series of IDs. We try to be very, very proactive about what IDs we accept. So you can give it entree. You can give it entree IDs. You can give it ensemble IDs. You can give it uniprot IDs. You can give it affymetrics probe numbers from a variety of popular micro arrays. And I think that's, I think that's it. And then it'll try to guess. It'll then give you this kind of a big page that has multiple sections. At the bottom of it is a list of, of how it did its mapping. But at the top, where it's most useful, is a, is a ranked list of over-represented pathways and their pre-values. And this is using a hypergeometric test of the Fisher's one-tailed test that you heard about yesterday. You can then, you can then zoom in on these and look at pathway diagrams or look at the database records. One of the features that we offer is a zoomable version of the, of our pathway map where it, okay, I have to stop doing that, where it, it highlights the genes that were in your list if you have, and one of the options allows you to turn on a color scale so that it'll give you the significant, it'll show you the significance values. Here I'm just, I'm just showing them all. You can also upload a time horse and it'll give you a little movie. So it'll give you an animated gif so you can see how your, the pathway map changes with time. Or you can give it expression values and have up-regulated genes, up-regulated pathways turn red and down-regulated ones turn blue. All right. Okay. And you can, you can zoom in to see the, to see details on a particular, particular pathway. Yeah. Okay. And, and the new, the beta version will also support this, but I, I highly recommend that you don't try this yet, because I tried it yesterday and it was not, it was not working quite right. Okay. And that's all I'm going to say about, about, about pathway colorization. It's basically a way to orient you, but you have to do all the hypothesis generation, generation yourself. Okay. So any, any questions about that? Pretty straightforward. Okay. A kind of a, a, a more interest and, and again, and another big limitation is that it's a pathway database. So it's only covering a little piece of your, your gene list. It's going to have a high rate of not being, not being able to say anything about most of your genes. No, I'm sorry. So we're not doing, we're not doing a reactome tutorial or anything, but if you go to reactome, you will see up in the top, you'll see a link that says try our new beta. And you can try our new beta, but the, the, the expression analysis as of yesterday, what I tried to do, the screenshot of it was not working quite right. And it may work next week, but, you know, just, just wanted, just wanted to warn, just wanted to warn you. Okay. But we're, this afternoon, we're going to do a, a demo and a lab involving a beta release version of a, of a, of a side escape plugin, which fetches data from, from the reactome functional interaction network and does various things on it. That's beta, but it actually works pretty, it actually works pretty well and will likely be released in the next, next couple of weeks when the user interface is polished up in just a little bit. Okay. So now I'm going to talk about working with interaction networks and trying to derive, derive hypotheses from them. And the nomenclature here is called active subnetwork extraction. All that means is you have a big network. You start out with a big network that encompasses 50% of the genes or more of the genome and you extract from that the subset that you're interested in. So, for example, the subset of genes which are upregulated in sample B versus A in your microarray or RNA-seq experiment. And then you build from that a subnetwork of how these interesting genes are interacting with each other and then you, you build hypotheses out of that. All right. And so the paradigm here, and this was stolen from a poster so it's, it's reactome oriented, is you start with the big interaction network. You extract the genes of interest, the ones that are mutated in your, in your cancer cell line, the ones that are overexpressed in your treated cell lines, undue expressed, under expressed, the ones which are, have a CR involved in CNVs. And then optionally you can add some linker genes of things which were, these are genes which were not actually on your list but are, are, are central to the, to the process that will, will tie them together. So you might actually have picked up a little bit of the same process here and a little bit there and the linker gene will, will tie them, will tie them together into the same process. Sometimes you want to do that, sometimes not. You can be then, typically you then get a big fuzzy hairball and then you can use various clustering algorithms. I'll tell you about one and, and you'll, you'll hear about others this afternoon to break them into functional modules. And then from that you can go on to, to examine those modules. So these are genes which are heavily interacting with each other to do hypothesis generation, sample classification and, and gene, disease gene prediction and other good things. Okay. I'm going to give you an example of this from a project that's going on in my lab now. And it's in collaborating, the, the person responsible for this is Irina Kotskaya. It's a postdoc in my lab and she's collaborating with Peter Dirks who's at the University Health Networks, yeah, UHN, UNH, here in, in Toronto, who's looking at brain cancer stem cells. And so the, the model is this. They have isolated stem, so they have flow-sorted stem cells out of patients, out of tumors from patients with glioblastoma multiforme to create a population of, of, of self, of self-perpetuating tumor stem cells. And at the same, in parallel with this, they've isolated normal neural stem cells from, from fetuses. And they have many of the same, they have many of the same markers, they have many stem cell properties. But the one big difference is that if you grow the neural stem cells in a rich medium containing bovine, serum albumin and other growth factors, and then remove the medium, they will differentiate into a variety of, of, of neural, of mature neural stem cells, neural cells, neurons, astrocytes and so on. The brain tumor stem cells, if you remove the rich media, they will continue to grow in an undifferentiated fashion. They'll sometimes throw off some things which are, look like, look a little bit like, look like astrocytes, but are, are not normal astrocytes. And so the question is, what's different between these two populations of stem cells? And so is, it's a straightforward microarray experiment where they compared several tumor stem cell lines to several fetal, normal neural stem cell samples. They found over a thousand, 1300, differentially expressed genes, either ones which are significantly up or down regulated in the two sets. They, they then published this data set a year ago with some, some observations, such as they found that beta-catenin is very highly expressed in the cancer stem cells. They then asked us to have a look at it in their network. So what Arena did is she extracted the active subnetwork from the, the, the functional interaction network. 716 out of the 1308 were in the network at all. So 55% hit rate is pretty much what we expect from our coverage of the genome. These were more, then we did some basic statistics on them. One of the most interesting is that their, the genes that are up or down regulated are, are, are more connected with each other than, than random genes. So in the, in the functional interaction network as a whole, the shortest distance, the shortest, the average shortest path is 3.82. So the distance between any two genes is 3.82 hops. And it's 3.58 in this set, which is not that big, but it's statistically significant. She then added a series of highly connected linker genes until the shortest path length started to increase. Okay. So we stopped when, when they started getting less well connected. And then we, we used a community clustering algorithm called Edge, Gavin Newman Edge Betweenness that was originally developed for analysis of the web and then Facebook sites to find communities that were highly interconnected to create several 20 clustering that consisted of 10 or more genes. This is what the thing looks like in, in, in cytoscape. You'll be able to generate those using the, the plugin that we'll talk about this afternoon. And there are some maybe some obvious things in here. So here we, we've circled the, the, the, the, the modules that are highly interconnected. Some of them that were expected like beta ketinin, which has previously been published, came up p 53 always p 53 always comes up hox cluster here. And then some of them were less we're not, we're more surprising. And each of these modules tells the story and generates a hypothesis. I'll take you through just a couple of them to give you a sense of them. So one of the big clusters was a big GPCR cluster, which initially looked very uninteresting. But when we zoomed into it and superimposed the expression data, it started to get very interesting. So what we're seeing here is the, is the network. And notice that there are actually two kinds of edge. There are these directed edges. These are curated interactions from a reactant where we know what the regulatory relationship is. And sometimes it's an arrow. And sometimes it's one of the, the T bar inhibitions. For various reasons, there are many more positive and negative inhibitions in our set. And then the dotted ones are predicted interactions from the base classifier. Triangles are linker genes. Red nodes are up-regulated in cancer versus normal neural stem cells. And blue is down-regulated. And what you immediately see is that there's this little cluster of angiogenic chemokines, which are up-regulated. There's also up-regulation of angiotensin and the endothelan receptor. Then there's up-regulation of downstream things, which are connected via linker genes with all these positive regulations. So the interesting thing, there are two interesting things about this. One is that glioblastoma is known to recruit vasculature. And so it's, you know, and so it's, it's, it's interesting that it's making angiogenic chemokines. So it could be using that to recruit blood vessels. But even more interesting than that, it's expressing both the, both the chemokine and it's expressing the, expressing the endothelum and VEGF receptors for that chemokine, which is suggesting an auto-crime loop. It's hadn't previously been described. It is known that tumor stem cells induce, do induce angiogenesis more efficiently. The normal neural stem cells do in the literature. So we could have predicted that. That's, that's a nice thing. Then if we look at another cluster that's highly connected to this one, the KREB cluster. So that's a trend, it's a CAMP regulatory element. We find that KREB was brought in as a linker gene, but everything downstream and upstream of it is up-regulated in the tumor stem cells versus the, versus the normal ones. Okay, I'm not going to go into what KREB is. It is known to promote cell proliferation and to be up-regulated in some cancers, which is good. But the cute thing about this is, is extensive crosstalk between the chemokine cluster and its, and its receptor and the, and the KREB cluster. So it, it, it, it suggests that what, that what's happening is just angiogenetic, chemokines are coming in here by the receptor. That's then turning on, activating phosphorylation KREB. And it, it's then inducing this very large family of genes which are promote growth. And so we have another auto-kine loop. Yeah. So that does suggest an up-regulation of KREB activity, just not one of the transcription levels. Exactly, exactly. So KREB, in this case, it's mechanism of up-regulation is by a phosphorylation event, which you don't pick up in the microarray. But it's there, you, but it came in as a linker gene. And so we, we picked it up as a possible, having a possible role in this hypothetical auto-kine loop. Okay. So that's, that's kind of cool. Okay. Here's another one, which is very, very straightforward. IL-6 is up-regulated and its receptor is up-regulated. It's actually a very nice diagram that came out of this. And this is, IL-6 auto-kine loops are, have been identified in lung and breast cancer, not previously in glioblastoma. That's interesting. But then there's, there's this correlated observation that the, these, these brain, one of the things that differentiates these brain tumor stem cells from neural stem cells, normal stem cells, is that the normal stem cells are very sensitive to, to a variety of chemotherapeutic agents. And the brain tumor stem cells are, are resistant. And they seem to have the phenotype of multiple drug resistance. Well, IL-6 is associated with multi-drug resistance. And the, the mechanism from the literature, the mechanism for this is thought to be that IL-6 is up-regulating the expression of a series of ABC transporters, which, which pump the chemotherapeutic agent out. So the ABC transporters are actually not well represented in our functional interaction network because they are membrane bound. And a lot of, a lot of the, a lot of the trans channels just don't come up at screens, at the screens well. But we were able to go back to the microarray data set and look at the ABC transporters. And sure enough, one of them, ABC-C4, was, was highly up-regulated. And in fact, when we graphed the cell line, when we graphed the full change in IL-6 against the full change in ABC-C4 transporter against each of the cell lines that went in, we find this beautiful linear correlation between, between the two. And so now we're going back to these cell lines to see if the ones which have high ABC-C4 are more chemo-resistant than others. So we got another hypothesis just by kind of staring at the network and thinking about the, thinking about the literature. Okay, so that's all I wanted to, that's all I wanted to say about it, but I wanted to give you a flavor of how, you know, how, how this actually can get you, get you to leads quite quickly. Yeah, questions about that? Yes, just that, I assume I've got a lot of them in the plug-in that we... Yes, and so we can do exactly that analysis. The blue-red, is that fire? Or is that everything that's variable? Yeah. Are you just punching it up or down, or are you actually putting down the raw data? Yeah, I'm not, I'm not going to try to zoom back there, but yeah, so that's actually a function, a side escape that you can map expression levels or any quantitative number onto colors. And so that's actually something you do after you load the network. Yeah, and Gary's going to show you how to do that. So you can, you can make it binary, or you can make it a heat map or whatever. Yeah, oh no, I'm sorry about that. These are, I, Irina made these by selecting those edges and, and changing them, changing them to red. So inside escape, she kind of dragged over them to select them and she made them red, just to highlight them. Yeah, any other questions? Yeah. Yeah, well, so the, so the, the, you can choose, right now in the plug-in, you'll see, you can choose whether or not to bring in linker genes, but you don't have control over the, over the number of linker genes to bring in. It's, it's, it's actually, it's actually problematic because if you bring in too many linker genes, you're just going to swamp out your, you're going to swamp out your signal, and, and everything will be connected to everything else. If you don't bring in linker genes, you may miss interesting biological connections, which didn't happen to be in your, so the way Irina does it, is she, she, she actually brings in the linker genes in steps until she qualitatively feels that the modules are getting too interconnected and stops. The way that the, the plug-in does it is a, an experimental procedure in which we bring linker genes until a series of measurements, including the, the, the, the number and size of modules and the, their, and their shortest path, their average shortest path within them starts to increase towards randomness, and then we stop. Okay, but it, it, we actually should be able to give you much more control over that, and we don't currently. Yeah. It is in, in your, in your, on the wiki and in your, the notes that you were in, in your, the email that Michelle sent you, where she gave you a series of, of a software to install. It's something. Oh yeah, it's inside escape. The plug-in is inside escape, and then the part that does most of the, most of the math and database access is actually sitting in reactome. So you need an internet connection to work for it to work. Okay. Now, and you guys, if you work for, if you have IP issues, you should be aware that what the plug-in is doing is it's taking your gene list, upload, sending it to reactome, and then using that to download the active subnetwork. So the, the, your, your list is actually leaving your domain and entering reactome's domain. We don't keep it or log it or anything, but if you, if you have IP issues, you should be aware, you should be aware of that. Okay. Any other questions before I move on? I have a feeling I'm running out of time. How much time? Zero. Minus two. Minus two? Okay. So we'll cover everything very, very quickly that I wanted to talk about. You heard a little bit about this yesterday, yesterday at the end, the gene set enrichment maps. This is a, I don't know, a thought little approach, different approach for clustering gene lists by their, their relationships on networks. This is, this is largely, this is Gary's work. And it's designed to approach the problem of redundancy in, in gene sets. You upload, you do one of the over, over representation analyses you learned about yesterday. And you get 10 different processes and some of them have very similar, some of them have similar names, some of them have different names. Are they really talking about the same process or different ones? And so what enrichment mapping does, and I'm afraid I'm not going to be able to do this justice in the time that we have available, is to, is to compute the, the, the, the relationship between the, between the gene sets that are over expressed in A versus B or over, over represented in A versus B and B versus A. And to use that relationship to cluster those concepts together to form a, a, a network of lists where two lists will be close together if they're related by having similar enrichment or depletion profiles. Okay, so this is a type of, of relationship map. And what you get out of this, and I believe this is implemented as a, this is implemented as a cytoscape plug-in, which you have. Yep, that's right. And which we will do this afternoon. That's great. So you'll learn more about that. Okay, you get, you get, you get these nice network cluster in which each node is a, is one of your lists and it's connected to other nodes, other lists, via a, a, a weighted edge where the width of the edge is proportional to the amount of sharing of over represented or depleted genes in your set. Okay, and so you can get, this is a set comparing smokers to non-smokers. I don't know, is this lung tissue or something, Gary? Okay, lung tissue. And so oxidative metabolism is, is reduced in the non-smokers. So they've gone to the kind of anaerobic there. And all these genes, all these lists have to do with oxidative metabolism. There's a decrease in protein translation. And an increase, fortunately, in detoxification pathways. So there are three lists here that have to do with detoxification. And you can get a kind of a very nice conceptual mapping of the processes that are changing in your two sets from this, this methodology. Okay. Okay, very, I'm just gonna have to skip through this very quickly, but you can use, you can use network modules to classify diseases, to find biomarkers. So one of the problems, so one of the goals of molecular profiling of disease states is to identify, is to classify patients or samples via disease mechanism, response to drugs, clinical prognosis. So you'd like, for example, find a breast marker, a molecular marker in breast cancer, which will predict patients who have a long-term survival and could be treated less aggressively versus those that have predicted very aggressive disease, which should probably be treated with all guns going. Or the patients, distinguished patients who will respond to chemotherapy from those who won't. And the problem with create, with biomarkers is you're doing, when you're trying to combine the results on 20,000 genes, you have many, many hypotheses. So you have many ways to combine them together. You have too many hypotheses to test to get very little statistical power. And the idea, the general idea here, is to create an active subnetwork where you have maybe 10 or 20 modules of related genes, and then to create the biomarkers from there. So now, instead of having 20,000 different things to test, you're testing a few dozen. And so, it's a very nice work done by, in Trey Ed Eidecker's lab at Rockefeller, essentially does this. I won't go into the details, but here's the reference you can read about it. They created a 50,000 interaction network from a number of sources, including used to hybrid data and REACTO. They then took two breast cancer cohorts and tried to create a classifier that would distinguish between metastasizing patients with metastases and those without metastases. They created, they extracted the subnetwork, they clustered it, they created a metric which related the probability of a patient's metastasis to the modules and then they combined them using a greedy algorithm. And they then create, then they got a series of modules whose presence was highly predictive of metastasis. Nicely, when they took two independent cohorts of breast cancer patients, they got a similar network and the classification accuracy was both better than traditional biomarkers based on individual genes and the modules had better coverage of known cancer risk genes. One of the problems with biomarkers is often you can come up with genes which seem to have nothing to do with the process at hand. Okay, that's all I have time for there. I'm very quickly to touch on something that Quaid is going to talk about. You can use, you can use network analysis to identify genes that have the probable function of genes that have no known function. If many of your genes in your list have no known function, you can actually look at the genes in their neighborhood and by the guilt by association mechanism you can, you can make some, make predictions. I wanted to show you an example of predicting oncogenes on the reactant functional network, but I'm not going to do that. You can do that on your own free time. So in conclusion, what have we learned here? Pathway databases provide excellent qualitative information, but they vary considerably in their content, the curation policies and their underlying data models. They're also restricted by their coverage, lack of coverage and they're really not good for quantitative hypothesis testing. Networks and particularly interaction networks which are constructed from integrating multiple data sources provide better coverage of the genome and do not necessarily sacrifice accuracy over curation. When combined with pathway information networks are a rapid way to provide clues to mechanism and they can use them to generate hypotheses to prediction functions and to subclassify patients and samples. And that is it. Questions?