 Hello everybody. Okay, so this lecture in materials is covered under a Creative Commons license that allows it to be freely shared and redistributed, so do with it whatever you like. So I'm going to give you a follow up on your lectures from yesterday. We're going to talk a little bit at a deeper depth on pathway and network analysis. Most of the examples I'll show are drawn from cancer, but everything I talk about is applicable to any model system or disease process. So the objectives for this module are to understand the landscape of pathway and network analysis. But I'm only going to touch very briefly on individual tools and try to talk about the principles underlying them. We're going to talk about where pathway and network data comes from, how it's collected, what the reliability of the various sources are, the analytic approaches to data analysis how you visualize and integrate it, and talk about some applications of pathway enrichment analysis. So, you probably know why we're interested in pathway analysis and that's why you're here. It's because high throughput data is very multi dimensional and hard to grasp. When you're sequencing thousands of genomes and tens of thousands of genes. It can be very difficult to interpret the results of experiments such as a perturbation experiments by projecting this multi dimensional data onto a smaller number of biological pathways. So you've increased the statistical power of your analysis by reducing the number of hypotheses from the thousands to the, the dozens. And you're able to tell biological stories, such as identifying hidden patterns in gene lists, or creating predictive models to explain your observations. And you can do such things as predict the function of annotated genes, based on where they associate in, in the pathway. So, this is a very broad type category pathway network analysis is is broadly defined as any analytic technique that makes use of bio of biological pathway or molecular network information to gain insights into a biological system. And so very large and rapidly evolving field with many different many different competing approaches. But basically to do any of these tests, you need two ingredients you need your data, high throughput biological data, such as altered genes proteins RNA is in a derived from an experiment, and you need a information source of known pathways and networks. So what I mean when I talk about pathways versus networks, you, I think you probably covered this yesterday, but a pathway is a ordered collection of genes with their relationships laid out genes or other macro molecular entities. So here we're looking at a simplified version of the first steps of the EGFR pathway with the its ligands, the receptor, and the first dimerization step, and it's laid out in a biologically intuitive manner, which reflects the way we think about biological pathways. In contrast, a network consists of less well understood relationships among among the various actors and that the main feature of this is that it's not semantically or ordered. We can't as easily look at this and try to figure out where the start and end of it is the network is actually probably more like what the cell sees. It doesn't it doesn't recognize sequences of events, but the pathway is an easier way to understand it for our limited minds. So there are pathway databases and network databases and some which have both. The databases have the advantage of usually being curated. Typically they're created by a team of curators and authors who scour the literature and reduce the experimental findings into a textbook style view of biological processes. They have the advantage of offering a human interpretable visualization of the pathway and tools that make use of the cause and effect along the pathway disadvantages of these is that they don't cover the genome very well. The best pathway databases maybe cover half of the genome. They are always a little bit out of date, because they're waiting for the review article that comes in with a big summary of the pathway. And so they're going to be missing latest knowledge. And a bigger problem is that each database has a different idea of where a pathway starts and ends. So they'll have the pathways will have different names, and they'll have different different constituents so it can be hard to relate the results from one pathway database to another. A typical pathway database typical pathway databases are reactome and keg. I'll be talking about reactome a lot because I'm one of the PIs for the reactome and initially conceived of it many years ago. These both these databases explicitly scribe a biological processes as a series of ordered biochemical reactions. And here is the basic reactome model can use as a similar model, where the core of every step is a reaction. It has a series of inputs, and the reaction transforms them into a series of outputs. And along the way, there can be regulatory molecules or enzymes and catalytic activities, which affect the effect the rate of the reaction. And this basic model can cover a whole variety of things proteins, small molecules complexes, and even such things as changes in the topological location of a macro model. If there's a transporter step in which a small molecules moved from the cytosol into the interstitial space, they're going to be an input, which is the small molecule inside the cytosol, and an output, which is a small molecule in the interstitial space, and each of them is treated as a separate related object. So, an example, a very popular example is keg the Kyoto encyclopedia of genes and genomes this is a huge library of information on hundreds of species contains the genome fully sequence genomes of many species proteins pathways chemical compounds diseases. And part of keg is its pathway collection, and that's a collection of manually drawn pathway maps, representing knowledge on spanning metabolism cellular processes, human diseases, and drug interactions. And keg is partially open, you can browse it for free. But if you require want to do data analysis on it or download it you need to pay an annual subscription fee for it. And his should be kind of a familiar picture they'll see a lot of these. This is the keg version of the cell cycle. One of the unique features of keg is that each of its options. Sorry, I'm getting a message here. I don't know if that was meaningful. The one of the key features of it is that each of its genes is identified by its enzymatic enzymatic activity using its EC number. Reactome is complete open source and completely open access there. It is free to it is free to use it's free to download you can set up mirrors of it you can create your own reactome. It is a 20 year old project that was I initiated with you and Bernie from the European bioinformatics Institute. And it is a pathway database encompassing metabolism signaling and many other biological processes. It has a team of about a half dozen curators who work with authors in the field to curate paper literature reports in a machine readable format. Every single statement is supported by a supported by the primary literature and across references many other informatics databases. It has a series of data visualization and analysis tools built into it with all as well as being available as our and Python libraries and the series desktop tools. It provides a Google map style interactive reaction diagram. It has textbook style illustrations with clickable overlays, and it lets you do things like find the pathways, give it a gene list you could find the pathways which can, which your genes are participating in. You can calculate the over representation of genes or proteins in a pathway, and you can do mapping from one species to another. And this is a typical reactome page. I've searched searched here for cell cycle. I get a little picture that was drawn by one of the curators or an author and a, and a human readable summary of that pathway underneath that I can dig into it, and I can start getting a detailed pathway diagram. Again, this is interactive and allows you to perform analysis directly on the diagram and a lot of database fields. The, so we'll go into reactome a little bit more later, but now I'm going to switch to networks. So in contrast to a pathway networks are much less attractive they they but very very information dense. Whereas pathways capture only the well understood portion of biology things that could go into a review paper. Networks cover high throughput data and less well understood relationships, including things like genetic interactions. You knock down a gene in yeast, and something and the expression level of another gene changes. It's not stated in some way but we have no idea what the mechanism is, but we know there's an interaction between physical interactions such as experiments in which you do a combino precipitation to develop to get lists of complexes. We know that all the gene, all the proteins in that complex interact with each other. We don't exactly know what, what their, what their role is. Co expression data gene ontology term sharing their adjacency and pathways, anything that describes a bind molecular relationship can be put into a network. So network databases can be built by built via curation and many are they can also be built automatically from high throughput experiments. They have extensive coverage of biological systems they may cover the vast, the majority of human genes. But the relationships and the underlying evidence between the underneath these interactions is much more tentative, and one of the problems with networks is they tend to have a lot of false positives. There are a series of curated networks that that in which some of the false positives have been removed by applying strict standards for evidence gathering. One of the ones that I tend to use our bio grid, which is a has interactions from 80 different species and has 2.6 million interactions among 87,000 genes intact which is one of the year, which is a European bioinformatics project that react to and collaborates with quite, quite strongly. This has 9,000 species and 1.2 million interactions. So, locally, Jean mania, which is one of Gary Bader's projects is a compendium of 2,800 gene association networks, representing 660 million interactions across nine species. And I'll give you a little look at that. So here's typical search on intact. I searched for everything that interacts with TP 53. And it found something like 9,000 different interactions, which I then filtered down to a smaller number of physical interactions that I could visualize. And it's showing the interaction graph here. I don't know how well that shows up. You can't see the lines very well on that screenshot unfortunately. And then with this, you can do things like given a given a gene list that you've come up with, identify pathways that contain them or networks that contain them both and see how far they are far away they are from each other. Jean mania is has a much better visualization than either intact or bio grid so it's the one I tend to use when I can. It has 2800 networks that have been brought into a framework in which you can do a variety of things but the most interesting thing is if you have a gene list of say a dozen genes. It will, it will create a network for you that contains those genes in your list, as well as other genes that are necessary to link them together. So it allows you very rapidly to search through it and find potential potential pathways among things that relate your genes. And you can select the type of network to interrogate, you can look at only genetic networks or genetic class physical networks, or things that are related by gene ontology terms or co expression terms. So very flexible and very useful tool for doing data exploration. So I'm going to talk a bit about the what the visualization analysis tool landscape is for biological networks. There are a series of a series of desktop tools that are very useful. Disont, which is shown on the left is designed, particularly for metabolic interaction networks, and it builds up color coded color coded by function on networks of that contain a gene set of interest. Navigator is is similar but it gives you a 3D view, you can rotate a 3D representation of your of the network that you've built up. And then side escape which I think you're all familiar with does pretty much everything that you that you might want and has a very rich ecosystem of plugins that provide both simple to a very advanced network analysis. I'm going to talk a little bit about the different types of pathway network analysis that you may see there are basically three types of modeling tools. So one is the most basic one is gene set enrichment analysis in this. In this type of analysis you have a set of genes that came from a high throughput experiment such as an RNA expression experiment, where you've got some genes which are in what were up regulated and other genes which were down regulated, for example. And what these tools do is it divides the divides the biology into a series of bags of pathway bags all the genes that are involved in cell cycle, all the genes that are involved in apoptosis all the genes that are involved in immune signaling. And identify uses their number of algorithms which are able to identify genes which are over represented, or pathway pathways which are over represented in your gene set to give you a sense of what biological processes have been changed in your experiment. And so this is this tools are essential for taking a large, large data set and identifying the identifying the biological processes, which may be altered in that, in that experiment. The second one is is used more for discovery. And these are sub network construction and clustering modules where you have a, you have a gene set of interest, and you want to construct a network, which relates them. And this is useful for finding new pathways that are altered in in cancer or my disease, or identifying a clinically relevance tumor sub types, which differ by which pathways or which sub networks are activated in that in that disease process. And finally, the most sophisticated and hardest to use of these tools are ones which do pathway based modeling. And these actually construct use the information underlying pathway or network database to construct a model of your bio of biological process, such that you can make predict you can predict on it. So having analyzed, for example, a single cell perturbation experiment which a series of genes in a pathway were knocked up or knocked down, you see what the effect downstream of the downstream effects of that perturbation. These systems will build a model, which then allows you to predict new biology, what happens when you knock up or knock down a gene, which was not in your experimental data set. So these are used are for things like prediction precision medicine prediction in patients. If I give this patient a drug, given their, their mutation profile, do I expect that drug to be effective. Okay, so let's talk a little, we'll talk a little bit about these these techniques. So, enrichment of fixed gene sets I believe you have covered that. Yes, yesterday, but just as a reminder, this is the single most popular form of pathway network analysis. There are two bifurcations in this is simple ever representation analysis, in which you have to choose a threshold, when between threshold in your data set. So for example, in a expression data set you have to set a threshold for which genes are up regulated versus which genes are down regulated to bins. And then it'll tell you it'll find pathways or pathways which are over represented in the up regulated set. The problem with over representation analysis is you have to choose a threshold. And it's not always obvious how to do that functional class scoring, which the exemplar of that is a GS EA, which I think you use yesterday is that correct. Okay. ranks pathway, ranks the pathways by the degree of over representation in the data set and provides a kind of a the ability to form a continuous function across them doesn't require you to find threshold and gives you a series of p values and false discovery rates for the for the pathways of interest. The advantages of both these techniques are they're easy to perform and they have lots of good end user tools, and the statistical models are very well worked out. And you accurate, pretty much accurate false discovery rates and p values disadvantages of any of these systems is there's a fundamental assumption that you can take the take human biology and partition it into non overlapping sets of pathways and so we know that isn't so that you have a series of bags. And we know that isn't true pathways are heavily overlapping. And in fact, when you use any of these tools, you usually get a series of overlapping enriched pathways which contain some of the same genes and you have to apply additional tools in order to try to figure out what is an independent observation and what is. What is not. Nevertheless, this is where most people start and it's adequate for the vast majority of uses. So, here is an example of pathway enrichment done with reactomes online tools. Go to the reactome website and click on analysis you get a page like this which offers a variety of online tools. The simplest one is the gene set enrichment analysis in which you cut and paste a list of genes and optionally a list of values associated with them such as expression levels. And here I've cut and paste a list of genes that were taken out of cosmic, which is a cancer pathway database, which is a list of cancer associated genes to see where all the cosmic genes fall on a pathway map. And this is one of the visualization this results that you get from this analysis. This is a Voronoi plot of the also known as a bubble plot foam a plot of all the pathways known to reactome. We've highlighted the ones which are over represented in that gene set, and we're more or less getting what we expect for seeing gene expression signal transaction developmental biology DNA repair cell cycle and the immune system, among other things, which are over represented cancer data set, which is exactly what we expect. And then we can zoom in on this and go down to the pathway level and actually see which genes were in our gene set and how they're related to each other. And then there's a large table underneath this of the p value and false discovery rate, and a lot of underlying information you can browse through. Okay, so that's, that's gene set enrichment analysis. Okay, now we'll talk about de novo subnetwork construction and clustering tools. So these are tools in which you start with a network, typically, and you apply a list of all your genes and proteins to it to identify topologically unlikely configurations that is in your genes, or protein or RNA set. Is there a subset which are closer to each other on the network, then you would expect by chance, because those are likely to be interacting with interacting with each other and shed light on your biological question. Then these algorithms will extract the clusters of those unlikely topologically unlikely configurations, and annotate them with a variety of biological annotation sources such as gene ontology terms. So here's an example of doing this. I'm afraid this is a really old example from 20 from 2010, but still a good one. One of the things that reactome has done is to take its path curated pathway data, turn it into a network, and then add high throughput data from the other net other interaction network database sources, and to do a little bit of machine learning in order to remove false positives. And so this gave us a what we call a functional interaction network that has 11,000 proteins and 270,000 interactions in it. It's relatively small by reaction network standards, but because we filtered it, it has a relatively low rate of false positives. Okay, and then, from this we applied a set of genes which are were up regulated in in triple negative breast cancer identified the topologically unlikely configurations of those genes, and it split the, the big network into a smaller network consisting of a series of genes which are interacting with each other frequently, and have cross connections to other genes so we clustered this into a breast cancer disease module of 10 to 30 genes. Actually, I'm going to give you, yeah. Alright, so this is just a information on the functional interaction network, and our calculated false positive rate is less than 1% in this are false rate negative rate is very high 80% because we're only covering a fraction of the gene. Okay, so here's something that you can do with it. We've looked 52 pancreatic cancers and identified their somatic mutations. And then if you just look at the, at the distribution of the mutations, the K, the KRAS gene, which is the very first one and those little crowded. You know, is there in over 90% of the cancers, and then there's a very, very long tail of less frequently under less frequently mutated genes. Can we use the information from this tail to to understand more about the biology of pancreatic cancer. So what we apply that mutated gene list to the the reactive functional interaction network. Sure enough, we get a series of clusters out of this. There's a big one that has KRAS right in the right in the middle. And it is, we annotated it with gene ontology terms and we get or be signaling and FGFR signaling and axon guidance is a big signaling module. And then we get smaller modules with hedgehog signaling, calcium block binding. The spliceosome axon guidance. These are all things which are most of these are known from the literature. Some of them were some of them were new, new processes. And then we can take the modules and we can cluster the patient samples by which modules are active in that particular patient. And sure enough that clusters actually pretty well the patients are here on the y axis and the modules are there on the x axis. And we identified four different types of pancreatic pancreatic cancer, which are distinguished by which modules are active. And then we looked to see if any of these tumor types were predictive of survival or response to drugs, and they weren't, but when we applied the same method to breast cancer. That's what I was referring to before, and did the same analysis. We found a module which is actually expressed. This is a patient whose expression is correlated with survival. This is a Kaplan-Meyer curve. These are months in which the patient was disease free. And patients that have a low expression of genes in this module do better than those that have a high expression. And when you look at what's actually in here, it's cell cycle and aurora bees and aurora bees signaling just associated with mitotic rate. So this is actually a pretty good prognostic marker for breast cancer discovered through Dinovo subnetwork cluster. So here are some of the clustering algorithms that you can use. Genemania is the easiest to use and really, really useful. And then there are two site escape apps, which I can recommend. One is called Hyper Modules, and the other is the ReactomeFI network, which was used to generate those examples before. And then there is another very popular algorithm called Hotnet from Ben Rafael's lab at Princeton, which uses an interesting model in which the network is represented as a metallic lattice. And it puts mutated or your genes of interest and makes them hot, and then it uses a heat diffusion model to find neighbors. I don't know why it works, but it works quite well. Okay. All right, now we're going to talk about pathway based modeling. We have a list of your alter genes, and you apply them directly to biological pathways, not to networks, in an attempt to try to discern the biological relationships. And this basically shades into systems biology. We're trying to make mathematical models based on the prior information from the pathway to explain the empirical observations. So these are, these start to get into a, into heavy math, but so for metabolomics and biochemical systems, there are tools that represent the pathway as a series of partial differential equations. One of the more popular tools is CellNet Analyzer, and it's used for metabolic relationships. For kinase cascades, people use the tools net forest and network kin. These are network flow models which model the flow of information through a kinase cascade and can be used to construct models of kinase signaling. And there's some tools which are attempt to reconstruct the transcriptional regulatory network, based on high throughput RNA seek or micro array data to identify key, key nodes in the network, such as transcriptional master regulate master regulators, and the oldest and the most of these is from Arachne, from the Colofano lab at Columbia University. And then there are probabilistic graph models, such as the paradigm tool, which is from David Housler's lab at UCSC, which represents the pathway as a series of influence nodes, where information comes into one reaction and goes out to several others, and using a Bayesian probabilistic network attempts to infer how the, how each node will influence will influence the next the next one. So it propagates activation through the network. So here is a little look at how paradigm works. You start with a very, here's an example of very simple pathway with just two genes and an output apoptosis, one's a negative inhibitor, and the other is an activator. It then expands this into all the steps of the, the central dog. So MDM two actually has a gene, and it makes an RNA, and it makes a protein, and the protein becomes active and it makes an active protein and TP 53 is a gene that makes an RNA makes a protein. And it makes an active, it makes the active TP 53 protein, which is inhibited by the MDM two active protein. And so then it breaks this, it expands this graph into all the macro molecular things and says okay, if I have introduced a mutation into MDM two, then this is decreasing the activity of the gene which decreases the RNA which decreases the protein, less active protein, and therefore it's going to inhibit apoptosis to the very complicated way of see of modeling something that you can, you can see very clearly, but when you get to large pathways then becomes useful. Okay, and this allows you to model multiple changes at different levels, including mutations copy number changes, changes detected by proteomics changes that are that are detected by post transcriptional post translational processes and paradigm has been very, very successful. And here's an example in understanding but interpreting biological data is an old publication from TC GA in which paradigm was applied to glioblastoma multiforme, and it very rapidly identified for different GBM sub types which had clinical clinical correlates. And here's a question I end with a with an experiment designed to get at a core question which has not actually been addressed, which is pathway modeling sounds great people use it. But can you actually use pathway curated pathways to predict biology. A couple of years ago, I went to literature to see if anybody had actually done empirical experiments in which they made a they took a pathway from a curated pathway database, stared at it, and used it in a systematic way to make predictions about what was going when you knocked out or knocked up different genes in the pathway and I didn't find any systematic studies in it so the react with the reactome staff we did this experiment. And so we asked two things. If you have a, if you have a curated pathway. So can a expert gazing at the pathway diagram predict what happens when you knock out some of the genes that are at the top of the pathway and can and can an algorithm do the if that's successful can algorithm do the same thing. I know it seems obvious and you think it's going to work but we've never tested. And the step one of this experiment is that we selected 10 cancer related pathways that were well annotated. And from each of them we gathered key input and output pairs. So the curators decided using a series of rules that we put together, which are the key inputs that we wanted to test, and which were the key outputs from the pathway which were the key inputs from the pathway activation. And overall we collected 4,968 pairs from those 10 pathways. Next, the curators went out to the literature to identify functional genomics experiments in which the key input was perturbed, one of the key and key inputs were perturbed, and they measured the effect on key outputs. And that was by CRISPR, or RNA experiments, and we were careful not to use papers that had already been curated into reactome because that would be cheating. And of those nearly 5000 test pairs. We found 531 papers, which reported empirical results on 847 of them. Next, for those 847 cases, the curators stared at the pathway diagrams and tried to make predictions. And we had a rulebook for how they would do this we had three curators doing it. It was an experiment in which independently, they, they made predictions we found high correlation among the three curators, and so on the basis that we divvied the 847 up among three different people to make their predictions. At the same time, a talented computer science scientist in the laboratory created a simple graph based analytic tool called MP Biopath, which basically does Boolean modeling of the pathway to make predictions to make predictions automatically. And it has a very simple algorithm at its core. All it does is it counts between the, the source, the input and the output, it counts the number of positive arcs, and the number of negative arcs, and figures out from that whether the overall effect is going to be up, down or no change. And then it has some rules for what happens when when you have things combining and forms. Okay. So the results were actually, I think pretty, pretty good. So the curators accuracy was 81%. So overall 81% of the time when a gene was knocked down, and it was predicted that the activity of the target would go down. They were, they were correct. The algorithm MP Biopath was also not bad. It wasn't as good as a curators but it was correct 75% of the time. If you just randomly guessed it would be 33%. So these results were significant. The largest. So what we're looking at here is a summary of that the greens are true positives. The oranges are false positives and the science are false negatives. The outer ring are curator, sorry, the inner the outer ring is MP Biopath the algorithm, and the inner ring are the curator predict your predictions, showing that the curators are doing a little bit better than the, than the algorithm. When we looked at where the errors were, the largest source of errors were here the false negatives. And when we looked at those in detail, it turns out that those are cases in which a key step was missing from the curated pathway, we hadn't kept up with the literature. So there's biology missing. The largest source of false positives was one in which the curators or the algorithm predicted that the, that there would be a change in that it would be up. But in fact, the empirical results were that it was changed but it was down. And these also turned out to be usually related to missing genes in the path in in the pathway where there was a feedback loop of some sort, which was changing the direction of the relationship. There was excellent core concordance between the algorithms prediction and the humans prediction. This is an ROC curve. How many people here are not familiar with ROC curves. This is the rate of false positives. That's the rate of false negatives where we're treating the humans predictions as being ground truth. In this case, random guessing would give you this curve. Perfect guessing would give you a curve that goes up. You have zero false positives and 100% accuracy. So it would look like this. We're getting a fairly high steep curve here, which is can be interpreted as being 91% concordant. So the conclusion from this was that, in fact, pathway databases are useful for modeling biology can be used even very simply for making predictions, but we need more funding to, to do more pathway curation. That's my, my pitch when I write grants. So conclusions from and takeaways. So pathway analysis allows you to discover biological processes hidden in large scale data. You have a plethora of databases and tools to choose from the curated pathway databases after many years are now reaching levels of completeness that allow for accurate prediction of perturbations. And the field is really right for machine learning approaches because we've got all these great curated data, curated data sets and interesting relations. And so I've left you with links to the various tools that I mentioned. I'm happy to answer any questions.