 Okay, we'll get started because I had a good break okay, so I was talking about you know reminding you guys about pathway enrichment analysis You have a gene list we have gene attributes like pathways, and we want to find which Attributes or pathways are enriched in the gene list, and there's a couple of tools that are available like David and GSE a we'll be talking Mostly about David a little mentioned GSE a So gene set enrichment analysis is the more specific term that you might find for this pathway enrichment analysis and The the idea as I've mentioned is that you you have a bunch of gene sets which are related to pathways for instance or parts of the cell and Then you take your gene list and you find out So here's some gene expression data, and we have some some Illumina IDs or That that come from this and we we might this might be gen bank IDs I'm not sure we can we we have some identifiers that come from this experiment. These might be Ranked by by some value or just a flat list and They're categorized in in in these different categories, and then we ask is the Cell cycle list of known cell cycle genes is it significantly enriched In this gene list, and we might know if this is the the set of genes that are upregulated That we know are upregulated maybe it's we can also assign it to enriched in the upregulated section and p53 might be downregulated and we can if we know that this part of the list is downregulated We might be able to say p53 is downregulated and these nuclear pore and ribosome are not significant So that's the idea we have gene lists We have our gene list and our gene set database, and we use an enrichment test to find Pathways for instance that are that are enriched just a couple of notes the This the gene list that comes from experimental data I'm it's a little bit confusing because there's a lot of different types of gene lists So I always use the term gene I try to always use the term gene list for the gene list that comes from your experiment So it's your list of genes that you're interested in that's the gene list and then I list a set of genes that are in a pathway I usually try and use the term gene set So it's it's a known set of genes that are related to each other and So so the gene list that we have from our experiment We we search against the database of of previous knowledge That is present in gene set databases like all the known pathways and then we do our enrichment test and we we find Pathways that are enriched and then we interpret this list and we Importantly we might find additional hype this this list might this list of enriched pathways might suggest Additional hypotheses that we would then follow up so it might if you're just interested in summarizing a list of genes you might Stop here and just think you just report the results. We found these pathways enriched in my gene list But if you're interested in following up on that you might say oh pathways are enriched in apoptosis How can I test if that's you know? What can I do with that knowledge? Maybe I can design an experiment that uses an apoptosis inhibitor and and that might be That might have a major effect on my on my cells if I have Cancer cell lines available So this is how the enrichment so I'm going to go into details a little bit more details about how the enrichment test works Statistically and I'm also going to tell you some of the important additional things that you that you're going to need to know This next part is very conceptual. It's applicable to any enrichment test But they'll then we'll focus on one particular product one particular software that that is web-based so so here's the the gene list this rectangle here and I've defined the the yellow Square indicates the list of genes that I've defined as Significant somehow and in this case say I'll typically use examples from microwave from gene expression data because That's sort of the type of data. That's most as it has most been analyzed by this type of type of analysis so these these yellow genes might be the up-regulated genes and And all the rest of the genes are just genes that are not significant Potentially, they're not maybe not significantly Differentially expressed and so that's just the general background Okay, so I take the list of genes that are that are interesting that are significant Significant in the experiment and I compare them to a gene list The gene list is the circle so the circle represents genes that are in a pathway and I I Look at the overlap when I say, oh, there's a certain fraction of Significant genes that are of important genes in my experiments that are Overlapping this pathway that are part of this pathway And there's a certain list of background genes that are overlapping this pathway And then I want to know is this overlap larger than I expect by random sampling of the array genes if I just roll the dice And I select genes randomly This is maybe a random this Square is a random list of genes that I select randomly from from the array Here's another one that I select randomly from the array and I want to know You know how often a Set of genes of this this many genes Maybe this is 30 genes or something how often if I select 30 genes do I get? This much overlap in the gene set versus the gene list So the way to answer that question I mean one way you can answer that question is roll the dice and just keep on trying 30 you know each time you pick 30 gene lists and 30 30 genes and you see how much it overlaps a particular pathway But that's time-consuming you have to do that thousands of times. So There's a statistical test called a Fisher's exact test, which is also known as the hyper geometric test Which computes this information for you and you get a p-value from that from that test so So the this is the gene list. This is a list of genes for my experiment and here's my background population I have a certain number of blue genes here and a certain number of red genes and the The I have a particular gene list and there's a there's a I just randomly pick this gene list from this this background population I might expect to see sorry, I might if I if I If I randomly pick some gene some genes from this from this bin I might expect to see more red genes because there's forty five hundred red genes and five hundred blue genes so But in my gene list I have four Blue genes and one red gene so it looks like my gene list is highly enriched for blue genes It seems like you know if I was randomly picking I would get one out of ten Blue genes, but here I have four out of five So The null hypothesis of this test sort of it basically you want to know if blue is enriched compared to Random sample from the population Or you know the alternative hypothesis is that there's more blue genes than we expect Okay, so this is based on the hybrid geometric distribution, which you may people may People statisticians may in the audience may know about quite a bit This is the expected distribution that you'll pick a certain number of blue Genes here from this bin Given a set of of given this gene list So here's the probability that I'll select zero blue genes popular the probability They'll select one blue gene two blue genes, etc the fact that whether you know the the probability that I'll that I'll pick four blue genes is Is the is the sum actually of picking four blue genes or Five blue genes basically all the all the genes that are All the numbers that are higher than the number here. So we don't have five genes But that's four is included in in five So the sum of this is the p-value basically the probability that we'll get four genes So it's four point six times ten to the negative four, which is very significant So it's very rare that I would pick four blue genes out of this list and out of a list of five here um, so the typically this this we like to think of this as the Overenrichment so we have Blue genes are enriched or it says black here, but because it comes up blue on this screen on time I meant I'm just converting it to blue So Normally we're talking about enrichment over enrichment, but you can also Test for under enrichment So the under enrichment of black is the over-enrichment of red. So that's the opposite The other important thing is that you need to choose a background population appropriately, so if you have If you have an array Or an experiment that can only that can only that doesn't sample the entire genome It only samples half of the genome for instance your say your experiment is a microarray that only has half the genes in the genome on it You can't ever find genes that are not on the array not in the microarray And so you shouldn't include those in your statistical test. That's not part of the background You could never find a gene from there So it shouldn't be part of the the the background of That that you use to calculate the expected amount of Genes that you might that you get so so the background population is the Genes that you could find in your experiment for next-generation sequencing analysis It's any gene right because any gene could you could find DNA from any gene. It's not it's not Really restricted unless you do something like exome sequencing which then you have You've selected for specific exomes and those might be not covering every gene in the genome every known gene So that might be a more restricted background So the If you have so so so far this Fisher's exact test as a as I've explained it is useful for testing the enrichment of one pathway So just one pathway we can do a Fisher's exact test if I want to do more pathways You repeat Fisher's exact test. So And I'll mention that later a little bit more okay, so one of the problems that Happens and with statistical tests is that you may you might if you keep on doing the same test over and over again You might find that you That you one of your random draw so say I say I I pick So say if I want to find out what the probability of picking four black in here turned out properly black genes out of a list of five Say I I want to know what the probability of this is and I just keep on picking genes randomly from this bin So I picked a bunch of genes randomly I got a whole bunch of different types of draws and then eventually a few thousand draws later I get I find one that has four black genes out of five and The problem with this is sort of a way of winning the lottery by playing many many times So if you wanted to know that the probability of winning the lottery If you keep on you someone tells you that it's there's a particular probability of winning the lottery That's the probability for playing once and winning the lottery one out of a million So you have a one in a million chance if you play the lottery once, but if you play the lottery one million times probably you're gonna win right because The the you expect a random draw with the observed enrichment in this case once every one over p-value draws So so you have to correct for this if you're doing multiple tests This is a little bit worse if so this is sort of more. This is the what's called the multiple testing correction What's called multiple testing correction and it's the multiple testing problem so if you keep on running the same test over and over again, you will by chance find something interesting and You need to correct for that. So And it depends on the number of tests that you're doing. So here's an example. That's a little bit closer to pathways here's a here's a list of genes and I have say black circles are one particular pathway and Squares are another particular pathway. So I Find a mix of genes and some of them are circles some of them are squares some of them are red Some of them are black. Those are two different pathways red versus black or circle versus square. I guess and red versus black so the If I if I keep If I so so that's equivalent of fight if I do the the If I if I draw sort of a random sample of genes, but then I evaluate different annotations that what using the Fisher's exact test each time That's another way of doing the same test over and over again so what we do in gene-centered Richmond analysis if you take your gene list and we we take a gene set from the database like a pathway and we use Fisher's exact test to evaluate the p-value that Of enrichment and we take the next gene set in the data in the database another pathway We evaluate it and we take the next pathway and we evaluate it and we keep on doing that for all the different pathways And so each time We ask is this particular pathway enriched in this gene list and that's multiple tests So on a certain frequency you might expect one of those many pathways to be enriched just by chance So that's sort of a different way of a little bit. It's the same problem multiple tests Testing problem, but it's slightly different way of thinking about it. So There's a number of standard corrections for this problem People may have heard of Bonferroni correction This is named after a person who discovered it But basically you take the number of tests and you multiply it by the original p-value and you get a corrected p-value So this is then the number of annotations that we tested the number of pathways that we tested And the the technical definition for this is the the corrected p-value is Greater than or equal to the probability that any any one of the observed enrichments could be due to random draws so if I if I Basically it wants to make sure that none of your The Bonferroni correction tries to give you a p-value that you can trust so in such that The there are there are none of your none of your significant hits are expected to be Significant by random chance and the jargon for this is controlling for the family-wise error rate So that's fairly common You might see them the number of tools is Bonferroni correction one of the problems with Bonferroni is that it can be very stringent and can wash away real enrichments that are interesting and so often people are willing to accept a less stringent condition which is the false discovery rate and Which leads to a gentler correction? And so often you'll also see false discovery rate and false discovery rate is probably one of the most commonly used multiple testing corrections because it's a little bit gentler and so People so this is not this is the the definition is that it's the expected proportion of the observed enrichments that are due to random chance so if you have if you've done a Enrichment analysis with a thousand pathways And you have a you're satisfied with a false discovery rate of five percent Then it means that five percent of your results might be due to random chance But maybe that's that's okay for you. So it's it's contrast that with the Bonferroni correction, which prevents sort of any one of the observed enrichments to be due to random chance theoretically so this is a little bit a little bit more gentle because it allows you to Get some false discovery rate in there, but you you know what it is You can set it to be I don't want to have more than five percent false false piece of information But you you'll get more signal you could potentially get more signal that way. Yeah So the question is as the false discovery rate generally accepted and firm It depends I guess on your question and how important it is to avoid false false positives if you in general for enrichment analysis many people use the false discovery rate because People found that Bonferroni is too stringent What you could do is you could try Bonferroni and then if you don't get any enrichments You could try to set a false discovery rate of five percent or one percent or ten percent Or one percent and see if you get more enrichments in more pathways that are enriched at that Stringency so for something like pathway enrichment where it's more of a discovery exploration type of Analysis it's it's probably okay for you to Accept some false positives given given the in as a trade-off for getting more signal Because you're going to manually you're going to look at the results and evaluate them so the But for something else you might you might want the more stringent test so Does that answer question so oh Yeah, that's what I was going to say There's also depending on the on the software that's that's a new that's being used there might be Particular ways that the p-value is computed That that are based on random sampling so instead of using a Fisher's test The GSE a software which I'll talk about at the very end briefly Uses a random it tries to simulate the whole the whole enrichment system and Randomly it randomly creates lots of gene sets and sees how often you get It gets calculates a p-value based on kind of a random simulation so it actually tries thousands of different possibilities and Computs it just like that rolling the dice picture that I had before and so sometimes some software might might You you want to use a an FDR that's higher or lower because of the the random model is not exactly this perfect for your situation and you You I don't know if I'm not really explaining this well, but there could be The take-home message that I'm trying to get across is that if you use different software that have very different types of enrichment analysis techniques They might have different acceptable at false discovery rates 10% or 20% Might be okay for some some some analysis And so you can kind of tune it and see where you wait so what I like to do is is try different false discovery rates and see what the results are one of the things I like to do is is if I'm worried about if the false discovery rate if setting different Parameters like choosing Bonferroni or false discovery rate at different levels if that If I'm worried if that if that affects my results I'll try different ones and see if it if I always get the same answer and doesn't affect my results Right, and so then it's not an issue if I get very different answers by using these different parameters Then I know that I need to be a little bit more careful and think about the result a little bit more carefully in terms of false positives um Okay, so you might see FDR Typically, they're these corrections are calculated using the Benjamini-Hopford procedure, which I won't go into but and I think I forgot to put it on the wiki. There's a great Document that's available online that I'll add to the wiki as a link that goes through Explains multiple testing correction very very clearly and there's others others than than just these two But you might see this in tools Benjamini-Hopford FDR or Benjamini You know any any name sometimes the the FDR threshold is called a Q value So you might see that in tools as well Okay, so I talked about Fisher tests, which also called the hypergeometric test so when you go to these these software you'll see that those those names and Bonferroni and Benjamini-Hopford false discovery rate as multiple testing corrections That's four gene lists that are well-defined. So here if I have a set of genes that So that might be perfect for for experiments where your experiment just generates a gene list examples are You know mutation as assessing all the genes that are mutated in a particular sample There's only a certain limited number of genes that are mutated And that's my gene list now if you have some as I mentioned earlier if you have some value that allows you to rank genes You might Know that that there's genes that are more mutated and genes that are less mutated Or you might know that genes there's genes that are up-regulated compared to control or down-regulated compared to control given gene expression data So in that case if you want to create a gene list you have to Choose a threshold so here we've chosen a threshold that set say for instance my threshold is I Care about genes that are more than twice you know twice over expressed compared to two-fold over expressed compared to control And also genes that are two-fold less expressed compared to control Or under expressed so I've chosen that threshold I've drawn a line here and all these genes in gray here become imagine this is a rank list all these genes here become I just don't include them in my analysis and the problem with that is that I'm throwing way information potentially here, and I've chosen a threshold two times over expressed two times under expressed That might not really be the best threshold and it's sometimes very difficult to fuck to choose a threshold because you just sort of Choose one that people people will typically use but you don't really have a very good Evidence for choosing that one and not another one. So there's a second set of Enrichment tests that use a ranked list and they don't throw away any genes These there's no gray genes here that are thrown away. So instead you use the full list of genes And they're ranked in this case for gene expression data from up-regulated to down-regulated and the most popular software That implements a test like that is GSE a which how many people have actually used GSE a gene set enrichment analysis software So a few people have used it The I don't we have don't have time in this course to really go through that in detail But it's if you have a ranked gene list You can try out GSE a it's pretty easy tool to use and there's lots of tutorials and There's other statistical tests that are available for ranked lists and what these just briefly because I don't have any slides specifically going to the details, but what these these Statistical tests do is they assign a p-value to a particular pathway By looking to see if the genes in that pathway are clustered at the top of the list or at the bottom of list more than you'd expect So if you just took a specific pathway if you just took a pathway like apoptosis And you had any random set of genes that were ranked you might expect apoptosis genes to be all over the place Equally at the top equally at the bottom equally at the middle They just you know if you just ticked off every gene that was an apoptosis going down the list You'd expect to have an even distribution. So there's that there are statistical tests that Tell you how what what what the probability is of having a particular distribution Maybe all of your apoptosis genes are all the way at the top all the top genes are apoptosis It's very different than a case where apoptosis genes are spread all over this list and so the p-value that you get from these tests tells you the difference between that and So that the main the main sort of take-home message here is that if you have a Plain gene list that's not ranked you can definitely use Fisher's exact test no problem The David tool that we'll talk about later If you have a ranked gene list you can and you can choose a threshold that you might be interested in you can use David but if you want to use all the genes in your ranked gene list then you can use one of these these Well GSEA is the most popular tool and If you're interested in the statistical tests you can look the Wilcox and Mann Whitney test or the Kamal Gerov Smirnov test KS test so you might have heard of some of these things But we're not going to go into detail on it. Okay, so The Any questions so far, okay, so I guess that the key the key point is that At the end of doing an enrichment test searching all of the different pathways and so and gene function Informate sorry piece of information about gene function that you have you You get a list of genes Sorry, you get a list of enriched go terms or rich terms Enrich pathways and a p-value that's associated with each one and that p-value can be corrected by multiple testing correction So the tool that we've chosen to use in the lab. It is called David it's From the National Institutes of Health and they've made a website that allows you to upload a gene list And then it will run an enrichment analysis for you and the reason we chose There's probably a hundred tools out there that do this enrichment analysis, so you'll see it all over the place It's actually fairly easy to construct tools that do this But we chose David because it has a nice website and it is so that's fairly easy to use and it has a lot of gene lists that you can you can use and choose among so I'm just going to go over this very quickly in PowerPoint slides And then we're going to start the lab and for the lab We have a protocol that we've we've developed for this course that you can just follow that very unique put together and and so so just As a quick summary When you go to when you go to David You can There's some links up here one of them is start analysis So you click that start analysis one and this will all be in the protocol so you can follow it Or you can check out Check it out You might Be pretty obvious when you just look at this so start analysis that takes you to a page that looks like this You can click on the upload tab and it allows you to upload a gene list If you don't have a gene list you can just use the demonstration list that they have here But you paste a gene list in here So I pasted in in this case a list of of genes that are frequently mutated in glioblastoma That are from That I mentioned earlier that are on the wiki And then you have to choose the list type Which is gene list or background so remember I talked about background You need to know the background of your experiment so for next-generation sequencing It's probably every gene in the genome, but if you have an array Microarray that only has a certain set of genes on it. That's your background you can Define your own background in David But David also has a whole set of backgrounds that are already known so a lot of common microarray platforms already stored in David So but it's important to select the list type which is a gene list It's not your background list So if you wanted to upload your own background list Which is the list of all genes on your microarray or the list of all the genes that you could see in your experiment you could you could upload a background gene list and save it in your in your set of gene lists and Or you could upload your actual gene list, which is the one that you want to calculate the enrichment for And you can submit the list at the end if David doesn't it David asks you to select an identifier so In this case these gene names are official gene symbols for Hugo gene names Hugo symbols that are officially recognized and so I can choose official gene symbol If you don't know what identifier you have you can you can select don't know and Or if David if you choose the wrong one if David doesn't recognize that your genes It will try to detect the correct identifiers to use and it will give you a summary. It says I don't recognize your your You know, I didn't recognize your identifier type Do you want to let David try and guess your identifier type and you can click this button to guess the type It's better if you It sort of seems to work better for us and for the labs that we're doing tomorrow If you actually know the identifier type and you choose the the right one And the reason is is because if you let David do it David converts everything to its own internal identifier And when you want to download the results you have the David I don't internal identifier And then you have yet another identifier, which you don't want doesn't give you your original identifiers back Okay, so That was uploading gene lists If I click on the list tab then I can choose the species that I'm looking at Gene symbols are similar between human and mouse so David will might complain that it doesn't know what organism you you are Talking about so choose one and then the list that I uploaded is here. I make sure that's selected and then The second step is I analyze the list and there's a few different types of analyses the sort of basic analysis is the functional annotation chart and The the tutorial that we've given you has all this information and this is the this when I click on that link I get this result and it's a little bit hard to see here But the top most enriched gene or type of gene is kinase it's from Swiss Pro keywords Then there's a transmembrane receptor protein tyrosine kinase activity, which is like a gene ontology term From molecular function MF fat, which means it's not it's not slim Tyrosine protein kinase Which is from Interpro Interpro is a database of domain of protein domains Here's some more Swiss Pro keywords. Here's an interpro Domain here's another go term VP means by a biological process and all the rest of these genes are biological process phosphorylation Phosphorous metabolic process transmembrane receptor and in fact if you know Biology of these genes you'll are of these terms You'll recognize that actually they're all related to kinases and phosph and phosphorylation metabolism and and If if the and the which is actually very important in cancer and here's the the number of genes that are In the list that are associated with this Term and the p-value and the Benjaminie Hawkeberg Corrected p-value. So the p-value that that when you use David, you'll you might come across They call that the p-value they some mostly they use p-value but sometimes they use something called the ease score e a s e and Ease is a name of their software before it was called David and the e-score is basically a fishes exact test that they modified slightly And if you look in the documentation, you can see that they basically have the same test But if they add a small they make a change in numbers by a very small amount to For a specific reason that I don't I don't know why they they did it that way, but it's it slightly changes the p-values for small gene sets and the So this p-value is actually not an exact fishes exact test if you're comparing between tools you might notice that It's a modified fishes exact test, but otherwise it's the same idea And you can also download this file And you get all the results as a as a text file The next I'm going to give a lecture about network visualization and analysis I just wanted to before Going further on this. I wanted to mention a few more points that came out that were questions that were asked during the lab about enrichment analysis, so Some people there was a there was one question that came up about gene set size So I forgot to mention that if you have a gene set that's really big like one of the top-level gene ontology terms like Biological process which is basically every every gene You know you could have a gene set a gene set that has 300 genes in it or 500 genes in it Those genes if they come up are less likely to be meaningful because it's so general So it doesn't really tell you something very specific about what you're looking at even though it might actually be enriched So so typically a lot of enrichment analysis tools throw those large gene lists away And I think the gene ontology the go term fat list already kind of has some filtering So they David has some basic filtering And very neat, you know if you can change that filtering if you can change the size of you know if you can change the size in David of the like change the the filtering of Gene lists by size gene sets by size like choose go terms that are more than 250 or I don't think so. Yeah, I haven't been able to see anything like that in GSA It's possible to do that you can choose this particular gene sets that like the size of gene sets you want to work up work with the other problem with Gene sets is very small gene sets. So if you have a gene set of size one or two or three It's very easy if you just get one gene out of three, then you'll probably find that's enriched So there's not a lot of just ability to distinguish between Statistically between Enfirm richman's a very small gene list. So those usually get filtered away as well The other Question that a couple of people asked was how is this different than ingenuity? So ingenuity is a commercial package that does enrichment analysis in it and the It has a they have their own curated database, so they probably use gene ontology But they also curate a lot of additional things So you probably get different gene sets from from them and then there's a there's also a question about You if you use different tools you get different results. What does that mean? Probably that means that the different tools have you are using different versions of gene ontology one downloaded it in 2010 and one downloaded it just yesterday or something and gene ontology is growing all the time So one database one tool might have a better newer version of gene ontology Or they might use different gene sets like some of them might use Additional gene sets that are not present in the other one and then the other reason why they might be different as they use slightly different Statistics, so you just have to look at what the P how the P values are calculated like they might use different different Multiple testing corrections Okay Okay, any other questions at the lab think you guys had a Good chance to go through that okay, so I Talked about gene lists and gene sets We know more information about About genes than just the fact that a set of genes are associated with a pathway We we might know that genes are connected to each other right and this is a new for instance Maybe genes are interact or genes encode proteins that interact And so this is an additional piece of information that can be used And I differentiate that by calling that network analysis One of the differences one of the sort of advantages of network analysis over Gene set analysis is that gene sets or typical pathway enrichment analysis is that pathway enrichment analysis depends on someone else's definition of a pathway so you might Not always agree that all the gene ontology terms are annotated properly or the way that you want You could always annotate them yourselves, but that is Very time-consuming although some people do it for specific projects so if you have a large network of of Connected proteins like protein interaction network all your or genes and they're connected based on functional relationships And you have a set of mutated genes you might be able to Or if you have your gene list which may come from it be a set of mutated genes like the example that we gave you the GBM genes You can see how they're connected in that network and if there's regions in that network that are highly enriched in Genes in your list then that is an interesting part of the network that might represent a pathway that is is Sort of more specific to your your gene list So instead of having someone's pre-defined notion of a particular pathway You kind of discover a new part of the network that is potentially a pathway The issue with that is that you usually have to interpret a little bit more because nobody's said that that part of the network is Particular pathway and so you have to look at what that network means, but that's sort of just an example of Using network information and how network information can help you Do it a little bit more and provide more information potentially than gene sets so before we get in so I'm going to give you a a An overview of a few different topics One of the Including just sort of talking about networks in general and network visualization. So And then after we can talk about side escape, which is you guys have tried out as a network visualization and analysis tool And then tomorrow morning We'll talk more about side escape and we'll use network We'll use side escape to create enrichment maps and we'll also talk about gene function prediction and some other other things That are more network based So this is more of an introduction about networks. Okay, so one of the Sort of the general idea with network analysis is that you you have some network similar to Pathway enrichment analysis where you had your gene list and you have your your attribute your gene sets or your pathways And you look for enrichment for network analysis You have network information like protein interaction data and you have Attributes like gene expression data You may also have a gene list and you want to combine that information together and analyze it and visualize it And often the result is a network that you can visualize and interact with the network of genes and their relationships and often you'll see pictures of gene-gene interaction networks in papers and So so there's usually a task because network networks are very visual It's like a diagram. It's instead of something like a p-value or a table like we were looking at in in David enrichment There's also People are also often interested in how to prepare a network visualization for publication so and so I'll talk about that and I'll This is sort of the standard workflow, so I'll talk about networks and also how to interpret networks if you're if you're interested in A specific example of this workflow. There's a paper we published a few years ago in nature protocols But how to use site escape to analyze gene expression data in the context of networks Okay, so Networks represent relationships which can be any type of relationship. So typically in biology We have for there's four different types of relationships physical relationships which are like protein protein interactions and Protein a binds to protein B and they're actually touching each other. There's a direct connection in the cell regulatory relationships, which are like you have a Regulator target and a target and the regulator regulates the target. So that's like transcription factors or microRNAs or kinases And often people draw these these types of networks as in with an arrow. So you have a transcription factor activates Target activates gene expression of the target or represses gene expression of the target You might have a t-shaped arrow instead like a bar for inactive for down inactivates genetic interactions are Sometimes seen as well. They're not really just it's not like a it's a very specific type of relationship Like you might you might have heard of epistasis. These are Relationships that you get when you what you observe when you mutate a pair of genes for instance and the Mutate the phenotype of the the double mutation is unexpected given the phenotypes of the single mutations This is more often found in model organisms, but there are a fair amount of genetic interactions in available for human A strong example an extreme example of that is synthetic lethality where you have If you knock out one gene and nothing happens to the cell you knock out another gene and nothing happens to the cell But you knock out both genes and the cell dies those two genes are probably in parallel pathways buffering each other And so that's a very different type of relationship than physical interactions And then finally functional interactions are a more general category of interactions that Relate genes because they're similar in function and they might be co-expressed or they might be they might have similar sequence And so all of these types of right interactions are also our functional interactions But functional interactions can also be very general like two genes are co-expressed It doesn't tell you if they're interacting or if they're in the same pathway, but you know that they're that they are somehow related potentially functionally So networks are Useful for discovering relationships in large data sets. So if you have a lot of connections between your data like similarity relationships You can Visualize them as a network and you'll see the relationships very clearly if you put that into a spreadsheet You won't be able to see them very clearly the other good thing about networks Another good thing about networks is that it's very useful for visualizing multiple different data types together And that allows you to see interesting patterns in your data So here's a Three different examples of how networks can be represented just as a background a bit of background Here's a list of relationships as you might and put in a spreadsheet If you're going to store network information in a spreadsheet You can say a is connected a one is connected to a two a one is connected to a three Etc. Say you have a spreadsheet and there's column one column two and then you can have additional columns So that are optional in this case. There's another column here with the weight This might be this might might represent the strength of the interaction, but it could represent other things Here's the actual network representation of the same the same set of relationships, so here we've plotted as a network and it's much easier to see that a nodes six or Circles six eight and nine are kind of forming a little triangle here kind of hard to see that from this this view you can also Make the thickness of these lines proportional to the weight this weight So a one a three is the thickest lines one of the thickest lines And so you can visualize different types of attributes here. We also have a So this is the a three a five connection, which is here in blue Here's the a five to a four connection There's an arrow here to signify that it's only the directionality of the connections only from five to four so Some tech terminology around networks these circles are called generally called nodes and The lines are called edges often Sometimes these nodes are called vertices And this terminology comes from computer science which and math which have studied networks for a long time and they call it In the field of graph theory So sometimes you might also see networks referred to as graphs and biology We tend not to use the term graph because often when you if you just Ask people most people what a graph is they'll probably plot point to a plot or tell you about a plot not a network So network seems to be more easily recognized by most people You can also represent Networks as a heat map in which case Each circle here represents each square here represents one of these edges or connections and all of the nodes in the network are represented on both axes here and This if you have a Connection between a node it the square the corresponding square gets colored in and in general This is sort of symmetric around this axis here Most of the squares on the side of on this lower half lower triangle of the matrix are the same as the ones on the upper triangle except for this green one here, which is only present in one direction the rates are colored are represented here as colors instead of thicknesses of lines and The net this heat map or matrix was clustered to put similar Similar Nodes together so these nodes have some a similar pattern of connectivity and this is six eight and nine Which is this little guy here these nodes have kind of similar connectivity pattern here And that's this little group here So this is useful for certain sort of seeing groups or clusters in the network. This is a less less Common representation, but you definitely see it in the literature and that's why I've included it here It's But but the the take home messages that all of these are equivalent and I guess also this this is the common representation of the network So this is an actual example of a biological network where all the nodes are proteins and all the lines are protein interactions The we've overlaid gene expression data on this so these are yeast proteins and we've overlaid function data so Proteins that are known to be part of the kinetic or colored blue Nucleosome proteins are green and replication for proteins are pink and the yellow ones are others The the connections here some of the connections are thicker than others This means that the genes that these two genes that encode the proteins are highly Correlated in their expression in a particular expression experiment that we overlaid the expression experiment in this case was From yeast that was looking at genes and how they were changing an expression over the cell cycle so the genes in this cell cycle are going up and down and If you have genes are going up and down at the same time They're core they're correlated in their expression and it gets a thick line here So you can see all these guys here are very thick lines It probably means that it means that all these genes which are also in the nucleosome are going up and down at the same time Yep So in this case the length of the line doesn't is not mapped to any kind of information And I'll talk about that a little bit more, but this is Just laid out so that the nodes are not highly overlapping each other So you can see things and that the line length is drawn however needs to be so that that happens That's a good question the the next that the other type of information is the size of the nodes That's we map the transcriptional transcription amplitude that is the highest level of transcription that Gene reaches in the cell cycle so it's genes are going up and down Some of the genes are getting like the highest level is really high and some of them the highest level isn't that high So the bigger the the circle the more higher the expression is at some part of the cell cycle So you can see again like right away. Here's a whole set of genes that are like have really high expression so if you were just to look at this naively you might think oh the nucleosome is really important in the Well, I guess all this is important in the cell cycle, but the nucleosome is really highly expressed and all Express together all these genes have thick connections between them. They're highly correlated in their expression so this is probably a And we know that this is a complex that probably gets expressed all at the same time and it's expressed highly at a particular moment and so in that way you can kind of see a relationship between Protein interactions, which are the edges the thickness gene expression correlation So you can get some idea about the dynamics of this complex and then the other other things that you can see are just the General relationships of this I'll talk more about this figure later in terms of how to interpret networks So if you see a network like this What patterns do you look at look for to get out? You know what what information do you take home from this network? So I'll talk more a little bit more about that later This is as it stands as just an example of Putting a lot of information together on the network Question yeah, mm-hmm. Yeah. Yes Yes, I can't remember what it was here. It was probably like time zero or something in the cell cycle Okay, so What's the difference between pathways and networks? So we talked about we've talked about pathways most biologists understand The idea of a pathway and the Or if you if you if you've given if you're if you're told the name of a pathway as a biologist You usually recognize that if I tell you apoptosis, you'll say oh, it's a pathway Whereas a network I've showed you it's just a set of relationships Pathways are also sets of relationships, right? You have Relationships between between aspects of the pathway the pathways generally have a lot more information associated with them Here's a metabolic pathway that has Series of glycolysis that has a series of chemical reactions So usually in pathways, there's a kind of a step a series of steps that occur So often it's more of a process not just a static diagram So you're trying or at least pathways or pathways are trying to capture some information about a set a series of steps It does here's a it's hard to see here But these are these are other types of networks and that Some people use like gene regulatory networks and the signaling pathway Metabolic pathway and here's like a really big network of protein protein interactions So the main difference between pathways and networks is that pathways tend to have More of a diagram of processes are trying to capture information about the process the biological process like a series of steps And there's a lot usually a lot more detail The pathway so you can take pathways and convert them to networks But it's harder to go back from networks to pathways because when you convert pathways to networks You lose information you can also convert pathways to gene lists, right or gene sets Just take the set of genes that's in a pathway use it you lose a lot of information at that point You don't even know how genes are connected. You just know that they're all part of the same pathway so There's sort of a hierarchy of information and pathway analysis gene set network and pathways sort of the highest level of detail Okay, so that's I'm not covering a lot of Content about pathway databases as part of this this particular workshop and the reason is is that Currently most people are analyzing are doing pathway analysis using gene sets and Networks and we haven't really gotten to the point with software that is very easy to do anything more than To use all the information that's in pathways Even though we'd like to and eventually people will and more tools will be available to make better use of the rich information That's in pathway databases So right now people are kind of mapping them to gene sets and networks and using them as as that, okay? So mapping biology to a network so a network is a very abstract concept any relationship Can be represented set of relationships is a network? like relationships between people but in biology You have to Decide what the nodes and edges mean so you could say that one each node is a protein and each edge is a protein Interaction like we did in the previous in the big network that I showed you But they can represent anything so a node can represent a more more information more realistic Mapping so maybe you're only interested in in showing the proteins that are present at the same time in place Maybe that's more realistic Edges or interactions don't necessarily have to be physical Interactions so I mentioned that there's different types of interactions physical regulatory genetic functional interactions It's very important when you look at a network the To ask to find out what that what the nodes and edges mean So here's an example of a really big network and It would be bad if you if you thought this was just because you a lot of a lot of networks So protein interaction networks wouldn't be very good If you thought that this was a protein interaction network because this is actually a protein sequence similarity network So all of the nodes here are proteins But the edges are represent sequence similarity and the color the brighter the color the more sequence similar The proteins are and there's I think a million proteins represented here So it's impossible to see any little protein, but the these regions here are protein families they're clusters of proteins that have similar sequence that are similar to each other and From this plot you can sort of see there's little dense regions And I can you maybe you can even draw boundaries around these and start to finding protein families That would be a useful thing to do with this network This so this is just the idea that Relationships don't have to be the typical types of relationships that you might think about okay, and the Network one of the So I told you that networks are useful to represent relationships And they're also useful for visualizing lots of different types of information on the at the same time But they're also useful for analysis and one of the interesting things that have that or that have happened in network research and its application to biology is that People that there's a lot of algorithms out there that that can be applied from network theory or graph theory from computer science to biology so As I mentioned before Computer scientists and mathematicians have studied graph graph theory for a very long time And there's all sorts of algorithms available Because of that so there's algorithms for Lots of different things and some of them can be used to answer biological questions. So here's an example How many people have heard of six degrees of separation this idea? So most most people so the That's this is the idea that everyone in the world is connected to everyone else with at most six six links So things sort of in terms of friendships. It's probably less now because of Facebook but this was this this idea came out of research from Stanley Milgram in the 60s who Gave people postcards with a name and said try and get this postcard to the person without knowing their address and Some of many of the postcards made it to their to the to the other person to the other end and he counted the number of hops and postcards took and then You know on average it was six to six links. So if you had a connect a big network of all of the friendship relationships on the planet like Facebook You could answer this question without doing the postcard experiment, which is time-consuming costs a lot of money at stamps but the so the question would be how are two people connected which path should we take and In computer science, there's an algorithm called shortest path by breadth-first search that basically searches a network to starting out going out breadth-first out Sort of starting from a node it tries all of the know its neighbors And then goes to the next level the next level by the end of the of the algorithm it it is guaranteed computational Computer scientists have proven that it will find the shortest path if one exists So if a path doesn't exist it won't find one and if it does exist We'll find one and it's guaranteed to be a shortest path. It doesn't have to be It could be actually more than one shortest path But it it will find at least one of them in fact if you ask it will find all of the shortest paths all of the paths That could be equally short So that might be interesting in biology if you're interested in how proteins are connected I have two proteins of interest and I see how they're connected. I want to see how they're connected in network So you could cut you could answer that question using this algorithm Is this biologically relevant that that's another question That's another story because maybe the shortest path isn't the actual best path in biologically best path but it is a it is An intro potentially interesting path. So this is sort of another question on top of that Computer science doesn't tell you if it's biologically relevant so Since about the past ten years people have been studying networks a lot in biology And developed lots of interesting applications Gene people have used networks for gene function prediction Which I'll mention again in a couple of times in a bit and tomorrow morning People have used Networks to find protein complexes or other modular structures and large networks to study network evolution To predict new interactions between genes and also there's quite a lot of interesting applications in in disease research so Instead of looking just at individual genes people have found networks that are correlated with disease subtype or disease. So if you have disease versus normal which networks and you have gene expression data for instances with with Present of it are available for for all this a bunch of samples You could look for regions of the network that are highly differentially expressed in disease and not in normal And that might be a region an interesting region of the network There was a couple of years ago. There was a good paper from Trey I'd occurs lab about breast cancer and finding networks that were Associated very specifically with breast cancer gene using gene expression data And this this ties into subnetwork based diagnosis This is the idea that biomarkers or markers of disease Don't just necessarily have to be genes. They could be a set of genes You could use gene sets for this, but you can also discover automatically the best Sets of genes that that are connected in a network and might represent a particular little pathway or system that is important for The outcome that is related to the outcome of the disease that you could use for diagnosis or prognosis and This is sort of the same type of thing in a GWAS Genic genome wide association study type of view. So these each of these these Examples there's a there's a A Name under here, which is actually a name of the cytoscape plug-in that does does these different Analyses so if you're interested in some of these analyses, you could look up some of these systems to plug-ins Okay, so what's missing in the typical network? Typical network and and also pathways typically our static so And generally represent all of the information that's available that we know about That could happen at any given time in in a cell They generally don't represent what's happening in a particular cell at a particular time and so that is Actually context here so cell type and developmental stage they also that's sort of related to dynamics dynamics can also be you know the Changing of concentrations of components over time. It's difficult to represent some of these things like a calcium wave or if or a Like a wave of information in a neuron or something like that wave of electrical conduct And so usually people represent that with a more detailed mathematical representation that can handle this and usually that's A simulation model that can work at a very high level of detail and simulate Process over time and all the concentrations and changing so how they change so we are not going to cover this But there's some pointers if you're interested in this typically this requires a lot more information than we to then we usually have about Pathways and so for genomics studies you never get to this level of detail because you're working with the whole genome Which we don't have a lot of information needed for this about all all the genes Also, we tend to represent proteins as circles or genes as circles But obviously there's a lot more detail like atomic structures and domains and proteins and so often that's not in the network either Okay, so so in general networks are useful for representing for seeing relationships and large data sets much better than tables The important thing when you see a network is to Understand what the nodes and edges mean there are many methods available for gene list and network or sorry for for network analysis and There's two ways of kind of Approaching this and what it's sort of I guess we need to develop more textbooks about network analysis But currently it's a little bit of a challenge to figure out the right network analysis for your problem So you can either become an expert in all the different analyses and then figure out Which one is best for you, but that's usually too time-consuming or you can define your question and have a very specific question And then look for look for answers Often it's it's if you have someone knowledgeable that you know you can ask them given this question what what Network analysis method should I use to answer this question? So that's probably a good way of doing it sort of determine a question I want to see differentially expressed networks I tried to give you some examples in the in the beginning or in a couple of slides ago And then later on when we talk about said escape I'll tell you a lot about different plugins and different types of questions that they can answer so you can get a flavor for it Okay, so the second part of this This in network introduction is network visualization Did anyone have any questions about the first part more questions? Okay, so Network visualization is is pretty quick, but just a few notes about it I Showed you a nice network that looks like this. This is the result of network layout and If you didn't have so so one of the most important thing Aspects of network visualization is network layout. So Luckily, there are automatic Algorithms that that can help you get a layout like this automatically if you didn't have that you would have a network like this So if you just took all the genes if you were looking at a protein interaction network And you were looking and you you wanted to draw it out Manually if you just drew circles for every protein and then start connecting them up You probably get a hairball that looks like this and people when they see networks like this They usually call it a hairball But because it's looks like a tangled mess and it's very difficult to interpret so luckily there's there's as I said, there's layout algorithms automatic layout algorithms that that Lay out the network in a much more intelligible way So one of the main types of network layer. So the main goal of network layout algorithms is to Arrange nodes so that they don't overlap and reduce overlap of edges because the more Crossing of edges or interactions you have the more of a nest or mess you you make so One of the ways of doing this is called a force-directed layout Which you might see in side escape or other other network analysis tools this is the way that this network layout works is that it Simulates that the network as a as a as a physical system where nodes are repelling each other and Edges are pulling nodes together. So the balance of those it's and it simulates this forces between nodes and edges Edges are pulling like springs and nodes are repelling like like charged particles and Sort of simulates throwing the network up in the air and having all those forces resolve themselves when it Lands all of the nodes that are highly connected are pulled by lots of edges Even if they're repelling each other they're repelling each other so they don't overlap very much and nodes that are that are Distantly connected or farther away in the network. So Not that works really quite well for most networks So that's usually if you see a force-directed layout or sometimes it's called spring embedded or an organic layout Those are good layouts to try in general for any network And they're generally good for up to 500 nodes before You start getting this hairball effect where it's difficult for the layout algorithm to Reduce number of edges because there's just too many edges. So if you have lots of edges and nodes You can try and reduce the the amount by filtering it To a more specific subset if you're having problems visualizing large networks also just a couple of couple of quick points about making networks look nice if you wanted to make a a Publication quality network that you were going to publish in a journal You it's a good idea to not just rely on an automatic layout But manually adjust the layout after you've done this automatic layout And you can also load the network into a drawing program like illustrator or something to make adjust labels And that's what we did for the network that the big network that I showed you before so I mentioned that Really big networks are difficult to visualize and if you have a very very big network like all of protein interactions in human You probably won't be able to visualize it very effectively. It will look like this. So it's all There's two ways of solving that problem one is zooming into a specific Area of interest like a particular pathway and just focusing on that pathway And another way is reducing the number of edges if you have some way of doing that Like if you have protein interaction and confidence values, you can just visualize the most confident interactions and I also mentioned that it's networks Are useful for Visualizing a lot of different information at the same time and I showed you You know in the network there were circles and different colors and different thicknesses of lines there's a lot of different visual features that you can use to Map information so we we mapped in this in this network here back to this this network we mapped node color to gene function node size to transcription amplitude edge size to Gene expression correlation. This is up to your imagination how you want to do this But you can layer on a lot of different types of information using different visual attributes Okay, so back to this network that I showed you and I talked about promise that I'd come back to Looking for patterns in this network that are useful biologically. So how to visually interpret a network? There are generally three types of patterns that that are recognized in biology as being quite useful for Looking at understanding biological networks. So if I give you this network and asked you, you know, what is special about this network? What does this network tell me? so I already mentioned that looking for Pat general patterns between visual attributes like all these circles are highly connected and they're big and they're all the same color and There are there there the edges that are connecting more thick. So you kind of see a lot of types of data Coming together to be to sort of stand out There's a there's that sort of a sort of a general a general Method to look for interesting relationships between types of data There's also this idea of guilt by association. So if you notice What was what was noticed very early on when people started visualizing networks like biological networks is that? genes that were known to be similar function usually hang out in the same parts of the network and you can see that here all the Connecticut genes are blue and they're all very highly connected. Maybe there's a blue one over here, but in general there's There's you know, they're they're connected to each other more than they're connected to other things. So That has been used for gene function prediction If you don't know that This gene here is involved in the or one of these genes in the middle here is involved in the kinetic core Say this gene here, which is yellow. Say I didn't know the function of this gene But I saw that it was connected only to kinetic core genes The more connections the better in this case only has one then I would predict probably that it's involved in the kinetic core And so I would make a functional a gene function prediction that this this gene should be blue Because it's only connected to blue genes Here's an example of a gene that's sort of connected to a bunch of blue genes and a bunch of green genes So it might be you might think that it's mediating it's sort of participating in both functions And maybe it's relating to the connection the physical connection or signaling connection between these two These two gene functions So that's guilt by association. That's the basis of most gene function prediction. So Again, that's the idea that genes that have similar function are more likely to interact with each other and so if you have a gene That's unknown function that without a function and you see that it's interacting with a lot of genes of known function You can predict that it's with some level of accuracy that it's part of that function as well Let's talk more about that tomorrow Another pattern is dense clusters. So you notice that there's a few dense densely interconnected regions It was also it was realized early on that these often represent in a protein interaction network often these represent protein complexes and people use that method to People develop developed methods to find these clusters and then predict protein complexes or new members of protein complexes In other types of networks like the protein sequence similarity dense clusters might mean something else, but generally they're interesting And then there's also global and so that's something that people look at look for There's also global relationships, which You know our general relation general global relationships between different parts of the network So the kinetic core is one part of the network seems to be more highly connected to nucleosome And then that's more highly connected the replication fork There are some connections direct replication for kinetic core connections But it seems to be a you know fewer than than going through here So that might tell you something about the overall structure of the network that you're looking at and people have Used this type of relationship to define maps of the cell from very large protein interaction or genetic interaction maps And so you can you can get a sense of which Processes are more closely related to each other Any questions so so to to recap Just a quick introduction The first step of network visualization is automatic layout. You have to do that before to just visualize a network and There's a lot quite a number of patterns that you can visualize together Together and I've said that repeated that a few times now If you ever find yourself looking in a network that's very complicated try to reduce it into a focus into a Focus your analysis by removing nodes and edges that might be less important So Everybody yeah question For the source of So there are many pathway many network databases or many databases that store Protein protein interactions or other types of relationships So for instance the intact database is a is a protein protein interaction database that Curates protein interactions from literature. So if you're just interested in networks and you mentioned Now pathways, there's there's databases that just focus on protein interactions or genetic interactions And they're curating these from literature a lot of protein interactions come from large-scale screens And they're pulling that data in as well for pathways there are pathway databases and pathway curators a couple people are here who are doing pathway curation and For the reactive database, but in general it's curated and sometimes it can be predicted So you just have to know which date like if the database has curated information or predicted information On the wiki. There's a link to pathway databases and network databases that are Pretty good Including a link to a website that that we maintain that lists all the hundreds of pathway databases that that we found And interaction databases as well. It's called path guide the same So networks might be I know like a protein interaction database is a very specific type of Interaction, but it's only protein interactions pathway a pathway might include many more types of relationships like Protein a binds to be and phosphorylates it and that phosphorylation changes the confirmation of B And then it becomes active and can fossil in it can do something else it can bind to this other guy and And then that guy gets cleaved and the two one product gets degraded and the other product becomes a Signaling peptide or something like that So that's much more detailed information. Whereas in the protein interaction database. They'll just say a binds to be they won't capture all of that And for that detail it might have the strength of binding, but generally the post translation modifications are They might be it might be entated, but I'm not sure how often they're really Required for binding. Yeah, it could do that Yeah so Yeah Any other questions? Okay, so Everybody was so I'm just going to introduce side escape. So side escape is a network visualization and analysis tool that is freely available and I think so one of the assignments that we gave you as you know, hopefully Before the class was to try out the side escape program by following the tutorials the introductory tutorials online the reason we did that is because It can take a little bit of time to go through those those tutorials, but those tutorials are fairly straightforward so you can go through them and And then when you get to the class we can focus in on Analysis methods instead of spending most of our time figuring out how to use the software So that help makes better use of the class time so But in the next just before the end of this class I'm almost finished here We can try and use side escape a little bit And I guess I don't know how many people are typically standing afterwards. You can continue using side escape I'll give you I can give a quick demo of side escape as well, but the important One important thing that we should do is make sure that everybody's set Computers side escape is working for Because tomorrow will be using side escape in the lab and we'll just assume that you have said escape running normally And you just just will use it So if drink after this lecture if you haven't got side escape running if you have some problem put up your hand in TAs will help you with it Okay, so Side escape provides is a freely accessible tool Originally, it was developed at the Institute of Systems Biology in 2001 2002 and then since then it became an open-source project that involved many labs Including my my own, but there's nine different academic and industry groups that are part of the side escape team of people that build side escape And it a lot of different types of network analysis Are available in set escape So the general idea with side escape is that you collect information network information from different places databases and Whatever other places you can collect it from you load that you have to collect that data And then you load it into side escape And then you can you can overlay gene expression data or other things And then you can do and you can run network analysis that are available as set escape plugins I'll talk more about where you get the network data because that's that's something important, so Side escape is There's lots of network analysis tools I guess one of the different things about side escape is that's one of the few network analysis tools That's really open source and it has a big active community around it And so there's a lot of tutorials like the ones that you used There's annual conferences and tens of thousands of users So there's a mailing list for discussion that if you end up really using side escape a lot excuse me you can Sign up to those mailing lists and get questions answered There's also many people have written plugins to extend the functionality of side escape and there's now over a hundred plugins And if you know Java You can build your own plugins or if you know someone who knows Java You can help that you can get them to build plugins for you if you want to build your own to extend the functionality in new ways so I'm not going to go through because you've already looked at side escape I just wanted to present this workflow that we've put together the And then I'll just explain how gene lists fits in in here. So so typically you load up gene lists and gene attributes To And you get network information from from somewhere. So these gene lists are coming from your experiments gene attributes We talked about Network information needs to come from somewhere as well But you can load that all up into side escape and you can You there might be different types of networks like protein interactions or functional interactions or regulatory network And then you can do network visualization. So side escape itself is all you need for network visualization And then you can do analysis. There's different types of analysis that you can do so gene list comes from your experiment network information I've So I put little words next to each of these boxes These are plug-ins in side escape that can help you with this bot with each box So if you're interested in getting network information, you can use IRF web, which is a My favorite database of protein interactions that is Collects all the protein interaction data that for many other databases and puts it together Gene mania will talk about that tomorrow that has a large database of functional interactions And we'll probably satisfy most people's demands here for getting network information Agilent literature search is a tool from Agilent that allows you to type in a set of genes And then it tries to find relationships from those genes from the literature It reads PubMed abstracts and it looks for relationships Based on keywords like a binds to be it will say it will draw a link between gene a and gene b And then there's string which is a tool like gene mania There's different types of analysis gene set enrichment analysis what we talked about this morning There's a bingo plug-in that does gene set enrichment analysis similar to David within side escape If you're working within set escape and you want to do enrichment analysis Then you might want to use bingo and then tomorrow we'll talk about enrichment map Which is a plug-in that allows you to visualize the results of enrichment analysis Like the results of David in a much more intuitive way So I'll really talk more about that tomorrow morning and we'll have a lab about that Regulatory network analysis. There's a plug-in called net match gene function prediction. We'll talk about that tomorrow gene mania Module detection like finding these dense regions. There's plugins M code and cluster maker Reactome fi is a plug-in. We often usually teach as well as part of our pathway course but and then So I've just kind of made this workflow to give you a kind of map if you're interested in any of these boxes You can look up this plug-in and then also in this in the next slides Which I'm not going to go over These these next few slides are references a little references of different set escape plugins that If you're interested in particular things like visualization for gene expression data, you can look at the this declare a plug-in This is a little bit more detailed about the bingo plug-in which is on that first first workflow Some of these and a number of these plugins are a number of the plugins that are listed on that workflow have a couple of extra slides about them So that if you're interested you can just learn more about them, but I'm not going to go through them Here's the the Agilent literature search plug-in. So if you're it's just a little bit of a Set of interesting plugins There's an each one a number of them have labs. So if you're interested in in In working with this information you can follow the the lab So here's a regulatory network motif finder. This will find like feedback loops for instance in the network and There's quite a quite a few other things and at the very end if you're end up being a big user of side escape There's some tips and tricks of how to use side escape better But you definitely don't need to use read that if you're big in your user But once you get that once you start using it if you start using it a lot that some of these might be helpful so those I just put in there for your reference and and That's it so Much time do we have I guess we're almost done any questions I can Probably I was thinking that if we had extra time we can I can give you a quick demo of side escape But I don't know if that's fully necessary. Are there any questions? Yeah I Yes, so Nobody's integrated all of these so the question is Has anyone tried to integrate all the pathway information in the world together in one place? That would be the ultimate goal. We really want to do that. So yes, people are working on that. No, it hasn't been finished It's not finished. So there's two Really this I ref web is a really good resource for protein interaction data So most of the protein interaction data is available through I ref web And then there's a project that I work on called pathway commons with Chris Sander at Sloan Kettering Cancer Center in New York and the goal of that project is to Merde can collect all pathway data from lots of different pathway databases so that it's in one Accessible place so that everybody can get access to it very easily and that will improve the ability of users to use pathway information so that's called pathway commons and there is a pathway commons plug-in for site escape so you can go into site escape and And I pulled down pathways from that the list and visualize them inside escape But I wouldn't say that it's finished yet, and I think there's quite a lot of work to do in that area So tomorrow we'll talk about gene mania and gene mania has done a lot of we've done a lot of work with gene mania to collect data and we have more than a hundred million interactions for human or something like that so Some of those are a lot of those are co-expression links, but we have ton collected tons of information So ideally what you would want to do if you have a gene list there's two one one thing I almost forgot about to mention is that there's two Important ways or to sort of different ways of using site escape. So if you have a network already Maybe you did an experiment like a protein interaction experiment and actually generated a network If you have that network you probably want to visualize it and analyze it And so site escape is perfect for that if you have a gene list you have to convert your gene list into a network and the way to do that is to Put the gene list into one of these tools that finds all the connections between Your genes and may may also include other genes that are connected to your genes And that's this gene mania tool that we'll talk about tomorrow is very good at doing that so you just give gene mania a list of genes and it will it will Find out all their connections and then you can start doing network analysis There's also a tool like that from reactome called the reactome functional interaction plug-in Which is also a site escape plug-in that's listed on that one of those slides Which is Right right here Right right here. Oh, sorry reactome fi. It's actually It we could have actually I should probably put like a separate box here. That's converting gene list to networks Gene mania and reactome fi are two good plugins for that, but we'll focus on on gene mania tomorrow