 OK, welcome to lecture of day three. So this morning, I will be telling you about networks and use of networks for interpreting, helping to interpret or visually interpret data, including the results of the pathway enrichment test that you guys did, the pathway enrichment analysis. So Lincoln, yesterday's slides in the afternoon yesterday's slides about Reactome, talked about pathways and a little bit on networks. The network part was mostly to introduce the networks that were in use with ReactomeFI, where you download networks. So I'm going to cover networks as well. There'll be a little bit of overlap at the concepts that I'm going to be talking about are more focused on network visualization and how to interpret how these networks are visualized. I'll mention a little bit of network analysis that wasn't covered yesterday. So hopefully there's not too much overlap. OK, so the, and I'll also be talking about site escape in general today. So a typical network analysis workflow includes first getting some network information from somewhere. So you got network information yesterday from ReactomeFI. For instance, you can load in your list of genes or mutations and pull down a network. That's one type of network information. There's lots of sources for this network information. And then the second thing you do is load some information about the networks. The genes and interactions have information associated with them. And that could be expression data or mutation data. And then you analyze and visualize networks. So I'll be talking mostly about the visualization thing this morning and some analysis. And then a lot of people, ideally, once you've finished your network analysis and you have a nice network figure that shows something interesting, like here's the genes that are more often mutated in my sample that I'm studying and how they're connected into modules, for instance, then you can ideally create an image for publication. And it's easy enough to save the image from the tool that you're working with, especially Side Escape. You can save it as a PDF or take a screenshot. But actually, there's a little bit more than you can do to make even better publication quality images. So this workflow was covered in a paper from a number of years ago now in Nature Protocols that we wrote to discuss how biological networks and gene expression data can be integrated and analyzed. OK, so I'm going to give you a quick introduction about networks again, hopefully covering some different topics that were covered yesterday. Focus on network visualization and discuss Side Escape as a software tool. Everybody's tried it already, but I'm going to just go over it again and see if anybody has questions for a more general point of view. And then I'll mention network analysis just a little bit broader than what was mentioned yesterday. So networks, the main goal of network information or networks, why we bother to, why we care about networks, is up there. They're very good at representing relationships in data. So in biology, people are very interested in different types of relationships, physical interactions between genes, regulatory networks, like Wyeth discussed, genetic interaction networks. We haven't discussed those so much here, but if you're working with certain model organisms, synthetic lethal interactions are very important, and also more and more in cancer. Functional interactions are interactions between genes where it might not be a specific experiment that defined the interaction, but the relationship means that the genes are somehow functionally related. So it could be that genes are co-expressed, or it could be that it's kind of a more general concept. It could be that genes are co-expressed, that they're physically interacting. It could mean that the genes have similar sequence, so functional interactions are fairly general. And as I mentioned, they're useful for discovering relationships, interesting relationships, and large data sets. So it's much better than looking at all the information in a table or spreadsheet. So imagine if you had a bunch of interactions between genes, and you loaded them into a spreadsheet, you'd have to have, say, two columns. Column one is gene A, column two is gene B, and those two columns create relationships. So gene A is connected to gene B, and then you have one row for every interaction. So it would be difficult to see the structure of the network if you didn't load it into a network visualization system. Another useful way that networks are useful is that they're very good for visualizing multiple different data types together. So you could look at protein interactions, and gene expression, and mutations, and methylation, and a whole bunch of other things that you layer onto the network and visualize. And then finally, over the past decade or so, almost 15 years, a lot of people have developed some very interesting network analysis algorithms. You looked at ReactomeFI yesterday. So here's an example of a network that I think many of you probably read the Nature Biotech paper that we sent out before about how to visually interpret network, biological networks. And so some of the figures from here are taken from that paper. But just to show this quickly, briefly, this network is a network of protein interactions between yeast proteins. And a number of different types of information is layered on this network. So gene function is some information about gene function is layered on this network. So we have replication for genes are covered in our colored red, nucleosome genes are covered green. And the width of the lines connecting the genes is proportional to some gene expression correlation from a gene expression data set that measured genes across the cell cycle. Genes at different points of the gene expression at different points of the cell cycle. So genes that have a thick connection, thick edge or line between them, are highly correlated across the cell cycle experiment. And then the size of the circle is proportional to how high the transcription is at any given stage, the maximum transcription level at any given stage of the cell cycle. So some of these genes are highly expressed, and some of them aren't that highly expressed. And then once this was visualized, there were a few modifications made to this figure in another. So this figure was generated in Side Escape. And then a PDF was saved and then loaded into Adobe Illustrator, where you could use any kind of image editing software to. And then we added these labels here. And we circled certain things that we want to emphasize. We added an arrow here. And so this extra work that was done in another image editing program just helps emphasize certain regions that are not easy to graphically display in a tool like Side Escape. And so I'll mention this a little bit more later, but this is just a quick example. So you can see that there's some structure in this network. There's kinetochore genes are more connected to each other than you would expect. All these functions act like that. And you can sort of see general distances between things. So certain sections of this network are closer than others. And it's somewhat instructive. OK, so Lincoln also mentioned pathways. Yesterday, pathways and networks are somehow related. In that pathways have network type relationships. They have connections between proteins or genes or molecules, small molecules in the case of metabolism. So this is an example of metabolic pathway and gene regulatory pathway and a signaling pathway and a gene regulatory pathway. And this is just a network of protein interactions, thousands of them from a mass spectrometry experiment. So pathways usually have more detail on their connections. That's the major difference. Also pathways, sometimes networks are really limited to pairwise connections between genes currently. And pathways can have more than two things involved in a relationship. So that's one difference. So sometimes they might be difficult to visualize in side escape because of that or as a network. OK, so I mentioned network analysis. People are interested in using networks or the concept of networks because there are all these analysis algorithms out there. So just as a very, and you already covered some, but just as a very simple example of how these network analysis are often derived is that people have actually studied networks for a very long time, more than 100 years, in math and computer science. And before computer science, people studied in math. And in that field, it's called graph theory and a network is called a graph. We don't use that term in biology because most people, when you ask what a graph is, they think of a plot. So network is more intuitive as a word for most people. And the interesting thing about this area of graph theory is that there are a huge amount of algorithms that are available where people have developed some algorithm to answer some question about these networks. So just as an example, how many people have heard of the six degrees of separation? OK, almost everybody. So this is the idea that everybody in the world is connected to everyone else through acquaintance or friendships by at most six links, or on average six links. And this was an interesting concept that was discovered in the 60s. Stanley Milgram, who's a famous social psychologist, I guess, was interested in finding out how people are connected. And so he sent a bunch of postcards to people in Boston. And he said, I want you to send this postcard to somebody in New York. And they gave the person's name and what they did. And but you have to send it through friends. You can't mail it to the guy. And I'm not giving you his address. So people tried to forward the postcard to a friend who they thought was somehow closest to a banker in New York. And surprisingly, many of the postcards actually made it to the guy without going through the post or without using the person's address. And each step along the way, he also asked that people send a postcard back to him so he could trace where they were going. And he found that on average it took six jumps to get to this person for the ones that made it through. So that was an interesting experiment that was kind of fun. And I'm sure it's much smaller than six degrees now with Facebook. But the question that he was asking is that was a time consuming experiment. But it served its purpose to map social relationships. But the question that he was asking is how are people connected in Boston to New York? And what path can you follow to get from one person to another? So if you know the full network, there's a computer science algorithm called breadth-first-search that is guaranteed to find a connection if it exists. And if it finds a connection, it's guaranteed that it's the shortest connection, or one of the shortest. There could be lots of different possible routes that are all equally short. So people in this field of graph theory have mathematically proven that this algorithm will work. If a path is there, we'll find it, guaranteed. And it will be the shortest path, guaranteed. So this is a very standard algorithm in graph theory. And you can see if this type of algorithm is useful for answering questions in biology. So for instance, if you had a protein interaction network and you're interested to see how two proteins were connected, you could just run this algorithm and it would tell you, OK, these two proteins are connected, or they're not connected. And if they're connected, this is the path that you can follow. So that's an example of people taking interesting algorithms from computer science and applying them to biology, or asking a biological question, and then going to computer science and asking, is there an algorithm already that can help me answer this question? Another example that you saw yesterday is graph clustering that Lincoln, that was mentioned, where you find these connected modules. So those are the algorithms that find those modules are standard algorithms in computer science. There's thousands of them, actually. And we can just use them, and we know that they're going to work. So one question to always ask. So a lot of the network analysis methods that people use have some basis in computer science. And usually there's some history behind the algorithms. And people have taken those algorithms and applied them to biology, which is very powerful, because we don't have to invent something new. We just take something that's already working and apply it. The question that you always need to ask is whenever you are looking at some new type of network analysis that you come across is, ask yourself how biologically relevant it is. So for instance, the shortest path question, OK, I can find a path between two proteins and a protein interaction network, but is that an actual signal transduction path? Maybe not. Maybe it doesn't consider context. Maybe it doesn't consider a lot of things that you would need to consider to know that that path was somehow useful in a cell. OK, so many applications of network biology have been many applications in network biology have been developed. And this slide in the next one just lists a bunch of different ones. So we'll be talking about gene function prediction this afternoon. Detection of modular structures, what you talked about yesterday with one of the parts of Reactome-FI. People have used these algorithms to study network evolution. So if you have networks from different species, there are algorithms to align networks. You can, it's similar to sequence alignment. And people also have used different methods to predict new interactions. So predicting protein interactions, predicting functional interactions, all sorts of different interactions can be predicted based on existing data. So sometimes you'll come across networks that have a lot of predictive, predicted information. There's also a lot of, especially, recent interest in developing algorithms for methods to help study disease. So I think Lincoln was mentioned yesterday. The people can study, they can find, there's methods to find subnetworks or little pieces of networks that are correlated with disease somehow. So those are people call them network biomarkers. And there are also ways of doing genome-wide association studies with networks, which we can talk about. So I'm not going to go through these in a lot of detail, but each one of these has a cytoscape plugin available, actually, that you can go and check out. And we'll talk more about that a lot more later. OK, so I mentioned that it's always important to consider how biologically relevant these networks are. So networks are a model of how we think the cell is working. They only capture a certain type of information, relationships between often genes. And I think it was mentioned yesterday how you can have different network mappings. And I'll just mention that briefly to remind you that the nodes and edges can represent different things. But often what's missing almost always is information about dynamics. What often networks are representing some static image. So it's not showing you how things are changing over time. There are actually a lot of ways of representing this information using more detailed mathematical models. We're not covering it in this course. But there's quite a lot of software and tools out there and resources for doing mathematical simulations of pathways and biological systems. Typically, these are not very applicable to genomics data. The reason is that they don't cover a lot of genes. So usually mathematical models of a particular system are very focused and only include a specific system of interest. And so when you have your genomics experiment and you want to do some mathematical modeling, you have a big problem because only a tiny fraction of the genes that you've found are interesting in your experiment. You don't have any information at this level of detail. So that's why we're not really considering that here. There's a lot of detail missing often. For instance, we represent often proteins as these nodes, which are often visualized as circles. Well, proteins have a lot more structure. They have domains and 3D structure. Often that structure is known somehow. And the context is often missing. So this is related to dynamics, but it's sort of which parts of the network are active at which times. In which tissues or which cells, which developmental stages. So it's always good to just note that that kind of stuff is not usually represented in a network diagram. Okay, so just to summarize, networks are primarily useful for helping you identify relationships in large data that are otherwise could be hidden. It's very important to understand how the network, as was mentioned yesterday, how the network is structured. And if you define your, so one of the issues that, so one of the things that I, in the slides that I showed you with all the different analysis methods, one of the issues with all these analysis methods is that sometimes it's difficult to find the analysis method that's useful for your question. So in this course, we've tried to select a few that are generally useful, but there are many dozens of other ones. And so there are lots of methods available for network analysis of gene lists. It's important to define your biological question, so what you wanna do, and then you can try and find a method that is available already to answer your question. So that's probably the best way. The other way of doing it, which is more difficult, is you can become an expert in lots of different methods and you'll just know how they can be applied to your data. So if you're ever stuck and you have a question, obviously we can address it in this class, but you can always email a mailing list, like the site escape mailing list, and say I have this data, I'm interested in seeing how, in answering this question, is there any network analysis tool that's available that helps me answer this question? Okay, so I'm gonna switch topics to network visualization, and just go over a few different concepts here that are important to understand. They're pretty basic concepts, but just to review, just to go over them in a little bit more detail is useful. So there's different ways of representing a network, so this is the common way of representing a network where we have nodes and edges and the connections are visualized nicely, but you may also see other types of network representations that are equivalent, they just visualize differently. So the basic way of representing a network as a table is what I explained earlier. So if you just have a list of relationships and you load them into a spreadsheet, you can visualize them as this table like this. You might have some additional information associated with these connections, like the strength of the connection or what experiment was used to derive that connection. So this is a list of relationships. And then sometimes people also, especially in, actually it's been quite popular with East genetic interactions, people often visualize their interaction network as a matrix and then color the squares based on, for instance, the strength of the interaction. So this is then a heat map. So all these representations are equivalent. This blue interaction here is shown here. It's between A3 and A5, A3 and A5. And then this blue one is here and it's also here because this is a bit symmetric for relationships that are symmetric. You have A3 connected to A5 and A3 connected to A5. So this matrix basically has all the nodes or proteins or genes on one side and the same set of proteins and genes on the other side. And then you color the square wherever there's an interaction. And usually heat maps also apply some clustering to the rows and columns to organize rows based on, to organize, put similar rows together and similar columns together. So you probably may have seen that before. And the major difference between these, why one would be more useful than the other. Does anyone know why one would be more useful than the other? So you might be easier to see modules in the heat map. It may be. Any other ideas? So usually the main reason why, from a visualization point of view, why one might be better than the other is how sparse the data is. So if you only have a few connections, it doesn't really make sense to draw the whole heat map because most of this information is blank. So in the network, you only draw what you have. And so, however, if you had everything connected to everything else, the network would be not very useful because you wouldn't be able to see all the connections could be all, they'd all be overlapping. People typically call that a hairball, which you've probably heard. So the heat map is actually really good for that. So that's the main reason from a visualization point of view. So from a visualization point of view, you are really concerned with how well the visualization tool is communicating information and how good it's using space on this screen, et cetera. Okay, so network visualization is actually fairly straightforward these days, but it's dependent on a very important concept which is automatic network layout. So everyone who use Side Escape, I think you tried the automatic network layout. You might not have thought too much. It's just something that you do by default, but this is actually a very important part of visualization. If we didn't have it, then all of our networks would look like this. So this is a network that you would just draw if you were just loading up nodes or say these nodes of genes and connecting them all. Automatic network visualization gives us a nice, pretty picture. In general, these, no, and then lots of different network layout algorithms have been developed, again, from computer science and we're just copying them over to biology and using them. So one of the most common types of network layout algorithms is called the force directed. Sometimes it's called a spring embedded network or layout algorithm. And it tries to, in general, most network layout algorithms are developed to try to optimize the layout in a couple of ways. One, it tries to move nodes away from each other so they're not overlapping. And the second thing that they generally try to do is reduce crossings of edges. So if you have connections between genes that cross each other, the more times you have a crossing, the more complicated the network looks and it's difficult to trace the connections because often, if there's lots of crossings, you won't actually be able to kind of see how things are connected. So you wanna reduce the crossings. The ultimate reduction is that you have no crossings. That's not easy to achieve. In fact, it's impossible for a lot of networks. But that's the goal of most network layout algorithms. So these force directed layout algorithms, the way they achieve this is they use a physics-based idea where the nodes are represented as positive charges that repel each other. And the edges are somehow pulling each other. So maybe they're represented as gravitational forces where the nodes wanna come together, but the nodes are like charges so they're kind of pushing each other apart. Sometimes the edges are thought of springs, so they have, you may remember from high school physics, people have a spring constant and a resting length of the spring and the spring can kind of bounce and compress and bounce. So those concepts are used by these algorithms and the algorithms simulate a system like this and you can think of it as kind of taking a network that has nodes and edges that are kind of springy and you throw it up in the air and it's jumbling around and then when it lands, it kind of rests, comes to rest and it's figured out that nodes are, all the forces are figured out so that the nodes are far away from each other and the edges are pulling things together such that nodes that are actually highly connected are more likely to be close to each other. So this is sort of generally how these force directed algorithms work. They're excellent in general, but they're not great for very large networks so I'd say on average if you have up to 500 nodes maybe if you have a nice big monitor and a good computer you can visualize bigger networks but the more nodes that you have and also the more edges that you have in your network the harder it's gonna be for these network layout algorithms to find a good solution. So bigger networks tend to give hairballs and the only way to reduce that is to remove some nodes or some edges. So general advice whenever you're trying out the network layout is to use a force directed layout algorithm first. If you have, but it's not the only type of network layout algorithm that exists there are others that are specialized for certain types of networks so if you have a network that is more like a tree for instance it's a pedigree or it's a protein phylogenetic, protein sequence phylogenetic tree or the gene ontology hierarchy there are hierarchical layout algorithms which kind of create the, lay the trees out very nicely and there's other types of network layout algorithms as well. So general advice is to try force directed first and also try different layout algorithms and see which ones work best. You don't always have to rely on one and what works best is what's visually clear to you in a couple of ways that I'll tell you about later. Okay, so tips for better looking network. So if you're ever going to, I briefly mentioned this earlier if you're ever going to publish a network figure that you make then you really should adjust the layout manually. So these network automatic network layout algorithms work quite well but generally they don't do as good a job as you can do if you're moving things around yourself. So they give you a lot of help but then for publication quality images you usually can move things around. So one of the things that often happens when you do these layouts is that for instance you might have node labels like gene names on your network and some gene names are longer than others. The network layout algorithms typically don't consider the label. So the labels might be overlapping each other and so you might want to move the nodes around so that you can reduce that overlap. Whenever you have any kind of overlap you want to kind of reduce it. And another useful tip is to load the network into a drawing program like Illustrator, move the labels around, you might want to emphasize certain things, color things differently as I mentioned. Okay, so one of the big problems that people face with networks when they're, especially when they're really big as I've mentioned a couple of times is that they get a hairball affectionately called, sometimes people call these things something else. Somebody called this once the Death Star. But this is a really big network. I think there's over 3,000 connections in this network with hundreds, almost 1,000 proteins. And it's kind of difficult to see what's going on. So you definitely see certain things. Like there's a big connection here, there's a big section here that's highly connected. This section isn't as highly connected. Here's another sort of module maybe. But in general it's kind of hard to see. So if you are faced with something like this you really need to zoom into something or filter it in some way to get a better picture. So often you can, there's a couple of ways of doing this. So in this case I took a few proteins from this network and I zoomed in just to see the connections among those proteins. So these are proteins involved in cell wall integrity and in yeast. So if you have particular processes or sets of, your gene list for instance defines a set of genes that might be interesting and maybe your gene list is really big so you might have a way of focusing your gene list to a smaller list of genes. And you can just look at that list to see how the connections among the genes and that list to see how they're organized. Another way is to reduce the number of connections. Sometimes the connections have confidence associated with them and you can remove connections that are less confident and focus on the strongest signal and then you'll relay out the network, you'll see some structure and you can keep on doing that until you see some structure. Okay, it's an interactive process that really requires often a lot of tuning. It's not often that you can just press a button and get a nice network unless your networks are small. Okay, I mentioned as well that you can layer on a lot of different types of information on your network. Often this is done visually using different types of visual features so you can imagine that you have all sorts of different ways of representing nodes and edges. So we have different shapes for nodes, we can have different types of lines that connect to show connections and have different types of arrows. So here you might wanna say this is an inhibiting arrow, this is an activating arrow and there's color, size, shape and in this network that we've seen before we've used a bunch of these things. We've used color, size, the width of the edges are proportional to some data so there are a few different types of data here. One, transcriptional amplitude for size, this is a second type of data that we had, the color is a third type of data we had and then the actual data that kind of creates the structure of this network is protein interaction data. So there are four types of data overlaid kind of combined in this network and it's really up to your imagination how you want to visualize your data. You can choose to visualize data in any way. I mean, there's certain natural ways to visualize certain types of data. Like for instance, if you have gene expression data that's on a continuous scale, you might wanna visualize that as a color gradient but you could also visualize it as a size gradient so bigger nodes are more differentially expressed. It's really up to you. Okay, so as I mentioned briefly before, interpreting these types of networks, there's sort of four major ways that people, four major concepts that are important for that to look for in networks when you have a network. So one is the relationships between these different types of data. So here we can see that there's a highly connected region that's all green and the nodes are all big and the edges are all really thick connection. So this means that the nucleosome, we know that this is a nucleosome, it is highly expressed across some stage of the cell cycle. All of the genes seem to be co-expressed so they're tracking each other and they all have protein interactions. So we can quite easily see that pattern by looking at this network. It would be much more difficult to see that if all that data was present in tables. So those are relationships between data. Another central idea is guilt by association which is, we'll talk a lot more about this afternoon with gene function prediction, but early on people recognized that if you have a gene that's connected to another gene, most likely, or those genes have an increased chance of being functionally related somehow. Maybe part of the same pathway or same complex and you can see that here that blue genes are kind of hanging out together usually there's not always the case like here's a gene that's blue that's not hanging out with these guys and actually when I look this up it's being misanitated somehow so it wasn't correct, somehow it wasn't correct and in the data that we downloaded to make this figure. So you might see some examples, here's another one like that. You also find these dense clusters so that's one type of thing to look for and networks and dense clusters in protein interaction networks often represent complexes or pathways and in other types of networks they may present other things so the modules that were found in a reactor MFI are often represent pathways and then there's global relationships between sections of the network so nucleosome is more connected to kinetochore. So these four things, concepts are useful to interpret networks and there are also things that sort of aspects of networks that you should look for and when you're laying out your network or visualizing your network if you can visualize them in your network in a way that makes these things more apparent then that's useful. So the automatic network layout algorithms like the force directed networks what they do often is they pull genes or nodes together if they're highly connected like these guys are pulled together because they're highly connected. You can imagine if all of these things were kind of have springs that were pulling each other together they would end up like this in the algorithm. So the force directed layout algorithms are very good for helping people to see dense clusters. If you didn't have that, these nodes would be spread out everywhere. Okay, yeah so when you are choosing different ways of visualizing your networks you can think about these types of these concepts and see if the way that you, the choice that you've made for visualization is making these more obvious. Any questions? Okay, so that is pretty straightforward I think but I think important to cover. So just to summarize, we need automatic layout to visualize networks is very important. There are lots of different network layout algorithms so we can try, you can always try different ones to see which ones give a better result for your network. Networks help visualize interesting relationships and large data. You can avoid complicated networks by focusing your analysis and visual attributes can help you overlay different types of data together and see how they're related. Okay, so yes? Yes, are we gonna, we'll be going over how to assign like, the ideas and things to the genus, like in order to get like, the data that you're showing, you're showing a complicated network, right, with all the different stuff, you know, the edges and like, yeah, will we cover all that? So I'll show you that inside escape. The challenge there is that there's lots of different types of data that you could load up and each person has often their own types of data. So there are general ways of loading that data up into site escape and then you can visualize it. So yeah, I'll show you how to do that. Okay, the only issue is getting the data. So sometimes some data is easily accessible and some data you might have to collect from different places and put together. Yeah, sure, I guess just generally, are those basic features of site escape or are those plugins that help you? The visualization is a basic feature of site escape and pulling in the data as well but loading in the data from kind of generic tables is what's part of site escape initially by default. Okay, so I'm gonna switch to discussing site escape. So site escape as you know, because I think probably everybody's tried it out by now, as we assigned that before the course, is a free software for network visualization and analysis. It was originally developed at the Institute for Systems Biology in Seattle in 2001 and since then it became an open source project which meant that they gave away all of their, the source code that was used to develop the application and lots of additional groups joined to help develop the software and my group in Toronto is one of the, around 10 groups around the world that is working on this. Trey Idyker is in San Diego and Ben Oshkowsky and in the Pasteur Institute were originally at the Institute for Systems Biology so they were the originators of this project and lots of other people have joined in. The idea is that as you guys probably know, most of the labs that are involved in tool development aren't really often interested in being tool developers, they're interested in scientific questions but they have to have some tools, they have to develop some tools to help them answer their questions so if there's a number of labs that are interested in the same questions they can share the workload by sharing the development of the tools and each person spends less time developing tools and more time on their science but then they're sharing all the tool development so through that effort, SightEscape has grown to be a software that is quite useful and fairly standard for network analysis and visualization. There are lots of extensions or plugins and now they're in SightEscape 3 which we're not including this course right now, they're called apps and I'll tell you more about that and the idea with SightEscape as I've already mentioned with networks in general is that you somehow, you load up information to SightEscape, you load up network information and also experimental data and then you visualize it and analyze it and use it to answer some question that you're thinking about. So SightEscape by default helps you manipulate networks so you can select a set of nodes and copy that to a new network. You can filter and query nodes and edges, it has lots of features for automatic layout and it has some features for pulling in data from standard repositories like searching interaction databases to get network information. But one of the most powerful parts of SightEscape, the reason why it's popular is other than the fact that it's free is that there's a large community of developers and people using SightEscape so it has 5,000 downloads per month and there's tens of thousands of users so all these users have helped develop tutorials and case studies and publications so there are lots of examples of how to use a lot of the tools in SightEscape. There's a good mailing list for discussion, questions are answered on that mailing list, they're pretty much guaranteed to be answered within a week and often they're answered faster. There's a lot of data and documentation and now there are over 160 plugins that extend the functionality of SightEscape. These are all hosted now at the App Store, apps.sightescape.org. This is very useful, if there's a plugin that does something that you want, it's maybe not useful if you can't find a plugin. So if there is no plugin that does what you want then you can always build your own. That requires a lot of knowledge right now in Java programming so either have that knowledge yourself or you have a friend who has that knowledge or you hire someone, a software developer. And in the future this will probably be easier by future versions of SightEscape will support programming apps in different languages like Perl or Python or other things like that so it might be easier. So you'll be able to script things. Here's just a fun picture from the SightEscape retreat a few years ago in Toronto where people spelled out SightEscape. And the next conference is in Paris in October where people that are really interested. Okay so SightEscape is a useful free software tool for network visualization analysis. It provides basic network manipulation and visualization features out of the box and then you have to download plugins usually to extend the functionality especially for analysis. So okay so I've loaded up SightEscape 2.8 here and I'm gonna load up a couple of files. So my default SightEscape works with SightEscape session files. SightEscape session files are a file format that SightEscape uses to store all the information in a session. That information can incorporate, can include lots of different networks that you've loaded up, attributes on the nodes and edges, all your settings for how you want things visualized, and all that gets saved into one file. You can only load up one session at a time. If you try and load up another session, each session is kind of like a project. You have everything set up for yourself for that project. If you wanna load up another session it will say you can only load up one session and it will switch from one session to another. The session files are .CYS and it's actually a zip file so if anyone's interested in looking inside you just rename it to .zip and you can unzip it and there's a bunch of files in there that you can look at if you're interested. The common issue that people just, one thing that's not intuitive with SightEscape initially is that you're wondering how you can create your session file and how do you get data into the system. So you don't use this open to get data in. This is only to get data in once you've created a session file. However, to create a session file you have to kind of go one step back which is importing data. And so this is actually the first thing that you usually wanna do when you run SightEscape is you wanna import data from various places. So you can import data from the easiest place to import data is from a spreadsheet. This is sort of the standard way that most people would import data. There's also ways of importing data from different web services and files from different places. So different format of files. So the important thing with SightEscape is that there's two major types of important thing to understand with importing data is that there are two major types of data to import. One is the network and one is the attributes. So the network is the connections between genes of interest and the attributes are attributes of the genes like we covered a couple of days ago. Could be gene expression, could be gene function like gene ontology terms. Any type of information could be associated with nodes or edges. So and it's actually one of the big questions from people is okay, I have all my gene expression data but how do I get a network? How do I get network information? Where do I get the network from? So ReactumFI is one place that you can get networks from that you looked at yesterday. Another place is Gene Mania that we'll talk about this afternoon. So Gene Mania has hundreds of different networks and there's a SightEscape plugin that allows you to load them all in. So I'm gonna just start with a network type of network information that we already have and I'm going to select it from the set of sample files that come with SightEscape. So if you go to the SightEscape directory there is a sample data directory which you probably have seen that has a lot of different sample data and mine's a little bit messy so I always forget which, let's see if this is the right one. So I've loaded up a spreadsheet, an Excel spreadsheet that has information in here that is previewed here so there's different columns. These two columns represent the protein interactions in this case that I have so I'm gonna select those to import and then there's a type of interaction here so I happen to know that there's a type of interaction here. This type can be anything you want any of the information associated with these interactions can be anything that you want any kind of random information and sorry, arbitrary information that you choose to load up. In this case these connections are either protein-protein and sometimes it says PD which is protein-DNA so I'm gonna load that up and I'm going to load up this column which has a bunch of numbers associated with it. I have to select how the interactions are defined in this file so which columns contain the source and the target so column one is the source, I'll just say is the source column two is the target and column three is the interaction type so these are colored here now and then this is just additional information here. One thing that's just a couple of tips with importing data is you can click on the heading of this column here, sorry you can right click on this column and you can give a name to this column. If you have and you might need to choose the type of data so whether it's an integer or a Boolean value I'm gonna cancel it because usually side escape is good at guessing this if it's ever not good at guessing it you can always select there. Sometimes you can have the columns have a heading like you would often have in a spreadsheet and then if you have a heading in your file you can click this show text file import options and you can say transfer the first line as attribute names and if I don't have headings here but if I did then those would jump up into this well you can see how that would work it would jump up into these headings this doesn't make sense in this case so I'm gonna turn it off. If you have other data at the top of your file you can start the import at a row further down and then you can, there's certain other options here that you can select. So I'm gonna just import this and we'll see what happens. Okay so it worked successfully loaded 331 nodes and 362 edges. Okay so now that I've loaded this data in you can see that it's a hair ball right so I'm using these buttons here to zoom in and out. If you have the right mouse like three button mouse or if you have a trackpad on the Mac I'm just using the normal zoom in feature to zoom in and out and if you have a three button mouse you can use the middle mouse button to pan around. I think there's a control key here that so it's not working on this laptop but the typically the three button mouse the middle mouse button will help you compress to move around. If you don't have that you can always use this window to move around the network. Okay so this network is what you would see if you just loaded in a network of interest that you had already in a table format. I'll talk a little bit more about why you would use this table format versus something else. And then the first thing that you would do is layout the network in some way so I talked about forced directed layouts. One of my favorite forced directed layouts is called Y Files Organic so that is I like it the best so I'm gonna click that and it lays out the network. So I'm going to now zoom in to the network and as I zoom in you might notice that the labels aren't showing here but as I zoom in more the labels show. So this is an optimization that Cituscape uses that makes looking at large visualizing large networks faster. Basically it doesn't show all the details when you've zoomed out. But there's some default options here that might not be useful might not be your preference of the options. So if you're interested in always showing the detail you can always go to view show graphics details. And often this is improved in Cituscape 3 but it's a similar idea so you can show the graphics details and now whenever you zoom in and out the labels are always there. All the graphical details are going to be there. So that's sometimes confusing for people that they don't see the labels because they haven't zoomed in far enough. It's a forced directed layout so it's the same way that I mentioned. Yeah. Yes. It's a prettier. Yeah. So the reason why I like Organic is that exactly it's a particularly good forced directed layout. It's not just tweak parameters but the people that made it have different heuristics that they use like rules that they use to sort of figure out how to lay things out better. And the only issue with it is that it's commercial so it's not an open source product and we've purchased a license to include it in Cituscape but it's all the other layouts are kind of reusable for other software and stuff so it probably doesn't impact most users but for any developer that's one issue. So yeah, the Y-Files layouts, Y-Files is a company and they've done a pretty good job of making layouts so their hierarchical layout for instance is quite good. So here's a, if this was a tree it would more of a tree structure, it would look better. Let me just go back to Organic layout. Okay, but you can try the other layouts. The default layout is the Cituscape forced directed layout and if you click, it's linked to this button here so if you click it, it's pretty good as well. So here's the Cituscape default one. So yeah, you can see that these guys, there's some of the structure that I talked about in the network is visible after you lay this network out. Okay, so. I have one question. Yes? How was column six that you imported? How is that used here? Okay, I'll talk about, that's a good question. I can show you how those columns were imported and where they end up. So just before I do that, I just wanted to mention that the, okay, I'll go through this. So the attributes that you load up in the network are not shown here by default. They're present in here and actually one of the annoyances of Cituscape two series is that this data panel, sorry, let's cut off here to change this. These panels here can be turned on and off by this view menu so I'm gonna hide this results panel because it's taking a lot of screens real estate. So this data panel here is where attributes are visualized. One of the confusing parts when using Cituscape initially is that there's nothing shown here and really we should show the attributes that you've loaded. So that's fixed in 3.0. But you can select different attributes that you've loaded up and usually I just click this button and select all attributes. So only when you've selected nodes here can you see what's in this attributes panel. So another little bit of a confusing thing when you start using it is that there's multiple panels here so you have to actually click these tabs here to see different ones. So I don't know if everyone can see this, but this is the node attribute browser. So I didn't load up any node attributes yet. I only loaded up edge attributes. So if I click this and I click this button to select all the attributes, now the columns that I had in my spreadsheet are loaded up here. So this is the interaction column, interaction type column and this is column six. So I can, unfortunately this column six is all one number, so it's not gonna really change things. I'll show you, I'll load up a better example with a lot more data to show you how visualization works with this data. I just- So basically column six is like expression? It could be expression values, yeah. Exactly, yeah, yeah. So because it's associated with the edge in this case, it would probably, a value like this might be confidence or strength of the connection. If it was associated with nodes then it would be gene expression data, for instance. So yeah, so I just wanted to quickly, you guys probably saw that you can select nodes and you can move them around. I don't know if you saw this way of aligning and distributing nodes, like you can rotate nodes, which is fun, but sometimes useful if you're making. So there's certain things that you can do with these align and distribute that help you with publication quality images. If you're doing manual layout and you wanna align a whole bunch of nodes in a line, you can just go here and say, okay I wanna have these all lined up like this and distributed like that. So they're kind of distributed. So this didn't work out that well because there's too many nodes but now they're all evenly distributed and lined up. So that's sometimes useful for manual layout. I'm gonna turn those off. Oops, okay. So the panels, you can also click these buttons here to move the panel somewhere else. You have multiple computer monitors sometimes that's useful. Let's see what else. Quickly, selecting a set of nodes, you can create a new network based on these nodes from selected nodes, all edges. So here's a network that I cut out of the other network and now if I lay this network out, I can see just the nodes that I've selected. So this is a useful way of focusing in your analysis to just nodes that you have nodes of interest. And then this network that I created is sort of a child of this network here and I can go back and forth using this thing. There's these windows here also available to kind of move around. So okay, so I wanted to cover a few more things. One is filtering and the other one is visualization. I'm going to load up another site escape, a predefined site escape session file that already has a lot of information loaded. What's the schedule, Michelle? 10, 15, okay. Yeah, okay, that sounds good. So before I do, I just wanted to mention that while I imported network information here from a spreadsheet, that's most useful for people that are generating their own network information. So not many people here are doing that. Well, some people are, I guess, if you're generating any data that creates interactions, like protein interaction mapping or chip seek data where you have multiple chip seek experiments that you've run and you have a transcription factor and all of the genes that it binds close to, those represent relationships and you might have that in a table, then you would definitely load them up by loading in a network from a table. If you are starting with a gene list and your experiment doesn't include generating interactions in any way, then usually you wouldn't load the network in from a table because it's difficult to kind of get network information from lots of databases and combine them in tables. It's much easier to use a plugin like ReactoMFI or Gene Mania as you'll see this afternoon to provide a gene list and then convert it to a network by just querying the database. So that's another confusing thing that depends on where you're coming from, what part of Site Escape you use. Hopefully that's clear now, yeah? Yeah, that's sort of a simple one. So if you start with a gene list to use ReactoMFI to build the network, can you export that network as a table so then you can add values that are associated with your genes from your original list? You don't need to export it. You can just import a new table. Sorry, you want to add value. Do I want a gene expression? Yeah. So my gene list is gene expression. Yeah. And I want to tie the whole change to all those nodes after I've gone and added the attributes through ReactoMFI, right? Yeah, so you don't need to export the table ever. You can do everything you want within Site Escape, and that's a good question because that sort of shows this loading attributes. So I've loaded a network from a table. If you have attribute data like gene expression data and one of the columns is the genes and the gene name, for instance, and the gene name is also used as the nodes here and then you have gene expression data, then you can load it in with attributes from table. Let me do that right now. One of the sample, so this is also sometimes confusing for people, the import attributes are network and import attributes, sorry, import network from table and import attribute from table of very similar screens here. So sometimes people get them confused, but I'm gonna import node attributes. I'm gonna quickly go through here and find one here that is a node attribute table. I'm opening it up. This table has a whole bunch of information here about these nodes, although here it's mostly gene ontology information, so let me see if I can find an example where it's gene expression data. I think this is it. No. So these files that I'm loading up, if they're not tabbed or limited, they won't get parsed properly here. So actually one of the things I can do is I can change, I can go to show text file import options and I can say instead of a tab, use a space and now these things get parsed correctly. So parsing is just a computer science term that means pulling data out of a regularly defined file. So now I have my gene name and some gene expression data. This file has headings, so I'm going to import the first line as attribute name, so now these jump up here and now I can pretty much import and oops, I need to, okay, I think that by default it just defines the first column to be the identifier that you're using, but you can also choose another column here as your identifier column so that if you have a spreadsheet with different columns like entree gene IDs and Hugo gene names and your network is entree gene IDs or Hugo gene names, then you can match it up. Okay, so let's see what I did wrong here. Oh, duplicate attribute name, yeah, so these attributes, there's two attributes that are here. One is the gene expression value and the other one is the P value and they just happen to be named the same, so I'm going to turn these guys off and import. Okay, so it loaded up all the data and now when I select some nodes and I go to the node attribute browser and I click select all attributes, now I see all these attributes and here's my gene expression data, so that's how you would do it for your question, answer your question. So now I can quickly show you, I think I'll stop soon, but I'll just show you a couple more things. Is it okay if I take a few more minutes? Okay. Okay, so one of the, there's two things I want to show you. One is filtering and the other one is visualization. So this button here and also one of the panels and one of the tabs in this panel here, filters, helps you define sections of the network that you want to select based on, it's basically querying the network. So I won't go over this too much because it's in the tutorial, but filters are, you can create all sorts of filters that say I want to select nodes that have gene expression higher than this or gene ontology terms equal to that and or the nodes that are named this and you can create Boolean kind of combinations of filters just one thing I want to point out which is an interesting tip is this search field here by default it searches for one column in the attributes the node name and I can press enter, I type in something and I press enter and it kind of zooms into that region that I've selected. But if you click this configure search options box here you can set it to search any kind of data that you've loaded up. So I'm going to select gene expression data that I've loaded up to search and I'm going to apply and now this changes into a slider, a range slider. So now I can select nodes I'm going to zoom out here so you can see it. I'm selecting nodes based on gene expression. So I want to select all the nodes that are highly expressed. I can just select them here and there's only a few that are lighting up. I can click this button to kind of zoom into the set that's lit up because they're all over the place it doesn't zoom in that much but you can sort of see that as I move this it selects more. So this would be a way for me to select all the overexpressed gene, nodes, genes and then move that to a new network and focus in on them. And you can do similar things with this filters box. It's just more complicated and there's a tutorial that you can follow. Okay, so the last thing I want to show you is the visualization. So we go to this VisMapper tab and this is also mostly covered in this tutorial but just quickly you can take data from this, from the attributes that you've loaded up here and map it to colors or shapes or anything. So I'm going to take this column here of gene expression data and I'm gonna map it to color. So I have to find the attribute that I like here which is a little bit annoying because there's a whole bunch here. So we're actually currently, my lab is currently working on redesigning this. So I'm sensitive to these issues but I found node color and I double clicked on it so it says double click to create a mapping and then it gives you a few options. So by default it says ID as this column here is what I want to map to. I actually want to map gene expression data to node color so I'm gonna click on this guy here and then there's different types of mapping. So continuous, pass through or discreet. Again, these are in detail in the tutorial. Pass, the one that we want to use here is continuous. That's where you have some continuous value data that you want to map to a continuous visual attribute. So as soon as I do that, it sort of creates a default color gradient here that goes from the smallest number to the biggest number. So in this case, the genes are automatically mapped to a gray scale based on their gene expression data where white is high over expression and black is low under expression compared to control. But I can click on this and I can change these colors. So I can change this color to red and now it's a different type of great scale. I can also add additional points here so I can create something that's at zero. It's white and then at high values. It's, and I can double click on these things to change the color here. It's green or something like that. So as I do that, it usually updates itself automatically. So I'm gonna press okay. Occasionally, this system gets into a state where this isn't updating automatically. I find that it's just because some button hasn't been pressed here in the right order that occasionally happens. And so sometimes you just have to reset this to get it working again. And sometimes a workaround is to hide the graphics details or toggle this graphic details like hide and show and then everything gets reset in the visualization system. So that's the basics of visualization. There's lots of different options here that you can try out. The plugins menu allows you to load in plugins which you've already seen before the workshop hopefully to load up your Reacto, my Fi plugin. And I think that's it. Okay, any questions? Yep. Okay, so during the break, a couple of people asked questions. So I'll just review some of those. So one question is can you undo inside escape? And yes, there is an edit undo function. It doesn't always undo every action but most of the actions like moving nodes and laying things out are undoable. Some of the plugins that you run aren't there. The people who programmed the plugins didn't implement an undo function. So it's not undoable. So you can just check if what you're doing is undoable. Otherwise you can just always save your session. People also asked about site escape web. If you are, I guess if you are programming a website and you wanna visualize a network on the website there's a site escape web system that helps you do that. It's not a tool like site escape that runs on the web. It's just a programming library but it's useful for some people. Okay, other questions are more specific. So I'll, if other people have them, we can repeat them. Okay, so I talked about, I just gave you kind of a quick summary of site escape and there's some tips about features that I think are useful. This is the workflow, the sort of typical workflow that you would follow for network analysis. So if you have your gene list, you, so as I mentioned, if you already have a network you can start with a network and load it up into site escape. If you don't have a network and you have gene list and gene attributes like gene expression data, you need to load that into site escape. I need to convert it to a network. And so these are tools here that help you convert your gene list into a network. And what's missing, one is missing is ReactoMFI. It's actually mentioned down here. But then once you have your data in site escape, yeah. So gene mania supports, so if you have non-human data, ReactoMFI is only human. Gene mania supports seven model organisms and a couple of additional ones are coming online. Quaid will mention them probably this afternoon. So it has human, mouse, rat, yeast, C. elegans, Drosophila, Arabidopsis, and it's going to have E. coli, soon and zebrafish and possibly this year, Tetrahymena. So those are the ones that are supported. The, and I'll mention a couple of other ones. Another one is String, it doesn't, it's not as user friendly as gene mania, but String and I'm sure Quaid will mention it this afternoon, String is a website that is similar to gene mania, has a lot of similarities, and it supports all sequenced organisms basically. So all organisms that have a genome sequence. Quaid will probably tell you the differences between them. Gene mania plugin for site escape is a lot better than the String plugin for site escape. So that's why we're one of the reasons we're focusing on gene mania. So any other questions? Okay, so yeah, so these systems help convert a gene list into a network. Then you might have different types of networks and you want to visualize them. Site escape itself handles all of the visualization that you need. You don't need to download any plugins. And then you can do different types of analysis like we did pathway enrichment analysis. In site escape, there's a plugin called bingo that helps you do that. Otherwise you do it on the web, like talked about yesterday. Gene function prediction we'll talk about this afternoon with gene mania and String. Module detection, Reactive MFI has some built in. There's also active modules and cluster maker. So these are not really discussing these but what we've done, what I've done is I've put in some slides for these other plugins in the presentation and during the lab you can go and look at them if they're interesting to you. So and there's also some, if you have a regulatory network, one of the things that people like to do with regulatory networks is look for motifs like feedback loops or feed forward loops or things like that. And so if you take a network that's generated from the tools that Wyeth mentioned, there's a tool in site escape called NetMatch which helps you kind of identify these little motifs. Okay, so the next few slides, I'm not really gonna go over very, very in detail there. As I mentioned there to give you intro, a quick intro to some of the other plugins that we found are useful but not generally useful for everybody enough to cover it in a course with the time that we have. So Vista Clara, for instance, helps you visualize lots of different gene expression data at the same time when you can play a movie. So for site escape you can follow this. So each one of these slides basically has an intro of the tool and then like a lab that you can follow if you're interested. So during the lab time, if one of these tools is interesting for you, you should be able to find a little lab here and follow it. So bingo, that does enrichment analysis. You get a visualization that looks like this of the genontology and the nodes are colored, the genontology terms are colored by how enriched they are. So that's sometimes interesting visualization. Cerebral is a little bit outdated but it's an interesting tool for visualizing different conditions and time points of gene expression data on a network. So it shows you different plots. Active subnetworks is a tool for finding regions in the network that are connected and significantly differentially expressed across multiple conditions. So it's trying to find, given a network that you have plus some gene expression data across multiple conditions, you find a region that's sort of active all over across all those conditions or across some subset of conditions. Network clustering, there's a tool called Mcode that can find clusters in a network and there's also cluster maker, which is mentioned on the flowcharts that I showed you. If you don't have any network information available from genomania or reactome, you're working on an organism, for instance, that isn't even completely sequenced, but you know that there might be some literature information. So one way you can access that information is by converting network information available for a nearby organism via orthology. Another way is to automatically try to extract information from the literature and this Agilent literature tool, there's a tool, a plugin called Agilent Literature Search, which allows you to type in a set of genes and you could also type in an additional keyword like here is atherosclerosis or a specific type of cancer or context and then it does a PubMed search to find abstracts that mention those genes and then it looks for relationships described in the abstract. So you might find a sentence that says gene A binds to gene B or gene A regulates gene B. It will extract that information and draw a network for you and then you can actually look at the sentences that it used by right clicking on the interactions and seeing the sentences that were used to create that interaction and you can curate it by saying, oh, I don't trust that sentence, I'm gonna delete it. So, okay, network motifs I mentioned. So I'm just gonna skip over that. This is just a little lab that shows you how that works and then the last slides in this section are Cytoscape 2.8 tips and tricks, which again I won't go over, but you can read through. If you're using Cytoscape 2.8 a lot, there are a few different problems that sometimes people have that I've tried to mention already a couple of things but you might find some interesting knowledge here. It's probably more useful if you're definitely using Cytoscape often. Okay, so I'm gonna switch to another presentation to introduce the lab. Okay, so we have like an hour and a half for lab time and I'm gonna go through this lab fairly quickly and then you guys can try it out. So this is focusing on a particular plugin that is useful in Cytoscape for helping to visualize and interpret the results of your pathway enrichment analysis that you did yesterday. So we learned yesterday that a enrichment test is very useful and we learned how it works and this is an excellent idea. More than 10,000 papers have used this method to help interpret their data and as you saw yesterday, you get this big table of pathways and how enriched they are and one of the issues with this looking at the data like this is that there's actually relationships between these pathways. So for instance, a bunch of these pathways, B cell mediated immunity and myeloid cell differentiation they're related to the immune system. They don't always say immune system but if you know enough about biology, you can recognize a lot of different relationships here and so it's actually this presenting data in this table is not a great way of presenting it because there's a lot of overlapping pathways and a lot of pathway crosstalk, a lot of genes that are part of more than one pathway and this information about a specific theme like immune system is just spread out all over the place here. So if we have a table where there's relationships between parts of the table, what's a good way of visualizing it? What's a good way of visualizing relationships in a table? A network, exactly. Okay, so this is what this enriched map plugin does. It visualizes that table as a network and this is software that MyLab has developed, a method that MyLab developed that there's actually a couple of methods out there like this. Another one is called Clugo, which has a set escape plugin available which is pretty good as well. So the idea here is that you, instead of visualizing all the gene sets as a big table or the pathways, you can visualize them as nodes and you can see how they relate to each other because they might be related because they share genes, like the two pathways have a lot of genes in common. Okay, so in this presentation, the enrichment analysis technique that I'm using to show the examples is Gene Set Enrichment Analysis or GSEA. As was mentioned yesterday, this is where you can input a ranked gene list and there's no threshold that you need to set and then the important thing with GSEA is that it finds pathways that are up-regulated or enriched in the up-regulated set and enriched in the down-regulated set. So these are pathways that are going up and pathways that are going down. And the enrichment map takes the table of p-values and your pathways and their enrichment p-values and visualizes it as a network like this. So we have each node is not representing a protein or gene in this case, it's representing a whole set of genes or a pathway. So and then the edges are connecting pathways that have a certain number of genes in common and using this overlap, typically this overlap score. So if these are the genes in pathway A and these are the genes in pathway B and they have a certain number of genes in common, according to this simple score, then that gets translated to an edge width. And so the thicker the edge width, the more genes are in common between these pathways. And then the enrichment color, the weather in GSEA because it gives you up and down, that's mapped to red and blue. So red is up and down is blue. And then the intensity of the color here is proportional to the significance. So more significant pathways are colored darker colors. Okay, so I'm just gonna give an example. There are three different ways that enrichment map is useful. One is to visualize the results of a single enrichment like you did yesterday. So this is an example where we took some gene expression data from this paper where they were looking at breast cancer cells and their response to estrogen. So they treated the cells with estrogen, they collected three samples to gene expression data, collected gene expression data on those samples and did the same thing for controls that were untreated. And then we compared these to find differentially expressed genes and then ran an enrichment analysis, GSEA, then we got this big table and then the visualized this as an enrichment map. So the enrichment map site escape plugin was used to draw this picture automatically. So actually what enrichment map does is it draws all the nodes and the edges and does all the coloring. It doesn't do these bubbles here. The bubbles and the labels are currently added manually afterwards for publication quality images. We're working on a future version that can try to draw these bubbles automatically. It's not always easy, but the, so what you can see immediately here is that instead of seeing hundreds of pathways, we can see many fewer themes because a lot of these, each node is a pathway and a lot of nodes are kind of related to each other. So all of these nodes, all these pathways are somehow related to RNA transport. And that is now more visible as a general functional theme rather than having these pathways spread out all over the table. So immediately we can get, so what this really does is gives you a very quick visual summary of your gene expression data in this case in terms of pathways. Again, we're always using gene expression data as an example, but you don't have to, it doesn't have to be gene expression data. The one advantage of enrichment map is that it can be used for any type of enrichment test that you have. So if you have GWAS data or if you have methylation data, you can, and you've done your pathway analysis, you can load the results into enrichment map. Okay, so zooming in on one of these clusters here, you can see the actual gene ontology terms that are associated with each node. Okay, so that's fairly straightforward. Another thing, the second use of enrichment map is comparison of two enrichments. So this is something that's actually not possible in any other tool that I know of. In this case, the paper that we looked at, they actually had multiple time points and they were interested in seeing the difference, things that were differentially expressed between an early time point and a late time point in this estrogen treatment experiment. We used the gene ontology as our gene set database and we created an enrichment analysis, we did pathway enrichment analysis on this time point and this time point. So we had two pathway enrichment analyses and then we loaded them up as an enrichment map where the first, the early time point enrichment is mapped as the center of the node and the late time point is mapped as the border of the node. Here's a node that has a bright red border and a white center. So white means that there was no enrichment at the early time point in this pathway but the red border means that at the late time point this pathway was really enriched and up-regulated. So if you're interested in looking at pathways that were differentially enriched between two different time points, you might notice that a lot of the nodes are red so that means that there's not really any difference between the two time points. Here's a section where there's a bunch of nodes that have bright red centers and white borders so bright red center means enriched at the early time point and that white border means not enriched at the late time point. And here's a section, ubiquit independent protein degradation where the reverse picture is seen and so it seems to be a lot of change happening here but the rest of the map is not that much change. So this visualization method makes it really easy to see this pattern if you were looking at tables and just you'd have to match up all the tables together and take quite a long time as I'm sure you could imagine. So this is again just an example of how network visualization could be useful to quickly see patterns in data. So zooming in on this little section here, if you have gene expression data and other types of genomics data could be loaded up in similar ways just with the appropriate formatting. Enrichment map tool allows you to click on a pathway and see the genes in that pathway and a specific heat map for that pathway. So I clicked on this node here and I got this gene map, gene expression map and you can see that wow, there's a really big difference between 24 hours treated and untreated and at the early time point treated and untreated doesn't have a very big difference. So that's why this node has a bright red border and a white center because this pathway is not really enriched at the early time point and it's enriched in differentially expressed genes at the late time point. This is the reverse picture here. You might notice that these patterns are a little bit, the big difference here is that you're looking for difference between experiment versus control. The way that that difference works whether it's all, so green is up and purple is down whether the genes are all up or the genes are all down is not really visible from this enrichment analysis. You just see that there's no change here, no change here. Any questions? Okay, so the third and last use case, use of enrichment map is what we call query set analysis. So there's a number of different things, a number of different biological questions that this can help answer but the idea is that you've done your pathway enrichment analysis. You've visualized your results as an enrichment map. In this case we took gene expression data from a mouse heart tissue that was published in this paper and these investigators had knocked out a microRNA which as most of you probably know is a negative regulator of gene expression. So if you knock out a microRNA, you expect the targets of that microRNA to now be generally up-regulated because the negative regulator is being removed. So we found that there's a lot of pathways that were up-regulated and some pathways that were down-regulated. So now we wanted to know how do the pathways that were up and down-regulated, how do they relate to the microRNA targets that we know that are predicted in a microRNA target prediction database? We used target scan I think for this one. So we had a set of microRNA predicted targets and that's another gene set. So just like all these pathways are gene sets that predicted targets of a microRNA are gene sets and we represented that gene set as another in this query set sort of, we queried this enrichment map with this additional set. We said okay, how much overlap is there between this set and all these pathways? And these additional lines that are drawn in the enrichment map show the overlap between the targets and the genes in the pathways. So certain pathways like this vesicle trafficking pathway have a lot of microRNA targets, so they get thick lines and other pathways that are going up like translation don't have any microRNA targets in common and the pathways that are going down don't have any microRNA targets in common. So that makes sense. So the targets are kind of focused in pathways that are going up, which makes biological sense, but not all the pathways are linked targets. So we might see, we might interpret that from this that we might infer from this that certain pathways are directly regulated by this microRNA and other ones are not because they don't have the targets. So that's a way that you can use, you can ask another, the biological question that we asked here is given the pattern of pathways that are going up and down. And the fact that we've perturbed this microRNA in our experiment, how can we explain the effect of this microRNA in terms of physical connections between the microRNA and its targets by using this additional information? Yes? Yes. Yes. So you can use this for transcription factors as well. So taking what you learned about transcription factors, you could do a similar analysis, and you can actually, one of the things that we wanna do, implement as an automatic search system, what we haven't done it yet, is search a whole bunch of transcription factors and see which ones best explain the results. And that would help tell you which transcription factor might be regulating your gene of interest. So right now, we don't have that automatic search system, so you could manually take some transcription factors that you might know are interesting. Maybe you know something about your system that gives you a hint about a transcription factor, or you use the tools that Wyeth mentioned on Monday to do that search automatically, and then you can take those transcription factor targets and put them in here and see how they explain the pathways. So we've done that a few times. Actually, it's worked out quite well with certain projects that we have. Okay, so the Autism Spectrum Disorder map that I showed you in the introduction on Monday morning, actually used this Enrich a Map idea and used all these features. So the circles here are pathways that were enriched in the copy number mutations, if you remember the study. We also had additional sets of genes that we knew were important. So there were genes that were known to be associated with Autism Spectrum Disorder, and genes that were known to be involved in intellectual disability. So we added those as query sets. We also did enrichment, pathway enrichment analysis on those gene sets because some of them were quite large. I think there were 200 genes in here or so. And so all of these triangles represent pathways that were enriched in these genes. These parallelograms were pathways that were enriched in Autism genes, and then we can see how they were overlapping each other. So this is a bit complicated, a bit of a complicated enrichment map, but it uses all the features that I mentioned. Okay, so the gene set sources that were used in the Autism Spectrum Disorder case were gene ontology, pathway databases like Reactome and PFAM domains for your information. So this is the enrichment map plug-in and site escape. This is where you enter your data. Once you generate your enrichment map, it's displayed here, and you can click on nodes to see the heat maps. If you're using gene set enrichment analysis, you might know that there's a feature called the leading edge, which basically identifies the genes that are providing the strongest signal to the enrichment or for the enrichment. And those are highlighted here if you load data from GSEA. And then you can interactively change the p-value and q-value cut-offs, and this will update automatically. So this is also a good way of exploring your pathway enrichment analysis, because you can change those things interactively. So I mentioned that enrichment map gives you a very nice visual summary of your pathway enrichment results. What we'd like to do, and that's great, usually what you do when you are looking at that is the way that you would use that map is you would use it to quickly identify functional themes that look interesting to you based on your knowledge of the system. You might see things that are well-known, so they're not interesting. You might see things that you didn't know about but look like they're linked to the phenotype that you're studying, and so those might be interesting. And you might see a bunch of stuff that you have no idea how it's linked to the phenotype, and so maybe that's potentially really interesting, but it doesn't have a link, so you're not gonna follow up on it. So that's normally the way people think when they are looking at these enrichment results. So once you've identified something interesting, you'd like to zoom in on it and look at the genes, look at the gene expression. So for instance, we find that in this map, there's a region that looks very, like it's changing a lot, and it incorporates a lot of gene sets that are pathways information that comes from the reactome, and one of the pathways is reactome apoptosis. So then what you'd like to do is download the pathway, the apoptosis pathway from reactome and overlay your gene expression data on the pathway, and so we've changed from a network where the nodes represent pathways to a network that is a pathway where the nodes represent proteins, and we have overlaid gene expression data on here, and then you can zoom in, you might find that one region of this pathway is actually where all the signal is, all the differential expression signal is coming from one region. So zooming into this level to get a more detailed mechanistic understanding of the genes in your gene list is kind of a path that most people would wanna take and we're currently working on making this easier. You can do all of this inside Escape manually, but we'd like to have it more point and click in the future. I mentioned that one thing that in Rich Room Map doesn't do right now is automatically circle the regions here and what I call functional themes, doesn't automatically circle them and label them. So we've developed another plugin called WordCloud where you can select a set of nodes and if these are all genontology terms, for instance, the WordCloud will show you the most frequent terms using this WordCloud visualization, which you might have seen on the web a lot of times this is used as tag clouds. So the more frequent a word appears in the genontology terms, the bigger it's shown here and so this is a signaling related cluster and there's different signaling pathways in here. So this is sometimes useful for navigating an enrichment map. Okay, so that's basically it. Just to acknowledge Daniela Merico who came up with this original idea and Ruth Isserlin is a resource assistant in my lab who developed this plugin and one of the things that she really liked developing this plugin because she's using it a lot for her own analysis and so she was excited enough about it that when she was presenting at a lab meeting she baked an enrichment map cookie and then so I can tell you that this was really good so I can tell you enrichment maps are useful and also tasty if you ever eat one. Okay, so we're moving to the lab now. So the next hour and a bit is, I guess we're finishing at 12.15 or 12.30, 12.15. Okay, so the next hour or so is the lab and I'm not gonna do a demo, I'm just gonna let you guys try and follow the lab. The, in this lab you can do a few things. So the main activity in the lab I'm proposing is try out enrichment map. So you need to follow these steps. You need to download the, if you haven't already done it, you need to load the enrichment map plugin from the plugin manager in SiteEscape. So this is a SiteEscape plugin, so you load SiteEscape, install the enrichment map plugin and then you can load in your results from David or G Profiler, the results that you had yesterday, but there's also a bunch of tutorials that we've made available on the enrichment map website and Michelle has printed them out for you so you can follow them. So there's a number of different paths that you can take. The default path is to take the data that you did from yesterday's lab that you created in yesterday's lab and load it into enrichment map. So you can take the David results. If you don't have them, you can recreate them and save them and then load them up into an enrichment map. Or you could follow one of the tutorials. The good thing about the tutorials is that they have data that is available for download where everything, all the data you need is right there, including gene expression data and you can just follow the tutorial to load the data up. You can also very recently added a tutorial that shows you how to take the liver data that you had in the integrated assignment and load that data up as an enrichment map and then finally you can try your own data. And then the other thing you can do in the lab is just try outside escape, try out some of the different plugins like the slides that I presented earlier or ask questions about your data that you have and what kind of analyses I could recommend for that. Okay, so next hour just try out those things and if you have any questions put up your hand.