 Good morning everyone, people hearing me alright? Excellent. Good. Looks like it's working both on Zoom and on YouTube. Let's hope it keeps working this way. And how's the video quality? I know it can be a bit choppy on Zoom. Okay, it looks to me like it's pretty decent on YouTube at least. Where I am? I'm at home. Being the geek that I am, I've pretty much been making a streaming setup that might make Twitch streamers envious. So we are talking multiple camera angles here, slide sharing, screen sharing. Let's hope it all works. This camera is out of focus. Hello. That's more like it. I'm trying. I'm trying. I have to say this has been a great opportunity to learn OBS Studio, YouTube live streaming, lots of different things and be able to call it work. For those on YouTube asking, the Zoom session is only for the official course participants at KU. Both, of course, to give priority to the people who've actually signed up for the course but also to avoid risking running into the limit of having at most 100 participants. Maybe I could do a tutorial on this streaming live on YouTube and simultaneously teaching over Zoom. Well, I can do a tutorial on that if it turns out to work. Let's hope it does. It's not too complicated really. It's just a matter of recording everything into OBS Studio, which is open source software used by lots of streamers. Then in that one you set up a virtual camera and then you're setting that camera to be your input camera for Zoom. And inside OBS you can record locally and you can stream to YouTube and other streaming services. There's a bit of software to set up other than that. It's not really too hard. Yeah, thanks Klaus. Hope things have been running smoothly the past couple of days. Also, one small thing, because I'm at home and I don't have a dual screen set up, once I start streaming the slides, I cannot see the Zoom chat. So if you're asking questions there, I unfortunately can't see them while I'm showing the slides. Yes, speaking of slides, maybe I should get them up. Yeah, thanks Nadja. So Nadja is a postdoc from my group. She's one of the several people who will be helping out today, especially when it comes to the exercises. Good, I see Katarina here as well, also a postdoc in my group. So we have a number of people here who can help you with the exercises later and possibly also try to fill in in the chat while I'm giving the presentation, because this is quite a bit of multitasking. Let me just fire up the slide back here. Okay, seems to have a bit unstable internet connection, which is scary. I'm seeing OBS connecting and we disconnecting, we connecting to YouTube live. Nothing I can do about that, unfortunately. All right, I guess it's 8.15 and we should get started. So first of all, welcome everyone to this day on Network Biology. The overall plan for the day is that there will be lectures and a few exercises in the morning or just some software demo in the afternoon will really be diving into all the cytoscape exercises that is much more hands-on and interactive. So the morning session, I'll be streaming all of that, not just to Zoom, but also to YouTube as a live stream. I hope to make everything available to people also afterwards, since there are people in other time zones who are interested. Obviously, I doubt people from the US have dialed in at this hour. So the topic of today is Network Biology and really, why do I care about Network Biology? What is Network Biology? Very quick introduction here. I'm a group leader at the Novo Nordisk Foundation Center for Protein Research. So that's a center at the University of Copenhagen with funding from the Novo Nordisk Foundation, and for that reason, we're very interested in general in proteins from all kinds of angles, including looking at protein interaction networks. I should also mention that I'm a co-founder of a company called Intomics. I don't say that to pluck the company. I say that because some funding agencies think that this might be somehow a conflict of interest. So now you know it. I'm one of the founders, owners, and advisors of Intomics. Both in the academic setting and in industry, we're dealing with a lot of omics data. And generally what happens is that when people do an omics study, what comes out of it is a lot of molecular players. And what you want to understand afterwards typically is how these molecular players interact with each other, understand their interplay. And that's really the core network biology to me, using biological networks to understand how things work together inside the cell. And for that, networks is a really useful abstraction. And it's also a useful abstraction that really lends itself to visualization. Now I'll start out today with a bit of core concepts just to make sure everyone is on the same page. And those are sort of the core concepts of working with networks, just some terminology and a little bit of background that I want to get out of the way before we dive into the more biologically relevant parts. So when you're talking about networks, there are two things you need to know. One is what are nodes? And the nodes in the network, also sometimes called vertices, are the things that are to be connected to each other in the network. That could be proteins, of course, when we're interested in protein interaction networks. It could be diseases, if you want to know how genes and diseases work together or which diseases show comorbidity. Then you have the other half of the networks, so to speak, which are the edges. And the edges is what connects the nodes. So those are the connections between your things in your network, be it proteins or diseases or something else. These edges, there can be several types, there's also undirected edges. An undirected edge means that there's no difference between having the edge AB and the edge BA. So for example, A and B binding to each other, that's an undirected edge. The opposite of course are directed edges where the direction matters, AB is not the same as BA and that's what you would often have in things like signal transduction pathways where obviously A doing something to B like A activating B is not the same as B activating A. Doing something will work a lot with today is so-called weighted graphs or weighted networks. That means that not all edges are considered equal. By that I mean instead of just there being an edge or not being an edge, we have a probability or weight attached to the edge that somehow quantifies how sure we are that there actually is a link between these two things. When people talk about networks, you'll find that a lot of the literature is talking about things like network topology, sort of the structure of the networks, talking about things like robustness of networks where the networks fall apart, how you can use the structure of the network to infer which nodes are the most important nodes etc etc. And there's a lot of terms there. People talk a lot about network degree and the degree is simply how many connections an edge has. So if something interacts with five other things, it has a degree of five. You have measures like centrality, there are a number of different centrality measures. I'm not going to bore you with all of them, but it's again somehow says a combination of either how connected the node is or how important the node is to keeping the whole network together. So that can be degree centrality. You can also have closeness centrality, many different measures. Another thing is clustering coefficients, which is sort of if you're looking at a certain node and you looked at its interaction partners, how connected are those to each other. So that gives you an idea of is this node part of a closely connected part within the network, which is relevant later as the name implies when you're doing clustering. So we will be using clustering, but we won't really be talking about clustering coefficients today. We will talk a lot about robustness, that links with the whole centrality idea. The idea being that if you remove certain nodes from the network, does the network fall apart? I think one important thing to think about here is that whether it makes any sense to talk about robustness or not depends very much on which kind of network you're dealing with. If you're dealing with a physical approach, approach and interaction network, it's not clear at all what it means to the cell that things are connected, that there is a path connecting A to B. It is not clear that it is in any way relevant to the cell if the network falls apart. So talking about robustness or such networks is not a terribly meaningful thing to do in my opinion. However, for some other networks, like a signal transduction pathway, it obviously makes sense if you break the signal transduction chain, then you've broken the signaling. So what we'll focus on today is mostly protein networks. And when it comes to protein network, what people mostly talk about are physical interactions. We do have those in our networks as I'll get to, but we are not limiting ourselves to physical interactions consisting of A binding to B forming a complex. What instead we're looking into is looking also at functional associations more broadly. So capturing which things work together. So we're trying to link proteins to other proteins. If they somehow work together. Now, in the name of the game here is obviously guilt by association. How can you find out whether two things work together? And for that, I find it illustrative to look at my favorite guilt by association network ever. And this, as you can clearly read on this slide, the nodes are not proteins, they're people. And this is a network that is based on emails. So it's an email network where we're linking people to other people based on who's sending email to whom. And what makes this particular network funny to me, at least, is that it's not just some random email network. It's an email network that was built based on the email sent during the last couple of weeks in the company Enron before the company went bankrupt in enormous scandals a few years ago. And you can read many fun things out of it. If you look at the lower left side of the network, you'll see a bunch of people who are not mentioned. Some of those people were people outside the company who magically managed to sell their stocks in time. And I believe who have read that some of those people actually went to jail for insider trading. Just above them, you see the whole board of directors who actually compared to many other people were sending remarkably few emails while the entire company was collapsing around them. And if you follow the news back then when Enron went bankrupt, you will know that those people were in fact more busy playing golf than they were managing the company. So they were out on the golf course, improving their golf handicap while at the same time the company was crashing around them. But enough about Enron, enough about email networks, let's get to the core of this, the string database. So the string database is a big database that I'm heavily and my group is heavily involved in developing together with the group of Christian von Meringen at the University of Zurich and Pierre Borg at EMBL and Heidelberg. So the string database is a protein interaction database. It's a database of functional associations. And at the heart of it, you can go to the website stringdb.org and there you can look up any protein of interest and you can then find interactions for those proteins. And that can of course give you a hint of what a protein might be doing. So if you look up some protein where you have no idea what it's doing, you can see which other proteins it's likely to work together with and based on that, try to make some qualified guesses as to what your protein might be doing and go out and validate them in the lab hopefully. What you can also use it for is to come with not one protein of interest, but a whole long list of proteins from an omic study, query for a network and that way find out how the proteins that showed up as significantly regulated in your study behave in terms of being functionally associated. Now the starting point for making the string database is a collection of some 5,090 genomes in the latest version. And these encode a total of 24.6 million proteins and the goal is of course to link all of those to each other with functional associations. So string is a heavily used resource and this is sort of the bragging slide from Google Analytics. We're looking at more than 30,000 users on a typical week and as you can see it's growing over time and the other thing you can definitely see out of this slide is that if nothing else, string is very good at detecting when it's Christmas. So at least scientists do tend to take a break over Christmas. We'd like to think of course that the reason why people use string so much is that string works well. The problem with making that statement of course is that when it comes to bioinformatics tools and I guess many other aspects of science, everything works well according to the authors. So you're dealing here with it's so easy to fake your benchmarks intentionally or unintentionally and make yourself look good. So why would you trust my benchmarks of string? And you shouldn't. Thankfully you don't have to because there are independent benchmarks out there. That's one of the advantages of being a heavily used resource. Other people actually compare things to your resource. So this is a graph from a fairly recent paper from the group of Triadica, one of the main groups behind the whole cytoscape tool that we'll look at later. And they were working on building approach and interaction resource, putting together lots of different existing resources including string. And as part of that, they benchmarked how good these different networks were. For identifying disease genes, both looking at genome-wide association studies and looking at literature-based gene sets. And according to both benchmarks, string came out as being the best performing network of the many, many networks in this graph. And part of that of course is string is great, at least I think so. Another big part of it is of course many of these databases are very limited compared to string because they're focused on physical interaction networks. And if what you're interested in is to find disease genes limiting yourself to only use physical approach and interactions is really going to harm you. So if you believe me and Triadica that the string network seems to work really well, the next obvious question is, how did we pull this off? How did we manage to make a network that is this good? And the name of the game here is really data integration. If you want a good network that performs well in these kinds of benchmarks, you need as good coverage as possible or proteins, you need as good coverage as possible in terms of which ones work together. And the first kind of data we integrate is what is called genomic context. And genomic context, that's a whole class of methods that can be used to infer functional associations based on just having a set of genomes. So the easiest to understand of these is the gene fusion method. The idea here is, imagine you have two genes, you look at the top row, the red and the yellow gene. And as indicated by the broken line, these are different genes sitting in different places in the genome. If it's a eukaryotic genome with multiple chromosomes, they could even sit on different chromosomes. However, if you go look at the orthologs of these genes, if you do sequence similarity searches and try to identify the likely orthologs in other organisms, you find that in some organisms, the second and the fourth, these two protein-coding genes have been fused into a single large protein-coding gene that encodes a fusion protein. Now, if you think about this for a second from point of view of a cell, would it make any sense to take two proteins that have nothing to do with each other whatsoever and covalently link them together, making them one big protein? And the answer to that is, of course, no, it wouldn't. Why would you link two unrelated proteins? That obviously means that since some organisms actually took these proteins and covalently linked them to make them one big protein, that's a pretty strong hint that these proteins are doing something related, also in the organisms where they haven't been fused. So that way, by looking at evolution, looking at how these genes are organized, in this case, fused in other genomes, we can make inferences about functional associations in our genome of interest. Another example is gene neighborhood, so looking at which genes sit next to each other. And the simplest case of this, of course, is operands. So if you're looking in particular bacterial genomes, you have operands where several genes sit together as a cluster, which is transcribed as a polycystronic transcript that encodes multiple different proteins. And generally, those genes that are transcribed together in a single operand are, of course, functionally associated because they're always expressed at the same time, and there they were needed at the same time. So typically, they encode, for example, different enzymes involved in the same metabolic pathway. Now, the problem is if you're looking at just one genome, it's not that easy to infer operands because every gene has to sit next to something. In fact, it has to sit next to something on both sides. And when two genes sit next to each other, they have a 50-50 chance of pointing in the same direction. So just having two genes sitting next to each other being transcribed the same direction is not much of a hint that these two are likely to be in an operand together. However, again, if you use the power of evolution and you look across 5,000 different genomes, you can look at this and say, is it evolutionarily conserved that these genes sit together in what looks like a somewhat conserved operand? And the reason why this works is that if you're looking at something like genomes, then on the time scale of about 100 million years, genes get shuffled around. If you have two genes sitting randomly next to each other, not being functionally associated, not being transcribed as a single operand, chances are that if you look in another genome that is 100 million years away, they're not sitting next to each other. So for that reason, when you have a big span of organisms over a very long time scale, you can make inferences about functional associations. The last method we have in string of the genomic context methods is so-called phylogenetic profiles. This is the hardest one to understand, but it is also by far the most powerful of these methods. The idea here is that you're looking at presence, absence, patterns of genes. So here you have three genes, the red, the yellow, and the green, and it's a toy example. On the left side, you have a species tree. So you see how these different species are related to each other. And you see that when you look at these genes, the presence, absence, pattern of these in this toy example is identical. And you also see that if you look at the top two genomes, one has all three genes, the other has neither. So it doesn't follow the tree. It's not like one of these cases where say gamma protobacteria have these genes and nobody else, you have close neighbors where one species has it, another doesn't. In another part of the tree, one has it, its neighbor doesn't. And to explain that kind of pattern, you need a lot of joint gain and loss events of these genes, which of course is exceptionally unlikely to happen by random chance. So what you do is that you say, well, since it's unlikely to happen by random chance, it presumably didn't happen by random chance. It happened for a reason. And the reason is that these genes are somehow involved in carrying out a common function. If you have all of these genes, you're able to do that function, whatever it is. If you were to lose one of these genes, you're no longer able to carry out that function, at which point you have no evolutionary pressure to retain the other genes and you're therefore likely to just use them pretty quickly. In the real world, the tree of course is not this small, you're looking at 5,000 genomes. That's what gives you the statistical power. On the other hand, the pattern matching is of course not this perfect. It's not a perfect match in terms of presence and absence. But this is the idea. The idea is that if genes are systematically present in the same subset of genomes and those genomes are not closely related to each other, then you likely have some set of genes that are needed together for some function that those organisms are able to do. Of course, you only get so far by only looking at genomes. So if you want a good network, in particular, a good network for looking at something like higher eukaryotes, in particular human, you need to pull in other experimental data on just the genomes. That could be things like gene co-expression. I'm not going to spend time talking about that. But gene co-expression, you could argue, works a little bit like the phylogenetic profiles. In the phylogenetic profiles, we're looking at presence and absence across different genomes. In gene expression, we can look at presence and absence of genes across many, many, many different conditions. So based on RNA-seq experiments or older microarray experiments, we can look at which genes go up and down together across thousands or tens of thousands of different experimental conditions. And if genes systematically go up and down together like that, it's a pretty strong hint that they might be doing something together because they seem to be needed under the same conditions. Another thing that we focus a lot on is physical approach and interactions. So as I mentioned earlier, lots of networks are based purely on these. For us, it's just one of many evidence channels. So the physical interactions can come from a wide variety of different screening technologies. I'm not going to go through all of them. There are many out there, but that's not really the topic for bioinformatics course. I'll just illustrate one of them because we need an example for later in the presentation. And this is the example of a tandem affinity purification followed by mass spec experiment. And the idea here is pretty simple. So imagine you have approaching of interest. You put a tag on that, which effectively is a molecular handle. You're now able to do an experiment in which you grab that handle, pull down that protein. And when you pull down that protein, it comes down together with whatever else is stuck to it. Now, when you have that, you can then use mass spectrometry to identify what's in the pull down. And you can run around and put handles on lots of different proteins and do lots of pull downs and then see which things systematically come down together. And based on that, try to infer which proteins are likely in a complex together. I'll get back to that. Lastly, we integrate what we call curated knowledge. Curated knowledge is different from the experimental data in the sense that this is really more of the textbook knowledge. It's not somebody did an experiment and deposited it in a database showing that these proteins supposedly interact with each other. We're talking about established knowledge. We know that these things exist. So this includes things like protein complexes. We know there is something called say the cohesion complex. We know what the subunits are. This is well established. It's not based on one experiment. Could also be things like pathways. So there are lots of different pathway databases. They were also on triide because graph, things like reactome, keg and so on. And these pathways, of course, tell you how different enzymes work together on different pathways, making reactions with different metabolites. You have other things like signal transduction pathways knowing which kinases, regulate other kinases, so on and so forth. And those of us older than, I think, most of the participants here had to learn a lot of these metabolic charts by heart for a biochemistry exam and promptly forget them again afterwards. That was already pointless back then. I hope we are not doing that anymore because all of this is, of course, available in computer readable databases. There is no reason to know it by heart. So we take all of that and we put it together and you have the string database. Except that it's a little bit harder than that. There are a few problems. So firstly, there are many databases. We don't get 5,000 genomes from one place. When we integrate 5,000 genomes, we have to collect those genomes from multiple different places. There are multiple different repositories for physical protein interactions. There are dozens of pathway databases. If you want a good database with as good coverage as possible, you need to integrate many, many, many databases. These databases tend to come in different file formats. Of course, people try to standardize it, but there are still many different formats to deal with. And even if people use the same format, they likely don't use the same identifiers within that format. So one database is going to use uniprot identifiers for the proteins. Another database is going to use NCBI identifiers. Another database is going to use yet something else. So we need to deal with that. Then the data is of what I very politely refer to as varying quality, which is the nice way of saying that some of the data is really bad. So if you just treat everything as being equal, you're not going to get a good network. You're just going to get a network where you get flooded by false positives. And then lastly, well, not lastly, the data are not comparable. You have the problem that how do you compare a pathway to a physical protein interaction screen to co-expression data to inferred operands to phylogenetic profiles? These things are just fundamentally different things. And how can you even compare them? And last but not least, all the data is not in the same species. There's a reason why we have something called model organisms. We do experiments on those model organisms to learn something primarily about human. So if you're interested in some human proteins, you don't want to just look at the human data. You want to know what have we learned from mouse? What have we learned from rat? What's available on the yeast orthologs, the Drosophila orthologs? You need to somehow integrate data from all these different species and put it in one big network. So some of this is just hard work. There's not a whole lot to say about it. So you have a lot of databases. Somebody has to download them. They are in different formats. Somebody has to write a lot of parcels. And of course, when databases decide to change formats, somebody's going to have to update their parcel. Then they use different identifiers or somebody has to make mapping files. We need to have files that tell us which uniprod identifiers correspond to which ensemble identifiers correspond to which NCBI identifiers. We need to have mapping files for that. And again, making those is just hard work. Where things get a bit more interesting is when it comes to dealing with quality. So there we build what we call raw quality scores and that's what gets me back to the physical interaction screens. The idea here is that you develop a raw quality score for each type of data individually that allows you to take this kind of data and rank the interactions coming from it from which ones are most likely to be correct so you get a sorting of your interactions based on how likely they are to be right. And if we look at tantrum affinity purification followed by mass spec, how could you rank them? Well, let's imagine the evidence landscape here. We're looking at the interaction between the blue and the green protein and we've done a number of pull-downs and in one pull-down we tagged the blue protein. We got the green and a couple of others. In the second, we tagged another protein and got both the blue and the green protein in the pull-down. We tagged a third protein. We got the blue in the pull-down but not the green and we tagged the green protein and we got a couple of proteins but the blue was not among them. The real problem here is how do you turn this into a number? And that's not at all clear but I hope it's clear that the first two pull-downs are positive evidence and the last two pull-downs are negative evidence. The more often we see these proteins together in pull-downs the more we will tend to believe that these proteins interact. The less we see them together in a pull-down the more often you see one but not the other the less we're going to believe that they're in a complex together. So if you think about it from a statistical standpoint what you have from a Big Ten and Affinity Purification experiment is basically a 2x2 contingency table. So what we have in this 2x2 contingency table is we know how many pull-downs we did in total we know how many contained the blue we know how many contained the green and we know how many contained both. When you have a 2x2 contingency table you can of course do a lot of different things the first thing that should come to mind if you do statistics is Fischer's exact test. You could calculate a p-value are the blue and the green proteins together in pull-downs more often than you would expect by random chance? That was my first idea I tried that it turned out to not be a very good scoring scheme. You could do other things you could calculate an observed over-expected ratio we know which fraction of pull-downs contain the blue we know which fraction contains the green that means we can calculate which fraction we would expect by random chance to contain both and we can compare that to the actual fraction that contains both that turns out to be a better scoring scheme. But the point here is you have to come up with a lot of different scoring schemes you can do even better than what I mentioned here but you have to come up with some scoring scheme that allows you to rank the interactions from what's best to what's worst and you have to do that separately for each type of data. For co-expression data it might be a Pearson correlation coefficient for the other types of data it's going to be yet something else. Now once you have your raw quality scores for each type of data or each big screen you need to do score calibration and the idea here is that we compare everything to a common standard in our case pathways from the CAC database and now what we do is we try to say how do these interactions depending on their score agree with gold standard pathways that represent sort of what we know about which proteins are in a pathway together so for doing this calibration we first ignore all the proteins that we have to pathways so we now look only at the subset of proteins that actually have been assigned to pathways by manual adaptation and we can go in and look at say all the interactions that score between 1 and 1.1 and we can count how many of those do we have the two proteins in the same pathway and how many of them do we have them in different pathways and we then figure out that something like 14% of them on the same pathway the rest are in different pathways and we tell you that a score between 1 and 1.1 is pretty bad you could do the same for the one scoring between 2 and 2.1 and you would see that in that case something like 70-80% of them fall in the same pathway which tells you that a score above 2 is a pretty good score you do that for lots of different score pins you get a cloud of dots like I'm showing here and you fit some simple mathematical function typically a sick void function through it and you now have your calibration curve the trick is we can now go back to all the proteins that we might not know what are doing the ones that don't fall on any known pathways and we can take a given interaction we can calculate a raw quality score and now that we have a raw quality score we can go in and say the raw quality score was 1.7 what does that mean? well 1.7 if you look at this curve means about 50-50 chance of these two proteins being in the same pathway we can do that for all the different types of data and the real trick is that of course you have different calibration curves for different types of data and even though the scores on the X axis are completely different things we've now managed to map everything into the same score space we've turned everything into posterior probabilities of two proteins interacting or being associated the next thing we need to do is deal with the species problem the evidence being spread across model organisms in particular and for dealing with that we have to transfer evidence by orthology so orthology has to do with evolution of genes homologs are genes that share common ancestry orthologs are genes it's a subset of homologs that's separated by a speciation event as opposed to a gene duplication event so the ones that are orthologs or that in the last common ancestor of two species we believe were a single gene back then this is a very complicated scheme I'm not going to try to explain it in detail but the idea is that you do it in a two-step process so first we build up what we call orthologous groups at different levels of evolution so we have groups that group together the orthologs within mammals we have broader groups that group everything in vertebrates even broader groups everything in metasaur in eukaryotes or even going all the way back to the last universal common ancestor or as is normally called eukaryote and we can then map all of our evidence back to all those levels link the orthologous groups at the eukaryotic level at the mammalian level at the vertebrate level we can link all of those based on all the evidence from the organisms that are within them then in step two we can transfer evidence from those groups back to individual proteins through a very very complicated scoring scheme and the idea is that we now say we are looking at two human proteins of interest we take the evidence that we have for those directly of course but then we take the additional evidence from the other mammals from the mammalian orthologous groups we take the evidence from vertebrates that are not mammals from the vertebrate groups we take the evidence from eukaryotes that are not vertebrates from the eukaryotic groups so we can go back in different steps transfer more and more evidence in and the reason why the scoring scheme is so incredibly complicated is really that we need to deal with gene duplication events we need to deal with different genes evolving at different speeds and we need to deal of course with transferring from mouse to human it's going to be a lot more reliable inherently than transferring from so that's what we do in string and the bad news is that everything I've talked about so far adds up to maybe 10-20% of the evidence in string so we're still missing about 80-90% of the evidence which is what I'm going to be talking about in the next presentation so we'll have a short 5 minute break and then we'll get back and talk about text mining you're welcome to ask questions during the break though and now I can see the chat yeah thanks I'm now reading up on the questions and the answers so indeed the genomes represent different mostly different species in some cases we do have a few different strains of the same species but that's rare usually the main goal of selecting the genomes is we want to not have as many as possible because that just makes the problem blow up computationally and makes the database unnecessarily big so we want a set of genomes that keeps it at a reasonable level has as good a spread phylogenetically as possible so that we have a good coverage of the whole species space of what has been sequenced and then also we want to make sure that the genomes that we have in are of at least a reasonable quality so we collect first a very very large number of genomes we then do a lot of assessments of genome quality in terms of the assembly quality that the genes are not fragmented that there's actually good coverage of the clone libraries so the genes that should be in the genomes are there and then when we have that subset of things that pass the quality filter we start looking at which ones are too closely related again looking back at the quality we start looking at then okay when these two are too close which one is the better let's keep that one we have of course a list of sort of the main model organisms knowing that these particular strains of these organisms must be there because those are the ones that are like the reference genomes in the field that's what goes into the genome selection so the whole point of these networks there are many ways you can use them so you can use them to guide experiments so one way people use it is you have some gene where you don't know what it's doing and you then try to use something like string to get an idea of what it might be doing and then go in and do experiments another way people are using it for more single gene studies is that say they know some genes that are involved in a function but based on experiments they know that that set of genes is not sufficient to reconstitute the function and that means that they know they're sort of missing a gene or at least a single gene they then go into string with the genes that they already know are important for the function they're interested in and try to find additional genes possibly genes of unknown function that are functionally associated with the ones they know are important and that way try to find the missing gene in their pathway and then go show that experimentally so that's sort of all the guiding small scale experiments then you have the people coming from the omicside which is more what we do where you have a long list of genes and the way you use that long list of genes is that you want to somehow structure your list looking at a long list and finding a biological story is difficult so you want to put all of those genes into string or rather into cytoscape using string and that way make visualizations so using the network for doing data visualization of your omics data to help you identify what are sort of the interesting parts of your results. Alright let's get back to the slides bingo so as I mentioned we're missing most of the evidence and where that most of the evidence that we're missing is hiding is the text mining because this is where the evidence is hiding so this is sort of the back of the envelope estimate of the biomedical literature if we naively assume that everything is indexed in PubMed and that we assume that the average article is 5 pages long if I were to take it, print it all out on standard 80 gram A4 paper and pile it on top of each other and don't get a pile that is over 10 kilometers tall. Now of course this is an incredibly naive estimate the average paper is longer than 5 pages not everything is indexed in PubMed and for sure the pile is more than 20 kilometers as well but it doesn't matter whether it's more than 10 kilometers or more than 20 kilometers, in either case the reality is just there is too much to read. We cannot read the literature which sounds ridiculous right because we're spending all this time writing papers and they're clearly written in a way that is intended for humans to consume and we're now faced with the problem that we have too much of it and we can't read it which means that whether we like it or not we have to get a computer to read it so that's the reality we live in we need to somehow get a computer to read the literature because we can't possibly keep up with all of it ourselves and whenever I need to get a computer to do something that even half way approaches being smart I get worried and I find it useful to think of the analogy that a computer is about as smart as a dog by which I mean that if I put sufficient effort into it I can teach it to do specific tricks so borrowing a cartoon from Gary Larson what we say to dogs ok ginger I've had it you stay out of the garbage understand ginger stay out of the garbage or else and the only thing the dog hears is blah blah blah ginger blah blah blah ginger blah blah blah and it understood its own name and when I do text mining most of the time this is sort of our level of ambition we're trying to just get the computer to recognize names but most of the text in between is going to be complete blah blah blah to the computer we're working on getting beyond that but this is sort of the core of the text mining and string at the moment so this is what people in the text mining community call named entity recognition and text mining is a scary field to get into because people like to use overly complicated terms for very simple things so named entity recognition sounded fancy to me until I realized that it literally means recognizing things with names and unsurprisingly when you want to recognize things with names you need a good dictionary of the names you want to recognize so that means we need to know about synonyms we need to know whether the protein called cycle-independent kinase 1 and we need to know that cdc2 is the same thing we need to handle what people call orthographic variation which is the fancy way of saying that people can write the same thing in slightly different ways so for example in your dictionary uniprot or wherever you get the name from is going to tell you that there is a protein called cycle-independent kinase 1 and wouldn't it be nice if you were still able to recognize that in the literature when people write it with a hyphen if you're just doing plain simple string matching these two strings are not the same when you have it without a hyphen in your dictionary and it's written with a hyphen in the text you won't find it so you need to have some sort of flexibility in the matching you also need to know some rules about how people mangle names like you have cdc2 but you have cdc2 in human you have cdc2 in mouse you have cdc2 in rat and quite likely you're starting all three of them in a single paper so it gets very confusing so you put an h in front of it to tell me that it's human cdc2 but with all human gene symbols you put an m in front of mouse gene symbols an r in front of rat gene symbols and so on but again the dictionaries like uniprod are not going to tell you that so you have to teach the computer these little tricks of how biology is right it's not difficult but it's important then you need a blacklist and a blacklist is a list of the names that are not in your dictionary is wrong per se but they are a really bad idea from a text mining perspective because while they can mean what they mean according to your dictionary most of the time they mean something entirely different and my favorite example of that is the human gene naming committee which in their infinite wisdom decided they would be a good idea that the recommended gene symbol the one you're supposed to use in the literature of a certain gene was this now anyone who has done wetlab molecular biology of any kind will know that SDS is a detergent that you use in the lab quite often for several things one of which is to denature proteins so of course if you text mine the literature and you think that whenever people write SDS it means the SDS gene you are going to be very very wrong a lot of the time and this is going to be particularly disastrous because since you use SDS protein proteins pretty much every protein that we've ever studied will have been mentioned together with SDS in the literature for which reason it's going to be a complete disaster when we go to the next step and try to do relation extraction using co-mentioning so the idea in co-mentioning is very simple if people mention two things together in the literature they're probably related and that would of course be useful to us when we're trying to build a functional association network now you could of course argue that we could very easily have that two things are mentioned together by random chance it doesn't mean a thing but that's where we do counting the idea simply being that if people keep mentioning many times the two things together it's not by random chance it's because these things are actually somehow related the question is of course how you count which level should we count should we count how many documents mention A and B together should we count it only when it's within the same paragraph should we count it only when it's within the same sentence and the answer is that doing either of these three is wrong because things being mentioned together in the same sentence is obviously the strongest evidence but it's still a good hint when things are mentioned together in the same paragraph and it's still a hint when things are mentioned in the same paper even not the same paragraph so you need to count at all the different levels and somehow combine that into a weighted count where the count is not actually an integer but it's a count that is more weighted count taking into account how much and how close to each other are these entities mentioned together the way the formula looks is like this so we have a weight for being mentioned in the same document a weight for being in the same paragraph a weight for being in the same sentence and then for entities i and j we calculate this weighted sum k is the sum of all the documents i, j are the two entities that we're trying to link to each other and we're now saying are they mentioned in the same document same paragraph, same sentence for document.k once we have that we can convert that into our raw quality score where we basically normalize this very much like the physical protein interaction so we need to normalize things and take into account how much is written about i how much is written about j and once we have that quality score and once we have that quality score we then do score calibration against keg pathways exactly the same way as before and that way we get our interactions from text mining scored in a way consistent with all the other evidence we then take all of that text mind evidence and like for all the other evidence we transfer all of it by mythology and now we're done and we have string now the neat thing of text mining is that we can use it for many other things than just protein-protein interactions and that gets us to the next topic of knowledge graphs as people call it so in knowledge graphs you're trying to build not just protein networks but you're trying to link in a protein-knowledge graph lots of other things together as well that is relevant when you're looking at proteins so we've so far looked at intra-species networks so string there are other networks of course where you look at proteins for all different species so inter-species networks I just want to highlight here a work of former PhD student of mine Helen Cook who made a version of string for viruses so there you have virus proteins associated with the host proteins using exactly the same kinds of strategies I've talked about for string you can have protein-chemical interactions so linking proteins are small molecule compounds that could be interesting if you're interested in drug targets and metabolism for that we have a sister resource of string called stitch which is basically string with small molecule compounds bolted onto it as well you can do protein compartment association so your notes in your network could be subcellular localizations as opposed to protein entities and that way you can link things and basically capture the subcellular localization information about the proteins you can draw it as a network or you can draw it as a pretty figure where you color a cell based on where protein looks like it's located in either case behind the scenes it is a network same thing for protein tissue associations you can look at tissue expression you can map that onto figures like this or you can think of it as a network where you have a node being the liver you have a node being the heart in terms of the data model versus how you visualize it behind the scenes it's a network protein disease associations we can again we have a database called diseases where you can go in and look up either for a certain disease get the genes or for a certain gene gets the diseases again it's a network where you now have two kinds of nodes proteins and diseases and the nice thing here is it's the same strategy we're using for making all of these resources we take curated knowledge from whatever reliable manually curated databases exist we take experimental data it could be tissue expression data subcellular localization data physical protein interactions chemical screens doesn't matter what it is we run text mining looking at co-occurrence evidence so that would be things like if a protein is mentioned with a disease that protein is likely involved with that disease it's very simple logic here we map everything to common identifiers so we use the string identifiers for proteins in all these different resources but similarly we have standardized identifiers that we use for diseases being disease ontology for tissues being the Brenda tissue ontology subcellular localization coming from gene ontology so on and so forth and then we again develop scoring schemes we don't consider everything equal we score things using the same strategies we benchmark everything to get calibrated scores out and that's how we then get one big knowledge graph but even though you can view it as a number of different resources online behind the scenes it is really one big network that ties everything together so that's the end of this first part we now should have let me so get the time we have a bit more than 10 minutes for sort of general questions and answers and then after that a 10 minute break have we experimented with more sophisticated text passing attempts to understand the nature of the text and the association of the terms mentioned so good question in string we have a very old natural language processing pipeline that I didn't talk about here that is a rule based system using grammars to pull out associations or things like A active A to B and so on we do have that in there we're not really doing anything to develop that one anymore Katarina who at least earlier was in the chat she might still be here it's actually her main project is using these more modern methods for dealing with the with the text so she's looking into all the kinds of technologies with deep learning using bird transformers and so on so she can tell you much more about that in chat if you're interested and how we calibrate the score in string I mentioned it we calibrate them by benchmarking everything against a common gold standard in the other resources we do things in slightly different ways but overall the idea is the same you have some gold standard like manually curated gene disease interactions and based on that you figure out how can I calibrate these different raw quality scores for different data types into a score that means the same across evidence type so you know that the score of X means that you are so and so sure so another question over on YouTube is is string an undirected network string is mainly an undirected network it's mainly an undirected weighted graph however we do from in part the passing the manually curated pathway databases and in part from this language processing pipeline we do have some directed edges in there but if you only consider the directed edges then you're going to be throwing away the vast majority of string so if you're thinking about it from a graph analysis standpoint it's probably best to think of it as an undirected network because the vast majority of edges are undirected done text mining would it be different scores when two proteins are co-mentioned in the same paragraph and if they are two different parts of a paper in the latter case will it still get a score yes it will still it will still get a score it will be a very low score but it will still get a score so things being mentioned together just in the same paper even in different paragraphs does carry some weight but it's way lower than if it's in the same paragraph and actually being in the same sentence it's only a little bit better than being in the same paragraph so I would say if you didn't want to do this complicated weighting scheme the best level to choose is probably to say same paragraph over on YouTube what databases do you use for storing the string data so technically we are storing all the data in a big Postgres database so it's an SQL database we're not using no SQL databases for it we did do some experiments working with graph databases like Neo4j but we ended up not really seeing any advantage in doing that so we ended up just staying with Postgres how are the scores combined good point as the evidence are collected from different sources experiment predictions and I could add different organisms so in string when everything has been combined has been calibrated and you have probabilistic scores like you effectively have posterior probabilities giving each piece of evidence then in most cases the evidence is being combined assuming independence so you're combining the probabilities I'm slightly exaggerating the simplicity here the real formulas are in the papers but if you're 90% sure given one piece of evidence the two proteins are associated you're 50% sure if you're 50% sure given one piece of evidence and you're 50% sure given another piece of evidence then overall you would say there's only 50% chance that the first one is wrong there's only 50% chance that the second one is wrong for the edge to be wrong both of them would have to be wrong so there's only 25% chance that both of them are wrong so 2 times 50% is 75% sure if you have two independent pieces of evidence then each gave you 50% it's slightly more complicated than that when you're combining evidence across species for example you need to take into account how closely related the organisms are if it's very distantly related organisms you can sensibly assume independence but of course if you're combining evidence from E. coli with evidence from Salmonella you would be a fool to assume independence so in that case we are more using a max score rather than combining them so we either assume things are dependent and we just take the better of the scores or we assume independence and we combine them in the fashion I explained there's also a question at some point about for the contingency tables what was the best fitted scoring scheme I think that was for the physical interactions it's been changing over the years I wouldn't really spend time on trying to explain that now because right now actually Mark in my group is working on coming up with new scoring schemes for the physical interaction data so whatever I would tell you now would be something that actually would be outdated very soon but the idea was basically it's sort of it's not served over expected ratio but at the same time combined with the absolute count having to be high as well so it's sort of a compromise between you need a lot of observations to be sure and at the same time you get you need a good observed over expected ratio let me just see which paper it is Katarina managed to dig up on the yeah that's the right one exactly so that's a so the paper describing the NLP pipeline was a super productive collaboration with another group in Heidelberg so it was all done while I worked at the EMBL in Pierre Boch's group and I was working with Jasmine Sarej a computational linguist and I don't think I'm going to offend anyone if I say that his knowledge about proteins was as great as my knowledge of linguistics so we had very complimentary skills let's put it that way but somehow it just worked amazingly well and yeah I knew what it was we wanted to accomplish he knew what one could actually do with NLP at the time and we put together in half a year each of us working maybe a quarter of our time on it we managed to put together a pipeline that actually was probably the best in the world at the time and it's still doing remarkably well it's it's sort of embarrassingly outdated technology but every time somebody's been benchmarking it they've told me that it is scary good considering how old it is I'm not seeing more questions come up so just the plan now is that first we have ten minute break I just need to grab some more water and have a short bio break here so that was a question about APIs are there REST APIs we're accessing string programmatically there are REST APIs yes I don't think there's a Python package for doing it specifically but you can access the REST APIs it's all on the stringdb website if you go to the help section the API is documented there so yes there's definitely REST APIs for accessing it programmatically that being said depending on what you want to do you don't necessarily want to use the REST APIs there are also bulk download files and if you want to do something like a global analysis of the whole network of a certain organism I would say you are far better off just going downloading all the data locally rather than trying to fetch it via the REST API it's just going to be much slower to do it via the API and unnecessarily complicated another question can we use string to predict inter-intra species ppi's of organism which is not in string yes and no it's certainly not something you can do via the website you can't just go plunk in a genome that you're interested in that it isn't in string and get a network but you can use so there's an authority resource that string builds on that is called eggnog and if you download there the orthologous groups then you can figure out you can take your there's a tool as well called eggnog mapper so you can go map all the genes from your genome to either a close relative in string and transfer the evidence from there orthologous groups then you can take and download the whole network from string and you can map interactions over so it's a lot of coding work to do there's not an easy way to do it at the moment it's something I would like to see in the future but it's certainly not a trivial task and it's also something that would be computational quite expensive so that's really the tricky part can we make this good enough and fast enough that we can offer it as a service that people can just go use because we obviously don't want people to be crashing our server okay so let's have a quick 10 minute break then starting at 9.30 we'll jump into some exercises so the exercises are you should have the link in the course material but let me just open it up here so the exercises we'll be doing in 10 minutes are these ones and the idea is to have basically what is that 25 minutes or so for that then a short 5 minute break you of course welcome to start working already if you want I just don't promise that I'll be sitting at the computer right now and get as far as you get during that time it's to give you an idea of what's actually in string how do you work with string through the web interface where does the data come from understand the nature of this protein graph before we at 10 o'clock get back into talking about cytoscape string app and how to work with these networks inside cytoscape because you have a much better view of the underlying evidence when you're in the web interface so that's why first we go look at the web interface have an understanding of what's the data we're dealing with and start working with those networks on a much bigger scale in cytoscape and just for everyone also on youtube if you want to play along with the cytoscape exercises it would be a good idea to get cytoscape downloaded and installed as well it's bit of a big install so depending on your internet speed you probably want to get started if you haven't done it already and if you already have cytoscape installed please please please make sure you update to the latest version 3.7.2 and make sure that you've updated the apps including string app to be the latest version we usually always run into some people who already had an old version of cytoscape installed and then the exercises don't work because they are an old version of cytoscape old version of string app etc so especially now that we're trying to teach online and we can't see your screens please make sure you're on the latest version to avoid a lot of questions thank you and I'll be back in a bit less than 10 minutes I'm back so people still hear me excellent you hopefully see my screen now the exercises that we'll be doing it's this link it should be on the course website as well I've just opened up the web page here I hope you see me sharing the screen so these first exercises they are all web based and let me just get myself out of the way there that's better and you can see we have a number of exercises in the first one we're really going through basic things with the string website querying for a single protein looking at how you can represent the networks in the interface looking at the so called evidence viewers which are all about seeing the underlying data, seeing where a network came from or rather where a specific interaction came from experimenting a bit with the query parameters and seeing how they affect the results you get just to basically get an understanding of what's in string how does it all work that's really the key to get through that exercise 2 that one is the stitch database so if you're interested in how you can do similar things with small molecule compounds have a look at that one it's not something you really need for the exercises later so that's very much optional exercise 3 the disease query we will be doing a disease query from cytoscape later on so it would be great if you go in at exercise 3 so that you have an understanding of what's in the diseases database before we use it from inside cytoscape exercise 4 is all about the string viruses so if you're interested and well many people are at the moment even though I'm sorry to say that a certain specific virus that I'm not sure I'm allowed to mention on youtube is not in the database obviously we have not released a new version of the string viruses after the outbreak but you can look up other proteins from other viruses lots of echo no idea where the echo would come from is there still echo okay good I don't know there was a chat comment in youtube that there was lots of echo I have no idea where that would be coming from good to hear that there's no echo so yeah dig into the exercises just questions right there's a channel called database which can be a bit confusing personally I would vote for renaming that channel the database are the manually curated databases so they are really the curated knowledge but I would agree it's not the best name for it because all the data that goes into the experiments channel for example comes from the databases as comes from other databases and all the evidence in string overall string is a database so it's not the most accurate term I think it's called that for historical reasons back in the in the stone age the string database only had interactions inferred from the genomic context methods and interactions coming from the databases so it was called database since then we've added a lot more from databases not that descriptive anymore very good question so question about what do the scores mean for the database column well since the databases kind of the manually curated knowledge that we believe to be true how can you benchmark it it's the gold standard it's what we are benchmarking against effectively so for that reason the database channel is the only one that actually is not benchmarked everything that comes directly from one database gets a flat score of 0.9 we say we're 90% sure of stuff that is manually curated if things score higher than 0.9 in the database channel it's because we got the same manually curated information from several different places but I don't even think that happens I think higher than 0.9 it has to be from another evidence channel as well so the fact that it's in more than one database wouldn't really make it more reliable I would have to check that carefully so I'm not doing the database channel this is not meant as a disclaimer it's more the we're a big consortium making this different groups are responsible for different parts of string and I just want to give credit that it is the group in Zurich that is doing all the hard work of integrating these way too many different pathway databases so really technical questions on how things are done in that channel the people in Zurich would be more qualified at answering it than me is there a reference of how these scores are calculated by reference do you mean publication the problem is there's not a single publication explaining how everything in string is done which sort of is an artifact of how publishing works so we are making the string we are releasing new versions and there's this thing called the nucleic acid research database issue and we publish every two years an update paper string but since it's an update paper every paper describes what's new what's changed and for that reason it's effectively like this series of papers is like patch files to source code you almost have to go back to the original paper and then you have to read all the subsequent papers to figure out which things have been changed since then there's no where you're publishing a paper where it's a live paper so to speak where you just have the current information in one paper it's quite bizarre there's not one equation for a score no because the point is all the different evidence channels have different scoring schemes that's also why if you were to make a paper that explains how everything is scored it would be a gigantic publication because there would be explaining the scoring scheme for how do you score fusions, how do you score neighborhoods how do you score phylogenetic profiles that's a separate paper in its own right there is a paper out on how we score the phylogenetic profiles then there's the whole scoring scheme for experimental interactions which has changed over the years and which is due to change again there's the scoring of the text mining on top of that you then have the whole scoring of how do you transfer evidence by authority so yeah the best people to answer questions about how actually both the we'll get it right here the genomic context channels and the database channel and also the authority those parts are done primarily in Zurich the authority transfer so those are the best people to ask I think I saw over in the youtube chat I think I saw David Lyon over there he's actually a former member of my group who's now in Zurich so he might be able to answer questions for their part over there I'm just delegating things here without even asking people if they're willing to he can complain later question about if two proteins are in flybase they would have a score 0.9 no the scores are about interactions right of course the proteins themselves are going to be in Drosophila protein is going to be in flybase because in fact we get the genome for example which gets the genome annotation from flybase so everything is in flybase but for there to be an interaction 0.9 in the database channel they would have to be manually curated information from flybase that they interact does that make sense I think I'll just really quickly go through exercise one just to show you what the idea is to make sure that everybody's understood that part before we dive into cytoscape after another break so here's the string welcome page when you first arrive you go in you can query for a protein I query for insulin receptor I can either leave the organism on autodetect or I could select human if I'm lazy and I leave it on autodetect it comes up and says I know insulin receptor in a lot of different organisms lots of proteins it could be I'm guessing you mean this one in human yes it is and I get a network and it looks like this the first thing you notice is that there are many different colors of lines and if you look down here then you will see that the different colors of lines correspond to different types of evidence so the pink lines are experimentally determined interactions the yellowish lines are text mining and so on so this is what is called the evidence view where the colors of lines represent which types of evidence you have for certain interaction another view is the confidence view where now you see you don't have multiple lines anymore instead you have different strengths of lines so for this we are now just showing what's the overall confidence of a certain interaction we are more sure about this interaction between these two than we are on the interaction between these two so in this mode we are just showing how sure are we not where did the evidence come from when it comes to the evidence view you can still get in and find the evidence of where does a certain interaction come from even when you are in confidence view so if we are interested in insulin receptor and insulin receptor substrate one to make it easier to click on and just going to move these makes it a lot easier to click on this edge and get the pop-up and now that you are seeing this pop-up you see we have evidence from a little bit from co-expression a little bit from text mining but it's annotated manually curated databases there is experimental biochemical data both in the same organism in human and transferred from other organisms that's really where the bulk of it comes from sorry there was a lot from text mining as well so that shows where it comes from you can go in and see more details so as I said there is experimental data supporting this interaction if I want to see where did that experimental data come from click the show button and it will show me that I have things imported from a number of different databases in the act HPRD grid which is biogrid and you see which type of assay they did pull down assays affinity chromatography experiments two hybrid experiments enzyme enzymatic study etc etc there are loads of different and you see it truncated to just show 10 there is so much data and it's not surprising that we have a lot of data for that interaction there is a reason why it's called IRS1 it's insulin receptor substrate 1 so of course given the name of it it's obviously very very well known that insulin receptor substrate 1 interact with insulin receptor if we go to the query parameters this is where there is a bit more to understand there is a lot of different settings that you can choose there is the minimum interaction score there is the maximum number of interactors and that's really what we are trying to make sure people understand here so what's the difference between these how do they play together right now we have a network if you count the proteins you see there is insulin receptor and 10 other proteins in the network if I go over and I set the minimum interaction score to 0.7 does it change the network more specifically does it change the list of proteins and if we looked at the list of proteins here before and after I can tell you it did not change the list of proteins now that's a very simple reason for that we are ranking the proteins based on their interaction score with insulin receptor the query protein the top 10 proteins all score more than 0.9 so when I am choosing to show the top 10 it doesn't matter whether I set the score cutoff at 0.4 or 0.7 or even 0.9 the top 10 is still going to be the same top 10 however some of the interactions between the proteins may score lower than 0.7 so we did lose something in terms of interactions when we changed the cutoff so the next thing is let's go turn off all evidence types except from experiments so if we only consider the experiments channel now you got a network that quite clearly looks different and if we go look at the legend you will see if you compare it with before that the list has changed the reason is the score we are ranking on is the score coming from the evidence types that we have turned on so when I turn off everything but one evidence channel it means that the score that I am now sorting the proteins on and taking the top 10 that score is different it's now the pure experiment score as opposed to the combined score across all evidence channels if I now go and say let's show top 20 0.7 only experiments show me top 20 update the network and clearly there is a lot less than 20 proteins in this network what's happening well the reason is this is the lowest scoring protein that scores over 0.7 in the experiments channel alone so when I am asking for a top 20 with a confidence cutoff of 0.7 using experiment channel only the answer the string comes with this sorry I don't have 20 proteins for you there are not 20 proteins scoring above 0.7 based on experiments alone and I could go lower this one and I still wouldn't get more I think you saw the point any questions to this otherwise we are ready for digging into cytoscape alright let's get to cytoscape so now you've seen string you've seen some of the other resources around string and I hope you have an understanding of what's the data that goes into it we have confidence scores it's a very complicated setup but it basically allows you to know which interactions are more reliable which interactions are less reliable and you can it's based on many different lines of evidence and that's why it's complicated now cytoscape unlike string 2 cytoscape is great for doing analysis of networks cytoscape is great for doing visualization networks in particular that's what we'll mainly use it for here but cytoscape is not a database it's one of the most common questions that comes up in the cytoscape help desk which by the way I can highly recommend if you have cytoscape problems to go to the help desk it's not a database cytoscape does not give you a network when people open cytoscape and go where do I get my network the answer is well you're using a network too you have to load a network in from somewhere the cytoscape user interface has three parts one is the networks where you're showing a network that's the visualization one is the tables that's where all the data resides those are the ones you need to populate with some data to have something to show and there's the visual styles which is the mapping of how the tables turn into the pretty visualization that is the network so here's how it looks we have up here the network visualization and you see in this case it looks like a string network you have down here the tables where you have a note table that contains all the notes you also have an edge table that contains all the interactions and then over on the left you have the style panel where we can say how different properties map onto this so we can say how the strings of the edges depend on the scores in the edge table or if we have some data how we want that data to be mapped as colors onto the network so you can do many things in cytoscape you can of course load and save sessions this is the first thing where people get confused they want to get a network into cytoscape and they think the obvious thing is you go file load well you don't go file load because you use load for loading and saving sessions so if you don't have a session already you've saved you don't have anything to load so this is more like loading and saving Adobe Australia for us to get data into cytoscape in the first place you need to use import so you import a network and you can import a network either from a local file so if you have your own tap the limited file or excel sheet with interactions from your own study you can load it into cytoscape using the local file load importer or you can import a network if you don't have your own from a public database which can obviously be string you can import tables so that one is something that is named a little bit confusingly import table is really all about importing note table information so when you have imported a network already and then you have for example your own omics data that you want to import into cytoscape as well in order to visualize it onto the network you want to use the import table functionality to load your data into the note table you then have property mappings and there's a few of them there's the pass-through mapping which just takes the data as is and puts it on the network that's what you use for example to say this field contains the name I want a pass-through mapping of that name to be the texture on the note you have discrete mappings which is when you have discrete data or data that consists of classes so say you have clusters A, B, C and D and you want to have things in cluster A, B, red and things in cluster B, B, blue you would use a discrete mapping and then you have continuous mappings which are the ones you want to use when you want to use for example the color gradient to map something like a lock fold change value onto a network or like when we take the string confidence and map them onto how strong the line is on the edge there's something called default and bypass so I just talked about the mappings these are the other two halves of how you map data inside of Scape the default color is what it's going to show if you don't have any data so if I load in some lock fold change values I give it a color gradient going from blue to white to red and then I set a default color to be gray and the nodes for which I have missing values would be gray because I have no lock fold change value whereas something that I have data for but it doesn't change would be white the bypass is something that allows you to select nodes and set properties specifically for that node and just say I have this node I want it to be orange it's very powerful you can do a lot of things with bypass but it's also a bit dangerous because you can't see what you've done in a session so if you mess things up it can be quite messy if you're using bypass so I would generally try to avoid using bypass when you can another important feature inside of Scape core functionality is selection filters so instead of selecting things by you know selecting a square or somehow clicking on the node holding down a button clicking multiple nodes you can select based on properties so you can say select everything for which the lock fold change value is greater than so and so select all the nodes connected to the node I've currently selected all those kinds of things so you can select by attributes you can also use a lot of layout algorithms so Cyberscape has a lot of powerful algorithm for how to layout networks you'll be playing a bit with those I highly recommend installing the Y files plugin which has additional layout algorithms clustering there's the cluster maker 2 app which is really powerful for doing all kinds of clustering algorithms on the network and that gets me to the topic of Cyberscape has an app store so it's bit of a bad name I know that John Scooter Morris our collaborator who's on the Cyberscape core team prefers that it would be called the toolshed instead because it's not an app store in the android or apple sense it's not a place where you have to pay for anything the app store is just a place where you can install additional tools in Cyberscape from it's all free so one of those apps that you can find in the app store is StringApp and there's a few related things as well so the StringApp basically does this it allows you to very easily take string and get it into Cyberscape it has a lot of added functionality as well and but this is really the core function of it to get string into Cyberscape you can do several types of queries you can do protein queries so in that you query for typically a long list of proteins and fetch a network for it you can do a disease query where you use the diseases database and go and say I want a network for Alzheimer's disease so you just query for Alzheimer's disease get the top end proteins for that disease and then retrieve a string network for it or you can do a PubMed query where you start from some topic of interest you query the literature for that topic you find the proteins of interest by text mining of that literature and then put a string network one of the most common use cases of wanting to get string networks into Cyberscape is to visualize army data so simple case here we have a proteomic experiment and what came out of that is of course your typical Excel table with a lot of columns you see here we have a column called uniprod it's always good to use things like accession numbers instead of gene names when you're querying a database so we want to have that and we want to use that we have a lot of data coming out of the mass spec including some things like some block fold change values after 5 minutes after 10 minutes then what you do is you take the whole list of uniprod accession numbers from that table you do a protein query and that way you fetch a string network you then go and do import table to load all of these additional columns from this Excel table into Cyberscape as well and you can then go and color the network by fold change for example so you now take the fold change values from the Excel table and map them as colors onto the nodes using the visual property mappings in Cyberscape on a string network so here you have a network where the coloring of the nodes comes from the user's own data the edges come from string other things you can do enrichment analysis we didn't use it in the web interface but string has functionality for doing enrichment analysis so you can do a go term enrichment analysis find terms that are significantly enriched in this network and map it onto the network you see there now these halos circles around with several segments with different colors and these colors represent different terms that are significantly overrepresented in this network we can also take things like site specific data so this is particularly interesting to people who work with phosphoproteomics data or other post-translational modifications or things where you have multiple comparisons for example time series data and you can use the omics visualizer app and you can use which is also developed in my lab in particular by Mark, Nadja has been one of the main drivers, she's also in the chat for the string app and you can then the same way as these colored circles with multiple segments map lock fold change values for different phosphorylation sites on the same protein for multiple different comparisons you can of course make overly complicated figures this way if you want you shouldn't but it is a very powerful visualization tool and it gives you a lot more power than core cytoscape in terms of being able to map complicated data onto a network and it is designed of course since it comes from my group to play very nicely with the string app so string app and omics visualizer work great together now the bad news is that what I showed you here were really easy examples and they were easy because they were small typical omics data sets, you don't have a small network like what I just showed you you have typically hundreds to a thousand even a few thousand proteins that come out and for those you will typically have a thousand to ten thousand interactions and when you show that you end up with something that looks like this and at first maybe you think this looks nice I mean I see lots of networks like this get published in the scientific literature but try making any sense of this network right I mean what can you see the only thing you can see it's a big network it looks like everything is connected to everything and it is what I like to and many people in the field like to call a ridiculogram everything is linked to everything it's ridiculous to make a plot like this because you can't see anything from this figure so what we want to do typically when we have a big complicated network like that is to use network cluster and to cut it up so that's where we use the cluster maker 2 app in the exercises which comes from John Scooter Morris at UCSF our collaborator and string app in Zytoscape and we use that to identify functional modules in the network we then typically cut down the network and show only the interactions that are within the clusters and that way we can take the network from before and turn it into something that looks much much simpler it's still a big network it still takes time to understand this but I hope you agree that this is a much nicer figure than the one we had before so to summarize I hope I convinced you that networks are a very useful abstraction that I hope you remember nodes those are the things in your network edges those are the ones that connect your things so proteins and interactions typically you heard a lot about the string database it's a database of protein networks it has a whole suite of related database resources around it subcellular localization tissue expression disease associations and all these resources string and its sister resources are made by integrating heterogeneous data from a lot of different places they use text mining all of them rely heavily on mining information out of the text because no matter how many databases you import you're not going to capture everything there's so much written in the literature that is not in the databases I've told you a bit about cytoscape I think the best way of learning cytoscape is to do hands-on exercises so that's basically what we'll be doing the rest of the day it's a network tool it has apps that you can plug in like the string app that is really useful for working with string networks inside cytoscape you have omics visualization we can do a lot of things visualizing the data on string networks both using the core functionality of cytoscape and using omics visualizer so with that I just want to really thank a lot of people behind this work so I already mentioned several times the string database it's a huge collaboration it's been running for a long long time it all started in Pierre Borg's group at the EMPL where both Christian van Meringen and I was in his group at the same time Christian then started his own lab in Zurich I started my lab in Copenhagen we're now working together as a consortium still all of us involved in doing the string database lots of people from the groups have been working on this I want to particularly highlight Damian who was one of my first PhD students and now working in Christian's lab since quite a few years really string being his core activity he's been the mastermind behind the orthology transfer and many other parts of string over the years I mentioned Helen who did the string viruses which is certainly a very timely thing now Mikael Kuhn was the mastermind behind stitch if you're interested in small molecule compounds David Lyon whom I think might still be online has been doing was first a postdoc in my group and is now also in Christian's group in Zurich you begin to see a pattern here it seems done lots of things especially related to the enrichment analysis functionality on the text mining front Alexander Junge in my group has done a lot of work current postdoc Duhard Grisser in the group has worked a lot on getting the full text mining in Katarina who's here helping today is working very hard on making what is going to be the next generation of text mining in string so using bird transformers and all that fancy stuff I've been collaborating for many years with Vangelis in Greece so he's really driving a lot of the early work on text mining together and Sampe Poussalo is our collaborator on the whole bird project with Katarina the knowledge graph work Sune Frankill was the main reason for this whole thing starting with the diseases database Oana has been doing a lot of work in particular on the tissues resource and Bertow also on the tissues resource and on making a knowledge graph of the whole thing cytoscape string app, omics visualizer that's all not just doing a ton of work on it collaborating with Jan Gordkin's group where she was joined between my group and Jan's group John Scooter Morris I mentioned him several times he's one of the core developers of cytoscape and heavily involved in both string app and omics visualizer and Mark is the main developer of the omics visualizer so thanks for your attention thanks for funding as well of course and I'm happy to take some questions and then after that I'll dig in and give you a quick demo of what you can do in cytoscape before we break for lunch and then continue in the afternoon with exercises do we have any questions to this there's a question about could you do things with antibiotic resistance maybe the smartest is to actually try to just do a demo with that so how would I go about if I wanted to make some network of antibiotic resistance in some bacterium then the first thing is I need to find literature so I want to dig out a list of antibiotic resistance proteins from the literature and after that I'll start by going to Pogmed then let's see antibiotic resistance that gives me 200,000 or so I need to share the screen a query for that gives me a list of 200,000 papers let's see what should we go for streptococcus yeah that's good we now have 14,000 let's see what does that give no, bad idea so this gives us a bit less than 14,000 papers about streptococcus antibiotic resistance I'm just going to copy that I'm going to jump into cytoscape and this is a perfect case for illustrating what you can do with the Pogmed query now as I import network from public databases and when you have the string app installed you can choose under data sources instead of universal database client you can choose string Pogmed query obviously I don't want a human network now I want something that would be streptococcus let's see which one should we take let's take that one paste in this query ask for 50 proteins so now if everything works well we are going to get a network of proteins from this species that are mentioned a lot in abstracts talking about antibiotic resistance in streptococcus so it query Pogmed it found the same number of papers as we found in Pogmed it should we are using the API for the same database so it obviously should do the same it then goes to my database of precomputed text mining results figures out which proteins are mentioned a lot together with this and retrieves a string network for those proteins so the list of proteins came out of running text mining but the network came from string so I hope that answered the question can you use this to make a network for antibiotic resistance genes yes you can, you very much can so what else could you do we can do things like let's illustrate the features here so you see down here you have the note table in the note table we are seeing the list of proteins that came out from this query we have the edge table it has the interactions coming from string network table you don't really need to worry about in this one we have a lot of columns you'll see one thing we have over here is the text mining scores so that's because we did a PubMed query then we have a score that actually is what it was ranked on that sort of combines how many papers were there in total in PubMed mentioning this gene how many papers out of my 14,000 from the PubMed query mentioned this gene I can now take these I could choose to do a coloring of it so let's say we want to color these proteins based on how strong text mining evidence I have for these genes being related to antibacterial resistance to antibiotic resistance so for that you go to the style and you want to be in the style notes and then first thing we currently have these colors that are sort of reminiscent of how things look in string it's called string colors I turn this off and now I say I want the fill color to depend on the text mining score I then say I want the mapping you see let's pass through discrete continuous since this is a continuous valued score I want a continuous mapping and I now get a color gradient that turns the numbers from the score into colors on the network and you see it's a very skewed distribution there's really one that has very strong text mining evidence and most of the others are scoring quite low you can go in and customize these gradients I could sort of pull the middle color down and maybe get a little bit more color into the network that way so that's the kind of thing you can do with a PubMed query so on the import page what does you smart delimiters refer to remember maybe Natya can explain I think it's sort of like when you have lists of things instead of having one per line so it would automatically split them I think if you have a comma separated list of identifiers I think that's what it is the reason why you need to be able to turn it off is obviously depending on how you identifiers look splitting things could break your query so if you're not if you're having a clean list where everything is on a separate line what's the limit to this am I right Natya yes I got something right I'll note that down as a win let's look at something else we have somewhere here a spreadsheet so this is the kind of spreadsheet in fact it is the one I showed on the slides where you have uniprots you have a lot of different log ratios and so on so let's try to go and get a network for this so I just select all of these edit copy I jump into cytoscape and I say file import network from public databases string protein query homo sapiens yes paste in my whole list of uniprot identifiers here maybe crank up the confidence a little bit and it doesn't matter whether I have smart delimiters on or not because things are nicely separated on different lines and these are uniprot accession numbers they don't have any spaces of commerce or funny things in it that this functionality could break click import there's one ID that could match to several different proteins in string I just use the first one as the representative it's a whole set of histones that are nearly identical I fetch the network, I get a network now the next thing I can do is that now I would like to visualize data onto this network so for that I want to import table import table from file I have the same excel table as a tab delimited file on the desktop and my computer I'm just importing that one you now see here basically the same spreadsheet shown inside cytoscape and the important thing is here I have the uniprot accession it's marked with a little key up here there's what's the key column for the network since we query it with the uniprot identifiers all of those land column called query term so these IDs over here accession numbers over here are the ones that are in the query term column so I want to match this key column against that key column when I import it may seem like nothing happened but when we are scrolling all the way to the right you will see that I now have a bunch of columns gene name, peptide, sequence 5 minute, 10 minute, log ratios etc so I now got all my data from this spreadsheet into cytoscape and that means I can now go and say fill color let's just edit, mapings let's say I want to color things based on the 10 minute log ratio and then I want to make it a continuous mapping and you see it automatically detected that this has a negative values in it so you sort of want a two-ended gradient this makes sense but the colors are really faint it's hard to see anything you would typically want to go first maybe pick a different palette with a bit stronger colors you could go to I don't know this one boom that's more strong still doesn't really fix it right next thing we could do is to say change the min and max values and say let's go because it just auto scales is based on what are the biggest values in those columns but usually you have some few proteins that are kind of outliers in terms of being very highly regulated compared to the rest so now if we just go say minus 3 to plus 3 okay now we're getting somewhere, right? now you have strongly colored proteins and you can see what's going up what's going down the next thing we could do is that we could say up here these little structure emitters that may look very nice inside the node when you have a big network you can't really see them anyway and when you're trying to map data onto your network they're kind of disturbing the coloring so let's turn them off and stand out even more clearly we could try and the layout here is actually not too bad we could run an organic Y file's organic layout it usually does a better job there you have it that gave you a nicer network I think that was that very quickly with omics visualizer so you might have noticed here that you have this problem what do you mean by that can you combine it on YouTube, well you can combine two sessions you can't really combine sessions but you can combine networks so the networks have to be in the same session for you to be able to merge them so let's just do a really quick thing of course I can't merge these two networks in a meaningful way since one is a structure cocos network and one is a human network but let's make another couple of quick human networks import network from public databases let's illustrate a disease query a query for Alzheimer's Alzheimer's disease give me top 100 proteins 0.0 8 give me a big network 200 proteins something like that and that gave me a network of Alzheimer's disease which is a ridiculous problem as you would expect then we can go import another disease database network from public databases Parkinson's disease let's do the same for that so now I get a network of Alzheimer's disease I get a network for Parkinson's disease I could go merge these networks now one little thing if I want to be clever I could go and say there's this score over here called disease score and you have the same down in the Parkinson's it's also called disease score of course in one network it refers to Alzheimer's disease in the other network it refers to Parkinson's so that's a little bit confusing you can rename them so I'm going to rename this one let's just call it AD score I can go to the Parkinson's network right click this one rename it call it PD score boom now I have these two networks I can take them and go tools this is where we combine networks tools, merge networks I can take the union of the Parkinson's disease network and the Alzheimer's disease network merge them and now I have a big network you see I had 200 here 200 nodes in this network the merge network contains 321 nodes because there's of course an overlap in terms of which proteins are involved in both diseases by the way you may notice now that it's a big network help where did all my engines go why is it not showing any engines it's because it tries to be fast so when the network is very big and I'm zoomed out like this it doesn't show me everything if I say view I tell it I want you to draw everything you see now the labels are drawn all the edges are drawn life is good now one of the tricky things here is that when you're merging networks like this of course you're just taking the union of them so I took the union of this network and this network so if I have a protein that is only an Alzheimer's disease and I have another protein that is only in Parkinson's disease the merge network will not have an edge between them because the edge wouldn't be in either of the two networks so it's not going to be in the merge network there are ways of fixing that it's a bit of a work around string set a string network string and then first I'm just going to change so you know how we set the confidence when we are when we are importing a network you can change the confidence afterwards so what I'm going to do is first crank it up all the way we're never 100% sure about anything so I just deleted all the edges and now if I go lower the confidence the app will go back to the server and retrieve all of the interactions so I can now go in and retrieve interactions between them I can also go all the way down I can say let's give me the biggest particular gram get here, get me all the interactions that we have in string no matter how weak they are so that's how it looks when you do that and as you can see we can handle pretty big networks you're looking at a network with more than 16,000 edges here of course I can do a layout on that it's not really going to make things look pretty I mean that's your ridiculogram if you've ever seen one this is where you want to use things like cluster maker MCL clustering, there's a whole lot of clustering algorithms in cluster maker MCL by everyone's experience just works amazingly well for string networks well it works amazingly well in general but in particular for string networks I would always use that one granularity, that's something that says how much we want to cut up the network I'm going to go something like 4 here because this is really a big network I want to cut up and then it can take into account that this is a weighted graph that's really important, that's one of the nice things about MCL it can handle that the edges are weighted, tell it the one that matters is the score from the string database and then I can say create a new cluster network and I don't want to restore the inter cluster edges so now it's going to run clustering figure out which proteins are in a cluster together, make a new network where it has only the interactions within nodes from the same cluster and that didn't really manage to break apart the hairball, that's the danger of live demos, I should have picked an even higher granularity parameter on this but you see it did break it up a little bit I'm not going, in the interest of time I'm not going to play around and see how to get a better parameter for it you could increase the parameter break it up even more, you could also have filtered on confidence first in any case this is it you might think now I have these proteins, how can I select the ones that show graphic details how can I select all the ones that are involved in Alzheimer's disease from this this gets us to the selection filters so select column filter I can now select proteins based on different properties and what I would want to select based on is where are you AD score now I selected all the ones that have an AD score so that's all the ones that were involved in Alzheimer's disease and this is where you could then go and say set a bypass color make them red for example and when I then click outside to deselect them, you'll see all the ones that had an Alzheimer's score are now red time here running a bit over time no, I'm not select ten minutes any questions, any other stuff you would want me to demo in cytoscape maybe one thing I should have shown you is that we're importing from the databases a lot of different properties you see all these columns called tissue so those comes from the tissue database and that means that we automatically get tissue expression data in so if you want to say this network of things that are supposedly involved in Alzheimer's disease does it look like they're expressed in the nervous system I could immediately go and say fill color edit mapping so I could say I want the fill color to depend on the tissue nervous system and there we then have a confidence score that goes from one to five or zero to five in terms of how sure we are and now I just colored it based on the expression evidence and as you see the vast majority of the proteins got a very dark color meaning that we have strong evidence for expression in the nervous system for pretty much all the stuff involved in Alzheimer's disease that's what you would expect both because nervous system is a very well studied tissue where we have good expression data but also because obviously the proteins involved in Alzheimer's disease you would expect them to be expressed in the nervous system of course one downside to doing this is that you can only show whether things are you can only show one color when I set the fill color here I have to choose one tissue what if I wanted to show multiple tissues that's where we have some brand new functionality in theomics visualizer so instead of importing a table from a file I can actually import a table into theomics visualizer from the note table so I can go and do crazy things like say let's show all the tissues I'm not saying that's a good idea to do just to be clear here you could select a few of them but just to show what can be done I can say import all the tissues into anomics visualizer table I now have them in this other table here I can now say I want to show that as a pipe visualization on the network values, continuous mapping yes that sounds great draw it and now what I get is that every single note is a little pipe shaped heat map if you will where the different slices correspond to different tissues with their scores mapped onto them I cannot do this with this many slices but if you have 2, 3, 4 slices that could work no more questions on youtube no more questions on zoom can I show mapping so questions on youtube can I show mapping of compounds to proteins like drug protein interactions yes yes I can so actually there's a few things you can do so let me just go I think I've messed up the session so bad by now that it would be good to start from a clean slate there's a few things we can do one is by the way there's also this query panel here where you can query I always forget it myself so I fetch a network here that's an interesting feature why did it decide to do something a bit funny is doing something a bit funny here it seems like despite doing that's a bug in cytoscape seems like despite closing the session and starting a new session I ended up actually having the bypass being remembered you can right click it when you've selected all of them and remove bypass that's a bug report I should file anyway I have a clean Alzheimer's network now let's just lay it out by the way when things are a bit too densely packed layout node tools spread it out super useful feature now in the node table I have a lot of information and somewhere in there we have information from illuminating the droccable genome so illuminating the droccable genome is a big NIH project that I'm involved in and I'm involved in the so-called knowledge management central which is headed by Tudo Priya at the University of New Mexico and there we have things called target development levels and target families so we can go in and color things based on the target development level which is think like is this something where we have an FDA approved drug is it something where we have a small molecule compound or is it something where it's just reasonably biologically classified but we don't have a compound yet or is it a so-called dark target which is a target where it's from a protein family where we are pretty sure that we should be able to make drugs against it but we know nothing about this protein so let's take the target development level and this is a good chance to use straight the discrete mapping and now you see here you have T-tlin and T-chem so we can take the ones where we have FDA approved drugs mark them by blue take the ones where we have small molecule compounds but nothing FDA approved color them orange and there you have it you have drug target information mapped under the network so that's one thing you can do the other thing you can do is you can go use stitch database and you can go and do import network from public databases do a stitch approach and compound query query for something like I don't know, Gleavec if I can spell ask for some interaction partners of Gleavec let's say give me 20 and now I got Imagineip which is another name for Gleavec I think this is the correct chemical name for it, Gleavec is a brand name for the drug Dasatinip which is a very closely related compound if you see the structures over here you will see that those two compounds are very similar and you have then information about the targets both the ones that are sort of the approved targets and the ones where there's information relating them to these I think we are done questions and I think this afternoon I don't know, should I keep the YouTube stream running? It's going to be a long one but I guess YouTube can handle it or should we just go to purely run it as the zoom thing in the afternoon will it consider off-target as well? Yes it will, so when we're looking so it depends on what you mean so when you're doing the stitch query absolutely we're just looking at do we have binding information or things being mentioned a lot in the literature and so when it's the same kind of functional association as you have in the stream it's not just going to give you the approved intended targets it's definitely going to give you things like off-targets too, it's also going to give you things like cytochrome p450 enzymes that are involved in metabolizing the drug. The Gleavec question here, isn't the label option under the style tab meant to change the font size for the protein names? Yes no so there's label font size over here if I use if I change the label font size to be 18 then they all become bigger however if you're not seeing it is probably because you're also not seeing the names being centered is probably because you are using string style labels which are these ones with sort of a little shadow on them and offset relative to the center if you're using plain cytoscape labels then these features will affect it it can also be really useful over here you see it's the column called display name from string that has been shown as the label if you're importing some omic study and your excel table has a column that contains the gene names that you prefer to use in your study you can go and change the label from being the display name to be the gene names from the excel table that you got in with import table so that the names shown in the figure are consistent with what you show in all your other figures in your paper is there a way to map known ptms from literature and comparative not from purely cytoscape so if you have your own ptm data and ptm data from elsewhere you wouldn't have to gather that outside cytoscape you can of course import both of them into cytoscape to compare them afterwards on the network but you can't go in and do it's purely in cytoscape I believe although there is the app store it's a massive, there are so many apps in there so I'm not going to rule out that one of the hundreds of apps is capable of doing what you want to do I'm just not aware one that does it so there was a question let me see it's a youtube question about tips and tricks on how to deal with huge networks depends on what you mean by huge so there's a couple of things with cytoscape and huge networks so one thing is of course you have networks and you have visualizations of them if I import a massive network into cytoscape by default it's not going to draw the network because it says that it's too big there's no point if you really insist I can do it but it's just going to be slower and gobble up a lot of memory so you're better off analyzing it without a visualization first however if you have truly big networks that are so big that it really doesn't make sense to try to draw them yet if you know how to program I would highly advise that you do as much of the analysis as you can outside cytoscape before getting them in when I say huge networks I'm talking tens of thousands of nodes hundreds of thousands two millions of edges when you're looking at that cytoscape can handle it if your computer is powerful enough I just don't think it's a good way to work with massive networks if we're talking big networks a thousand proteins lots of nodes and the challenge is not so much how do you get your computer to handle it but how do you draw it then the trick is really things like clustering to cut it up that's the key thing there because you have this big radiculogram and you basically have to chop it up in smaller modules for it to make sense absolutely if you're loading massive networks into cytoscape it is going to take a lot of memory I would not really want to run cytoscape on a machine with less than 8 GB of RAM working with kind of big-ish networks you would want to have at least 16 so everything I did here which you saw was pretty snappy that was run on a PC with 16 GB of memory and I was not using much of it it would be neat to have that's something we could consider having a way of fetching data directly from PTM databases into Omicvisualizer would be a neat functionality to have it depends on whether those databases have the necessary APIs to even allow us to do that I haven't explored it it would be super neat if one could do it if you would have a way of automatically querying some of the major databases like phosphosite plus and whatever they are called if you could query those and retrieve PTM data directly from them that would be neat what you can do is if they have something like an Excel table you can download that and then just import it into Omicvisualizer from the file good, I think we'll stick with the timetable we'll stop here we have one hour lunch break and we'll be back on Zoom with the cytoscape exercises since there won't be more more lectures or anything like it in the afternoon I will end the YouTube stream here you're welcome to contact me if you have questions if it's more cytoscape questions I suggest ask on the cytoscape help desk we are monitoring that as well so you will very likely get an answer from us on it but it means that it gets shared with other people so instead of having everybody ask the same question or write your private emails it's really better that you ask your questions and the answers get out in the open to help other people as well so thanks everyone I'm happy to see so many people were here so many people were on YouTube as well and see you back those of you on the course on Zoom in an hour enjoy lunch bye