 Welcome to this brief introduction to the String Database. As a version 11.5, String contains more than 14,000 genomes, which encode a total of more than 67 million proteins. The goal of String is to tie these together both with physical interactions and with functional associations that tell you which proteins are likely to work together, even if they don't physically bind. The evidence for this comes from many different sources. The first class is genomic context, which is what we can pull out of the genomes themselves. That includes gene fusion, where we're looking for events where two genes in one organism have been fused into a single protein-coding gene in one or more other organisms. We also look at gene neighborhood to identify evolutionarily conserved operands in these many genomes, and we look at phylogenetic profiles to identify genes that show a similar presence-absence pattern across the tree of life. All of these methods allow us to infer functional associations just from the genomes. However, if you want the best possible networks, especially for higher eukaryotes, you need to take into account also other experimental evidence. That includes looking at gene co-expression to identify genes that show similar expression pattern across many different conditions, looking at interaction experiments in which people identify physical interactions, for example, by pull-down experiments, also take into account curated knowledge from manually annotated databases of protein complexes as well as molecular pathways. And this includes, of course, the big biochemical maps that you may have seen in a lecture some time before. Not everything is in databases, so for that reason we use automatic text mining to process the vast biomedical literature. Briefly, we identify gene and protein names, look for co-mentions to identify functional associations, and use deep learning to pull out physical interactions. This topic is covered in more depth in another presentation. However, there are quite a few problems when it comes to doing all of this and building a resource like string. Firstly, there are many databases. You have to go to many sources to get all the evidence. Secondly, these databases tend to be in different formats, and even if they're in the same format, they tend to use different names for the same genes and proteins. The evidence is of highly varying quality, which is to say that some of it is really bad. And the data is fundamentally not comparable. How do you compare physical protein interactions to, for example, evolutionarily conserved operands? And not all the evidence is even in the same species. You may be interested in two human proteins, but the evidence could be in most. Some of this is just hard work. There are many databases. We need to download them. There are different file formats. We need to write parcels. They use different identifiers, so we need to make mapping files. This is a lot of work, but there's not much to say about it. Where things get a bit more interesting when dealing with a varying quality. So for that reason, we developed quality scores for each type of evidence that allows us to rank the interactions from the most reliable to the least reliable. If we're looking at physical interaction data and we have pull-down experiments, we might rank the blue and the green protein relative to other interactions based on in how many cases we've seen them together in pull-downs versus in how many cases we've seen them apart, seeing one, but not the other in a pull-down. We then do score calibration to make these different scoring schemes for different types of data comparable to each other. We do that by comparing everything to a common gold standard, which in our case is keg-pathway maps. That allows us to take the raw-quality score and convert it into a posterior probability of two proteins being in the same pathway, given any one piece of evidence. We can then subsequently take all that evidence and add it up and transfer it by a theology to take a common gold standard and transfer it by a theology to take, for example, the data from Mars and transfer it to human. That way, we get the whole database and now it's just a matter of providing access to it. The most common way of accessing string is via the web interface. In the web interface you can query for one or more proteins of interest and that way find out either how they are related to each other or which other proteins they're likely to be working together with thereby shedding light on what the definition of interest might be doing. Importantly, the web interface has evidence viewers that allows you to dig into the underlying evidence for any interaction you see. If you're interested in working with bigger networks I highly recommend taking a look at Cytoscape, in particular the String app which ties together the Cytoscape framework with the String database. That allows you to do much more advanced network visualization including visualizing your own data, for example from a proteomics experiment onto the protein network from String. There are also web services if you want to make your own tools that interact with String and there are bulk download files that just allow you to go and download all the data and take it from there. Everything is available under open licenses and that way you're free to use it any way you want. That's all I want to say about String today. I will make more in-depth presentations about individual topics later and also some of the material on networks and text mining have been covered in other presentations before. Take a look here for another related presentation.