 Thank you very much. So thanks for the invitation, thanks for the kind introduction and I'm happy to see that so many people survived last night and I'm hoping that my voice also survived. So what I'll be talking about today is network biology, how we use networks to combine modern day army's data with the literature, which is really important for interpretation. So firstly why networks? I mean this whole meeting is about networks but why is it we actually want to use networks? And very generally, broadly speaking, whenever you have many things, then you care about their interplay. Networks is a really powerful attraction and it's one that allows you to easily integrate large different data sets of very different types of data. It allows you to represent knowledge in a structured way and it allows you, it lends itself really nicely to visualization of that data. And of course visualization, data visualization is really important to be able to explore data and discover new things. So why network biology in molecular biology? Well, nowadays we have lots of different omics technologies and these omics technologies really allow us to measure everything so to speak in one go. Measure all the genes, all the transcripts, all the proteins, all the metabolites. And that's of course lovely, it gives us vast amounts of data, but there's a real challenge when it comes to taking these data and somehow interpreting them and making sense of it. How do you actually discover something when you have your gigantic matrix of numbers? And the key to that is that you have to somehow connect what you're seeing in your samples to what we already know, which generally speaking is the scientific literature. So we need to somehow connect something to literature to be able to help us understand biology better based on the omics data, being able to discover new biomarkers, being able to prioritize drug targets. All of these are things that I've been involved in working with, with networks. So that's why we do network biology, which is really the core topic of my research group. Now where do we get networks from? One place is the string database, just to have an idea how many people in the room have already heard about string. Okay, good, but I kept this part short. So as you know, it's a database where we're starting in version 12 of spring that we just released recently from a small collection of 12,535 genomes, which is less than the previous version because we've now done smaller things so we don't need as many genomes, encoding 59.3 million different proteins. And our very modest goal is that we want to take all of these, and we want to connect them both in terms of physical protein interactions, so which proteins actually bind to each other and form protein complexes. And also in terms of more broad functional associations, which in a hand wavy kind of way means proteins that somehow work together, maybe being in the same pathway or something like that. And we are one of the so-called core bio-biotic resources, meaning that it's a database like Uniprot where the idea is basically if string was to go away, a lot of people would be in trouble. And that's been recognized by Alicia and by the new global bio-biotic coalition. Now the reason why they acknowledged us as being such a resource is that it's a heavily used resource. We have currently approximately 30,000 users per week from around the world. So it's something where you know you don't need a monitoring service to tell you if your website goes down, you're going to find out in your inbox real fast. And we would of course like to think that the reason why we have so many users is that we are probably the best here, is probably the best network database in the world. I have to make the cars work right here. I don't really agree. I like German here better. The reason why string is good is that we integrate many different types of evidence. If you want the most comprehensive network, you kind of have to put everything together. And that includes looking at things like genomic context that basically means what can you do having just a collection of 12,000 genomes? Well, one thing you can do is to look at so-called phylogenetic profiles. The idea is very simplistically that you're looking at the presence, absence, patterns of genes, searching the so-called orthologs and the equivalent genes of different organisms. And you see, do genes come and go together? If you have a situation like here, here's a toy example, where the three different genes signified by different colors are all either there or not there, and especially where it's not trivially following the organism tree that you see on the left side, the species tree, then it would take a lot of joint gain and loss events for this pattern to emerge by random chance. And the way we interpret that is, of course, it didn't happen by random chance. It happened for a reason. These genes work together. If you have all of them, you're able to do something X. If you were to be missing one of them, you can't do X anymore, meaning that the two remain into you serve no purpose and will be lost pretty quickly. That's all great for pro-carriers. If you want to make good networks for eukaryotes, and for some weird reason, most of us care mostly about higher eukaryotes, experimental data, you have to bring in something more than just genomes. That could be things like protein interaction screens. Are people familiar with APMS, affinity purification followed by mass spectrometry? A few people. Okay. Generally speaking, the idea is very simple. You want to measure which proteins are complex together. The way you do it is that you put a handle on a protein, you grab the handle and pull it down. That, of course, means you're pulling down that protein together with whatever stuck to it. And then you throw it into a mass spectrometer and figure out what's in the mix. And you generally run around, put handles on lots of different proteins, do lots of pull-downs, and based on that you can infer which things are complexes together. We also have what you call curated knowledge. That's not the experimental data. That's more like your textbook knowledge. The things you would find in your standard molecular biology textbook when you study at university. And there will be things like, you know, these kinds of complex pathway diagrams of metabolism, signals, instructions, so on. I am actually old enough to have had to learn this by heart for a biochemistry exam and commonly forget its game afterwards. It exists as computer readable databases, and I shouldn't have to find this out. It is unnecessary to learn this by heart. Unfortunately, even though a lot of work is put into building databases like this, and we are very thankful to people doing that, most of what we know is not on such databases. And that's why we do text mining, because basically most of what we know about biology today is here. This is a bank of the envelope estimate of the biomedical literature. If we naively assume that everything is indexed in the PubMed database, if we assume that the average paper is five pages long, we print it out on a standard 80 gram or four paper and pile it on top of each other, you're going to get a pile over 10 kilometers. Not everything is in PubMed. The average paper is for sure longer than five pages. The pile is for sure more than 20 kilometers. It's probably more than 30 kilometers. It doesn't really matter, right? Because whether it's 10, 20 or 30 kilometers, the truth is we cannot read it. Plus, even if we didn't have to read this, there's a new pain coming out every 30 seconds. So that's why we do text mining, our desperation basically. So this in a nutshell is how we do it. We take all of these things we put it together. Now there are a few problems to doing this. One is there are many databases. There is not just one pathway database. There are dozens of them. There's not just one database where you have physical interaction experiments. There are multiple. Since these are different databases, they often come in different formats. And even if people have been nice and standardize the format, they are probably still using different names from the same proteins in the different databases and even more so in the literature, whether it was a famous statement that biologists would rather share their toothbrush than share toothbrush. So we kind of need to handle that. Another issue is that the data are what I very politely refer to as varying quality, which is to say that some of it is complete garbage. And we need to somehow handle that if you just take everything and put it together and ignore what is actually reliable and what's not reliable, you're going to get a useless network. So I'm sorry to disappoint you, but the key ingredient in solving these problems is called hard work. You know, there are things like there's a lot of file formats where I guess what, somebody has to write a lot of puzzles. They come in different formats and they use different names, so you need to have mapping files. This is what I often refer to as being the pipetting of bioinformatics. It's what we spend this proportion of amount of time on. Where things get slightly more interesting is the quality issues. How can we score quality and figure out which things are reliable, which things are less reliable even within an experiment? So let's say we have some big high throughput physical interaction screen where we've done thousands of pull downs and we've seen things and let's say we're interested in the blue and the green protein. And you see, we have a pull down where we tack the blue and got the green. We have one where we tack some other protein and got the blue and the green and the pull down. We have one where we tack yet another protein, we've got the blue, we've got the green and we tacked the green and we didn't get the blue. And now what I want to do is basically take this evidence landscape as we call it and turn it into an up-up. There's a lot of smart people in this room. How would you turn this into an up-up? Any ideas? 42. 42. If we get all of them to score 42, we have the baseline performance here where everything is scoring equal. Yes? You need to give scores to all of them. Yeah, exactly. The question is what would be the score of this binary interaction, right? I mean, I could start on a very simple one. Two, we've seen them together twice. I mean, it's better than nothing. It's clear that the first two are positive evidence, right? The last two are negative evidence. The more often we see them together, the more we're going to believe it, the more we see them apart and that's where we're going to believe them. So you could do the difference here. Hey, that's one option, right? You could do the ratio. You could do a two-by-two contingency table and say, we know how many pull-downs we did in total, we know how many contain the blue, we know how many contain the green, we know how many contain both. So it's just exact test, get a p-value because I'm a statistically inclined person that the first thing I tried doing, that was a terrible idea. It turns out to be better to do things like the over-representation ratio, how much more often do you see them together and you would expect by random chance? You can do even better than that, but the point here is not the exact tolerance to you. The point is you have to understand the data and be creative and come up with a way of how can we turn data into some sort of numbers that allow us to rank all the interactions coming from certain data set or type of data set, from what do we trust the most to what we trust the least. The next problem is, of course, we're going to need to have a completely different scoring scheme for different types of data. You can't score phylogenetic profile similarities in the same way you score pull-down essay. So the next trick is for calibration, making these scores comparable to each other. And the trick here is that you compare every set to a go standard, a common reference, and on our case we use keg pathways. It's not important. We could use something else. In fact, for protein interactions, for physical interaction networks, we're using complexes from the complex portal. But for functional associations, we use keg. And now imagine that you're looking at two proteins and we are restricting ourselves to looking at proteins that are actually in keg maps per second. Then when you have two proteins, either they're in the same pathway or they're not in the same pathway. And that means if I score things somehow, I can look at everything scoring between 1 and 1.1 and say, how often are they in the same pathway? And the answer is about 14% of the time, which starts with a score between 1 and 1.1 is not very good. I can go with the one scoring between 2 and 2.1 and see it's something like 80% of the time that they are in the same pathway, which tells me that's pretty good. I do that for a lot of score intervals. I get a point cloud like that. I fit some calibration function through it, typically a sick point. And now I have a calibration curve. And I can go in and take two proteins which might not be in keg maps. I can calculate the raw quality score. It's 1.7 and I can go in and read it off this chart and say what does 1.7 mean? That means about 50% chance of these two proteins working together in a pathway. And of course for different types of data, I'm going to have completely different scores on the x-axis. I'm going to have completely different calibration curves. But at the end of the day, I now turn everything into what is the probability of these two proteins working together in a pathway given this one piece of evidence, whatever it is. And then you can start probabilistically putting all the evidence together and make it work. So that is string in a nutshell. One of the big things we do is the text mining. So how can we take the literature and turn that into a network? So the goal here very broadly is to structure our knowledge by text mining literature. And that could be just linking proteins to each other like in string, but it could also be linking genes to diseases, linking drugs to targets, anything you think of. And I guess it hasn't escaped anybody's notice that things have happened in the past couple of years and you might even recognize this. And yes, we are using transformer based models to do text mining these days, like everybody. We're not using chat GBT. We're not using a GBT model. These are generative models that are very good at spitting out text and producing text. What we are using are things from the BERT family, more specifically a large biomedical reversal model, which are better suited for doing information extraction from text. So they're better at retrieving things. And the idea of these models, occasionally not familiar with them, is that you have pre-trained models, where the idea is you take the enormous, huge, unlabeled corpus, corpus just meaning bottom of text, and you train it on that. And the way you train it on that is that you just leave out some of the words. You leave out, say, 15% of the words and then you train the model to guess the missing words. And based on that, it basically learns the language. So it learns English, it learns biomedical English, if you're right, on biomedical literature, and now it has a pretty good idea of how things work in terms of the way we write. And then we take that and we fine-tune that model for a specific task we want to do. So in our case, we want to pull out various kinds of interactions between proteins. And the way you do that is getting back to the topic of hard work, is that you make a manually annotated corpus. So we took something like 2,000 abstract in the process of extending it to 3,000 and manually annotated all of the interactions of those. And then you fine-tune the model on that. We've done that for physical protein interactions that, of course, can pick up patterns like, you know, something binds to something. It can handle that also in much more complex ways, but it can also pick it up when it's written in completely different ways, of course, like something, something complex. So it will learn all the different kinds of patterns and it can deal with long complicated sentences pretty well, because it doesn't have to learn how to understand those sentences based on our chosen abstracts, because it already knows English before we start. We're also doing regulatory interactions, and those regulatory interactions have direction. That's going to be one of the big improvements in the future of spring, is to not just have undirected interactions, but having directed interactions. It's not the same thing that A regulates B as it is that B regulates A. We are going to have sine, so positive negative regulation, and we're going to have mechanism. So we can see is it regulated by phosphorylation, regulated at the expression level, all of those kinds of things. And this, what we get is excellent performance. It is really remarkable how good a model you can get with just 2000 abstracts, which is the world of manual representation, is not that bad actually. It is remarkable how good performance you can get on this. However, that means that we're now left with some completely different challenges when it comes to mining the literature, because I think now that the biggest problem now that we have language models is not to get the computer to correctly read the text and extract from the text what the text says. Instead, the main challenges are things like publisher payboards. You know, you can't mind things you ain't got access to. You don't need the text to mind the text. We need to worry about things like literature quality, newsflash, not everything that has been published is correct. You might have heard of things like paper notes, mass-produced fake papers. And when you have this kind of literature, of course, the model is very, very good at extracting from the text what the text says. But if you want to build a good knowledge graph representing human knowledge within a field, you don't want to produce that with junk where the model correctly extracts from the text what the text says. But what the text says just isn't true. And then the rest, the big issue of study bias. Of course, the literature is limited in terms of that it contains what we know and what we know is based on what we've studied. So if you make a network purely by text mining the literature, then no matter how good access you have, and no matter how good models you have, and no matter how good you are by identifying fake papers, you will have no interactions for understudy proteins. So all the dark targets out there that you might be interested in, we will not have interactions for them because if we haven't studied them, we haven't written that off. And that leads me to the next topic because if we want interactions for understudy proteins, there's no way around going to big-omics data and saying how can we make networks from these big-omics data which are more unbiased in the sense that they don't care whether we're interested in a certain protein. If it's there, we're measuring it. So we want to make an unbiased network. And the idea is to start from systematic unbiased data, especially things like single-cell or an e-seq data where we have measured transcript levels across many hundreds of thousands if not millions of different cells today. And based on that, you now want to somehow infer a network. Also using things like mass-fake-based protonics where you have similar kinds of data for proteins, although not at the single-cell level yet. And the question is how can we take these kinds of data where basically for each gene you have a very high-dimensional vector like a vector with 600,000 dimensions representing 600,000 different cells, and make a network. Can we do that? And of course, people have done this kind of thing for a long time. It's called a co-expression network. The problem is if you actually took those co-expression networks and benchmarked them, they gave very poor results. Even the fairly highly correlated things often were things that seemed to have nothing to do with each other. So what's the problem and why doesn't it work? And I believe there are two main problems here. One is redundancy. When you have single-cell data in particular, you are going to have many, many, many cells for a very similar. And if you think about it mathematically, it's even worse because things are in balance. You might have a thousand cells at this time every time you have one cell of that type. If you just calculate something like a Pearson correlation coefficient across that, you're now saying that this type of cell is a thousand times more important than that type of cell, which makes no sense from the standpoint of biology. The other problem is sparseness. When you're doing single-cell data, you are not going to see all the transcripts that exist in a given cell. You're not going to successfully measure all of them. So you're going to have a ton of missing values. And that's something we've recently been addressing through a method called FABRA. It's functional associations using variational auto-encoders. And the trick is very simple. As the name implies, we take variational auto-encoders and we apply them to these very high-dimensional data to learn from the hundreds of thousands of dimensions and low-dimensional latency spaces it's called. And that is effectively data compression. And by doing data compression, you're able to address both problems. Because when you're compressing data, how do you manage to compress the data without using a lot of information? And the answer is by eliminating redundancy. The other thing is that by compressing it together in fewer dimensions and kind of averaging over several cells effectively, you also get rid of the missing values. So this suddenly gives you a much better starting point. So we basically just did that and then we calculate correlations to this latency space because it's presumably much better than what we started with. And this is the approach very simply you take one data set or another. You do a variational auto-encoder. You're sampling the latency space and all of that. We get correlation coefficients and we use that as our ranking and we benchmark it the same way I talked about it earlier. This gave the kind of leap in performance where when you appears these two are interchance-less to you, your immediate reaction is you must have done something wrong. Because the dotted lines or dashed lines is what you get on some data sets when you're just running Pearson correlation in the original space. And the full lines is what you get when you take the exact same data and just calculate Pearson correlation coefficient in the latency space. And when you look at the performance gap especially between the two green lines, that's just ridiculous. So this is a world of a difference. And the nice thing is this data doesn't care about whether something is understudied. So if we get links for more than a thousand understudied human proteins, we are able to make more than 4,000 links for those and produce this gigantic network that we've already made available. And it's also included in string version 12 that we've used it recently. So far, that is the new coefficient channel of string these days. The last thing I want to talk about is network visualization. So lots of people, the most common use case of the string database is definitely to make visualizations. Specifically, what happens is this. You have sort of somebody's done an online study and they have a long boring list of significantly regulated genes or proteins. And they want a pretty color for a figure. And they take a 4-mentioned long list and they paste it into string and they get this. And I'd just like to point out that I am not showing this to make fun of the authors of this paper. I have no reason to do that when a couple of years later somebody copies this. And I don't want to make fun of those either because a couple of years later somebody copies this. It's clearly not getting better. This is what is in the field known as ridiculous kilograms. And the worst part of these is that most of them are made with string, which means that I feel that I am partially guilty of them. So of course, the only thing that you can do in that situation is how can we make people stop doing this? And that's basically to say, how can we make it easier to make good figures? And that's where we started working together with the cytoscape team. Have people run a cytoscape? Yes, excellent. So cytoscape is a powerful graphical tool that allows you to do network visualization and also network analysis. And we were developing together with one of the core developers in San Francisco, John Scuida-Maurice, the cytoscape string app. And it basically does what it says on the tip. It takes string and puts it into cytoscape. You can do all the kinds of queries you would expect to be able to do. So you can go and query with your long list of proteins, and then you can retrieve the string network and get into cytoscape. Where now you can do things like network clustering to cut it up into smaller meaningful modules. You could do things like using enrichment analysis to do functional annotation of those clusters so that you know what they are. You can import the omics data you have so that instead of just having sort of a carrot-colored network where everything has meaningless colors, you could actually color things based on the data. So you can map the data to visual properties using things like a color gradient to color the nodes based on what's up-regulated, what's down-regulated, and that way visualize it on the network and produce these kinds of figures instead. But I hope you agree are more meaningful. So this hopefully leads to people making batch-run network figures instead of just dumping things into string and taking the default network. So with that I want to acknowledge a lot of people. So the string databases, a long-running collaboration, it all started in Pierre Borg's group at UVL Heidelberg, where Christian Solviering was the staff scientist in the group when I arrived as a postdoc. I took over his job when he started his group in Zurich and I started my group in Copenhagen and now it's been a long-running collaboration between the three labs for close to 15 years. And one of the absolute key people in that is Dominic Stupiec and he was a former PhD student of mine from Poland and did an excellent job, then went from my lab after his PhD to Christian's lab in Zurich where he continued to be one of the core developers in Zurich. Rebecca Kiers has been doing a lot of work on reworking how we interpret the experimental data. Michaela Kotoli is a Greek PhD student of mine who's been doing all the work on father. It's really her brainchild. Katerina Nostow, another Greek in my group. I tend to attract Greeks at the moment. I don't quite know what happened. She's the key person behind the text mining or the language models to extract information. Lots of other people contributed over the years. Text mining, I already mentioned Katerina. It's also a collaboration with sample future of those groups in Finland. Father, we've collaborated with Simon Rasmussen and then at Marvin's group as well and Cyberspace Stringer, a bunch of people contributed, but most crucially, Jen Skudamorris. There's one more thing, even though I'm past the acknowledgments here, and that is if you might have seen the product like that and then wouldn't it be real neat if one could all go into the networks and then look at them in 3D? Well, we've been coding that kind of stuff where you have even hand tracking and everything to be able to to look at the networks. And in case people think this is too cool and have to play with it, I actually bought the VR headset. And if you haven't heard enough of me, I have a YouTube channel. Thank you.