 Can you shortly confirm whether you guys see my screen? Yes, you're all set. We see you and your slides. Perfect. Thank you so much. Take it away. So of course, it's a really great pleasure to be here at yet another fantastic bioconductor conference, I think, and I'll be speaking about protein-protein interactions from the BioPlex project and how we are working with it in R and Python. Before I begin, I'd like to especially acknowledge Roger and Tyrone, who really did a lot of work on the Python side. All right, so I think many of us are familiar with protein-protein interactions and know that there are, I think, many good reasons for studying protein-protein interactions. But just to mention here a few, knowing a protein's interactors can, of course, provide insight into a protein's biological function, plays a protein within biological pathways, and reveal mechanisms underlying biological processes as protein complexes form, basically, and underlie many aspects of cell biology. And similar to NGS technologies, this throughput has also substantially increased over the last years in experimentally-detect protein-protein interactions when it comes from individual experiment targeting only one protein's and their interactors up to proteome-wide protein-protein interactions where you are able to detect the interactors of up to thousands of proteins. So there are many experimental techniques for detecting protein-protein interactions. The two main techniques are used to hybrid screening, which I think many of us have heard of at one point or another, and here in the Bioplex project, we're making especially use of affinity purification mass spectrometry. And just like a basic idea of how this works, you basically express a library of exonages, bait proteins in human cell lines, and these bait proteins basically have an affinity tag, and then you pull out these bait proteins via immuno purification and basically check which proteins actually bind to your bait proteins. And we call these proteins that bind to these bait proteins that you introduce, we call them prey proteins. And you can basically apply something like mass spectrometry to basically get the identity of these prey proteins ending up with the bait proteins, with the bait protein and the prey proteins that interact with that bait protein. And although this technology as well is established for like around like 25 years here in the Bioplex project, it really has been taken to another level and applied at a proteome scale. So looking a little bit at the project and when it started, actually back in 2015 in the first cell publication, where they investigated a human cell line and embryonic kidney cell line hack 293T and targeted around 2,500 bait proteins. And over the years, the project has scaled up to basically half the proteome, around 10,000 baits. And as you can see, Bioplex has also started to expand to other cell lines. So in the latest publicly available version, it's now available also in a human colon cancer cell line, HCT 116. And behind the scenes, Bioplex has started to embark also investigating other cell lines. And the plan is to also to scale up and to basically almost the full human proteome. And accordingly you have these different, you have here up to 40,000 individual experiments that are conducted. So comparing Bioplex to previous efforts and existing databases such of course, string and bio grid, you see that there's a lot of new stuff that you can learn from. And when you look at the number of proteins that are studied and the number of interactions that are studied, then the individual Bioplex networks for the two main cell lines, but also in combined form are in a different order of magnitude than previous efforts. And actually when you compare to existing databases such as bio grid that might report low throughput or high throughput PPIs, there are a lot of interactions that are not in current databases. So that makes Bioplex of course, a very valuable resource that also enables computational discovery. And this is where we started to think about how to make Bioplex programmatically accessible from within R and from within Python, how to represent Bioplex and how to efficiently manipulate it and how to connect it to downstream applications and domain specific packages in R and Python. And this is basically what we did here and I'm going to decompose this figure now panel by panel, but the main aspects here is importing the networks in either R or Python and then allow a number of analysis with regard to protein complexes, protein domains with the recent interest in protein structures with the alpha fold kind of like revolution. We also have a lot of stuff on protein structures and of course integration with our mixed data. And these packages that we ended up implementing for accessing Bioplex and for analyzing Bioplex data together with other data sources are available on Bioconductor and PIPI. So let's start with looking a little bit into this import piece here. And I think one thing that we need to realize and what I hope I introduced clear enough is that there are two types of proteins here, these full circles or these paid proteins, these exonages proteins that we introduce into the cell and then there are these bait proteins that, these prey proteins that are endogenous proteins and that are bind to these bait proteins. And basically, if you take these little network motives and decompose them here, you end up in what you see in the end simplified in your data. So you basically have here an interaction from Aldo A to Aldo B and also from Aldo A to Aldo C. And then on the other hand, you also might have reverse interactions here between E03 and E02 where one is a bait and the other one is a prey or by spurser. And in addition to these interactions where one is a bait and this other one is a prey protein, you might have some sort of confidence score, some interaction probability which reflects basically how much confidence do you have that this is an interaction beyond background levels and background binding levels. And this file, you basically start importing and other either all Python. So you typically start with some sort of a data frame object and then in order to unlock graph algorithms, you might turn it into a bio conductor graph structure or a Python graph structure. Of course, there are other graph structures around that we maybe didn't use here in the very first place, very popular as I graph, but the good thing is that there are conversion functions. So this is a simple function called to turn maybe this graph NEL object or this network X object into an I graph object. Of course, as we're working with a beeper data, we also started to experiment a little bit with an Apache Spark backend which allows data chunking. So you have on disk a status storage and implicit parallel computation. You can spin that up. That is a little bit the computational side of things from the end user side of things. This graph frames framework is quite nice and intuitive because it basically goes away from this somewhat bulky graph API to a data frame based API for graph. Or you basically just have a data frame for the node data and the data frame for the edge data and everything else is abstracted away from the user. And you have basically all kinds of graph algorithms that can be performantly executed including something like Google page rank which is quite helpful when you do something like a network propagation of disease association scores. So far to the import piece, a first thing that you typically wanna do is to check on how do your PPIs of your networks actually overlap with known protein complexes. And for that, they're of course comprehensive databases. Here we are importing protein complex from the Quorum database. And then you can ask how do you represent these complexes? Well, you can basically represent them in the way that we have it down here. So you basically have a list of where a complex is a fully connected graph instance. So you basically draw an edge between every single subunit of a complex and can then basically check, do you actually see all of these edges in your PPI network? Well, a certain amount of overlap is of course expected just by chance. And as we are increasing in versions and also compare between cell lines, we typically also want to do some sort of statistical assessment. So one thing that we implemented here is a sort of a random sampling test where we basically pull out repeatedly from the network the same amount of nodes that we're having in the network and check whether the amount of edges that are connecting these set of random nodes is bigger or smaller than the one that we actually observe on that complex. And there are certain covariates that you likely want to account for. An obvious one is node decree. Another one maybe inheritance to the bait and prey problematic that we are having here with APMS. You also want to account for the number of subunits and the bait and prey ratio in the complex because of course these baits have a higher chance to the more baits you have the more interactions you expect. And this testing strategy can be of course generalized to other gene or protein set of interest. For example, we might be interested in overlap on pathways or other functional gene sensing. We'd also like to understand other certain structural features that mediate these PPIs. For example, what about protein domains? So the protein domains are part of protein sequences that fold independently into three-dimensional structures and those could be of course features that tend to explain what kind of PPIs are happening. And here we're also importing from existing databases. The main knowledge on proteins have been collected in the early days of bioinformatics here from PFAM. We don't have much protein and protein structure information in bioconductor but I found this good old PFAMDB annotation package was quite helpful to annotate for each protein in the network basically domain information. And then again, you can do some sort of enrichment test here. We're doing a basic two cross two contingency table Fisher's exact test with all its benefits and disadvantages. And basically you can ask, okay, how many PPIs are connecting two domains that you're studying? How many PPIs are actually involving either of the two domains and how many PPIs are not involving either of the two domains. And then you test basically the number of PPIs that are connecting both domains based on the hypergeometric distribution. And we are visualizing that with alluvial plots. So, and you basically see here that there are around 300 interactions that are connecting the HSP 90 domain with the PKIN domain. Well, now we have talked about complexes and domains and we have basically checked for interactions but another thing to corroborate whether there are indeed these interactions happening is to look at structural data. So for example, within the complex we actually would like to know do these subunits are actually physically close enough to each other to interact. So if we're looking here at a certain protein complex maybe this corum complex we can of course see here that this particular subunit has a hard time maybe interacting with this subunit as it's just like a physically and 3D space too far apart. And in order to enable such kind of analysis we are importing structural data from the Protein Data Bank. We don't have much support for that in Bioconductor but there are some really awesome packages on CRAN such as Bio3D or R3Dmo which I highly recommend. And then you can basically look at calculation of distances between atoms between different chains of such a protein complex. And then we're applying a threshold-based approach where we say, okay, if there are atoms between two different chains that are below a certain threshold here we're taking six angstrom as a default but that is a parameter to the function we infer a direct interaction within that complex otherwise these are indirect interactions. And then you can go back to the same thing that we did for the protein complexes before and check whether these structurally inferred interactions are actually matching the one that you looked at in just a complex at itself. The last piece that we looked at were integration of the data with Amix data and here we, in the beginning, focused on transcriptome data and proteome data but we have lately also started to bring in copy number variation data and alternative splicing data. But here for the transcriptome data side we basically have pulled RNA-seq data specifically investigating these two cell lines so these HeG-293 cell lines and these human colon cancer cell lines and then from the proteome data we actually imported a data set that were accompanying this Bioplex 3.0 publication here which compares the both cell lines with mass spectrometry. And then we're representing of course these data sets with well-known data structure in bioconductor space or in Python space which are also interchangeable such as summarized experiment and end data and then you can do a bunch of things of course with this data and I'm just pointing here out two things. Well, one thing goes a little bit into the quality control side of things. You can start to look at assessing variability in your APMS experiment as a result of bait and prey expression. Of course, the more a prey is actually expressed in the cell the higher the chance it is to show up as an interact. And the same is to a certain extent true for bait. And then what I'm showing here is kind of like a basic observation where we're doing a differential expression analysis between the two different cell line on transcriptome level but also on proteome level and you see that is quite reasonably correlated and then you can basically ask, okay when you're now going to your networks and you take for example here such a full change as a score are there certain parts of the network that aggregate these scores a lot and indicate, okay there are a lot of difference in this network between these two different cell lines. And often enough after this maximum scoring subnetwork analysis you end up with still a quite big and bulky high scoring subnetwork so you can apply something like gene set enrichment analysis to identify themes within these modules that aggregate a lot of scores. And we have actually put here a little R and shiny graph viewer together that basically once you have identified such gene sets or high scoring subnetworks that you can overlay with different metadata on the nodes or on the edges in order to explore that further. With this I'm pretty much at the end of my talk. Thank you so much for the attention. I will be happy to answer any questions and I just wanted to point out that all of this stuff is available. Basically the Bioplex package which is available on Bioconductor, the Bioplex Pi package which is available on PIPi and then the various interactions that I had shown for protein complexes, for protein domains, for protein structures, but also integration with omics data is in this GitHub only repo here, this Bioplex analysis repo. Thank you very much. Thank you. Thank you to all the speakers. Does anyone have any questions? We can start with one from the online chat. So this is for Ludwig, Ryan Thompson asked. So given the asymmetric bait-pray relationship is the resulting PPI graph a directed graph? This is correct. And actually you indeed also store for each node on the node metadata, whether it's a bait or a prey in order to enable such analysis that I mentioned in the very last part where you maybe wanna check is there a relationship between expression and showing up in interactions? Thanks for the great presentations. I'll tree of you Ludwig. If I had a new isoform that is not part of the annotation and I use Bioplex to tell me like how the structure will be and how that will change which proteins it could interact with. Well, if you have a new isoform and you wanna know the structure then I'd recommend going to AlphaFold, right? So this is this new tool that came out from a Google DeepMind where they basically used deep learning to kind of like solve this protein folding problem and this will really be the tool here. I think like when it comes to finding the interactions you could, so you're saying this is an isoform for a protein that already has isoforms in Bioplex for which we know interactions. Is this what you were saying? Yeah, let's say completely uncharacterized protein. No, let's say it already exists and like with AlphaFold maybe get the new structure. Yeah, with AlphaFold you would get the structure and with Bioplex what I think you would do you would check other isoforms of this protein and you would get the interactors and then if you would like to know something about the function of this isoform you could do something of like Guilt by Association where you maybe like check where are these other isoforms with what kind of proteins do they interact and what kind of functions do those proteins actually have if that makes sense. Cool, thank you. And I have another one for Paola, but yeah. I don't know if I'm pronouncing your name correctly. Sorry. Yes. So my understanding is that the coupler is based on this database of transcription factor perturbation experiments. So how generalizable it is if maybe what I'm studying as in was not part of those perturbation experiments maybe I'm I don't know that in my case let's say I'm working with human brain data. How much can I trust the results and how will I know if it's maybe guiding me in the wrong path? Like is there any failsafe? Yeah, good question. Just so the perturbation experiments where benchmark data set that we use but those are not included inside the coupler. So the coupler is agnostic of the kind of data you come with your data and then you can fetch prior knowledge and you can use different methods. So what the methods as I shown they are quite similar. So I mean some of them perform seem to perform better or worse depending on this benchmark data but what's really important is that your prior knowledge that you use it's correct. And this is not an easy question to answer so it really depends on for example how to evaluate this gene regulatory networks and this is something that we have started looking in the lab but yeah there is no clear answer to this. So there's another online question for Ludwig. Does Bioplex allow us to study protein interactions in light of post-translational modifications? This is a great question. I think like from the top of my head I would say no because these bait proteins that you expressed are basically coming off from a library so they typically are the canonical sequences without post-translational modifications. If you could then it would be most likely on the praise side but to be honest I haven't seen it in any of the papers. So I haven't embarked on that at all. It's a great question that I likely could bring to Edward Hadlin the PI of the Bioplex project. Thanks for this question. Any other questions? All right, well let's thank all three speakers one last time.