 So, welcome all to the virtual SIP Computational Biology Seminar Series, and today we have the pleasure to have Lars Malström, a group leader and life scientist specialist at the Service and Support for Science IT, called F3IT, of the University of Zurich. I would briefly go through your bio. A trainer's chemical engineer at the Lund University last obtained his PhD in 2005 in electrical engineering at the University of Washington in the US. From 2006 to 2008, he was a postdoctoral research assistant in David Goodlatt's lab at the University of Washington as well, and in 2008 he became a group leader at the Institute for Molecular Systems Biology at the ETH Zurich until he moved in 2014 to the Service and Support for Science IT. Briefly, Lars' areas of expertise are system biology, mass spectrometry, protein structure, prediction and data management. He also provides services such as consulting, proposal writing and application support. Today, Lars is going to explain to us how one can study o-spassagen interaction using protein structure modeling and chemical post-linking mass spectrometry. Lars, thank you for coming to the Zen, and the floor is yours. Thank you to Yammer for the opportunity and the organizers for the organized seminar. Of course, I'd like to thank everyone who's here and also joining us online. So today's talk is really applied. Most of the methods in here are actually published, so if you're interested in the more technological aspects of what we do, those will be in paid reform and not in the presentation today, unfortunately. What I would like to tell you today, essentially, is interplay between data collection, data modeling to drive hypothesis generation, and that these hypothesis then are testable. So for that, we need a system. It's working on a human pathogen. It's called Stratococcus pyogenis. It lives in the oral cavity of about 30% of humans, so it's very, very, very common. About 700 million of us get mild infections every year, so there's a strep throat, and of course, this disease is not very deadly. And luckily, both for humans but also for the scientists, this thing is very susceptible to penicillin, and hence doesn't really constitute a threat, which means that it's very cheap to experiment on, and so it fits bioinformaticians in that sense. It is, however, also interesting because at some frequency, it actually gets into the blood. That actually happens very often, pretty much every time you brush your teeth, you'll have bacterial pieces entering into your blood. Of course, in general, it's not a problem. In this case, since it's so common, this probably happens a number of times, and very few of these times, there is some infection established. So we have about 1.2 million infections. We're very severe infections per year, and these are mostly in the third world, but Western world is not completely excluded. And once it's established in your blood, mortality rate is very high. So between 25% and 40%, depending on study. All right, so surface bacteria is almost nowhere to action happens. So here is a mock-up of what this looks like. So you have a cell membrane, here's the cell wall, and then you have proteins, bacterial proteins that are attached either to the cell wall or secreted, and they interact with proteins from the host. So here are two very common blood proteins in the gen albumin sample. And of course, if you photograph this thing with an electron microscopy, this looks this way, where you have the surface, and I don't know if you can see up here, there is a sort of faint gray layer. That's these very long end proteins. So this spectrometry has this area. Once you add asthma to this, and you do the same micrograph, you see there's a huge stick sort of darker gray layer here, and that's actually the plasma proteins. So this thing is completely covered by the host proteins, and this is very likely a strategy to avoid detection. And so we want to understand the function of this layer, the interaction layer, how the host bacteria interacts with the host, but also how the host responds to the bacteria. And so this is sort of that my little systems biology slide. I'm sure you all know it, but I'll go through it real quick. So you have some large system, they're complicated, can't really measure directly, at least not that it's the technique I prefer. So we fragment this either by proteases or by sharing DNA, for example. And you can do some things to it, you can perturb. Living system where you can fractionate the system once you're listed, and so on and so forth. And you can measure that by mass photometry, and of course, use a DNA, use genome sequencing to get to the underlying DNA code that drives the virulence of these things. So then you digitize them, and you end up having a lot of data, which will be stored and created into some simple statistics. And then you try to create some models, and there is an infinite number of models, and ideally, these models will generate hypothesis that then go back to the lab and test. And so you have this sort of iterative approach where your hypothesis gets better and better, and the difference that you discover gets better and better, more interesting. So more specifically, mass photometry is one of the techniques we use. This is just an example of the techniques we're using. And this can be a global cell lysate. We digitize that simply by separating or cutting the proteins into peptides, separating peptides with liquid chromatography, and then measuring it in mass photometry. And essentially, you can fragment the peptides and record all the fragments. And from that, you can puzzle together essentially two things. One is the identity of the peptides, and since most peptides are unique in the genome, you can also locate which protein that peptide came from. And the other thing you can read out from mass photometry data is it's a quantitative value. Now, it's not an absolute value, it's a relative value, so you can only use to compare two samples with each other. And of course, you can generate heat maps, and you can do this clustering, and you can see certain proteins are more highly related than others, and certain samples are more related. And so you can now do this many, many times. Here is one example I'll go through really briefly, take the opportunity to advertise talk. I'm going to have a BC2 where I will talk a lot more about these particular projects. But in this case, we've taken 34 strains, some of them were invasive, so isolated from blood, others were isolated from oral cavity in the same region during the same time period. And then we use genome sequencing, and we create a composite genome, simply by lining them, they're all very, very similar. And of course, we can mark out now the regions where these strains differ. And so then we can go and measure those with this SWAF MS. And what you can see now is that you have samples that are, or spaces that are related, and then proteins that are related. In this particular case, the blue ones are the non-invasive ones, and the red ones are the invasive ones. And so already here, you can sort of see two things. First of all, there seems to be two major groups, and those two major groups, this group over here, this group over here, they are mixed, they all have contained both invasive and non-invasive bacteria. But within those groups, you have subgroups, and with a little bit of imagination, maybe you can sort of see that there is some information in this protein complement to actually identify virulent strains. So we're still working on sorting in all the details. But this gives you an opportunity now to to, since you have information about virulence in these particular data sets, you can start to look at the proteins and what you can see if you squint, is that there are some proteins that seem to be higher in the non-invasive strains and lower in the invasive strains and vice versa. And so this will give you some information about the mechanism, perhaps. So if you're then scaled is to some much larger system where you have more experiments and more samples, in this case we have, there's a little bit data numbers, we probably have 5,000 genes by now, and we've probably injected sort of 20,000 mass spec injections, then you have some computational tools, and then you have lots of online databases, and then you would want to integrate these and make some sort of simple model. And this model that I'm building here is an association model, and it really comes down to taking clusters, and any two proteins belong to the same cluster, get an association score. And then you do this over tens of thousands of data sets, databases, or computational results from the normal bioinformatics tool sets, and then you compute association between every single protein, throw it into a sort of force-treated graph model, and you constrain it by protein localization. So methotometry we are actually able to isolate proteins that are in the cell wall of some membrane, we can isolate whatever comes on the outside by shaving the bacteria with proteases, and of course everything else will be counted as intracellular. As we end up with a little model like this, we have sort of the cell wall of an exposed area up here, and then for clarity we put sort of DNA associated proteins down at the bottom here. And then you can use this thing to sort of figure out relative distance between proteins. And so to make it a little bit more concrete, this is now protein M, and it's now associated with proteins that are largely on the surface, and what you can see is a protein M is known to interact with several ligands in human blood. The peptidases here are used to cut certain hyper molecules that the host is throwing at the bacteria, and you have something that also is binding macroglobulins, for example, or phypronectin. So this is sort of all proteins that are carrying out their function in the blood when they are in blood. And luckily for us there was apophetical protein that came up in this particular model. So these associations shouldn't be confused with protein-protein interactions. They're simply the salt of this sort of concept of providing proteins into many, many groups, and whenever two proteins show up in the same group, they are counted as associated. So for example, some of the things that were particularly useful in this model of that, some of these proteins are more abundant in virulent strains. Some of them, or most of them, are irregularly exposed to human plasma, which is a known assay for virulence, and then of course they're primarily found at the cell surface or in the secreted pools. So make this model a little bit more interesting by using structural modeling. In this case, it's almost exclusively homologous modeling, but it's the same concept as the normal modeling. And the way that we do it is to estimate the local confirmation of every single segment of the protein using fragments. That's represented here. So for example, this is amino acid 1 to 9, this would be 2 to 10, and so on and so forth. Now you can rephrase protein structure modeling into Monte Carlo search. So essentially what you do is you replace randomly one part of the molecule with this local confirmation, and then you estimate the energy with a statistical potential. And then you end up with, since it's such a fast algorithm, you can do it as tens or hundreds of thousands of times since we end up with large clouds. Of course this is a known structure because it's RMSD, so you can see that yes indeed in certain cases, a few cases perhaps you have some good correlation between your statistical potential and the quality of the model, but you also have all the minimas. And it does indeed work, and it's just getting real old now, but double blind prediction has, I think it notes in Baker's lab, we were able to get protein that was below two angstroms, or it's a short protein. And since then we've been able to model bigger proteins with accuracy, even though modeling with this resolution still remains pretty uncommon. And so now you can build sort of a little bit more explicit model. This is really a mock-up, so we haven't done any minimization on it for the simple reason, we don't have the compute power to do it. We might get there one day, but for now it simply remains handy. And so this thing here represents kind of this particular interaction or association network. So this is a hypothetical protein, turned out to be pretty interesting. So first of all, it is highly expressed in virulent strains. It is mostly found on the outside of our tears, and it's a create pool. And there was a transposon study where it knocked out the majority of the non-essential genes and then tested for growth rates in blood. This particular protein was important for survival. Whereas in most political studies, if it's hypothetical, it's grossly overlooked and they're never cared about it. But the hypothesis is that this protein is interacting with the human plasma in some way to age survival in blood. And so onto the next methotometry technology here. In this particular case, it's called affinity purification methotometry. And so the concept is that you take a protein and you express it. In this case, we use the E. coli expression systems and you put a tag on it and then we simply dip this thing into human blood and pull down that protein before this interactors. And then we use some statistics to figure out which proteins are significant interacting with your protein of interest. And in this particular case, we had nine or so, I think eight or nine, interactors that are from blood. And as you can see, they're very so some of them from complement proteins, some of them are trypsidic temperatures, some are involved in transport or in ovulation. And then we have this HRG, which turned out to be really interesting because it's a known protein that is indeed involved in the pathogen. And it has been shown to kill spectroscopic pyrogenous. And this particular interaction, we did confirm it with a bioquery. And it does indeed do the job. So in this particular case, you have bacterial strains. This is honestly the reference here at 0 percent HRG. You have by definition 100 percent survival. And that should add more HRG. You can see that survival goes to zero or goes to zero. And then if you grow this much hair at non-permissible concentration of HRG, and you add ship back, at some point, you recover some of the survival. And so indeed SRG is inhibiting the main function of HRG. And also an important aspect of this is that we did detect antibodies against this ship protein in patients. So three out of six that had sepsis or had an invasive bacteria in a zero weight patient with non-invasive infections. So that can conclude that it is indeed involved in some type of survival aiding function. So of course, what's next? We'd like to know all the details on how ship neutralizes HRG. And this might be allowed to design a drug that the ship, by any means that they kill the bacteria. And of course, we want to have an atomic resolution model just to produce interacting. This is still working in progress. So instead, I will share with you a new technology that we are developing right now, actually, that will allow you to do this without actually doing crystallization. And to no one's surprise, it's mass-patrometry based. This maybe has also its target for the omics. So it is this M-protein and fibridinogen, their interaction is actually known. And so if you know, if you know the structure, you can compute the distance between any two amino acids very trivially. And if you combine that with cross-molecomatometry, which is explained really quickly, essentially, you use a molecule. In this case, ESS interacts with lysines. And so if you allow to interact with your system, in some cases, it interacts with lysine from two different protein chains. And so now you can go and identify that mass-patrometer. And with that ID, you can define the maximum distance between these two lysines. So in this case, I'm just going to do sort of an assumption here that you have two models. These are not percent of models, these are just made up. One is, of course, the real conformation and the other is a fake one. You can now compute all the lysine-lycine distances. So in this particular case, the green one here, which is the correct conformation, have 39 potential cross-links or lysine-lycine cross-links. And in the incorrect conformation, you have 18 potential cross-links. And so now you can use a mass-patrometer to go and actually actually go and target these particular cross-links. Now, I should probably mention that these are isotope-coded. So we have a light and a heavy form. And what you would expect, of course, is that they would behave exactly the same, both in the, very similarly, in the separation steps about behavior density in the mass-patrometer. And so what I'm trying to show here is now those response experiments where we go from 0% cross-linker down here up to 2,000 micromolar. And then these are like six conditions. And they're represented here as 12 ion chromatograms. Ion chromatograms is really just a time chromatogram where the Y axis is an intensity. And of course, they're paired so the lights are red and the heavies are blue. And you can see that your vast majority of the signal is actually light. And I just, as expected, more common to see the light versions have more signal in the light channels than the heavy channels. But you have these doublets that obviously are not present in 0% micromolar but are coming strongly at some native time point. And if you look at these profiles, I'll sort of cut this little part out here. And I'm showing it from the side now and trying to be a little bit smart with the colors, sort of have blue, light blue, and heavy blue, like the dark blue, where you can see that they are indeed reasonable profiles. That's what you want. And also, they're slightly shifted in time, which is also what you want. And so you can see that with a little bit of imagination, you might perhaps see that most of these are actually looking reasonably correlated or losing the similar. And if you simply calculate the areas under these curves, you get that they're really similar in intensity. So now you have all the pieces you need to write a score. And so we wrote a very, very naive one. Need some improvement. But this particular peptide is actually the cross-link between these license here and these license here within cross-linkable distance. In general, we measure these at the C-beta, C-beta distances, and that improves the DSS, which is just 11 nitrons, but also the lysine side chains, which are about 7 nitrons. It is the cross-linkable distance. So now if you go and do the experiment, and you actually target all of these cross-links, in this case there are a total of 60 or so, which is you can target all of those in a single injection. The score, as I said, is very naive. So we ended up with quite a few of positives. So both the blue and the red ones were scored positive in this case. And by hand, there's no one in this room will ever buy. If they get to review this paper, we did some manual validation, and we can only validate five out of the eight that scored positively in the correct and zero in the non-corrects. No one would believe that. So that's still work in progress. We do have a new score that works a little bit better, but we're still adding some features to it. So this isn't a game there, having seen that pass through the score filters. In the case where all of these are models, which I'll show you in a bit, then you wouldn't really know which one to choose. So what we're doing here, and this is the key, if you've ever looked at a normal cross-linking experiment data set, it is slightly disappointing in the fact that we've run the cross-linking experiment in low stoichiometric settings. For the simple reason, if you over cross-link it, you create a ball that cannot be digested, and so you can't measure it. So in general, they aim to get one cross-link per protein pair. And so with combinatorics, you are literally creating the needle in the haystack before you even start. So if you look at these data sets, in general, there are two to ten cross-link peptides out of 30,000 negate sequence. So you have a really bad situation where you actually become really tricky to score this data because you have so much stuff that isn't cross-linked. And that's one of the reasons why all mass photometer cross-link experiments are done on highly, highly purified systems. They spend about three years to get the system conditioned right, then the mass photometry, and then it's hard to measure and hard to analyze. And so of course, here, the idea is that you make models because in that case, in the first case, you assume that you know nothing about the system. In reality, we know a lot. So we know that the proteins very often, especially if you work in interaction between pathogenic bacteria and blood proteins, you have a lot of known structures. So the fact that you have all those information is completely lost in a normal way of doing cross-link experiments. And here, we're trying to find a way to use this information to our advantage. So now, of course, we measure every single potential cross-link whether or not it's there. And so now the case is that you can know, instead of doing discovery, you're doing model selection essentially. That's the idea. And so as I said, these were all sort of faked models, high-mold moving some change around, so it's pretty ugly. But of course, you can do particular docking in these days. It actually works reasonably well. If you have a crystal structure, the packing is all correct, then you have a high-engined. In the other cases, you have very, very high noise levels. So again, the idea is exactly the same as for normal modeling. You perturb the conformation by rotation and translation. You optimize the side chains, and then you determine the quality of the model with the statistical-based energy function. And then you get these rugged, sort of low-energy conformations. And in general, the capri, which is a little bit blind, we have 20 or so models if the system is easy. But at least there is some hope that we get close. And that's really that, short seminar today, more of the conclusions. So large amounts of diverse data can be used to create usable models for the systems. Of course, in this case, we have really a large amount of data in a very, very simple system. I haven't tried this on anything more complex than cross-progress 2000 open reading frames. So it's still a small system. I don't know how much data it would be complex. A target that are identified using this model can be expressed, and we should define binding part, and it's just a finitive verification map otometry. And then perhaps the more interesting part, these model interactions can be used to guide every cross-linking map otometry of data acquisition and data integrity. So of course, lots and lots of co-workers to mention. Rudy, for the opportunity of doing this work, I do a lot of for doing some of the modeling, we've seen on Halloween for most of the mass spec work, but everyone is pretty much involved in either workflow, data management, data position, or all aspects.