 So, good morning, everyone. It's my great pleasure to introduce Alexandra Bonda, who's going to be giving us instruction on HEDOC and its use. Alexandra is from the French-speaking part of Switzerland. She did undergraduate work in Lausanne. And her PhD at the University of Utrecht was Rob Kaptein. And then he did a postdoc in the US with Axel Bringer at Yale. And now he's a professor of computational structural biology at the University of Utrecht in the Netherlands. So welcome. Thank you very much for the kind introduction. I guess the mic is working. So it's a pleasure to be here. And so we have a full day program. I will lecture this morning. I don't know the background of everyone, so I will lecture, introduce a bit of basics of docking in general. And then we'll go more into the specifics of HEDOC. I have a very large lecture here on my computer. So I will go through a first program. And then we will have multiple choices where you can choose, basically, the topic I should be speaking about. And I can feel a long time with that. So we'll go how far we get during the lecture. And this afternoon, then we will do some practicals. I saw that some of you were registering already to some of our portals. If you have not obtained your credentials yet, but you did register, do check your spam mailbox because you need to confirm your email before we enable your account. But that we can deal with this afternoon. So let's start. So this is the campus where I'm working at Utrecht University in winter. And what you see up here is the famous tower in Utrecht, the Dom Tower. And this was celebrating the 375th anniversary of our university. And the colors that you see here are the seven colors of the seven faculties that we are projected from a physics lab on campus to the Dom Tower. So this is the highest building in Utrecht. And there is a load at 4 bits to build any building taller than that one. So Utrecht is a very old city. It's very nice to visit if you get the chance. So within Utrecht, I'm working in this building, which is the equivalent of NMR Farm. So it's our NMR building. We are European facilities since 96 approximately. This is the academy building in the center of the city, next to the Dom Tower, actually, where all the official events take place. And what you see here is some kind of vague picture of one of our NMR Hall. This is the infrastructure that we have for solutions on 950, 900, 750, 260, 200, 500. We did order, a couple of years ago, a 1.2 gigahertz machine. So this should be probably more like 2020, 2021. We are number fifth on the delivery list. And there is also a stronger infrastructure in solid-state NMR. And we have also a computational infrastructure. This is more my business. So we have a number of cluster in-house, but we also are connected to a European grid, and actually also to the open science grid in the US. So we have access to more than 100,000 CPU cores. And these activities are supported by a number of European projects. So the European access infrastructure for NMR is currently coordinated by the INEXT project. So enough about Utrecht and the NMR Facility in Utrecht. So this is the program for this morning. So I want to give you a general introduction. Since HADOC is an information-driven modeling approach, I want to discuss first the information sources that can provide some bits of the puzzle when it comes to modeling complexes, biomolecular complexes. I will not discuss the classical structural method like NMR and X-ray and coryam these days. Then I want to talk about general aspects of docking, just to introduce the docking field and the different approach that you can find in the docking field, for those of you who might not be familiar with that. And then we will move to the specifics of HADOC. So how are we doing the modeling process in HADOC? I will give you an application example of that. I will show you how we can use cryo-electron microscopy data and also to guide the modeling process. And then at the end, I want to speak about the concept of the interaction space. And this is related to this visa software that we developed. And the question here is not so much, can we model the structure of a complex? But can we rather assess the information content of the data that we're going to put in a modeling? And then depending on time, you are going to choose a topic which I will cover. And if we have enough time, we can do a number of those topics. So I think I have probably in the order of 20 different topics you can select there. So I can't go on for a long time if you don't stop me. So what are we speaking about? So we have these days thousands of genomes that have been solved. And the genome, of course, encodes the information, but the action takes place at the protein level. So we have the genome, you have the proteome, and the next level of organization is the interactome. What you see here are a lot of dots connected by lines. So the dots represent proteins. The lines represent interaction between proteins. This is a highly dynamic system in the sense that the network might change as function of where you are in a cell cycle of a cell. It might change as function of where you are inside the cell. It might change as function of post-translational modification to your proteins by a molecule. And if you want to understand how this network works, you need to add the structural dimension not only to the dots, which will mean solving structure of the individual component of the proteins, but you need to add the structural dimensions to the connections between the dots to the network. Because not only, say, protein misfolding can lead to disease, but also miscommunication between biomolecules in general can cause troubles and can be at the origin of disease. So this is what I like to call also the Facebook of life or the Instagram of life, basically. So the concept of the 3D interactome has been introduced by Patrick Alloy a couple of years ago. So this is a picture from this review that they published where you get here the spectrum of experimental and computational methods that you can use to study biomolecular complexes. And it starts on the left side with the experimental methods. So you find here the experimental structures. So you will find x-ray crystallography where you can look at very large complexes. There is no size limit there. And then as you move to slightly smaller systems, you might study domain-domain interactions. So not the full proteins, but just the domains that are involved directly in the binding. And that's a field where NMR can also contribute, because here we are looking at smaller systems. And then you move into the peptide-mediated interactions. So this is typically a field where actually NMR can contribute a lot because peptides have a lot of flexibility and possibly disorder. And they might be more difficult to study by classical crystallography methods, but NMR is very good at that. This is a very highly relevant field because a lot of proteins do contain partially disordered regions. So some proteins are completely disordered and they only adopt some conformation when they bind to their targets. So these are the experimental methods. And then you come into the modeling fields. And here in the middle, you find homology modeling. So you often associate homology modeling with modeling of a protein structure based on a non-structure in the database, in the PDB, provided there is some sequence similarity between the two. But you can, of course, do the same at the level of complexes. So if you find an homologous complex who has a structure in a PDB, you can simply model directly the complex. So this is what is called template-based modeling. And then when you move more to the right, so you enter the field of docking. So here we will have typically the structure of the components of the complex, but we don't have the structure of the complex itself. So you cannot do template-based modeling of the full assembly. You can do modeling of the components, but then you will have to put them together. And this is where we will concentrate today. And at the end in this picture, they put hybrid modeling. And this will be related, for example, to the work of André Salli. So the combination of a lot of different type of experimental and modeling techniques to solve problems. So the nuclear pore complex model is such an example. So you will have a bit of maybe mass spectrometry data, a bit of NMI, crystallography, and you put everything together, maybe low resolution data. And hybrid modeling, strangely to me, in this review is put completely on the right side, so as a computational method. But really it belongs here in the middle, because that's the intersection of the experiment and the modeling. And this is something that you see more and more happening these days, because as we look to more complex system, there will not be a single structure or structural method that's going to give you all the answers to solve the problems. So we will have to combine different experimental techniques with computations to get a look at these complex assemblies. Now we can go into the protein database and try to look how much information do we have about complexes. So this is also taken from the same work of Interactome 3D. So these are already old statistics. So what you see here, well, this is a small number, 45,000 for human. So we have about 20,000 proteins in our human proteome. They form several hundred thousand interactions. So the size of the Interactome is almost one order of magnitude, well, it's more than one order of magnitude larger than the number of players, the proteins that you have. So 45 is not representing the full Interactome. These 45,000s are interactions that have been documented. So there is some biochemical experiments that proves that those proteins do interact. So these are the validated interactions. So if you look at those 45,000s and you go look into the protein database, this is what you see. So we have structural information. So we can homology model the complexes for less than 10%. So this will be the white and pink regions here. So this is the structural information that we have on the complexes. Then you have this large blue fraction, which represents about 50% of those interactions, where we do have the structure of the components, but we don't have the structure of the complex. And this is the area where docking as a computational techniques can contribute. And that's 50% of the interaction. And then you see this gray region, which is about one-third of the complexes, where we have nothing. We have no structural information about the component and, of course, nothing about the complex. So that's a challenge to the field, both to the experimental people and the modeler. What do you find in there? You will find, again, this disordered protein. And the human proteome is expected to be about 40% disordered. You will find also all the membrane and membrane associated systems that are more challenging to solve. The situation is slightly better in bacteria, but you see that in some of structural information, it's still about 10%. The blue region is larger, but you are left here with about 25% where there is nothing. But the blue region is good news for the docking field. So when I'm speaking about docking, what is docking in a nutshell? So given the structure of the component of a complex, can we find the solutions that will predict all those interactions? So what you will have to do is to generate all possible combinations of those two proteins in 3D. So we are trying to solve a 3D puzzle. You will have to translate the protein. You will have to rotate the proteins. And every time, you will have to measure the quality of the fit between those two. And the components that comes into that measure, so the shape complementarity will play a role. So it should fit. And next to that, you typically add physics, energetics. You want to have complementarity of charge. So this will be represented by an electrostatic energy function. You see here the Coulomb potential. And you want to measure also the complementarity of more of the hydrophobic surface, which will be represented by the van der Waals interactions. So in docking, we have two main aspects. So if we have in the x-axis here what you have will be the conformational landscape or the interactions between the proteins. On the y-axis, some kind of measure of the quality of the model that you generate. So the sampling means that you will have to sample all possible configuration of those two proteins. And then you will have to score them, meaning you want to associate some kind of score or energy value to each of the model that you have generated to try to locate the global minimum in this system. So that's the scoring part. And you have different methods for the sampling. You have different methods of scoring. And sometimes the sampling and scoring are tightly associated. So there's all kind of variations. Now, if you have data, experimental data, you can think of using them in two ways. You might use the data to bias the sampling. So this will be information driven docking. Or you might use the data to score the model that you generate to try to help in the identification of the best models. Or you might use them, actually, both. So this is the difference between the two. So this will be a global search. So we don't use the data during the sampling. We just do the global search of all possible configuration of the two molecules. And then you will have to score all of those. If you use data, you might bias the search. So you put the data to drive your search or to limit your search problem. And hopefully, if the data are good, of course, you can spend more time sampling the relevant part of the interaction space. So this is nice. Of course, it depends on the quality of the data. If your data are good, you are going to sample the right region in this interaction space. But if your data are wrong, you're never going to sample this region. And you're going to look in the wrong regions. So it has advantages, but it also has dangers. You have to trust your data if you're going to use them in a sampling. Because garbage in is going to give you garbage out. It's the Gigo principle, which you can find even in Wikipedia. So these days, we speak a lot about integrative modeling. So integrative modeling means we want to combine a large number of different sources of information to bias our sampling and generate the model of your complex. So if you are an experimentalist using modeling, you're going to generate models. And the model should not be the end of the path. The model is only the starting point for new experiments, basically. So you want to use the model to generate hypotheses. And then you can go back to the lab and test those hypotheses. It might help in speeding up structural determination. And of course, with models, you hope to increase your understanding of function. If you are more on a modeling side, if we can use the data, we hope to be able to decrease our false positive rate, meaning that we generate a lot of models that are completely crap, irrelevant for the complex what you want to study. And by using the data in the modeling, we're going to limit this false positive rate. And of course, you need data also to measure the accuracy of your modeling process. So we wrote, well, there are a number of reviews in the docking field. So if you want to learn about docking, you're not familiar with docking. This is an old, well, 2002. It's kind of an old review, but it's still very relevant. So this really introduced you all the principles of the docking. If you want to see what the field is doing in terms of docking, you can look at the special issue of proteins, which is appearing about every two years, about Capri. And Capri is the critical assessment of predicted interactions. So it's the equivalent of CAS for structure prediction, but Capri is for the prediction of complexes. So in those special issues, you can see basically the performance of the different groups, the different approaches that are used to do this kind of modeling. We wrote a number of reviews about integrative modeling. So this is the most recent one here. So let's move now to discuss a bit the information sources. The non-classical structural information sources. And we start simply in a wet lab. So if you are doing mutagenesis, and then mutagenesis to probe residue on the surface of your proteins. So what you are doing, you introduce mutations. You need some kind of a binding assay to measure if the complex is formed or not. And if you see that after mutating a residue on the surface, you don't see complex formation anymore. You will interpret that as this residue must be important for the interaction. So what you get out of that is a residue level information about possible binding sites. You, of course, have to make sure that your protein is first properly expressed. Sometimes some mutation makes that the protein does not express properly anymore. And that the protein is also well-folded. Because if your protein does not fall properly anymore, of course it's not going to interact. But not because that residue is important for the interaction, but because that residue is important for the structure. But this is information. And there are techniques that are systematically mutating all the residues on the surface of proteins, like alanine scanning with a genesis. So this is residue level information. You don't know exactly what that residue is doing, but you know that it's important. And that's usable. What you see a lot these days is to use cross linkers to study complexes in combination with mass spectrometry. So here you use small molecules that have typically a flexible linker and two warheads that are going to attach to the proteins. So the classical one, some of the first one where I've been targeted to the lysine side chain on the surface of proteins. So you do the reaction in solution with your complexes. And of course, those chemicals are highly reactive. So you're going to react with whatever they find out there, which is suitable for the reaction. So from time to time, you're going to create intermolecular cross links, but you're also going to have intramolecular cross links. And then what you are doing in MS is typically to you digest those proteins with proteases, and then you detect the fragments, the peptides. And from time to time, you're going to detect two peptides that are coming from two different proteins and that are cross linked by the small molecule that you use. So those fragments will have the additional mass of the cross linker. So this is telling you that the residue that have been cross linked must be at a given distance from each other in a complex. Because those linkers are typically flexible, this information is not very precise. So you might get distance range between, say, 0 and 20 or 0 and 30 angstrom. So it's not very precise, but it's still information. And if you get enough of those, you should be able to model the complex. These days, people are even using similar approach with different chemistries to start solving structures of proteins. And I've heard talks where people are able to collect almost NMR numbers of NOEs but from cross links. And then you can try to start folding your protein based on this cross link information. One big advantage here is that you need very little sample to do this kind of reaction. And you can even do the cross linking experiment these days in living cells and detect the complexes that are existing during the cell cycle. Of course, after that, you have to lysate your cell, digest your complexes to do the analysis. But it's quite amazing that you can do this kind of work in living cells. So here, you're getting distance information, not very precise. It's not always easy to detect those because the concentration of the cross links might not be very high. But there has been a lot of progress in that. And MS is really the method of choice here. Another way of mapping interactions between molecule is to do HD exchange, which can be studied both by NMR. If you have assigned your protein, so you're going to see these are parents of signals. So what you are doing here, you're taking your protein, and you dissolve it in D-tool. So what's going to happen is that all the exchangeable protons in your protein are going to exchange for deuterones. And you have to do these experiments with your free proteins and repeat the experiment with the complex and look at the differences. So the regions that are protected from exchange because of the formation of the complex are going to be protected. So what you're going to get here is, again, residue to peptide level information. If you detect by mass spectrometry, the resolution that you get is at the peptide level. Again, you're going to digest your protein and analyze the peptides. You might suffer from indirect effect, meaning that you also detect allosteric processes here. So something born in your protein, and there is a conformational changes in another region in the protein. So this is also going to affect those protection factors. If you do NMR, of course, you need to label your proteins. Now NMR, we are at an NMR facility here. So NMR is very good at looking at very weak interactions. So those interactions that you will have a very hard time to crystallize, for example. And the classical type of information will be these chemical shift perturbation experiments where you are monitoring one labeled protein with N15 or C13 and you are titrating the other component of your complex and the signals that are affected by the binding show up as displaced during those spectra in the case of weak binding, fast exchange. And you can map this information on the surface of your protein and this defines the interface where things are binding. Same story as with HDExchange, you might monitor here also allosteric changes. So not everything that you detect will be directly relevant for the interaction, for modeling the interaction. So we'll have to deal with that in some way. NMR can provide you information about the relative orientations of your molecule. So think of RDCs, things of relaxation, anisotropy data. So this is information that you can take in the modeling as well. It does require more work to collect it, but it's bits of information. And a very clean experiment by NMR also is to use these saturation transfer experiments. So this does require that one of the protein be deteriorated. So you're typically going to monitor the signal of this green protein in this example, which is deteriorated except for the protons that have been exchanged back, or the deterrents that have been exchanged back to protons. And then you saturate the red protein or signals of the red proteins and by relaxation process, the interface of the complex is going to be affected. So this is a clean experiment in a sense that you don't suffer from allosteric changes here. You really only monitor the interface, but it does require the duration of your protein. And then there's plenty of over-experimental techniques that are going to give you some bits of information. Okay, think of, so in NMR, adding magnetic probes to your protein, to your samples, which are going to induce relaxation enhancement, is giving you distance, but also interface information. CryoEM, when it used to be at lower resolution, these days, it's very high resolution, but not everything is at high resolution. And if you do tomography, you're not going to reach the atomic resolution. So you have low resolution data, and the game will be to dock the complexes into the shapes that you extract from the CryoEM. Or SACs, for example, all kind of spectroscopic techniques, FRET being one of them. So again, the information is not going to be very precise in terms of distance that you are measuring, but every little bit is going to help you potentially. And if you don't have any experimental data, you might still go to Bioinformatics and look at sequence information and evolution. Because in sequence, there is also usually information about binding sites in proteins. So the typical examples, if you detect conservation from your sequence alignment for residues that are on the surface of a molecule, it must be for a good reason. Otherwise, nature will not bother conserving those residues, okay? So conservation on the surface is telling you something about possible binding sites, so you're detecting the active site of an enzyme. And these days, what is also very popular and coming up very much is core evolution. So now you're going to look at conservation between the two sequences of your protein and you are looking for correlated mutations. So one residue in one protein mutates at some point in evolution, for example, from a positive to a negative charge, and at the same time in evolution in your other sequence, you see a reverse mutation. So if you can detect these kind of signals, you know that those residues must be somewhere close in space because their mutations are correlated. This is not a new concept, so this was there in the 90s already, but the analysis and the extraction of the signal was very difficult, and it's only since probably five years that the methodology has improved a lot and it's not possible to detect those. And you can also detect this information for say intramolecular contact so that you can use that to for protein. So in CASP, in the Structure of Prediction competition experiments, there is a category which is about predicting contact in proteins and this is based on co-evolution methods. And you can use these predictions, you can combine that with say propensity so you can analyze in a PDV which are the typical amino acids that you find at the interface of a complex. You can use that to improve your prediction. You expect that those should cluster on the surface and you make a prediction. So we published in 2006, it was whiskey as one of those predictor because we needed this information for Capri the docking experiment. And since then, there has been a lot of methods that have been published and this was taken from a review we wrote 10 years ago so this is clearly not up to date so since then the list will be way too long and feeling completely the slides. But you find plenty of methods out there and plenty of web portals, web servers, you can do this kind of prediction. So even if you have no data sequence and structure is going to help you and we have published also a meta server which combines six different online servers to do a prediction. So these are the servers that are combined in our C port predictor and C port was basically trained to make predictions for use in Haddock. And this is just a snapshot of the portal. So what do you do with all those data? So you can use them in two way, when I was describing docking I say we have a sampling and we have the scoring part. So you can use the data after your sampling so this will be up posteriory so you're going to generate a lot of solution and then you filter the solution based on the information that you have which will be what I call here data filtered docking. Or you use the data apriori meaning that you're going to bias the search bias your sampling using the data that you have at hand. And this is what we are doing actually in Haddock. So let's move now to some general aspects of docking before going more to the specifics of Haddock. So the docking problems. So there's a number of choices that you have to make if you're going to model interactions. And those choices are also reflected in all kind of different docking software that are out there. So you will have to think about how to represent your system. So do I need to represent all atoms? You might be docking by printing 3D models of your protein and then you do it with your hands and then the only relevant part is the surface basically. You don't care about what is inside of your model. And there are some computational methods that do exactly the same. Then you will have to worry about the sampling in principle so if your proteins are rigid and they are not, what you have to sample are free rotations and free translation. So you can fix one molecule at the origin of your coordinate system. And then you will have to sample all possible rotation of the second molecule and for each rotation you have to sample all translations in three dimensions. So it's a six dimensional search for two body problem when there is no flexibility. You would have to think about scoring so you're going to generate a lot of solutions. So how do you identify the good ones? And of course we all know that proteins are not rigid. So we will have to worry about describing to some extent the flexibility of your system. And this makes the search much more complex. Then you are not dealing anymore with a six dimensional search problem but it explodes in terms of complexity. And if you have data, how are you going to use it? So the first docking software was in the early 80s. Actually, this was the work of Joel Janin and Shoshana Vodak. So the first docking software was called DOC. Not the DOC that you might know for small molecule docking but this was protein, protein docking. And at that time they were using an explicit representation of the proteins describing the coordinates of all atoms. And their search will be in real space which might take more time. So you sample your rotation, you sample your translation and you hope to get the solution to your problem. Now since then there has been a lot of new approaches and a lot of the docking methodology that you will find out there is not using an explicit representation of the proteins of your molecule anymore but it's discretizing your protein onto a grid. And the grid is nice because once you have grid equally spaced points, you can start applying fast forward transformation techniques to do your sampling. For the NMR people here, it should appeal because you all know what FFT is good at. So what are you doing? So you define a grid. The grid spacing defines the resolution at which you're going to do your modeling. So you can play with the grid spacing. And you're going to map your protein on a grid and you're going to assign properties to the grid points or the voxels basically. So you want to map on a grid the surface of your molecule and you want to map the inside. So here you see two different colors. And then you're going to work with grids only. And what you want to do is maximize the overlap of the surfaces, this blueish color here. And you want to avoid overlap of the gray color here which will be the core of your protein. So it's a shape representation of the protein in principle. Now you can add many more information onto the grid so you can also put charges in principle of your surface, different properties. Then you will need more grids to do your sampling. So the resolution is defined by the grid spacing. So they are different software that are doing that. And then the docking will be a geometric docking where you match the shapes basically. But you can also add energy terms to that one. The search can be in real space or in a free space. So bigger is one example of a docking software that use grids and performs the search in real space. And then you also find now systems where you have a big representation. So part of the system is mapped onto a grid and part of the system is explicitly described so you're going to have the atoms of that molecule. So that's typically something that you find more in a small molecule docking field. In methods like Autodoc or ICM, the protein is mapped onto the grid or actually the energetic field of the protein is mapped onto the grid and your ligand will be explicitly represented and moves in this energetic grid. And that allows for quite efficient search. So one thing that you have to realize about grid is that in principle it's a fixed object. So once you have a grid system, you cannot describe flexibility or you will have to play tricks to account for some level of flexibility. But by definition, your grid is fixed. And then there are even more methods. So when we speak about in principle we might only be interested in the surface of your molecule and one nice way of representing surfaces is to use spherical harmonics. So for the chemists among you, you will say, oh, this looks like orbitals and spherical harmonics are actually used to represent orbitals. The NMR people will also know everything about spherical harmonics because they enter in the relaxation equations. So here what is being done and this is implemented in a hex software from Dave Ritchie. So you are going to represent the surface of your molecules by a linear combination of spherical harmonics of different orders. So in hex they use 15 different terms. So you have a 15, sorry, 45 terms. So they use here three and then a combination of 15. So 45 terms that represent your protein. So the number of variables that you need to represent your protein surface will be 45. And that's much less than representing all atoms in your system because then the number of variables that you're dealing with is in the order of 10 or hundreds of thousands depending on the size of your system. So because of that you can do very fast search. So hex is a very fast docking software but it's again, you cannot modify the shape. So it's kind of a rigid body docking software but you can really do this search very efficiently. And spherical harmonics are also used in gaming for example. So if some of you are gamers, you have nice shadows in the games these days and the shadows are generated by using spherical harmonics. So the graphical cards, the gaming cards and the professional cards, they are very efficient libraries to deal with spherical harmonics. So there are some docking software that have been built for gaming cards that are making use of the gaming libraries because everything is there to represent those molecules. So that's quite nice. And by buying the number of terms in the expansion that you are using, you can also change the resolution. So if you use only a single term, you're representing all your protein as spheres. And of course there you're not going to be able to distinguish very well different solutions because everything will look the same. But you can see that different combinations are going to affect the resolution. And yet another way of representing your protein is which is also only considering the surface of your protein is basically to decompose or express your problem as a 3D puzzle problem. Okay, you might know 3D puzzles where you can build the Eiffel Tower in 3D. So this is the same principle. So you're going to decompose the surface of your molecule as puzzle pieces by doing a triangulation. And then when you solve a puzzle, you're not taking all the pieces of the puzzle and trying to find a solution in one go. That's a very hard problem. So what you do, you take one piece of your puzzle and then you search for another piece that will match and that's you start building your puzzle. So these kind of methods do exactly the same. So once you have decomposed your protein as a series of surface, as a series of puzzle pieces, you just need to take one piece of the puzzle and then you look for a matching piece on the other molecule. And that's a very fast process. So patch dock, which is based on this principle from Nusinov and Volsom, is again a very fast docking software because it does this matching. It's kind of the same principle that I use in image recognition. So face recognition on your phones is using a similar thing. So you define the face of a person as a kind of a puzzle piece and then you try to find matches in the libraries that you are looking at. So this was about how do you represent the system? So then you have to worry, of course, about the search. So in principle, we have a six-dimensional problem and for each model that we generate, we're going to need to calculate a score and this can be the time-consuming part. Now the translational search, if you're using grids, can be carried out in a Fourier space. If you're using the spherical harmonics, you're going to do the rotational search in a Fourier space, in that case. So typical methods that are using grids and FFTs are Z-Doc, Gram, FT-Doc, Piper, Plus Pro. A lot of those software are actually developed in the US. Bigger, I already mentioned. It's also a grid, but it's doing a direct search. So what you have to do, so you take your protein, so you have your receptor, which will be typically the largest of the two components. You discretize it on a grid. You represent the surface, which is kind of bluish. You might see it or not, and then the inside and you calculate the fast Fourier transform of this grid, which gives you the complex conjugate. For your ligand, which will be the smallest of the two molecules, you're going to sample rotations and for each rotation, you need to generate a grid and the grid should have the same size as the grid of your receptor. You take the fast Fourier transformation and then you calculate the correlation function in Fourier space. And when you do that, this is a two-dimensional correlation function. So for each combination of X and Y translation, you get a correlation score and you're going to store the highest point in this correlation map. And if for a combination of those, you take the inverse fast Fourier transformation, what you are generating is the solution, your complex, basically. So this is how it's working. Now, systematic search, like this grid-based method, you can carry the search at different resolution by changing, for example, the spacing in your grid so that you speed up the calculation. And you will need to refine your solution at the end because since you are working with grid, there is no intrinsic flexibility. Typically, those methods generate models that have a lot of bumps at the interface. So if you were to submit them to the PDB, if the PDB was accepting such method, such model, you will get probably a very long report about all the clashes that you have at the interface. So you need to refine those solutions before. Now you also have energy-driven search methods. So here we are going to use molecular dynamics, energy immunization, maybe genetic algorithms, Monte Carlo, whatever methods, with some kind of energy function and you are trying to find the minimum of this energy function, which we assume will give you the right solution to the problem. And here, often, so the search is combined with some kind of simulated handling protocol like you might see in NMR structure calculations. So if you are doing this kind of energy-driven search method, you have to think about how do I start my optimization process and you're going to have to start from many different starting configurations. So if you do the systematic search, then you use search everything, so you have covered the entire interaction space. Here, you would have to repeat your calculation many, many times with different starting conditions and hope that you optimize your system and you see some convergence. So here is one example, which is implemented in ICM from Abagayans, so where they define anchor points. So you put pins that are equally spaced on the surface of your protein and your starting point will be all combinations of those two pins for your two molecules. Or you might say, I separate molecule, I randomly rotate them and that's my starting point. That's, for example, what we are doing in Haddock. And if you do that enough, then you should have a proper sampling of starting configuration. So now flexibility, and that's very much a challenge still in the field. So just to give you an idea, so there are benchmarks that we use in the docking field to measure how well a method is working. And those benchmarks, they are classified as the easy targets, so where there are no conformational changes and they have medium targets and they have challenging targets. And the challenging one will be complex where there is more than 2.5 extra conformational change between the free form and the bound form. And 2.5 is not that much, okay? And that's already classified as a challenging case. So flexibility makes everything more problematic. So you have an increased number of degrees of freedom, but it also makes your scoring more difficult because your energy landscape is going to be much noisier once you add flexibility. There are a lot of methods that you can use to kind of study the dynamics of proteins. You can use molecular dynamics, you can use elastic network models to try to predict if a protein is going to change its shape or not. The big problem is that we cannot predict really when do you need conformational changes. It's not because a protein is mobile or has some dynamics that this motion is relevant for the binding that you want to describe. And usually if you put a lot of flexibility in your system, in your modeling, you make everything more complicated and it might hurt you when you don't need it. Okay, the example I like to say, so here is some flexibility, okay? This flexibility will be very relevant if I want to give someone a hug. But if the binding side is in my back, this is completely irrelevant, okay? And it might just complicate your modeling process. So don't start in general with the most fancy method. So I think an important message here is start with the simplest method first and when it does not work, go to the next level of description of your system. So even if we can study the conformational dynamics of proteins, we don't know when we need conformational changes and we don't. And I think that's a big challenge. So we are still limited in what we can describe. And how you describe flexibility also depends on how you represent the system. If you only use the surface of your protein, if you use grids, grids are by definition rigid in this case. So the first way of describing flexibility in the docking field was to have some kind of what was called soft docking. So this will be a hard sphere interactions. They simply cannot overlap. That's what happens when you play pool beard. And this will be a soft interactions where you allow for some overlap between your atoms, your particles. And this is one of the first way that was done to describe an implicit level of flexibility in docking. But when you do that, you will have to remove your bumps before doing something with your models. So a simple way when you use grit, I say, okay, we represent the surface on the grid and we represent the core of the protein. So what you can do is to empty the grid points that correspond to the surface side shapes. So that if you have overlap in this region, it's not going to penalize you. So if you have overlap of gray with gray, it's going to give you a penalty. It's going to give you a bad scoring solution. If you have an empty point, it's neutral. So you're going to allow for overlap on the surface of your molecule, typically. So this is one example of dealing with such flexibility. You can also pre-sample conformation. So you can do molecular dynamics again, elastic network models. You can use an NMR ensemble of starting conformation if you have them, meaning that you will have possibly the backbone and side chain in different states. And then you're going to do the docking from all those states or maybe use them all at the same time. So there are different ways to that. And this you can use both in rigid and flexible docking. And then you can use explicit flexibility in a system where you're going to allow for the atoms to move during the modeling process where you can say, well, I'm going to optimize side chains or I'm going to optimize side chains and backbone. This, of course, makes the computational course more expensive. And it's typically only introduced at later refinement stages. So Hadock is such an example of a flexible docking slash refinement software. Rosetta also has some kind of a flexibility. The two are different approaches. In Rosetta, you first do flexibility along the backbone and then you build the side chains. In Hadock, we put flexibility first along the side chains and then deal with the backbone. Then comes the scoring. So how do you know what is a good solution and what is a bad solution? So that's really the holy grail in docking because if you can solve this problem, so we are looking for a needle in the A stack typically, so you generate a lot of solutions and you want to fish out the good ones. So if you have the perfect scoring function, the remaining is just CPU time, okay? That's sampling. And that's something we can solve in principle. But the scoring also depends on how you represent your system and it also depends on how you deal with flexibility. So there is no really universal scoring function. Some scoring functions have been optimized for use with a specific type of docking approach. So if you do rigid body, grid-based docking, there are scoring functions optimized for that. And if you do flexible docking with Hadock, for example, we have different scoring functions. And what you also see in the field is that people are tuning scoring functions that are specific to a specific type of complexes. So they might do very well for antibody antigen scoring but poorly for another type of complex. So these are the advantages because if you have enough data that you can train your perfect function for the problem that you are interested in, it's going to help you. But they are not a general scoring function. Say you're going to dock a protein with a peptide, with a piece of DNA. How are you going to combine all these specific functions if you want to score the entire complex? So in principle, the chemistry and the physics in all molecular recognition process is in principle the same. But because we are still limited in our ability to really distinguish what is good and bad, we are optimizing scoring functions that are specific to different type of complexes often. So what do you find in the typical scoring functions? A combination of different terms like intermolecular, van der Waals energy, electrostatic. You might account for hydrogen bonds across the interface explicitly. The amount of surface which is buried when you form your complex is in many scoring functions. You might calculate some more empirical energies like desolvation. So when you bury an interface in a protein-protein complex, you will have to remove water from that interface in most cases. And this might be a bonus. So typically if you bury adrophobic surfaces, you win energy because you release water in a bulk water, so it's an anthropic term. If you bury charges, you might pay a price because you have desolvated those charges first and this costs energy. You might use statistics, amino acid interface propensities from analysis of PDB, statistical potentials like pairwise residue or atom contact, again based on the knowledge that you extract from the protein database. And if you have data, of course, you might want to use your data in your scoring function as well. So in general, the more sophisticated the scoring function, the more computationally expensive it becomes. And at the end of the day, it's also how much time do you want to spend in doing the entire process. And at the end, often, so if you do this kind of modeling, you might be generating thousands or possibly sampling millions of solutions, but you cannot handle millions of solutions. At least you cannot look at them. So what is often done is to cluster the solution, meaning that you want to put them in a bag if they resemble each other. So that's the clustering problem. So the idea, if you have some kind of energy landscape, so you want to put all the solutions that resemble each other together, and then you might do your scoring on the cluster basis. There are different ways of doing this clustering. So often it's based on the positional root mean squared deviation. So you calculate some kind of RMSD between your models. You generate an RMSD matrix and then you do your clustering based on that. But you can also use different methods. So we introduce one which is called the fraction of common contact, which has the advantage that you don't need to do the fitting. So for example, the problem of the RMSD, so these are two models of the same complex. If you look at them, you will probably tell me, oh, these are the same solutions. But because when you do your docking, you give a name to the different chain, ABC. What you see here is ABC and ACB. And when you're going to do your RMSD calculation, you might have to calculate both solutions if you want to find that the RMSD is low. Because if you just do the straight calculation of ABC against ACB, this is going to give you 20 extra RMSD. While in reality, it's the same model. And if you have, so it's b-sendal symmetry. So we have defined a method based on the fraction of common contact. So you don't need to do fitting. You calculate the contact at the interface and you cluster based on this information. So this is very fast. And it has advantages compared to RMSD, especially when you start looking at, so RMSD is fine if you look at two molecules. When you have complexes that consist in more than two molecules, the RMSD becomes more insensitive to what the other molecule are doing. While the fraction of contact might be better in clustering those solutions. Okay. So now let's move to our approach to the docking problem, which is Hadock. So Hadock is an integrative modeling platform which allows to incorporate, so these are all the data that we have been discussing today in the first part of my talk. So Hadock has been developed now for more than 15 years. And the idea for Hadock came from an NMR problem. So I'm part of a large NMR lab. Doing computational work in the middle of experimentalist. And there was probably now 16, 17 years ago, a PhD student who was studying a protein-protein complex could not collect the data to solve the structure in a classical way, meaning intermolecular NOEs. But he had these titration experiments which are basically telling us where the binding occurs. And that's when the idea came. So can we encode this information into some kind of ambiguous distance restraints that will bring together the interface without pre-defining what the orientation should be. And since then, we have been adding support for many, many different types of information. So the current version, which we are going to use in the tutorial this afternoon, can handle up to six molecules. So we can do docking than more than two molecules. You cannot do that if you use FFT-based grid methods. So there you can only dock two molecules. So if you want to do a trimer, you would have to dock a dimer, and on the dimer solution, dock another molecule. We can also use, I didn't speak about symmetry, but actually if you know that you are dealing with symmetrical complexes, this is also information. And you can impose symmetry to your solutions. And we're using here tricks that have been developed for NMR by Michael Nielges in the 90s, early 90s, to deal with structure calculations of symmetrical homomeres. And we can use the same concept for modeling symmetry in various types of symmetries. So we have some flexibility. I'm going to come to that. We do refine our structure in explicit solvent. So actually the protocol that we are using in Hadock is pretty much an NMR type structure calculation slash refinement protocol. So how do we encode this fuzzy information that we get from all kind of experimental data? So if you know where things are binding, but you don't know where they are binding. So we use the concept of ambiguous interaction restraints. So that's a, for the NMR people around you, you are probably familiar with ambiguous NOEs. So you have multiple assignments that are still possible for an NOE. And so we do the structure calculations, the NMR structure calculation using some ambiguity. And the ambiguity typically NMR is the order of say five to 10 possibilities. So here the ambiguity level that we are dealing with will be more in the order of several thousands. So we distinguish two types of... So we have a list of residues for the first protein or the first molecule. We have a list of residues for the second molecule. So these are the residues that you predict or identify that's being important for the binding. And we distinguish two types of residues. So the experimental one will be the one indicated in red in this simple plot here. And since experimentally, you typically never detect perfectly the interface. We want to increase the fuzziness of our interface by selecting the surface neighbors. So these are what we call passive residues in the context of HADDOM. So what we're going to do is define a distance restraints between any active residue on one molecule and all active and passive on the other molecule. And we're going to calculate all distance combinations between all atoms of this residue here and all atoms of all the residues on the other side. So to give you an example, if we assume that on average, we have 10 atoms per amino acid, and I would have 10 amino acids or 10 residues defined as interface on molecule B. So in these calculations, I could calculate 1000 individual distances between all combinations of atoms. So those 1000 distances, I'm going to enter them in this sum here, one over the distance to the sixth power. So the NMR people will say, oh, that's the NOE averaging effects that typo-dipole interactions. The modular molecular dynamics people might tell you, oh, this is the attractive part of an energon function, one over R to the sixth, okay? So we take this sum. So when you start, when you add distances in this sum, the sum grows and grows and grows. So you get a larger and a larger, larger number. And then you take the inverse sixth root of the sum. So you transform the sum back into a distance, and that's giving you one number. This number is always going to be shorter than any distance that enter the sum. It's just a mathematical property. If you take, as an example, 1000 distances of five angstrom, and you enter that in this summation, what you're getting back is 1.67 angstrom. So the effective distance that you are working with will be shortened at the shortest distance. And it doesn't matter which contact are made, as long as one contact is made, you're going to get a short distance out. And then we use energy functions that have been implemented for NMR distances typically, which can be quadratic for some part, which becomes quadratic and then becomes linear so that at longer distances, longer violation, the force becomes constant. So we use, so this is a classical NMR function which you use for structure calculation and we use exactly the same concept, but the level of ambiguity will be in the order of 1000 or several thousand. And you will have one distance restraints per amino acid that you define as being part of the interface. So you have, end up with a network of ambiguous interaction restraints that's going to pull your interface together, but which is not going to pre-define in which orientation the binding should take place. Okay, as long as one contact is made out of these 1000 that are potentially entering the sum, you're going to satisfy your energy function. Since the data are never perfect, what we also typically do, that's a bit to deal with the GIGO principle. So we randomly delete a fraction of the data. So for each docking trial that we're going to do, by default we delete 50% of the data. If we do bioinformatic predictions, we like to overpredict the interface. For the ad hoc approach, it's better to be too generous in your definition of the interface than too restrictive. So when you use bioinformatic data, we might delete up to 90% of the information for each model that we generate. Next to this experimental information that we put in, we have a classical energy function that describe the bonds, the angles between atoms, torsions, and the non-bonded interactions, van der Waals and electrostatics. And this makes it different from NMR. So in NMR, when you do structure calculations, typically you only have a repulsive term for the van der Waals. You don't want clashes between atoms and you don't use electrostatic until the final refinement process. So here electrostatic is going to play an important role to define the orientation. So we use from the start proper van der Waals energy function and proper electrostatics. And we add the experimental energy terms. So this will be this distance term, but we can also use RDC, pseudo-contact chief. So we have support for a variety of terms there. And the search in HADOC is a combination of energy minimization and molecular dynamics. So we're going to calculate the derivative of this function, which is going to give you the forces and the forces are telling us in which direction we should search to locate the minima. Okay. So if you are not in the modeling field, if you only have the energy, so John introduced me as I was born in Switzerland and in Switzerland we have a lot of mountains in the US also, we have more concentrated mountains because it's a very small country. So the problem is to locate the lowest altitude point in Switzerland or the lowest energy point in your energy landscape. If you only know the energy, you are moving in a mountains in Switzerland blind and you have an altimeter that tell you or now you are 2,000 meters. But you have no idea what your next step should be because you don't see the landscape. And the landscape are the forces in modeling. So the derivative of the energy function. So if you open your eyes in Switzerland or in the mountains anywhere, Colorado for John, then you know in which direction you should walk if you want to go down here. And that's what is guiding our modeling process here. This is also what is guiding the modeling process when you do molecular dynamic simulations. If you do Monte Carlo simulations, typically you only use the energy and you do random moves, but you have no idea in which direction, typically you need to move to locate the minimum. And then we're going to use molecular dynamics types of simulations where we are solving Newton's second law of motion, force equal mass times acceleration or acceleration equal force divided by mass, which is what is written here. And if you have defined your energy function, this function here, by calculating the minus the derivative of the function with respect to the position of atoms, you get the forces. And if you have the forces, you can integrate this equation as function of time twice. And then you go from the forces to new positions. And you are going to repeat that process many, many times and you're going to generate a molecular dynamics trajectory of your system as function of time. In Haddock, we're also using torsion angle dynamics as a way of more efficient sampling. This is something that was also developed mainly for NMR structure calculations where we have typically a lower ratio of experimental with respect to the number of variables. So the degrees of freedom are no longer here, the move of each atom in XYZ directions, but the degrees of freedoms are the rotation around the bonds. And that's very useful in Haddock because it makes very easy to freeze some rotation or to release some rotations. So in our refinement program, process in Haddock, we first optimize the side chains and then side chain and backbone. And you do that simply by defining which rotations are allowed to move basically and which are frozen. Haddock actually uses CNS as structure calculation engine which is derived from Explorer. So we have three stages in our process. The initial stage is a rigid body minimization which we call IT0. So here the molecule are treated as rock solid. Then we're going to perform some simulated and any optimization of the model where we introduce flexibility. And then we refine the solutions in explicit solvent. And this protocol is basically derived from an NMR structure calculation protocol. So Haddock was originally derived from ARIA. And some of you might know ARIA as an automated NOE assignment and structure calculation method for NMR. So rigid body, we generate, we sample in the order of say 10 to 100,000 models at that stage. We write only a fraction to disk. This is quite fast. So it's in order of maybe a few seconds for 10 models typically. We're going to do a first filtering and we take a fraction of the models that are generated here and we give those to the flexible optimization stage. So we use now in this stage torsion angle dynamics. So the rotations are the degrees of freedom that we optimize. This is going for simulated annealing with multiple steps where we first introduce flexibility only along the side chains in your proteins and then alongside chains and backbone. Typically, which regions become flexible is automatically defined. So only the residue that are at the interface will be treated as flexible. And then the model that come out of this stage are solvated in a layer of TPP privata. So we use about eight angstroms where it's not a molecular dynamic simulation. It's not a proper molecular dynamic simulation. It's a very short refinement. Again, what you might be doing is some NMR structure calculation protocol. So it's a few tenths of picosecond at most what we are simulating. So nothing spectacular happens here, but you do improve the contact that you generate between your molecule and you do improve the energetics as well. During this stage, you might model small conformational changes. So how much conformational changes can you induce at this stage depends on the amount of data that you give to the system. So typically one to two angstroms is doable. We have seen cases where we are up to five angstroms, but because there is a lot of data that are driving the system. But typically, it's don't expect miracle. It is not going to fold your protein. If you are docking a peptide onto a protein and it has to fold into an alpha edex, this is not going to happen. So this is basically the protocol in a nutshell. So our starting point for the optimization for the docking process are the molecules separated in space, randomly rotated. So you see here a system. So this is a central protein here and these are the starting points of the second molecule in this docking process. Then we proceed with our rigid body energy minimization. And this is how it looks like. It goes very fast here. You see the energy being minimized. In this example, this is a complex where the atoms or the residue that you see as spheres are based on NMR chemical shift perturbation data. So because we have this knowledge of the interface, you are basically only sampling that interface in your modeling. And the result of that is this ensemble of model here. So we're looking at 200 models in this case where you see that all the binding occurs on one side of the protein here because we put NMR information into it. But they are multiple orientation of the second molecule. Okay, so this still looks like spaghetti. You take the best models, the best few hundred models of that and then you bring it into this flexible refinement stage. And this is how it looks like. No, I'm in trouble. So we have first a rigid body dynamics phase. Then you see there's a side chain become flexible. So now we are optimizing the interface. And in the second and the last phase, the backbone is going to become flexible as well. This loop just flipped over. So here we are able to model small conformational changes, induced fit effects. Now I need to find back the proper slide. Here we are. So the effect of that is visible here. You see that you start seeing some clustering of solutions. So this was looking like spaghetti. These are linguine now, okay? So you start identifying clusters in there. So those will be submitted to the water refinement. And at the end what we are doing is clustering the solution. So we put the models that we assemble each other into clusters and we're going to score the clusters. So we calculate the average score on the top four model of each cluster. The cluster might have different sizes, but we only want to compare the top four model of each cluster in our clustering method. Because the energetic is quite noisy. So if you average over the entire cluster, you start having effects of the cluster size. And that will bias the scoring that you get. They are docking methods that use the cluster size as a scoring function. Since we put data into the modeling process, this is not a good idea to do for in the case of Haddock. So we don't consider the size of the cluster as an indicator of being the correct solution. Of course it will be nice if the largest cluster is also the best one, but in many cases this is not the case. So in terms of flexibility treatment in Haddock, we have several levels. So we also have an implicit flexibility treatment, meaning that we can dock from an ensemble of structure. So we don't repeat the docking for, we don't do different docking run for each component of the ensemble, but we give the ensemble directly to Haddock. And the starting point will be combination of different models out of the ensemble. And we do scale down the intermolecular interactions during the optimization process. And we have explicit flexibility. We do allow for side chain rotations and we do allow for side chain and backbone motions during the refinement stage. So I already explained you, we do the scoring on a cluster basis. In terms of energetics, we use the OPLS non-bonded parameters. And this is our scoring function, which is very simple actually. If some of you know the Rosetta energy function, it has many, many more terms with weights that are adapted up to the probably 10 digit. So we have different scoring at different stages. So at the rigid body, the initial stage, you see here 1% of the experimental information that we put in. 1% of the Van der Waals interactions, full electrostatic, disolvation energy term, and the buried surface area. And with a negative term, meaning we like to see large interfaces at this stage. At the final scoring stage, we have 10% of the experimental information, 100% of the intermolecular Van der Waals energy, 20% of the electrostatic and the disolvation energy term. So this function, I actually optimized 15 years ago on free complexes. You say, well, statics is fantastic. But it has survived 15 years of optimization. So I have people with computer science background in my group. So they're all into machine learning, deep learning these days, so they want to optimize. And they are trying to beat this function all the time. But apparently it's hard to beat. Of course, you can come with, instead of 1.0, you get 0.973315. But I think the accuracy of our scoring function does not justify all those digits. So I like simple terms. So I like to keep things simple rather than go to a very complex one because you optimize it for protein protein and then you use it for protein nucleic acid and you will get different weights. Those weights is what we are using for pretty much every type of complexes. So we have some guidelines if you do small molecule docking. And again, if you optimize and you compare the performance on all the complexes, all different type of complexes, it's still very difficult to beat. So it should really make sense in terms of optimizing it. So simplicity is important in my opinion. Okay, so most people are using our software through its web portal and that's what we're going to do today this afternoon. So we are able to run the portal because we have been over the years having funding of a variety of European projects. We have access to this large infrastructure, grid infrastructure in Europe, Asia, but also the US, the open science grid. So currently we have more than 11,000 registered users that have been submitting jobs to the portal and about now close to 40% of all the computing that has been done has run on these grid resources, distributed grid resources. You can of course also run the software locally, but the local version does less for you than the portal. So the portal is going to do more validation. The portal is going to automatically define protonation state of histidine if you choose so. You have to worry about those small details when you do docking. A histidine can be neutral. If it's neutral, there are two ways to position the protons on the side chain or it can be positive. And this charge difference can be crucial for the docking success. When you do mutagenesis to test your complex, you might do one mutation that we move a charge or add a charge and you might kill your complex in doing that. So when you do your modeling, you have to worry about those small details. So the server is going to do that automatically for you. The server also does all the clustering analysis for you automatically. If you run the local version, you will have to do a number of things manually. And this afternoon we're only going to use the server as a tutorial. So these are all statistics from 2016. So you can see where is it used. So it's used pretty much everywhere. But what was more interesting to us at the time was to see what are people doing with the software. So the light blue fraction is protein-protein modeling. That's the most used of the software. And then you see the orange one are protein peptides. So it's more flexible peptides. So that's also quite a large fraction. Almost two-thirds is insane in a protein world. Then you get into nucleic acids and RNA, green and yellow, which is also a significant fraction. And at the time what was surprising to us is that quite a number of people are using it for small molecule docking as well. So this is the dark blue region here. At that time we had never really published any benchmarking of small molecule docking with Haddock. It was able since the start to support small molecule because we wanted to be able to include cofactors in the modeling process. Cofactors, if they are close to the interface, they might be important for the proper binding. So the ability was there since the beginning, but we have never been systematically looking at it. So now since then we have been doing more on small molecule, but it's really able to do a combination of things. And probably the most surprising thing that I've seen recently was a group doing docking of the proteins against the nanodisk. So we have a lipid disk with a protein around the disk to stabilize it and they were docking protein based on NMR data onto the nanodisk and they submitted that to the portal and it accepted it and was able to handle it. And I never thought that this would have been possible but apparently the machinery in that case was robust enough. Not everything works. Of course there are things that are difficult to represent like glycosylation of proteins. It's a bit of a tricky thing to deal with. It's possible but you cannot get the covalent bonds of the sugar to the protein described properly. So there are still plenty of things to do. So this is, I mentioned the ID of Haddock 15 years ago, pretty much, and this is the complex that we were looking at at the time. So that's an E2E3 complex in the ubiquitination pathway. So we could not collect the NOE data to solve the complex in a classical way by NMR but what we had was chemical shift iterations. For the small, this is the ring domain of the E3. So you see again the typical pattern with defined regions on the surface that are affected. This is the E2 site, so we had mapping on both sides. So we did the docking and we ended up with two sets of solutions at the time. And we could not distinguish those solutions based on the score. So they have similar scores. But what the models were good at was to generate hypothesis or candidates from mutagenesis. So here you see one important sol-bridge, aspartate 48 glutamate 49, binding in this example to lysine 63. And what you see here, the same glutamate aspartate combinations but not binding to an lysine on the other side. So those two solutions are 180 degree rotated solutions. And this is something that we observe quite often actually. Symmetrical solution coming out of the docking which are often difficult to distinguish. So nature has solved this problem usually because nature comes out with one solution. But in terms of modeling, that's still a challenge to distinguish these symmetrical solutions. So based on this information or collaborators, the group of Mark Timers in our medical center in New Territory went to do mutagenesis and they had the yeast-to-hybrid assay to measure binding. So this is the native complex here. And then you do mutations on one side, on the other side, and you start combining those mutants. And there is one particular combination here which gives a strong binding. If you swap the site glutamate 49, so this one for lysine, and you change lysine 63 here for glutamate. So this basically, this is a swap of amino acid across the interface. And this restore binding. So the data now telling us that this is the right solution and this was a false positive of the modeling process. So this is an example how you can validate the model based on some mutagenesis data. And actually you see quite a number of papers that are coming out also in high-impact journals where people have some NMR or experimental data but they cannot solve the complex so they go to do some modeling. Based on the model, they go to do mutagenesis and they validate the model and this is being published in top journal. So it's kind of accepted these days that you can use docking or information-driven docking to generate models provided you take the steps to validate them. And actually, even when you do crystallography, so in Utah I have a colleague, Pete Goss, he's quite a famous crystallographer and he told me also when we solve the crystal structure of a complex, I don't trust the crystal directly. So we do mutagenesis in the interface given in a crystal structure to test that what we are observing is a biologically relevant interface. So even crystallographers actually do that. And last year there was a funny paper where they say we validated our crystal structure by doing Haddock docking. So now people are using docking to validate their crystal structure as well. It's probably not mainstream research but they really wrote that in that way in the abstract which was kind of funny. Okay, so in the fields I need to introduce a little bit of metrics as well. So how do we measure the quality of models? So you can only measure that if you know the answer. Of course, this is what Capri is about, for example, or this is what you are using when you're optimizing your methods. So in Capri we have an assessment terminology where we give stars. So this is the Michelin guide of docking and we measure we have different metrics. So one metric is you fit your model against the known reference structure only on the interface. And this is giving you the interface root mean square deviation. So to generate an acceptable model in Capri, the interface RMSD should be below four angstrom. If you are below two, you gather quite a good solution. So you get two stars and if you are below one, you have a high accuracy solution. The other measure which is used is the ligand RMSD. So here you fit on the largest molecule, the receptor, and you calculate on the small molecule. So this is giving you typically larger RMSDs because you kind of maximize the difference on the second molecule. So here the cutoff is 10 angstrom for acceptable, five for good, one for excellent. And what we also measure is the fraction of contact at the interface that you reproduce. So you need at least 10%, which seems like a very small number, but it just tells you the difficulty of the problem. 10% will be acceptable and more than 50 will be high quality. And we're going to see those metrics coming back during my talk. So now you can ask the question or we can ask the question, okay, what is the relationship between the accuracy of the models that we are generating and the data that we put in? Or we could say, when does the model stop and the structure start? Okay, so you have to realize that everything which is in a PDB is a model to some extent. There is no single structure in a PDB where there is no model that went in. Even when people solve crystal structure at high resolution, the bond lengths, they don't come from a crystal structure, they put it in there when they build. The angle is the same thing. They may choose ideal or atomic state when they build a side chain. So these are all models. Some models are everywhere in a PDB. Maybe only the small molecule, the Cambridge small molecule database will have their structure that have been solved really from scratch using only the data because the resolution is really high. Actually, we use this database to derive information about bonds and angles that we use as model when we build crystal structure. And the PDB, if you go to the PDB, you try to find out so how much data do I need to define my model as a structure? There is no rules there. For a long time, they have been accepting structure that are in between the model and the structure. Now they are becoming a bit more strict because now there is also a new website which is ready to accept integrative model where you use modeling with some data. So just to give you an example or a feeling of what's the interplay between the amount of data that you can generate, I'm showing you E2HPR. We're going to use this as one example this afternoon. So that's an NMR complex solved in a group of Myers-Claude. So what you see in those plots on the X axis, it's the interface RMSD to the target, which in this case is the NMR structure that was solved using NOE data. On the Y axis, you have the Hadox score, which is in RB3 unit. So we don't put units on a score because I don't want people to interpret the score as a binding affinity. I think there's a lot of people that think that the numbers that you get are binding affinities. It's not the case. We have shown that there is no relationship between scoring function that we use in the field and binding affinity in general. So you should not make this mistake. Of course, it doesn't mean that scoring is not working. We use scoring because you are scoring one complex and you try to identify different solutions of the same complex, but if you compare the score of two different complexes, you cannot say this one has a stronger binding affinity than this one. And if you're interested, I can talk about that in a multiple choice session. So this is score. This is a measure of the quality. The color coding is black are the rigid body docking solution, the first stage of Hadox. Red is the flexible refinement stage. Second stage and green is the water refinement stage. Remember that the scoring function that we use at a free stage are different. So this is rather flat. So this is using only chemical shift perturbation data. So we have knowledge of the interface, further, nothing. And you see that as we start optimizing adding flexibility, water refinement, you start seeing energy funnels or score funnels in this case. And you see here two sets of solution. This one scores slightly better and it's about one 1.5 angstrom from the average NMR structure. But we still have this symmetrical solution here. So this is, again, a symmetrical solution which scores quite well in this case. If you use a bit of anisotropy data that defines the orientation of the molecules, so either RDCs or relaxation anisotropy data in the molding process, you see that we still get symmetrical solution because this is one set of RDCs, so that's not solving your symmetry problem fully. But now the correct solution is much more populated and better defined. And you see these are the top 10 models that are shown here. So you can see that. So with chemical shift perturbation, this is the precision of the model that you get. So if you can put orientation information, you get better data. If you put the NOEs that you have measured, so this is not an NMR structure. So you get only one set of solution and it's very close to the average structure deposited in the PDB. And actually Haddock has been used a lot to do this kind of modeling as well because there was no easy to use protocol to calculate complexes at the time when we developed the software. So this one I already showed. And I think that's interesting. Seems to have a bit of duplication. So two slides and then we take the break to be sure that I'm, yes. So just to show you a bit of performance in Capri. So these are slides I got from Shoshana Vodak. So this was CAS Capri in 2014. So we participated in that round only with the web server, meaning that you have three days to do the homology modeling and the docking. So in CAS and in Capri these days you get a sequence and you have to solve the structure of the complex or predict rather the structure of the complex. So these are the results. So you find the CAS people here. You find the Capri people here. The server category and we only submitted a server and Capri also has a scoring competition where you are given models from all the predictors and you have to fish out the native like models. And you can see here where we are standing. So in that round we did very well with Hadock. I must say that a number of those complexes were template based modeling. So it was maybe not that difficult. Most of those were also symmetrical complexes. So we can impose symmetry in the modeling and that helps, but we did very well. And you also see that our very simple scoring function that I show you did very well in identifying correct models. We are more successful in the scoring than in prediction actually. And this was, so you find also some of the Rosetta people are in there in terms of scoring function but the Rosetta server is here. So just to show you that simplicity sometimes also pay off. You don't need very fancy energetics to properly describe your system. Hadock is supported by a European project called Bioexcel which is a central hub for biomolecular modeling. So we are together with also Gromach's software in there and over software for Quantum MD. And we have a support forum for Hadock which is operated under Bioexcel. So it's called askbioexcel.eu. So there you can find a lot of probably answers to the question that you might already have been answered. So it's freely searchable. If you want to post to it, then you should register. And now it's time for the coffee break. Okay, let's keep going. By the way, someone asked about the slides. So I will create a PDF of whatever I present today and share it with you. It will be shared via, you know? So let's move now to a number of example application example of different type of data that we can use in the modeling process. So I'm going to show you example of using the data as restraints to guide the docking process. So we want to focus or search on hopefully the right part of the interaction space. But we can also use the data posterity as a filter to basically eliminate bad solutions. So let's start with the restraints and we'll start with a rather recent story. So this was published two years ago now. And that's an iron piracy story. So by the way, probably many of you don't know what Haddock is, so you think of fish. So I used to have quite often questions, often in the US, I must say. So why did you name your software after a fish? Well, we didn't name our software after a fish. We named our software against a cartoon figure. This guy here, that's Captain Haddock. Okay, so and Captain Haddock is working with Tin Tin. So there's a lot of comics about that in Europe. So maybe not very known here. So see Captain, swearing a lot, drinking a lot, which is why we named our bioinformatic predictor whiskey also because Captain Haddock likes to drink a lot of whiskey. Okay, so there are all kinds of connections there. And this protein is pretty much a pirate as well because it's hijacking iron from its host. So this receptor sits in the membrane of a bacteria and the bacteria needs iron for its survival and it does that by hijacking pherodoxin from its host. So they could crystallize the membrane protein. So here you see a team of different people in different locations in the world. So they could get a crystal structure of the membrane protein but they never managed to get the crystal structure of the complex. So along those names, some of you might recognize people doing NMR, for example, Brian Smith. So they went to NMR to try to study the pherodoxin part. So the membrane protein was too large to do solution NMR but they could find information about the pherodoxin part and then they came to us to do the modeling of the complex. So that's how the story came together. So what is the information that we have? So what you see here is again the NMR chemical shift titration data, protein sequence or numbers here, chemical shift displacements in our HHSQC spectrum. If you map those displacements on the surface, you see that it defines a well-defined region on the surface, you can probably not see but you see somewhere here is the iron sulfur cluster. So this is the information that we can use in HADOC to define the interface, which should be in the binding interface in a complex. Now you cannot do these kind of experiments on the membrane protein site or you will have to do kind of solid state NMR but it doesn't mean that we have no information. We know that this protein is sitting on a membrane and we also know which loops are the extracellular loop and since the bacteria needs to hijack ferredoxin from the host, the binding site must be on the outside. So the ferredoxin must bind somewhere in the surface which are the exposed loop in the membrane protein. So that's information. That's defining a binding site which is in principle too large for the small ferredoxin. So in the HADOC philosophy, we are defining those residues as passive residues and we have this color coding, green will be passive, red will be active. So what is the distinction? If you remember, I explained you that we define distance restraints between the active residues. So an active residue, if it's not at the binding site in the model, it's going to generate an energy penalty, okay? A passive residue can be in the binding site but if it's not in a binding site, you don't pay any energy price. So it means that the active residues should sample the green region but if part of the green region is not covered in interface, it doesn't hurt you. So the distinction of where the protein wants to sit will have to come from electrostatics from a Van der Waals interaction. So this is what we put in HADOC. So the passive residues of the external loops, the chemical shift perturbation data and you get an example of a solution here. So you see here two clusters. So these are the top two cluster. In this case, the best cluster is also the largest one. In the way that the clustering works in HADOC, the cluster number is related to the size of the cluster. So cluster one will always be the largest cluster in terms of population but the ranking of the cluster is based on the HADOC score and you see the score here and here. So this one is slightly better but in terms of standard deviation, you cannot say that this is significantly better than this. They are not so far from each other, those two clusters so they are probably related but they are slightly different solutions. So if you wouldn't like to distinguish which one is the real solution, we would have to find mutations that are specific for one of the two solution and then try to test those experimentally. So this was not done in this story. But this is just one example of the kind of data that you can use. Another example is based on NMR data. We are looking now in collaboration with Annalisa Pasteur's group in the UK at an enzyme which is recycling ubiquitin before a protein is degraded. Okay, so polyubiquitination is a signal for degradation but it's also a signaling process and you can create different types of polyubiquitin chain. Here are two examples, there are more than two lysines by the way on ubiquitin. So here you will have the lysine 48 linkage and here you have lysine 63 linkage. And the enzyme that we are looking at is called Josephine and Josephine cleaves polyubiquitin but it seems to have preferences for different branching in the polyubiquitin chain. And the question we wanted to answer was can we explain the preferences of this enzyme for 48 versus 68 linkages, okay? So what is the information that we have here to guide our modeling process? On Josephine, the NMR identifies two binding sites that are on different phase of the protein and they are also mutagenesis data in those regions. So this is the information that we have. Further, we know the catalytic triad of the enzyme. This is where the cleavage reaction should take place. So we are not going to use this information to guide the modeling but this is information that we can use to kind of distinguish models, okay? A proper model should bring the C terminal of your polyubiquitin chain close to the active site because this is where the cleavage is going to take place. So now how are we going to model this? So we did a free body docking. So we take two ubiquitin plus Josephine. We put this information here to define our binding sites to define active passive residues. And then we define one distance restraint and ambiguous distance restraint between the C terminal of the second ubiquitin and lysine 48 or lysine 63. So we basically do not pre-define which linkage should be formed but we let that happen during the docking process. So now you have an ambiguous restraint to two sites plus the data that defines the binding site. So it's a free body docking and this is one set of solution that you get. So in this set you see that it's the, so the ubiquitin ends up on both sides. Not surprising because we put data defining the binding site and this is the 48 linkage which is created in this case and in this solution the C terminal is getting close to the active site of the enzyme. So this information was not used but we get of course in the same docking run another set of solution where you get the 63 linkage which is formed. It's again using both binding sites. So if you look at the second ubiquitin it's binding pretty much in exactly the same orientation here but the first one has a different orientation if you compare those two and in this orientation the C terminal is pointing to the back of the projection screen so away from the catalytic site. So based on our predictions we will say that the enzyme has preference for the 48 because it's positioned in the ubiquitin in a more favorable way on the surface for the catalysis. So that will be the structural inclination for the preferences and if you do the experiment so that's time on the X axis and the percentage of ubiquitin reactant if you do the reaction, so the biochemical reaction and you see that the enzyme is more efficient in cleaving the 48 linkage with respect to the 63. It's not like it's unable to cleave the 63 linkage but it's very slow. So the biochemical experiment is basically not consistent with what we see in terms of preferences in the docking results. Now another type of example. So we're going to switch from using NMR data to using MS data. We spoke a little bit about mass spectrometry data at the beginning and one of the type of experiments that I was describing was the HD exchange data. Now this is going to be such an example. So we are going to use data in this case both as restraints but also as filter actually because we have actually two type of MS data we want to use in this case. So this system we are going to study here is a circadian clock machinery of a bacteria and the work was done by Adrian Melchion or postdoc in my group. So this, so the circadian clock system is the reason why I have jet lag because now I'm in the US and I have seven hours difference compared to Europe where I'm coming from. So you have an internal system that basically measure time in some way and this bacteria has such a system and this system consists of only three protein, K, A, B and C. If you express those proteins, purify them, put them in a test tube, add phosphate and add ATP, the clock starts ticking. That's all you need. What's happening is that there is a phosphorylation, dephosphorylation process in the system and you can monitor the frequency of your clock by following phosphorylation events by mass spectrometry. So you can really measure the frequency of the clock. It's quite amazing that you can reconstitute such a complex machinery in vitro with three proteins, ATP and phosphate and that's it. So at the time they were structures of the component but there was no structure of the complex. So we wanted to model the structure of the complex. So the first thing that you need to know of course is the stoichiometry of the interaction. We didn't speak about that in terms of interactions but that's a crucial piece of information. And in this case you can do native MS. So in native MS, so in MS people are able to monitor native protein complexes in the spectrometer. So you can make a complex fly as a complex in a spectrometer and out of that you can get the stoichiometry by analyzing the masses. Which is quite surprising. So you have vacuum in your MS spectrometer, you would say bad things are going to happen to your proteins but they are actually shown that they can get back the protein after the MS experiments and showed that the enzymes that they have been measuring in this way were still active after the MS. So we know it's a six to one complex so we need six kB to bind one kC. And then they perform also HD exchange experiments that were analyzed by MS. And this is the information that we have. And there was another bit of information that MS can do. So when you do these native experiments where the complex is flying as a native complex in the spectrometer, you can measure out of MS what is, so it's ion mobility mass spectrometry, you can extract what is called the collision cross section. This is related to the time of flight of your molecule of your complex in the spectrometer. So it's flying in a vacuum of the spectrometer against a gas flow of small molecule typically. So the hydrodynamic properties of your complex are going to define the time that it takes to bridge a given distance. So you can probably easily think that the donut is going to fly very differently from a cigar based on the shape they have. So MS allows you to measure this time of flight and this is related to the 3D shape. Of course, you measure one point and you're going to extrapolate to a 3D property so it's a long extrapolation. But it's a bit of information that you can get. As an example, so this is used a lot for example to monitor virus maturation. And these are now theoretical experiments where you see here, so this is an assembly, theoretical assembly of the same molecule in different shapes and different numbers of subunits. So here we are increasing the number of subunits so this is coming from this story here. And here you see the measure collision cross section and the units are square angstrom. So what you are measuring effectively is the average shadow of your molecule on the wall based on all the preferred orientation that you have when you do the experiment. So that's a square angstrom measure. And you see that for the same number of molecules so if you draw a vertical line depending on the shape of your system, the collision cross section that you measure will be different. So that's potentially interesting information to filter docking solutions. So here is KB, that's the smallest molecule and you see color coded to the protection data from HD exchange. So the blue regions are protected from exchange in the complex. And this defines a well defined face of your protein as binding site. There were also a few mutations that were known already to affect the complex formation. So that's the information that we're going to give to Hadock. The same experiments were done on KC and you see KC is a much larger molecule. It's actually kind of a double donut stacked on top of each other so it's not a compact molecule so there's a hole in the center. And you see the blue regions are the one that are protected or that see changes in protections in HD exchange experiments. And we see that we have a problem. So first of all you can see by looking at those donuts that there's a six fold symmetry. So we have six binding sites per donut. But we have a potential binding site on the top and a potential binding site at the bottom. So the stoichiometry is not 12 to one is six to one. So it must be binding either on top or on the bottom. And when you open the donut you see that there is also change in protection at the interface. So there's an allosteric process going on here. So something binds on top and there's a transmission of information in some way at the bottom. Which is probably part of the phosphorylation dephosphorylation control in this system. So what we did here is to do two docking experiments. Once targeting the top binding mode, which we call C2 and once targeting the bottom binding mode, which you call C1. And when you do that, this is what you're ending up. A number of clusters for the top solution and a number of clusters for the bottom solution. We cannot really distinguish based on the Haddock score which one is better. So we use this collision cross-section to try to filter the solution and make a distinction between the two. The experimental values are indicated by a dotted line. So this is the experimental range between 133 to 140 square nanometer. And we back calculated those collision cross-sections from our models. And they are plotted here in orange, the top solution in green, the bottom solution. And you see that the bottom solutions, they are all giving larger values than what is observed experimentally. And the orange one, for sure the number one solution and the third one are nicely in the middle. So based on this, we predict that this is the correct model for this complex. So all nice, all fine, published in PNIS in 2014. And then last year, they managed finally to solve the cryo-EM structure of this complex, actually the old free molecule KABC. And the cryo-EM structure reveals that this is the right solution. Okay, so I say, oh boy, wrong modeling. We bet on the wrong complex. But usually often you will just bury that on a carpet and forget about it and never speak about it. So, but I think it's always a good lesson. So if you do modeling, things are going to go wrong from time to time. So you must be prepared for that. And it's fine to recognize that you make mistake. I think it's important also to recognize that there are limitations in what we can do. So what went wrong? Several things. So first of all, this collision cross-section analysis, speaking to MS people, it's maybe not a very good idea to trust the data when you have non-globular proteins. Okay, KC is this double donut with a hole inside. The molecule is flying in vacuum. So what you can get is a compactation of the structure. So if you have compactation of the structure, the area that you are going to extract from your experiment will be an underestimate of what you have in real life in solution. And remember, our green ones here, they are all overestimating the experimental value. So this is one explanation. So maybe we bet on the wrong type of data. But there's another explanation and that's the main problem. And the main problem is nature playing big time tricks on us. And that's something that nobody could have foreseen. At the time when they published the cryo-M structure last year, they were actually several papers back to back. And in one of the paper, they describe a crystal structure of KB. When we did our modeling back in 2014, 15, we use a crystal structure of KB, which was a tetramer in a crystal. And it's this fold here. So what you see here, so the blue fold is the crystal structure that we used at the time. And in the 2017 science issue, they report another crystal structure of the same protein, exactly the same construct, same sequence. And that's the one shown here in orange. So look at the secondary structure. So the first domain, it's fine. They are identical. Look at what's happening here. So in the crystal structure that we use, you have beta, alpha, alpha, beta in terms of secondary structure. In the structure that was published later, same protein, same sequence, you have alpha, beta, beta, alpha. So completely different folds. And this is a quote from the paper. So this seems to be a rare case of a protein which can exist in two states. And depending on the conditions of your experiments, you have one on the other. So the cryo-M captured the state which seems to be the binding, competent one. And the first crystal structure was done apparently, I think, at higher concentration. And it crystallized in a different fold. So what could have we done about that? Well, nothing, okay. If we cannot trust the structure that are in the PDB, that's pretty much the end of all the structural biology and everything. So hopefully this is not happening too often. But there have been also very nice examples also from NMR. Angela Gronenborn has been a lot of her work on GB1, while she's shown that you can do a few mutations and all of a sudden you switch the fold of the protein. So it seems that minor changes can have huge effects sometimes in terms of protein fold. So, well, so we didn't really do much wrong at the time in our modeling process, except for starting from the available information in the PDB at that time. So if we now repeat the docking process using exactly the same data that we used back then, but using the crystal structure that was published last year, these are the docking results that we get. So the C2 model was the top one. The C1 is the bottom one. And you see now that in terms of score, they are very different. So our score is not discriminative of the right solution. And this is a superposition of the docking model on the cryo-EM structure. So the cryo-EM structure, I think is around four or five angstrom resolution. So it's not the highest resolution, but starting from the correct fold of the protein, we are not able to get the right answer. But in first instance, a couple of years ago, well, this was all we had then. There's nothing much that you can do about that. I think it's bad luck. Okay. So now we started with NMR, we went into MS. So now let me show you a little bit of cryo-EM as well. So you are probably all aware of what cryo-EM is. So these are examples of, say, low resolution density map that have been obtained by cryo-EM. So these days, there has been really a revolution in the cryo-EM world since several years now because of new detectors, mainly, where you can reach the highest resolution structure, I think, around two angstrom. What you get is not only a shape of your complex, if you are at low resolution, but there's also information inside the density inside those maps. So when the resolution was limited, what people were doing, mainly, is to use those density reconstruction and then dock protein into the map, okay? So there is a number of software that can be used for this process. So if you have high resolution map these days, it's like solving a crystal structure, okay? So you don't need to do the docking process. But if you are at low resolution, you need to rely on kind of docking your molecule into the map. And this is typically done one molecule at a time. Meaning that when you dock molecule to generate a complex, you never consider the energetics of the interactions between the molecule because you put one molecule at a time. So there are many software that are doing that. So some of them are like a semi-manual assistant. So in camera, which is very friendly for cryo-EM analysis, you can easily try to locate position of your proteins into the map. There are more systematic approach that have been developed for that. Color S power feed is our home, a rigid-body single molecule docking into cryo-EM. But the problem with all of these is that you don't, you do a rigid-body fitting into the map and you don't account for the energetics of the interaction. And you cannot use additional information. For example, you might have cross-linking data for your complex, which in principle you could use in a modeling process as well. So we wanted to see if we could actually use Hadock. So incorporate cryo-EM data into Hadock so that we can use our energetic description of the system so that we can use any other type of data that we might have at hand to guide the modeling process. Imp from Anders Sadi is also doing this kind of modeling, of integrative modeling. So this is the work of Hido in my group who basically implemented cryo-EM data into Hadock. So we are using, for our computations, we are using CNS, which is related to Explore NIH, which is related to Charm, which is related to all the MD software because they all originate from the same starting point if you go back in history. So CNS was a software developed for X-ray crystallography and NMR refinement. It stands for Crystallography and NMR system. So in CNS, in principle, you have all the machinery to handle density maps. And you can express a cryo-EM density into a format which is suitable to be handled by CNS. We didn't need to code any energy function to describe the map, everything was there. We just needed to basically transform the data into a format suitable for CNS and then optimize our protocols to make use of that. So now, since you have your energy function describing the map, you could say, well, let's just turn on the energy function to describe my map, minimize against it and I solve my problem, I dock into the map. If you do that, the system is not converging. So there are apparently a lot of local minima. The protein tends to end at the border of the map for some reason. So it was not a very efficient way of getting the proteins docked into the map. So instead, what we ended up doing is to identify centroids like the most likely position of the center of masses of the protein and you can do that actually by doing this systematic search against the map using software like a camera of color S of power fit. Actually, we wrote power fits to get this information because we want to store the coordinate of those centroids. And once you have identified those centroids, we define distance restraints which are very natural to the way that Hadoch is working from the center of mass of the proteins to those locations into the map. And if you don't know where your proteins are going into the map, you can define ambiguous distance restraints to all the centroids that you have identified in your map. And then you use these distances to drive your docking process in Hadoch as you were using other distances. So that's very simple. Once your protein are now inside the map, you can turn on your density. You can first optimize the rotation of your molecule in the map because this was not defined from the distance information. And after that, you can add, turn on your energy term that represent the map and minimize your system. And this can be done throughout the semi-flexible refinement stage, throughout the water refinement stage. So the remaining of the protocol is pretty much the same. And you have the advantage that you can combine this with NMR data if you were to have them with MS data, with mutagenesis data. So you can combine everything plus you have the proper energetic description of the interactions between the molecules. So we are calculating the local cross-correlation between the model and the quiem data and we are adding in our scoring function, this cross-correlation function. So I explain to you that we needed to extract those centroid positions. And for this reason, we wrote our own software to do a systematic search. It's called PowerFit. You can run it from a website as it's a web server. So you can only do one molecule at a time but you can just upload. And it runs on GP resources of the European grid. So it's quite fast. You can also download the software. It's on GitHub, so you can run your local version. And we've been doing extensive benchmarking. So we developed some new scoring cross-correlation functions that are a bit more sensitive than the ones that are in the other software that are out there. So you see as function of resolution. So we measure the success rate. So this is docking in the order of 380 proteins from different ribosome structures that are out there. So docking each protein in turn. And then we have different resolutions. So we can actually done sample the resolution. And you can see the success rate as function of different. So this is the local cross-correlation function. This is a core weighted where we give more weight to the center of the map. These are Laplace transformations which enhanced edges. And so this is the newest correlation function that we introduced. And this is the one that gives you also the best performance overall. So you get quite a high success rate up to about 13 extra resolution for all proteins. And we also, since we have a large data set, we can see what the success rate is as function of the size of your protein. So if you have a very large system, it's of course very easy to fit it in the map because there is kind of a unique solution. But when the protein becomes smaller and smaller, your success rates go down. So you can see as function of the size, so everything has a very good success rate up to 10, 12 angstrom. And when things start breaking down, it's if you are in the range of 50 to 100 amino acid or less than 50 amino acid, then you don't get a very high rate. So 50 and below, you get a lot of solutions that can fit anywhere in your map. But these kind of graphs also allow you to down sample your resolution because if you do the modeling at lower resolution, it's going to be much faster. So there's a speed price to pay as function of resolution. Anyway, so we use PowerFeed to identify the centroid. And this is the information that we give to Haddock in the initial stage of the docking. So now an example. So we're going to model the binding of KSGA, which is a protein that binds to the 16S ribosome. So there is no crystal structure of this complex. There is a cryoEM structure of the complex, which is available in the EM database. So what we have is a 13.5 angstrom cryoEM map. We have the crystal structure of ribosome. We have the crystal structure of the protein. On the ribosome side, we have a definition of the binding site from hydroxy radical footprinting. I did not mention this type of experiment yet, but this is basically allowing you to measure which regions of the RNA are protected from hydroxylation when the protein is present. So this is kind of the chemical shift mapping version, but on the RNA side. We have mutagenesis data for the protein site, and we have the map. So we have three kinds of data. So these are the mutations on the protein side. So if you mutate one of those residues, the interaction is not taking place anymore. Now if you go into the EMDB, there is a very nice model of this complex. And all the EM models that have been obtained by rigid body docking, they look very nice, as long as you only show the backbone as a ribbon. Because once you start turning on the side chain, this is what you are seeing. So these are all clashes. Everything yellow here are clashes. So the interface is a mess because you never took care of the interface in the way that the molecule are fitted into the map. You fit one at a time, and that's it. So if you focus on the free amino acids that have been mutated, so this one is not even contacting the RNA. This one is clashing, and only this one makes sense. So the model does not really explain the biological data that we have. So can we do better? We go to Haddock to try to debump the system. We input everything. We do our cryoEM-based docking. And we get only one set of solution at the end. And now you can focus on the interface. And you see now that these are the amino acids that were important for the binding. Now they make nice interactions because we have added the energetics at the interface. We allow for flexibility. It doesn't mean that it's, per se, a correct model. But at least it explains now the biological data that you have. And if you analyze further the interface that you get, we identified two over-Arginines that seems to be important for the binding. So now you could go back to the experimental bench, start mutating those, and measure if it affects the binding or not. So we did not do that. We used existing data to do this exercise. And this model has actually now been deposited in a PDB dev, which is a new site of the PDB to accept integrative models. If you look at the conservation of amino acid on the surface of this protein, actually, the two residue that we identify are also highly conserved. So there is some kind of a, it makes sense, basically. So we have no cryoEM supported into Hadock. It's not yet available on the web portal version of Hadock. So we have been working on a completely new version of the portal. So we have a working version now, so in this kind of alpha phase. So later this year, this will become available. So you don't need to account for the entire map. You can dock into a subset of the map. And I think what is interesting is complementary and compatible with all the other data sources that you can incorporate into your modeling process. And in principle, the same strategy could be used to describe SACS-derived shape. So now we're going to move to the last topic of, yes. So if you are docking with a low-resolution structure, but without cryoEM data. Yeah, well, I think the current version is perfectly fine for that. So what we are working on now also, and we have a local version doing that. So we also have a coarse-grained version. So we have been implementing the Martini force field, which is some of you might know for molecular dynamic simulations, where there is a 4 to 1 mapping of atoms. So four AB atoms are mapped into one particle. So it's a low-resolution model. So we have a working version of that currently. At some point, it will be also part of the official release and web portal. So we are still benchmarking that. But that will be a way, maybe, of modeling low-resolution. So we want to have that more for speed purposes so that we can move to larger complex and not pay too much of price in terms of the modeling cost. But in principle, also, like all the work that we do in CAP-3 and CAPS, you have to do homology models all the time. So there also, you can ask yourself, what is the resolution of those models that we're generating? So in principle, it's not a limitation. It's just that you have more uncertainty in the results. So now we're moving to the accessible space story. So we started doing that because in Utrecht, in the Byfoot Center, where I'm working, so we have a lot of structural biology. And we also have a big proteomic mass spectrometry group there. And they are generating a lot of crosslinks. And so we have been exposed to those crosslinks. And we realize that it's not so simple to use them because there's quite a high false positive rate in those crosslinks. So these are reactive molecules. When you do your reaction, if you have like an encounter complex, you might already be crosslinking. And you're going to detect those. And then if you want to use those crosslinks in your modeling process, you want to make sure that you have a safe and consistent set of data before you put them into the modeling ideally. So this is, again, like the MS workflow for crosslinking. So you have your complex. You add your reagents to the solution. You do your crosslinking, digest your protein, try to enrich the crosslinked fragments, and then detect them by S. So that's also the work of Hido in this story. So we wanted to see, we wanted to enter the questions of, can we identify actually the reliable data out of the set of crosslinks? And can we actually define the information content of the data before taking the step to the modeling? So given two structures and a set of distance restraints, so this is not limited to crosslinks. In principle, you can do this analysis with any type of distance restraints. Is there any solutions that satisfy N restraints? So N could be the total set of restraints that you put in or a subset, OK? In this context, the solution is a complex that satisfy the restraints. And a complex is a conformation where the molecules are interacting, meaning they are touching each other and they are not clashing. So we're not going to use any energetics in this. So we're going to do geometry, mainly, just generating solutions where you request that the protein should interact. No energetics, and then we're just going to do counting, OK? How many solutions are compatible with a number of data that we put in? So the accessible interaction space is the set of all solutions that are compatible with N restraints. And we can visualize this space. So how do we do that? We go back to rigid body, fast Fourier transformation docking techniques, actually. So we take our receptor molecule. We're going to map it on a grid. We're going to define the surface of that molecule and the core of that molecule. Then for our ligand, what we have to do is sample all rotations. For each rotation that we sample, we generate a grid. And we're going to do rigid body docking, sampling all possible combination of the two, and counting which combinations are compatible with the distance that we put in. And when you do this counting, you can basically then visualize the space around your receptor where you can position the second molecule while satisfying a given number of distance restraints. So what you see here, this density map is the location when you can put the center of mass of the second molecule. So it's not the full molecule. No, it's the location of the center of mass consistent in this example with five restraints. The orange sphere is the position of the center of mass in the crystal structure of that complex. So with five restraints, there is a very large space. And what you don't see in this representation are all possible rotations. But there is a very large space when you can put the second molecule. Remember that our cross links are not very precise in terms of distances. So this will be maybe an upper limit of 26 angstrom. That's quite common for the cross links that are used. If you increase the number of distances that you measure, you see that the space is shrinking. And if you don't find any solution consistent with all the distance that you put in, then you know that you have a problem. That maybe there are false positive data in your data set, or maybe your protein is changing its conformation and then you're in trouble. So let's take an example. So this is RNA polymerase two. So this is the complex. We have the crystal structure. There are six experimental cross links that have been detected for this complex. And they are, so these are BS free type cross links. So they are cross linking lysines. And the distance that we use is a 30 angstrom maximum distance between lysine, lysine, beta carbons. So we apply the distance restraints to the beta carbon. We added to this set two false positive. So one is at 35 angstrom. So it's slightly above this 30 angstrom limit. And the second one is more seriously wrong. It's 42. So we have this set of six and now we're going to do this analysis. So this is the number of complexes that you can generate in which as function of the number of distance restraints that you satisfy. So if we see zero distance restraints, this number is the number of solutions that we generate in which the molecule are touching each other. What we are sampling, we are using a one angstrom grid size and about 4.3 degree rotations. So this is a very large number. This is 190 billion solutions. So this is not the sampling that we did. These are all the solutions in which the two proteins are touching each other. Okay, it's a huge number. And then you start adding distance restraints and you see how many solutions. If I had one restraints, how does the space shrink? With one restraints, we still have 23 billion solutions possible and you add more and more and more. In principle, if we had very accurate distances, three distances should be enough to define uniquely the position of your two molecule. Okay, three points define a plane. If you want to define the orientation of two planes, three distances between those planes should be enough. At three, we still have, what is this? This is 300 million solution. Five, 17, six, five million. Seven, now we start having a false positive in the data. We are at 5, at 10,000. Have a question? No. So we're not using the restraints to guide the modeling. We're just generating all possible solutions and we count. Does it satisfy yes or no? Okay, so it's not, and you can put the restraints in a different order in your list. It's not going to change the results. So it's a systematic sampling of all possible solutions at that resolution in terms of search. But you can scramble in any order, you get the same data out. So seven restraints, 10,000 solutions, eight restraints, zero. Okay, so now we know that we have a problem because if we put all the data, we cannot generate a single solution which is compatible with all the data. So can we identify which restraints is the problematic one here? So what we do for that is that we're going to, we have eight restraints, zero. So let's go to seven and we analyze the 10,000 models that we have at seven restraints. And in those 10,000 models, we are going to count how often is a specific distance restraints violated in those 10,000 models. This is what you see here. So for each distance, we are counting how often is this distance violated? So if you look at seven, so here are the 10,000 models, you see that only restraints number eight is violated and it's always violated. So we have identified our first false positive. Then you go back, you're going to look at six distance restraints and we are now going to analyze five million solution and you do the same trick. And at five million solution, you see that the second false positive that we put in, number seven, is showing up as almost violated all the time. And there is a second one, number five now which starts popping up. If you go to lower number of five distance restraints, you see that you start spreading the violations of the over restraints and this becomes more and more spread. But now we are able basically to identify our false positive. While doing that, while doing this systematic search, what we can also do is for all the consistent solutions, so now we are, you can look at your 10,000 solutions that you obtain which are consistent with seven in this case and you can count which amino acid are actually making contact in all those 10,000 solution. You're basically extracting kind of an interface information out of the distance restraints and sampling that you are doing. And you can map on the surface what are the likely regions that are able to make contact. So out of a few distance information, we map somewhere an interface out of it. And this is the color coding of which regions of my receptor are most often contacted in the solutions that are consistent in this example with seven restraints. So you can do that analysis using our web portal called this vis. The code is also on GitHub so you can download it and run locally. If you want to make use of GP GPUs, you will have to install quite a number of libraries so that's not always trivial. So the computations here are running on GPU resources on the European grid, which makes them faster. You can also download the Docker container which has all the libraries in there already so that makes your life easier. And the server is going to present you with these kind of pre-calculated images where you can change, by scrolling here, you can change the view so this is not an interactive view, it's just pre-calculated images from different views but you can quickly visualize, is it compatible, yes or no? And it makes and it guides you in the interpretation of those violations so it's flagging in red, in different red color the potential false positive but remember that it's not because you see that there are no solutions that satisfy all the restraints that the restraints is per se false. It could be that you have conformational changes in your system and then we are doing a rigid body search in this case and that's the limitation of what we are doing. And this color coding is based on the Z score that you calculate from the average violation in the matrices of violations. Now you are not limited to MS so now I'm showing you here the same analysis using the E2HPR complex that was solved in a Myers-Claude that's the one I use also to show you when does the model start and ends. So we are plotting all the space shrinks as function of the number of intermolecular NOEs that you are putting into the system. And if you can see this number so it goes from one to 52. So to me, so the NOEs are much more accurate than the cross links, okay? We get upper limits of five, six angstroms so here since we applied to the carbons we have been correcting that for the carbons but still if you look at halfway through this movie you have about 25 restraints so now 20 and there's still a huge space with 20 NOEs where you can put the center of mass of your second molecule excluding any energetics. So to me this was very surprising to see. So only when you reach the total number of NOEs that were measured in this case you have a unique solution which basically correspond to the structure that has been solved by NMR. So that's, I would have expected that you know with 10, 20 NOEs you should have a unique solution there but if you just do geometry and no energetics it seems that there are many ways of positioning the molecule while satisfying the distance restraints. So how can we use this information if we want to use crosslinking data in docking? You can think of different scenarios so we have the distance restraints but they are not very precise and if we do the this analysis you can actually extract information about the putative interface. So you can think of free scenarios for modeling those. You can say I'm going to dock using only the distance restraints or I'm going to dock using only the derived interfaces and give that to a dock as if it was chemical shift perturbation data for example or you can use the combination of the two and I'm not going to give you the solution to that answer because that's something that we most likely are going to look at this afternoon in the tutorial. So now what about ambiguous restraints? Can we do this kind of analysis when we have ambiguous data? So now we are analyzing distance between two points and ambiguous data but what if you have for example chemical shift perturbation? Could you use this piece to map the interactions region defined by the chemical shift perturbation? So now you will have to do this counting accounting for the ambiguity and the way we do that is that you combine basically basically what we're doing is for each atom that you define this one as a distance restraints so you can put a circle around that atom and you see this is the space which will be consistent with that distance restraints and you try to, so if the second molecule is overlapping with this space you say okay that's satisfying this distance and in the case of ambiguous restraints you're going to put spheres on multiple atoms because there are multiple possible solutions and if there is an overlap with the union of all those spheres you say okay that's a consistent solution for ambiguous restraints and in that way you can encode chemical shift perturbation data in this case so that's the same A2HPR complex we have 21 ambiguous restraints that are defining on both sides of the proteins and you can see how the space change as function of the number of chemical shift perturbation residue that you defined and you see at the end this is the location of the NMR structure so again when you put all the data there is a unique location in this case coming out for this particular example and the latest version of this is that we have can actually output all the solutions that are consistent with the number of restraints so this is the ensemble that you get you can refine those solutions give those to Haddock, refine them in Haddock so you don't do docking you just do refinement and you get two sets of solutions so again symmetrical solutions are popping up out here and the one that we find the best in terms of score is the one which corresponds to the average NMR structure so this ambiguous distance restraint analysis is not available in the server yet so that's something that we will implement later on so in conclusion we have a way of visualizing the information content of distance restraints from any kind of sources in principle it's important to realize that it's only based on geometric considerations so there is no energetic in that it allows you to identify a false positive and it allows you also to extract additional information by mapping the potential interface in your molecules which can be valuable to guide the docking process but of course it does not account for any conformational changes so it's not because a restraint is not satisfied at all in this kind of analysis that it is perceived wrong so the working version that we have now can handle ambiguous restraints as well and I didn't speak about that but we're also working, it could be that in your data set you have two binding modes and the question is could we distinguish those two binding modes then you need not only to calculate if a location is space is consistent with a number of restraints but you need to keep track of all the combination of restraints that are possible and monitor that so that becomes a combinatorial problem so you can do that with up to 20 restraints but after that the memory starts exploding but that's something hopefully that in the future we will also offer in a new version of the web portal so we have 15 minutes left make your choice some of these we already talked about so no need to go back to cryoEM or MS data because we already covered those and then we vote in a very democratic process protein ligand modeling who else is for protein ligand so one, two, three, well okay anyone else wants something else? Tax? Tax? Not very popular, sorry specifically riddle that's also an interesting one that's nature playing tricks on us again so that's yeah I think we should go to the ligand part because there is a majority and who knows maybe this afternoon depending on how far we go in a tutorial and how much we do we might come back to some of it if there is interest but maybe you are so fed up with Hadock by the end of the afternoon that you don't want to see me again okay so the ligand one let's do that where is the ligands here okay so here we go so I mentioned that's quite early on in Hadock developments we have been able to handle ligands because we build the ability to deal with ligands in the server early on so this is probably one of the first example of using Hadock to model a protein ligand complex so that was published in 2007 this was actually before we had the web server and this is also an example of a multi-body docking so this protein it's the chicken liver bile-acid binding protein actually binds two ligands at the same time and if you want to get proper results in terms of docking if you are going to dock only one ligand at a time I think you will never get a solutions where you can fit both because the ligand most likely is going to end up somewhere in the middle of your binding pocket and then it will be very hard to get a second one inside so what's the data that we had at the time so this was a collaboration with Lucia Zeta and Henriette Molinari for NMR people so they had chemical shift perturbation data on the protein site describing the binding sites they also had relaxation data basically telling us that some loops in this protein are re-gidifying when the ligand is binding so that's also information in principle proteolysis also comparing the difference between the free protein and the ligand-bound protein in terms of stability on the ligand side there were some STD data and also a few NOEs so this is basically all the information that we put in doing a free body docking docking two ligands at the same time to get them into the binding sites so this is probably the first example and you see the best cluster so it's not like it's a very nice very precise solution so there is still some viability in the solution but you can do a free-bodied docking at the same time here is another example and then I'm going to come more to some more benchmarking where we are docking now we are looking now at a protein which has antibiotic properties because it binds to lipid 2 which is this molecule here which is a critical component of the cell wall of bacteria and it's a complex molecule which has sugars it has amino acids here pyrophosphate group and then an isoprenyl tail which is inserted in the membrane so we did a modeling based on NMR data so the NMR is telling you that this protein is binding to the membrane so that's one piece of information and then there is also NMR data that are pinpointing the binding side of lipid 2 to the protein so from this we know that when lipid 2 is binding the tail this isoprenyl tail should be pointing down and not up so that's also a bit of information that you can use to filter solution because when you do your docking in principle the molecule could end up in two orientations so you're going to filter your solution based on this information and using all that we ended up with a model which has published in a nice journal another early example, sugars so this is a defensin binding to sugars so you saw now kind of a different docking here so the protein was flying in and just exploring the surface so again there were NMR data that were pinpointing the binding side on the protein side but because of the flexibility of the sugar in this case we decided to do the docking during the flexible stage of HADOC so we didn't do the rigid body minimization first but we gave more time to the molecule to sample the surface of your receptor so that's a flexible binding example and here is another example again oligosaccharide binding to a protein which is now involved in blood-specific recognition to guide the docking in this case so there was a crystal structure with only one saccharide born in a binding pocket and the question that was asked here was can we explain the specificity of this protein? so here we are looking at the tetra trisaccharide this one and they wanted to pinpoint which amino acids and these are mainly histidine and tryptophanes that are candidates are responsible for the specific recognition of an enacetyl moiety on the sugar so we did the docking drawing the sugar in the binding side that was defined by the crystal structure but giving full flexibility to the other sites so again the docking was not done on the rigid body but during the flexible refinement stage and you see your enacetyl group is here and it seems that this solution pinpoints that this histidine should be important for this specific recognition process so this is allowing you to generate hypothesis and now you can try to mutate this histidine and see if you lose this specific recognition in this case so there were different candidates here that were potential recognition sites but so our prediction will be that this is the right one which needs to be tested so this was 2006 we've been doing small molecules since a long time another example of such a case now a crystal structure solved in a group of peat gross in Utrecht the crystallographer I mentioned before where they managed to solve the crystal structure of the membrane protein, it's Pag-L it's a deacetylase that binds to a component of the cell wall of bacteria LPS but they could not get the crystal structure with the substrate basically so they knew the crystal structure they knew the substrate and the information that we had here to guide a molding process is actually the chemistry of the system because we know what this enzyme is doing it's cleaving a bond and the reaction of this cleavage reaction is also described so that's information so we have a catalytic triad actually it was not clear it was a glutamate or aspartate but we have the histidine serine pair and basically the serine oxygen is attacking this carbon here of LPS to cleave the bond so we put this information as distance restraints in Haddock to guide the molding process to guide the docking process so this is the molecule that we are looking at and actually what we dock is only Lipid-X which is a subset of this molecule so when you do that so we use the reaction mechanism basically the cleavage mechanism has restraints to guide the molding process and again we know which part of the proteins are in the membrane so these Lipid-Tales they should not sit out there in a solution but we should filter solutions that are compatible with the membrane orientation of the protein and this is the solution that comes out of the modeling and then you can analyze this solution and you get some nice insight for example surprisingly this protein in the middle of the membrane has an aspartate so you don't expect a negative charge in the middle of the membrane but now the model is explaining why is this aspartate there because it's specifically recognizing LPS so LPS has an hydroxyl group here in this chain and this aspartate basically is helping in positioning LPS in the binding site and then we can also answer the question of why is it the aspartate called Glutamate which is involved in a catalytic triad based on the model we predict that the Glutamate is the correct one so we have been doing also some recent benchmarking to go a bit more systematic about docking capabilities of Hadock so we took a number of complexes from the Aztecs diverse set which is a well recognized benchmark for small molecule docking it's also used by all the small molecule docking software so if you do small molecule docking with just the average small molecule docking software Autodoc, Autodocvina, Glide all of that they are not docking against the entire surface of your protein you define typically a box where the docking is directed so you put information in there it's not in a form of restraints but you limit your docking to a region around your protein so in Hadock we will do kind of the same but we define the binding regions which we define as active or passive residues so that's what is going to bias for docking in Hadock so we define the binding site on the protein as active only for the rigid body blogging part and during the refinement part we just request that the ligand should be contacting the binding site but it's free to explore the binding site so that's the protocol that we are using what is also different compared to the protein is that we change or we give more weight to the Van der Waals interaction in the rigid body docking stage one when we do protein it will be zero zero one so this gives better results and we also have typically depending on your binding site so if you have a buried binding site it might be hard to get in because in the way we docked the molecule are separate in space you do your rigid body immunization you might have to go through the protein to go in and our Van der Waals energies might prevent that so if we have such buried site we can change your settings a little bit to allow this interpenetration of molecule so this is the docking performance if you take what we call the bound docking so you take the crystal structure take it apart and just try to redock so there's no conformational changes here so this is the cluster based docking so we have 78 cases and we have basically a 77% success rate where we get acceptable solution in the top cluster if you look at the top three we get 87% success rate so that makes it quite competitive with over docking software it's not as fast as a small molecule so you don't want to use Haddock to screen one million compound it will be probably a waste of computing time but if you have data you can consider now if you start from the unbound docking set so this will be protein conformations that are not corresponding to the crystal structure of the complex you see that you take a hit in the success rate but the top three you are still at 77% which is quite competitive do conformational changes affect us a lot this is what you are seeing here so you see the interface RMSD of the best scoring cluster so we might have overcluster that are better but we are just taking the top cluster in this example versus the RMSD of the receptor that you are docking against so this will be conformational changes versus accuracy of your docking results and you can see so it's not like it's going up so it seems that we are quite robust in terms of conformational changes probably because we are guiding the docking with our restraints and we are allowing for a bit of flexibility so this is good news and this is an example of using NMR data now to guide the docking process so it's a phosphatase so we dock against the upper form of the enzyme the red colors here are the NMR titration data for this system so this is what we give to HADOC we also mimicked our binding site definition so this will be five angstrom this will be 10 angstrom in orange this is the kind of definition that we use in our benchmarking with the Aztec set and these are real data and if you do the docking with those data so this is the results that you get so in green you have the crystal structure so it's superimposed on the protein and then just showing the ligands so green is the crystal structure gray are our docking models so it's getting very close and this is the docking that you get with the 10 angstrom binding site which defines quite a large regions around the binding pockets so you see our clustering could be probably better here because we have two sets of solutions so we should reduce our clustering cutoff probably here but again you get a very nice docking results using these data so we also participated two rounds of D3R so I don't know if you have ever heard of D3R it's the design, it's the drug design data resource D3R, it's kind of the CASP capri experiment but for small molecule docking and it's run in California basically so what you get there are data sets from industry typically for a large set of ligands against one protein so it's a nice consistent data set which is nice so we have been participating two years in D3R in two rounds and this is mainly the work of ZNEP and PANOS so the first target that we did was a Farnesoid receptor so it's a pharmaceutical target and the data at the time I think were coming from Roche for this set so what are the challenges? so depending on the ligands there are large conformational changes on the protein side and the side that we are targeting here is quite buried inside the protein and the ligands that we are docking they are quite branched so this is not a small molecule, it's quite a large one so for this one we did kind of the protocol that I described you in a naive way defining the binding site we generated ligand conformations using open eye sample different conformations took representative clusters ensemble docking in Haddock and that's it for the receptors we took the receptors from the PDB different conformations of the receptor gave an ensemble of receptors for the docking and that was it I'm speeding it up a little bit so this is the ensemble of receptors and then we do a docking and you get some very nice results so this was the first one that we looked at so we say okay it's working fine so docking works well but if you look over all the ligands that we had to dock this is where we are okay so that was not so very nice but we learned a lot so the best ones are at one extra on average for all the models that we had to dock so we got some good ones we got some bad ones but we learned a lot so what limits our performance in terms of docking so you can do experiments using different so what happens if I take is it a ligand conformation that limits the performance so if you use the ligand conformation so these are our predictions the dark greens are our predictions if we use the bond ligands we improve a little bit if we use the bond receptor you see that we improve quite a lot in the top term so it seems that we have to be smarter in choosing the conformation that we're going to use for docking so that's one lesson here so the receptor limiting factor so we need to select receptors that are based on similarity to the target ligands that you need to dock and that's what you realize people are doing in a small molecule field and also the ligands we need to up sample the major cluster if possible so if you do that so green will be this new strategy you see that we have better results in general this is RMSD these are all the targets that we had to dock so this improves our results this is the kind of sampling that you are getting in the first stage what you have again the ligands and this is the top 100 models and if you see a gray or dark bars it means that we have an acceptable ligand so we generated ligands but they should be done there if our scoring function was good enough and you see that they're all over the place now if we do this smarter choice of receptor and smarter choice of ligand conformation this is the picture that you get you see you're increasing a lot of sampling you have more success rate in doing that and then came Grand Challenge 3 so having learned from our failures in Grand Challenge 2 we went a bit smarter here so these are the kind of ligands that we have so we did a smart choice of the receptor by comparing the similarity of the ligand that we have to dock against the ligands that are already in a PDB and the similarity is sometimes quite high for the ligand we also did select conformation that resembles the ligands that are already known from the PDB okay so we try to find a ligand in a PDB which resembles most the ligand that we have to dock we measure the shape similarity and we also measure the chemical similarity so we use open eye software for that which is free for academics so we're going to have a smart selection of ligand we're going to have a smart selection of the receptor and instead of starting or docking from random conformation we actually pre-position the ligand in the receptor and only go for the refinement stage of Haddock and this is what you get for those ligands so this is 2.5 angstrom you see that we are doing them much better than in round 2 of this Grand Challenge so we have learned from our mistakes so you should not be afraid to go blindly in those competitions because this is really a learning phase otherwise you always optimize your software thinking that you are doing well and if you'd never try to do the blind test then you don't learn much let's skip that one and this is not where we are in the last round so those results are not yet published the publication have been submitted but we are so this is 2 angstrom and you see that the top five groups are pretty much the same performance so now we are doing very well doing this kind of docking I think binding affinity I should not really so that was another part of D3R so we have to rank the ligands because they are experimental data on that one so we had a very simple model for binding affinity prediction and it did again quite well just to show you where are we so this was Grand Challenge 2 so this is only using so basically some kind of structure-based measure of binding affinity using contact and here you have all the fancy free energy calculation methods so we have a very simple model that does quite well and actually in Grand Challenge 3 what we did was to not even use the protein structure to predict the affinity but just train a machine learning model that compares the similarity so in Grand Challenge 3 there were a lot of kinases in there so there is a lot of data on kinases by simply measuring the similarity of the ligand that you have to predict with the database that you have out there and training a machine learning model we are doing much better than all the other energy-based model so in terms of conclusions for small molecules so yes we can handle small molecule especially interesting I think when you have experimental data there was a very nice paper in December from the Novartis group who has been using Haddock to do small molecule docking because they had measured NOE data and apparently what they were telling me there is no single small molecule docking software which can define a distance restraint so that's why they resorted to Haddock to do their docking because in all the popular small molecule ones you can simply not define a distance restraint you would think it's very simple but it's not there and then if you don't put a distance restraint they try to generate docking poses and then filter with the NMR data they had and this was not working so the NOEs that they had were critical to generate good models so that paper was published in Jack's it's worth looking it up because they show that actually they can use limited number of NMR data do generate docking models they could not co-crystallize those structures because the affinity was too low but based on the docking models then they can do structure-based drug design generate a new generation of ligands that then crystallize because they have improved their binding affinity and they can show that those crystal structures are matching with the docking models they are getting so it's a nice story of doing structure-based drug design using docking models using distance restraints so and maybe some perspective also just in terms of because some of you are doing fragment-based drug design so we also looked at can we do something like that and then I'm stopping see you so this was a project actually of high school students who use the web portal because it's simple to use to do this kind of fragment-based drug design so we went to do docking without any information just throwing the small molecule using center of mass restraints 18 different fragments this is near our mini-days and then you look at where are the fragments ending so these are the center of mass of the fragments and then you can look so here is an example of different fragments that end up in the binding site of this protein and what is interesting so this is the docked model in yellow is the crystal structure so if you were to do rigid body docking so without any flexibility so you see that here we are opening kind of a pocket which is not accessible in the crystal structure so there is an interesting potential there maybe to sample for small molecule okay so time to stop if I can find my mouse so we go here so to finish so I've shown you that information driven docking is useful to generate model of bi-molecular complexes even when you might think that you don't have that many data at hand so data are always useful the models that you are getting they are models so they have their limitations so even if they are they are not the most accurate one they are useful to generate hypothesis and the model again is not the end of the road the model is only the starting point for a new experiment so in that sense I think modeling and information driven docking integrative modeling is very complementary to the classical structural biology techniques like lmr x-ray crystallography and cryo-em so I'm not doing all this work by myself so this is that's almost two years old by now picture of the group in Utrecht we are funding from different European projects and also over the years different people have been contributing to the development of continuous development of Hadock so this was a retreat last year so some of them are still in a group some of them are on the world but they are still in touch so I'm also thankful to the team of people here and with that I want to finish and thank you very much for your attention