 Hello everyone, welcome back for part two of my lecture on integrative modeling of biomolecular complexes, still with a nice background from Pula in Sandinia, and we will start directly after having given you all this introduction about docking in general Haddock and an example of antibody antigen complexes with an example of how you can use data from mass spectrometry experiments to model complexes. And this is the work of Adrien Belkion, a former postdoc in my lab, and Adrien has been working in collaboration with people from the HEC lab in Uthert, Proteomics MS lab in Uthert on the modeling of a bacterial circadian clock machinery. So this circadian machinery of this sign of bacteria is quite intriguing because you just need three proteins, if you over express those three proteins, put them in a test tube, add phosphate, add ATP, the clock starts ticking. Circadian clocks are basically the reason why you will be jet lagged if you were in Sardinia and coming from India or the US for example. So bacteria also have such systems. In this case the clock is regulated at a molecular level by interaction of those three proteins and what's happening is that there is a dephosphorylation process taking place and you can monitor this process by mass spectrometry which allows you really to measure the frequency of the clock. Now at the time when the work was done there was no structure of the full complex known and this is the reason where we try to model the complex and in particular the KB, KC interaction. So what were the data available at the time to model this complex? First of all the stoichiometry of the complex and this was given to us from native mass spectrometry and we knew that KB binds to KC in a 6 to 1 stoichiometry. It means that you need 6 KB to bind to 1 KC. Further there was information about the binding interfaces between those molecules which was coming from HT exchange experiments. I introduced shortly those in my general introduction in the first part of the lecture. So let's have a look at those data. So what you are seeing here is the crystal structure of KB, the smaller of the two components and it's color coded based on the protection factor. So region in blue are protected when the complex is formed from HT exchange and you see two views, rotated views of the same molecule. So this indicates that there is a clear face of KB that seems to be involved in interaction. So that's information which we can give to HADOC. Now the other piece of information that we have next to the HT exchange data are a number of mutations and you see two lysines 74 and 22 and two arginine 74 and 22 and lysine 66 which if mutated abolish the interaction. So this is also information which in principle we can take into account into our modeling. Now this is KC. KC is the largest of the two components. You'll see it's a very large molecule. It has a C6 symmetry. If you look from the top you can clearly see there are six domains. It consists kind of of a double donut. So you see one ring, a second ring. So one ring, the bottom one is called C1. The top one which is called C2. The blue regions are protected from exchange when the complex is formed and you see that we have a problem here because we detect six binding sites at the top, six binding sites at the bottom. So that will indicate 12 to 1 binding possibly but we know it's 6 to 1 and we know experimentally it's possible to distinguish. So we know that the binding should be either at the top or at the bottom but not on both sides at the same time. What you also see is that if you open the complex and this is the interface between the two donuts basically you also see that there is some protection there. So what's happening is that there is some allosteric communication between the top of the molecule and the bottom and this is what is controlling also all this phosphorylation, the phosphorylation process. So we have the information on KB, we have the information on KC so we can define those data as ambiguous interaction restraints in Haddock, do or modeling. We did two rounds of docking, one targeting the top domain C2 and one targeting the bottom domain C1 and what you see here are the top four clusters of those two docking runs ranked and we could not really distinguish if the bottom solutions were better than the the top solutions were better than the bottom solutions. We had a problem here to make a choice which one is the correct solution. So we went to another type of data which is also coming from mass spectrometry and these are collision cross-section data. So just to explain in a nutshell how this works. So the collision cross-section basically is the rotationally average shape adopted by a given molecular ion flowing or flying in a particular gas phase condition. So in the spectrometer, in the MS spectrometers, the molecules are flying through the vacuum against the gas flow of over small molecules and you can think here that the shape of the molecule is going to affect the way and the speed at which this molecule travels through space. So by measuring the time it takes for a given molecule to bridge a given distance in a spectrometer, you can derive information about its shape. It's a long extrapolation because you measure one point which is time and you extrapolate to a 3D shape. But there is information in there and basically you can do these measurements from native molecules with very little samples required. So this is taken so you see these figures are taken from this review Nature Protocol 2008 and if you look at the right graph you see so these are assemblies that consist so these are theoretical assemblies. They consist of the same number of subunits but in different arrangements and you can back calculate what the collision cross section should be from a 3D model and you see that even if the number of subunits is the same in all those the shape gives rise to a different collision cross section. So if we have a 3D model we can back calculate what the collision cross section is and compare that to an experimental measurement and try to use this information in our scoring process. This is what we did for the system so we had collision cross section. You see here the experimental data there's a range indicated by the dashed line so the experimental range is between 133 and 140 square nanometer so it's kind of a surface area that you are measuring experimentally. We back calculated those values from these eight complexes here those eight models with the orange corresponding to the C2 domain the green corresponding to the C1 domain and you see that while the green are all larger than the experimental range and the orange one is fitting at least the top cluster which is a top scoring cluster the best cluster fits nicely in a range. So based on this additional piece of information we selected our top cluster from the C2 domain docking as representing this complex. So this was all mice published in 2014 together with experimental data and then a bit more than two years ago came the cryo-EM structure of the full complex k, a, b and c and basically what the cryo-EM structure shows is that the bottom model C1 this one down there is the correct model. So basically we predicted the wrong model so we screw up somewhere. Well these kind of things happen and in science you're all going to run into things that don't work we're all making mistakes usually you tend to forget about these and don't speak about those anymore but I think it's very instructive and you get educated to show that things can go wrong and also try to understand what goes wrong because in all the experiments that you do and especially when you do simulations you learn most from your mistakes when everything goes right you don't think too much you keep doing the same when you hit the wall that's when you stop and you start thinking what went wrong. Well here what went wrong actually there are several things that went wrong. First of all just let me go back sorry so first of all it's probably not a very good idea to do collision cross-section measurements on non-globular protein if you remember the you see a kc so kc is a protein which has a hole in the middle so it's not globular it's going to fly in a spectrometer in a vacuum and what can happen is that you get a compactation of the system so what you are seeing in the spectrometer will be smaller than what you experimentally have because it's a non-globular protein it means that when you look at the results so this dotted line or dashed line here are an underestimation so if you were to move those up then you will select the other one so that's all the way as a modeler to blame the experiment for our failure but there's another explanation and this is a bit more somewhere worrisome it's nature playing tricks with us the crystal structure that we used at the time for kb was a perfectly fine crystal structure a good resolution at the time when the cryoam structure came a second structure was published for kb and it had a different fold it was exactly the same sequence you see the sequence here and you see the these two crystal structure the the one which is basically called gs is the what they call the inactive tetramerate grand state fold this is the fold that we had at this position at the time and the other one which correspond to the active side active state was more monomeric crystal structure the first part of the structure is the same if you look at the secondary structure the second part is completely different what was alpha becomes beta what was our what was beta becomes alpha so it's a completely different fold so nature is fooling us it's the same sequence which can exist in two different folds so that's a bit worrisome hopefully this is not happening too often otherwise we cannot trust structure biology and and the data in the pdb but from time to time you will have this kind of of weird things happening nature reminding us that we should be humble in our you know in our work basically so now an intermezzo in the meantime we have been improving ad hoc and introducing all kind of new features one of which was to introduce the martini force field into ad hoc to coarse grain molecule the idea of going to coarse draining so martini represents four real atom by one particle or one bead so it's a simplification of the system in terms of number of atoms which allows us to go to larger systems and model larger dimensions they are over additional benefits to do using coarse-graining so if you want to read more I just refer you to this paper this was a work of Jorge in my group a PhD student so basically we go from all atomic model to coarse grain particle we do the docking at the coarse grain level and we transform back the model into a domestic model this is supported now in version 2.4 of ad hoc and also on the web portal so using this approach we went to model the full kb kc complex basically the seven mer complex six molecule of kb one molecule of kc previously we were only docking one kb to kc to one to the domains of kc that were required so now we dock six kb onto kc at the same time using symmetry also as an information because we know that we have this cc symmetry using the same data that we used more than five years ago and targeting basically the same c2 and c1 domains so the top and bottom domain now if you look at the ad hoc score in this case we clearly see that the ad hoc score points to the bottom model as being now the best solution we didn't have this before the docking by doing coarse grain docking we also win in time so we took this model the model is not consistent with the cryo yam basically what you see here is the all model in one color and the cryo yam model in the other one and you see the center of mass of the the 2kb on the cryo yam model we didn't use the cryo yam data to do the modeling but we use the data to validate the model so if we fit our model into the em density and we did that in camera you get a correlation of 0.82 and the model that was directly refined built out of the cryo yam data published in a science paper as a correlation of 0.84 so considering we did not use the cryo yam data into the modeling we get quite a good correlation close to what the cryo yam structure is which is nice so since i was introducing or speaking a little bit about cryo yam now we have a good transition to the next topic in this lecture which is the use of cryo yam data directly to guide the modeling process in ad hoc so cryo yam i think everyone is aware well aware of it but it's really the the new star in structural biology if you go 10 years ago we were speaking of cryo yam as a blobology technique and these days cryo yam is giving us high resolution structure of proteins so what are you doing well it's transmission electron microscopy you vitrify your sample in liquid ethane and then you take images 2d images of your samples and what you see in those images is that you have your proteins that show up slightly the signal to noise of a single image is not fantastic and what you have is images that are of their protein in all kind of orientations so then you have to classify those images depending on your orientation and once you know in which orientation you're looking at your object that you can reconstruct a 3d object which gives you a density so this used to be rather low resolution now it's reaching very high resolution you can also use a different technique which is called tomography where then you are tilting the sample to generate different projections and from the series of tilt you reconstruct a 3d object so again so before 2013 this was a resolution currently we have a domestic resolution it's really impressive and this was made possible by developments both in a software to interpret the data but also in the electronic detection in particular direct electron detectors so if you go into the PDB and the EMDB you will find so this was a distribution of resolution like one year ago and you see now that there is quite a lot of map that are below five angstrom and there are maps that are even below two angstrom these days so it's really impressive it's not like everything is going to be at very high resolution you will even with the current technology you will still have regions in your map which might have low resolution and in the past you have you might have limited resolution maps so resolution is not a constant in cryo electron microscopy that's also different from x-ray crystallography so you have regions that might reach very high resolution but you might have regions when the resolution is still limited not allowing you to build a binissio a model into the data and that's where you go back to the way that the cryo em data were used several years ago was to basically use existing structures of the component of the complex and fitting those structures into the density this was done or this can be done manually using tools like a usf camera there are also software that can do an automatic fitting into the map a different version of it power fit being our own when you are doing this fitting typically what was happening is that you fit one component at a time into the map but you don't consider the interaction the energetics of the interactions between those fitted model meaning that interface are never really optimized the model looks like nice when you represent them but you should not turn on the sidechain otherwise you're going to realize that there are a lot of clashes experimentalist might tell you that anyway you should not look at the sidechain if the resolution is too low so that's that's a debate I think if the model is to be useful it should also make sense in terms of quality and stereochemical quality so this is something we wanted to look at to try to use this this map information cryo em density as a restraints into haddock and this is the work of a hido former phd student in a group we implemented basically em restraints into a doc to guide the docking process so haddock is using cns as computational engine crystallographic nmr system it stands for and cns was originally designed to do crystallographic refinement and it has a lot of tools for nmr as well it's scriptable and all of haddock is written at the script level in cns so cns has from the start all the energy functions to handle x-ray density maps so what we can do is transform or cryo electron density into a format that cns uses understands and then use the restraints in cns now if you try to turn the em restraints directly from the beginning of your docking and try to dock your molecule into the back what we quickly realize that the solutions are not converging so instead of using the map from the beginning to to drive the docking what we are doing is to identify centroids so the most likely location of the molecule into the map so we define those centroids and then we use distance restraints to guide the molecules toward those centroids now if you don't know where the molecule are going you can define those restraints as ambiguous which is very natural to what haddock is doing you do your energy minimization using the distance restraints and once the molecule are then into the map you turn on the density and optimize against the density and once this density is there we can take this through our refinement flexible refinement stage and also it's part of the scoring function so first centroids to pull the molecule into the density once they are in the density we turn on the density and refine against it now how do we get those centroids position you could get that from usf camera but because we needed them we also started looking into our own fitting software to to systematically fit molecule for all possible translation rotation into the map and this is what gave rise to power fit and power fit has been adapted so that it gives us the position of those centroids which you can give to haddock now here let's look at at the real case so this is the integrative modeling of the 60 and s ribosome and ksga a protein that binds to the 16x ribosome so what were the data available at the time so there was a 13 angstrom resolution cryoem resolution map available a crystal structure of the 16s ribosome was known there was a crystal structure of the protein the binding site on the ribosome was mapped mapped by hydroxy radical footprinting so we we know from this information where the protein is binding on the ribosome and there was also some mutagenesis data on the protein side like those free arginine that you see here that if mutated basically prevent the binding so this is all information that we cannot use in haddock to to look to refine this complex now there's of course a model of the complex that was deposited at the time it's entry for adv it looks very nice as long as you sure need a backbone but as soon as you turn on sidechain you see what's happening here the interface between the protein and RNA is full of clashes and if you concentrate on what are those three critical arginine or residue doing one lies into arginine this one is not doing anything it's not contacting the RNA this one is clashing and there's only this lies in that kind of makes sense so the model which is there is not explaining basically the data that we have so the question is can we do better so we're going to use haddock to try to debump this model and improve the quality of the model and this is the result of the protocol that I just showed you so we put all the data together we do our energy immunization with centroids we turn on the density we refine the complex and you see here this is rmsd from the the cryo-m model versus core we get one set of solution and then you can zoom in again at the interface and now you see that all those amino acids which were known for mutation to be critical they make interaction that makes sense it doesn't mean per se that they are correct but at least it makes sense and what is also interesting is that the model reveals additional residue that could possibly be critical for the interactions for example 147 and 248 are interesting arginine they have not been mutated but this will be a validation so we should mutate those test the binding and see if the model is indeed predictive or not if we look also at the sequence conservation we see that all those arginine that we find as being important for for the interactions are indeed also highly conserved so there is some backup from evolution there that indeed there is something interesting happening about those residue so in conclusion for this the use of cryo electron microscopy data into haddock so they are now fully supported it's part of haddock version 2.4 it's supported also in the web portal the new web portal 2.4 of haddock so we use distance restraint concept to drive the docking and initial and the first stage and then we turn on the density to refine and optimize the model it's map size independence you don't need to use all the density you can basically take a take a subset of the density map where you are interested in and it's fully compatible with all the other type of restraints that are supported by haddock so it it can be used in an integrative way to do integrative modeling so this work is also published in structure so you can look up the paper if you want to know all the details now the next topic so we spoke we've been speaking about very large complexes 16 ribosome and these bacterial circadian clock machinery so now we're going to zoom in into more smaller and molecules and we're going to look into drug like molecule so haddock has been pretty much from the early days supporting small molecules we first implemented support for cofactors because there are many proteins that have cofactors and those might be important for the interaction as well so the haddock server which exists since 2008 has from the start supported small molecule and we have been doing some work in the past also in you in the docking small molecule but we have no ambition to compete with all the very nice small molecule docking software out there like no glide autodog vina and this kind of molecule but one feature of haddock which makes it interesting is that we can define specific distance restraints between two points between two atoms and a lot of the small molecule docking software could not do that until quite recently and this is the reason why for example companies like zo bio which you see in this publication I've been using haddock since quite some time to drive the modelling of protein ligand complexes based on nmr data that they could measure so they have a technology when they can measure a few distances between the ligand and the protein and they use that for for basically guiding the docking at the time the small molecule docking software could not do that they will only be able to filter another example more recent which is coming from Novartis jack's paper 2017 for the same reason using the classical small molecule docking software they were not getting complexes that make sense of their data but using haddock they could define specific distance restraints generate models for system which would not crystallize based on the models do structure based drug design improve the models improve their design and then those started crystallizing and then when they compare the the crystal structure with the prediction for haddock they were coming nicely together that's that's a nice story we are not involved in this work but it's a nice publicity for haddock and small molecule now a few years back we discovered this drug design grand challenge drug design data resources and so d3r and d3r has been running the blind experiments that test the performance of small molecule docking software and push the boundaries of what is doable basically in drug development so we participated to three such grand challenges two three and four those are organized in two stages in the first stage you have to predict the poses predict the structure of the complexes and rank them and in the second stage crystal structure are released but then you have to predict the affinity of those of the of the molecule one good thing here is that's all the for each challenge basically for one set of complexes they are all coming from one specific company and it changed which company is providing the data but they have all been measured in a consistent way so they might have ic50s or binding affinity data which have been measured by exactly the same settings which is very important so this is this work our participation d3r has been mainly done by panos former phc students and then a former postdoc in my group so the first target that we encountered if we are was the phanezoid x receptor fxr also known as nuclear bale acid receptor so it's a pharma target involved in the diabetes among other so the challenges in this system are they are twofold first of all the binding site is quite buried so it's a hydrophobic pockets the ligands that are existing that are known are branch ligands so you have a buried binding site so for us for haddock this is going to pose problems because when you do small molecule docking for example of autodoc you build a box around your binding site and this is the only region where you search now in haddock we start with the molecules separated in space and they will need the small ligand will have to penetrate through the protein to reach the binding site the other challenge is that there are quite some conformational changes in the protein depending on the ligand to which it's binding and you see here this is a post d3r analysis of the different conformation of a receptor bonds to the different ligands that have been crystallized and you see that indeed these helices that are on top of the binding site do show quite some conformational changes depending on the ligand bound to it so you need to to be able to to describe these kind of changes to some extent to get reasonable predictions so what did we do you are given smile strings so this is a text string that describes your ligands so first we generated a 3d conformation of the ligands using the open eye omega toolkit which is freely available for non-profit we clustered those solutions and selected a representative from the different cluster and then we're going to give an ensemble of conformation of the ligand to haddock so this is this implicit flexibility treatment i've been speaking about in my introduction so this was a ligand site for the protein site we searched the pdb and we found quite a number of structure that had been crystallized already with ligands we cluster all those structure based on the binding the similarity of the binding pockets and then we selected cluster representative so this gives us also an ensemble of structure for the protein so we're an ensemble of structure for the ligand we have an ensemble structure for the protein we defined the binding pocket as all the active residues that are within five angstrom of all the ligands that were found in the that were available in a pdb and we did our docking and when the results became available this was the first two ligands i started looking at and if you look at the the left side so they say well it looks like the prediction was not so bad we were 1.2 angstrom away from the crystal structure i say oh that's quite nice so properly protocol is not doing so bad here you see that they are a bit more so this is two angstrom the limit in small molecule docking for acceptable solution is around two angstrom so you see here we have flipped a group here so that's why the harm is this is much higher so we thought okay probably we didn't do that bad in this prediction when the results came out the green bar is where we are standing so you see that the average actually of the best prediction so we can submit five predictions per ligand so the average of the best prediction in the top five for us was above five angstrom so we did the protocol completely blindly we didn't even even look at what we submitted in most cases so some ligand probably even didn't penetrate properly the the binding side so this is one of the limitation okay so that's another example where basically you cannot hitting the wall so yes we didn't do that well so what went wrong so this is when you start learning your lessons so first lesson but also by participating to the d3r meeting and talking to people is we have to do a smarter choice of the receptor conformation okay second lesson we need to select also the ligand conformation much better for the docking process and not use of this large ensemble that we did before and the third lesson was that the starting point which is randomly separated molecule in space is not ideal for small molecule because we have this problem of reaching the binding pocket well this is all they are all good good reason why these are our main limitations and you can read all the details in that paper then came basically grand challenge three which is not katepsin against 141 small molecule for 24 of those we had they were crystal structure so we had to predict those 24 structure so to select the receptor conformation now we compare the ligand that we have to dock with the ligand that are present in the crystal structure available in a PDB okay so we need a template in a PDB in order to be able to do this and you see here so we're going to select the receptor which has the ligand the most similar to the ligand that we have to dock okay so and we are for calculating the similarity of of the ligands we use the tanimoto coefficient which basically compare the similarity of substructure in the ligand so we have a smart choice of the receptor now for the ligand we still generate a lot of conformation up to 500 using open eye omega software but now we're not going to do clustering and select all those conformation but we are selecting conformations that have similar shape and also kind of chemical properties as the one found in the receptor that we have selected in a previous step so we can do like shape comparison of the ligand this is implemented in a rock software in open eye and you see here so these are different ligands but you see that they have rather similar shape okay so by doing this approach we're going to select the tan ligand conformation that are the most similar in this comparison to the template ligands which we have found from the from the receptor we selected and this is what we now going to give to haddock the third lesson in gcr2 was that our starting point was not ideal so instead of doing the full docking process in haddock we're going to superimpose the selected conformation of the ligands that we have to dock onto the crystallographic ligand in the template that we selected and we use for this again open eye software called shape tk so basically now we're only doing the refinement the final refinement stage of haddock in final solar okay now this is the the set of solutions so you see here the 24 ligands this is the rmsd from the crystal structure so these are already the results so you have 2.5 angstrom here and you see that in most cases we have predictions that are below 25 angstrom so these are some of those so this is below one angstrom very nice prediction here is another one 1.75 angstrom these are quite large ligands you also see these are the top five poses you also see that the scoring function is doing quite well so you see so this is one cluster of solutions this is another cluster so we still have this cluster base approach and you see that the top cluster is actually the correct one same story here same story here so in most cases you see here top cluster correct solution so the only few cases where actually the order is reverted so scoring function is doing quite okay so where does that put us in uh overall performance then you see now that we are done here at the top of of the predictors the top is quite flat actually so this is position six and we have about two angstrom median over all poses so that's a very nice result so we have learned all lessons from grand challenge two applied those in grand challenge three still without knowing what they will but the outcome will be but it worked clearly nicely uh grand challenge four came in 2018 and yet another protein beta secretase 20 small molecule for which they were crystal structure we follow exactly the same template based strategy as in gc3 these are some of these are our predictions superimposed onto the crystal structure one of the challenge here was that the ligand was cyclic and you see again that we are doing a very good job at predicting the right conformation the median or best model at 1.2 the top one model 1.5 and the top five 1.7 uh using exactly the same template based approach as in gc3 and this is all described in this paper published in 2019 so where does that put us against the entire list of participants and in this list these are not different participants they are also the same participants trying different approaches and you see that we have done three submission this was a template based approach and here we have been trying some other things but again you see that we are doing very well around one and there are groups that are going as low as 0.5 or more so i think so so this is this template based approach clearly uh makes sense so in conclusion for this d3r uh challenge uh this kind of challenges but also capri for protein protein complexes are very much catalyzer in terms of software development and methodology development because you learn from your failures you don't learn from your successes usually you learn from your failures the success factor for small molecule docking for us was really to to follow a template based approach small molecule docking which consists of a smart selection of the receptor conformation and the ligand conformation for the docking and also a smart positioning of the ligand in a binding pocket and then just refining the the pauses now this brings me to uh to one of the last topic but uh looking at the time now for for this uh for this lecture i think i will skip uh this topic i will record it as a bonus session which you can look at at the later stage if you want so i'm going to go directly to the conclusion and perspective part so in conclusion i hope that i've given you an overview of integrative modeling what it means and in particular what how ad hoc works under the hood what it can do i gave you several examples of using different type of data to guide the modeling process what we always have to realize that what we are doing when we are doing modeling is generating models they might not represent the truth they might not be perfect but they provide working hypothesis and in that way they are still useful so if you take the model and then generate hypothesis go back to the lab or talk to your collaborators and if you can generate do experiments metagenesis some of our experiments to validate the model then they are very useful because they are driving the experimental work so modeling is not the end of the road modeling is just the start from your experiment so you have really to see modeling and the experimental work as intertwined and you go in circle and using data when you cannot solve the structure in a classical way is really a very complementary to classical structural methods and you will see more and more of integrative modeling coming out this year the coming years there is even no database which is accepting integrative model called pdb dev now do you want to learn more there will be a practical tutorial in this online course but we have also many more tutorials that are online so if you visit bonvanlab.org slash education you will find tutorials for version two two which is the old version of haddock but the two four version some of them some of those are quite complex tutorials the tutorial that we're going to run in this thermal school is not yet in this list so you will have a primer in terms of tutorials now to finish i want to of course thank the people who are doing all the work i have the pleasure to give these lectures and go to nice places like pula virtually this year so all my group over the years so this is a picture taken about a year ago not everyone currently in a group in the picture but it's a good representation of the current group members which are also all listed on the bonvanlab website and we could not do all this work if it was for the funding of several national and european project with bio excel being very important to catalyze and support all the software developments around haddock with this i'm finished i want to thank you for taking the time to listen to me online you can find much more information about what we are doing our research and our software at the bonvanlab.org site if you have any questions about ad hoc we have this ask bar excel forum and with that i thank you again for your attention bye everyone and hopefully we'll meet sometime some place in real life