 Okay. Well, welcome everyone. It's a pleasure to be here today online to speak to you about the integrated modeling of biomolecular complexes. So I'm Alexandre Bourvin based at Utrecht University in the Netherlands. And my group is developing, among others, the Hadox software. So this is the topic for the next hour. And then we'll move into a tutorial slash demonstration of what you can do or what is a little bit small aspect of what you can do using Hadox. I'm going to give you first a general introduction on the topic of biomolecular interactions. Then I'm going to explain to you what Hadox is, what distinguishes it from other software in the field, which is the use of information to really drive the modeling process. And I will illustrate that with a number of different application examples ranging from a macromolecular complexes, protein-protein, to small molecule complexes, and finish with some conclusions. So biomolecular interactions are crucial to life. They are everywhere. We're all aware of all the genomic information which encodes for the proteins, which are the main actor of life. But proteins are playing their function by interacting with other molecules. So if you want to understand, so here we have the genome, the express protein will give you the proteome. And if you go to the next level of organization, you have the interactome, which is the network of all interactions in a given cell. And proteins are the main players in this interactome. So if you want to understand how things work at a molecular level, structure is not only a cell required, but also you need to shed light on the interaction that those structures make. So it means studying complexes, complexes between biomolecules. Because changes in structure, but also changes in the network in those interactions can be the origin of disease, simple mutations can distort this network. If I want to engineer a molecule to create beta materials or food applications, and of course drug design, which is one of the topics of today, to do drug design or structure-based drug design, you need access to structure. And if you want to module actions with drugs, so this brings me to the structural biology of interactions, where you see here on the right side, more the experimental side of things. And on the left side, you will see the computational things, part of things. And we have here the different experimental techniques with x-ray crystallography, NMR and cryo electron microscopy, being the three key methods to get access to structural information of structures and of biomolecules, x-ray being the oldest one, NMR having working mainly in solutions or in solid state, but also adding a lot of information about dynamics of molecule and cryo-em really the star these days in structural biology, which is producing amazing structures of complexes. So these three will give you access to say the full structure of a complex. But next to those key methods, you also have a lot of experimental methods and this is by no means an extensive list, but you have all the experimental methods that are giving you pieces of the person. So mass spectrometry these days has moved into, say, structural methods. For structural biology, you can measure, detect cross links between molecules and those cross links provide you distances between molecules. They might not be sufficient to solve fully from the data, the structure of the complexes, but they provide you pieces of information. The same supply, for example, to scattering methods like x-ray scattering, small angle x-ray scattering or neutron scattering, which provides you information about shapes of molecule, but not the atomistic details. So if you can combine this type of information when you cannot solve the structure from scratch, together with computational methods, you can move into, you can characterize interactions as well. So we are moving from the experimental to the computational and here there are different ways of looking at structure of complexes. So you can do it by homology modeling if there is a homologous structure of the complex in a protein database. But if you look in the PDB, you will realize that there are much less structures of complexes in the PDB than components. You can try to predict interactions by molecular dynamics, but predicting, say, protein macromolecular assemblies association is very challenging and will require a lot of computing and not so much has been done. And of course, in the context of today's lecture, docking is the method of choice. I should also add here alpha fold to Google DeepMind, which has brought a revolution in the field of, say, computational structural biology. It was developed to predict structures of, say, single proteins, but you see now there's a lot of developments where alpha fold 2 is also being looked into to predict structure of complexes. You will find preprints on bioarchives talking about protein peptide modeling using alpha fold. You also find some papers now using, say, classical docking methods and alpha fold 2. So there's a lot of things are going to happen in a very near future. But for today, we're going to concentrate on docking as a method to generate 3D models of biomolecular complexes. So molecular docking in a nutshell, so given the structure of the component of the complex, and in this case, we see a binary complex to proteins, can we predict how those associate? So this requires searching a six-dimensional space if the molecule are considered rigid, because you can fix one molecule at the origin of your coordinate system. And what you have to do with the second molecule is to sample all possible rotations. So in 3D would be free access and all possible translation around the first molecule, again, three dimensions. So six-dimensional problem for two molecules. Of course, many complexes consist of more than two molecules. So you also need to be able to to model larger assemblies and not all software out there docking software can handle more than two molecules. So this is not so simple as it looks like. There are, of course, over challenges that come into play because you have to add flexibility to those systems. Flexibilities and intrinsic properties of biomolecules and often associated with their function. So the dimensionality of the search problem is much larger than a six-dimensional one. Now, what do those docking methods usually consider in doing those predictions? So you're going to sample a lot of possible solutions and then you have to decide which ones are a good one and which ones are the bad ones. The oldest way of doing this discrimination game was to use the shape complementarity of the molecule. So you want to measure how well those two molecules fit on top of each other. This would be more like geometric consideration for scoring. But you can also use classical energy terms like electrostatic, of course, an important part of interactions and bimolecular recognitions, van der Waals energies. And if you have data, you might also use data as a way of scoring those models. So docking consists basically of two parts. The first part is the sampling. So you need a strategy to generate a lot of different models of your complexes. And I just told you you have a six-dimensional search problem. So you have some kind of interaction landscape. You have to sample this landscape. You have to generate a lot of different models, ideally to sample the entire landscape if you do, say, ab initio modeling. And associated with this landscape is some kind of energy function that's going to hopefully distinguish what is a good prediction that should be here at the bottom of this energy well from a bad prediction. So this is the sampling. And then the second phase is the scoring. So the scoring is basically associating a score, some energetic value to each of those models, and selecting what should be the correct solution. The two are often tightly connected, depending on the choice that you make on sampling. This has implication on scoring and vice versa. And these days, you also see a lot of machine learning, deep learning models coming out to help this scoring problem. Not so much the sampling yet, but that's bound to come as well. Now, if you have data, and this can be external data, experimental data, like some of the methods that I just mentioned, but this could also be bioinformatic predictions. This could be co-evolution information, which is also used by methods like Alpha-4-2 and many other prediction methods. So if you can predict contact between molecules, this is a very valuable information that you can use to bias the sampling, possibly, or bias the scoring. So you can use the data in two ways. And this brings us basically to integrative modeling, not of the use of data, the integration of data during the sampling phase. So if you don't use the data for the sampling, you're going to do a global search. So you have to sample the entire space, generating lots of models. If you use the data to drive the search, you're going to focus your sampling in a given region of space defined by the data. And hopefully you can locate easier the global minimum of this landscape. But this also comes with some dangers, because if your data are wrong, if you have bad data, false positive information, you might search in a wrong region of space and never find the right answer. So you have to be able to deal with errors in the data, and you have to be able to assess the reliability of your data. So this has advantage, but this also comes with dangers. There's no free lunch. So when we speak of integrative modeling, generally, in a field, we speak of the combination of different sources of data. And you see a number of experimental techniques, or actually information sources to model complexes. I mentioned mass spectrometry, crosslinking. You can do HD exchange experiments that I know you to identify binding interfaces. You can measure some kind of distances between molecules by all kinds of different methods, crosslinking MS being one of them. You can measure orientation. This will be more coming from NMR shape. You can simply do biochemical experiments or detect mutations in in evolutions or SNPs. So genetic variation that might affect interactions. And if you have nothing, you still have sequence and co-evolution these days where you can make predictions about where the binding sites are and possibly predictions about which residue are interacting because the co-evolution basically a revolution that has led also to the success of alpha fold 2. So if you want to read a bit more about that, there's a number of reviews. So these are reviews that we have been writing over the years. If you want to know what's the state of the art in the docking field, you can look at the special issues of proteins that are appearing every second, four years based on the results of blind experiments, blind where different groups are putting their software to the test category. Okay, so let's move now on more to the specifics of HADOC. So how do we do the modeling in HADOC using information to guide the process? So HADOC has been, so you see here the original publication which dates of 2003. So it has been almost 20 years since we started developing the concept of information driven docking or high ambiguity driven docking as you see here. So it started from NMR data initially where we had information about binding sites, but we didn't know more the molecules were binding. And since these original days we have been extending the capabilities of HADOC to deal with a large variety of information. So I think one of the power of HADOC is that if you have information about important residues for the binding, surface information, you can encode this information in some kind of energy function. I'm going to come back to that a little bit and use this energy function to basically bias the sampling and also the scoring. And next to encoding any kind of say surface based information, you can also specify very specific distance restraints between specific part of the structure. This is not normal, this is what is done classically in NMR structure determination. But HADOC was one of the first docking software basically allowing to specify distances between atom, which also has its advantages for small molecule docking as you would see. Now since the early days of HADOC things have been involving a lot, so currently we can handle up to 20 different molecules, so we can build large macromolecular assemblies. Symmetry is another sort of information that you can leverage for the docking process. So if we are modeling symmetrical assemblies, homomers, you can impose the symmetry as another restraint, another energy function that you incorporate in your sampling and scoring. We do allow for flexibility at the interfaces, so HADOC has a flexible refinement stage as you will see. And we do also some final refinement adding solvent, although in the latest reaction of HADOC we don't do that by default anymore. And we have constantly been putting our software to the test by participating into CAPRI over the years and have gotten very reasonable results, consistent results over the years. Now how do we encode information? So I told you one of the key source of information is what we often have at hand is that we know that some residues are important for the interaction, so this could be a mutation. If you mutate residue X on the surface, the complex is not formed anymore. So we know that this residue is important for the binding, but we don't know what are the contact that this residue is making. And we want to define some kind of energy functions that will force this residue to be at the interface. And the way to do that is to use the concept of ambiguous interaction restraints or ambiguous distances. This is not a novel concept at the time. It was actually this concept was introduced for NMR structure calculations by Michael Nilsjes, where we typically have a big witty in the assignment of NMR signals and we have to deal with this ambiguity. But in the context of docking here, we push this concept of ambiguous restraints to a much higher level. So you see an example of a complex. You have protein A and protein B. We have a number of amino acids that we have detected as being important for the interaction from some kind of experiments. So this will be the red one here. And we have a number of amino acids that we consider as a surface neighbors. Again, we don't usually detect the perfect interface. So we are missing information. We might have too much information, so we have to deal with that. So we increase a bit the definition of the interface by considering the surface neighbors typically. And what we want to do is to define some kind of energy functions that will force these residues to be to make contact with one of those guys on the other side without knowing which one of those should be the correct solution. And we do that by defining a distance restraints between this residue here and all residues at the other side. And this distance restraints is basically some kind of energy function that you see here. So it has an harmonic part. It is a flat bottom potential harmonic part. And then it becomes linear. So this is a classical potential that we are using in NMR structure calculations. Again, this is the work of Michael Leunges. And you see a whole the function. So it's zero if you are between the upper and lower limits. It's harmonic. If you are above the upper limit, then it goes to a linear function if you have above a given value s here. Now, for each amino acid that you have identified or predicted to be important for the interaction, we are going to define one such energy function, such distance restraints. Now, the distance that we input here, because we don't know which distance it should be, because they are all kind of possibilities. So we calculate an effective distance, which is the sum over all possible atom-atom combinations of distances between this residue here and all the residues on the other side. So we do this summation as one of the distance for six power. This is, again, coming from NMR. This is dipole-dipole interactions. But this is also the attractive part of a Lena Jones potential to represent Van der Waals interactions. So you sum all those distances as one of r to the six, and then you take the inverse of the sum and the six roots. And this gives you one distance. And this is what enters this sum. Now, what are the limits that we are defining? We are using rather short upper limits of only two angstroms by default in Haddock. So we say this residue should be within two angstroms when I calculate this effective distance from the interface of the upper molecule. Now, two angstroms is very short because it's shorter than the shortest distance between two carbon atoms, for example. But the property of this summation here is that this distance that comes out at the end is shorter than any distance that enters this summation and by quite a bit. And that's the reason why we can afford to use an upper limit, which is very short, two angstroms. Now, I also mentioned that the data are often not perfect, so we are dealing with false positive predictions or experimental data. So our way of dealing with that is to randomly delete part of the information for each docking trial that we are going to do. And by default, the server will delete 50% of the information. So it means that each model that we generate will be based on a subset of the input data that we give to the software. So that's really the key behind the use of information in Haddock. Of course, if you know exactly which pairs of atoms should be in contact because you have a cross linking data from mass spectrometry, you don't need to have this ambiguous restraints and you can define a very specific restraints in that case. So we have all the flexibility from highly ambiguous to very specific. So how do we search the space? So we have to search this, say, six dimensional space if the molecule were rigid. So we use for that basically a classical search techniques based on the combination of energy minimization and molecular dynamic simulations. So it's gradient driven. So we calculate where an energy function we calculate the forces and the forces are telling us basically where the system wants to move. So next to the experimental information or bioinformatic information that we put into the modeling process, we have our classical force field terms for molecular dynamics. We describe bonds between atom angles, rotations and the non-bonded interactions. Now the protocol, the docking protocol, the current docking protocol in Haddock consists of three steps. In the initial step, the molecule are treated as rigid. Now this is initial docking by energy minimization. Then we're going to heat up the system and introduce flexibility but at the interface only to basically optimize the interface of the complex. And at the end, the old protocol was solvating those molecules in water and performing a very, very, very short molecular dynamic simulation. So nothing with compared to say nanosecond micro-semitant molecular dynamic simulations. But we have to sample a large number of models. So it's also a time question. So this is an illustration of what happens in this rigid body minimization. So the molecules are separated in space, randomly rotated separately, and then we turn on the minimization and the minimization in this particular example is guided by residues that were detected by NMR as being part of the interface. So this brings basically the interface together without pre-defining what the orientation between the molecule should be. Now this phase we sample typically in the order of 10,000 to 100,000 different models. And then we apply a scoring function. Going to come back to that, select a fraction of those models, a few hundreds, and then start using molecular dynamics to optimize the interface. So you see here this simulated handling protocol. So you see flexibilities introduced first along the side chains. So you see side chain motions. And then in the second and last phase, we also allow the backbone to move. And you saw that this loop here just flip over. So this is done in torsion angle space or not Cartesian space or the degrees of freedom are basically rotation around bonds. This which allows to quite efficiently freeze part of the molecule without applying position restraints. So the molecule can move freely in space, but only the degrees of freedom that we want to sample are allowed to are free to move basically. And these are rotations. First the side chains and then the backbone. So this takes a bit more time than initial stage. And then the final stage will be to solve it this in a layer of water. So these are not very conditions as you have heard of molecular dynamics and do a very short refinement by very short. The default protocol was a few tenths of picoseconds. Okay, this is nothing. It's just changed a little bit of optimality, energetic, many of the interface, but doesn't do much in terms of structure. So this is really peanuts, nothing like full-blown molecular dynamics. But if you have to do that for hundreds of models, you don't want to have to run, you know, nanoseconds or microseconds invasions. So flexibility in this modeling field is a challenge because if proteins or if molecule change a lot their conformation upon binding to their partner, it's very hard to predict certain changes. So in Haddock, we have different levels of flexibility. So you can start the docking from ensembles of structure. So you can provide multiple conformations to the software. And that's usually the best way. If you expect large conformational changes, it might be easier to try to pre-sample to generate ensembles of conformation prior to docking. Rather than expect that the docking itself and the flexibility that we have, but also over docking software offer can model very large conformational changes. We are very much still limited in what we can do in terms of large conformational changes. And the explicit stage is that we allow for sidechain and backbone, but only I did the phase DPP to move during the refinement process. So again, the free stages and what appears here now are the energy functions that we use not to calculate the models, but that we use to score the models. Because again, we might have 10 to 100,000 models here and a few hundred model at those stages. And at the end, you need to rank those models that the scoring part and the final scoring function is seen here. So we have two terms that are based on the intermolecular energies, electrostatic and van der Waals energy. So this is the amount of energy between the molecules where we use 20% of the electrostatic energy and the full van der Waals energy intermolecular. We have an empirical dissolvation energy term which basically measure the bonus or the price that you have to pay when removing water from the interface. So if you bury hydrophobic surfaces, usually it's a bonus. If you bury charges, you have to pay a price to dissolvate those charges. So this is this term and the last term here will be the experimental information that we put in. So does the model fulfill the data that we had at hand. So that's how Hadock works. Now you can get the software to run it locally, but it's a lot of computations that you're doing. So you need quite some resources to do this kind of computing, which is why we have developed since 2008 now a web portal where users can submit their data. And what you see here is the latest version of the portal based on Hadock version 2.4. So users submit the data and the computation will run on our site either on clusters that we have in the lab or in most cases this being sent to a grid resources distributed mostly around Europe but also around the world. We have access on paper to about 100,000 CPU cores. So this is high throughput computing. It's not high performance computing, which is maybe more the topic of this pre-school. Of course, if you have access to a supercomputer, like the ones that praise offers, you can install a local version of the software and run things locally. I'm going to show you some data on that. The server, however, does much more things for you than the local version. So there's a lot of pre- and post-processing, which is done by the server, which you will have to do yourself manually if you run a local version. Now we have a large user base, slowly reaching 25,000 registered users from all over the world. And we have served more than 370,000 docking runs since the opening of the server in 2008. More than 50% of those have run on these HTC resources. And if we look at probably the 2.4 server, it's probably more than 80% of all the computations have been done on high-throughput computing resources. If you are interested in getting the software, visit bonvanlab.org slash software where you will find all the information about Hadock. This is just an overview of the distribution of our users. So we have more than 120 countries represented with a large community coming from India, Europe aggregated in the US and also China. Now we have seen last year when the lockdown started that a lot of people could not do experimental works, labs were closed, but they realized that you can still do computing. So if you have access to computing resources or if you have access to services like the Hadock server, you can still do research. And what you see here is the number of docking runs that the server is processing per month. And before the pandemic, we were at somewhere 2,500 docking runs per month. And here you see the start of the lockdown, start of the pandemic, and you see a huge increase in a number of process jobs. So we almost tripled in a few months. This kind of reflects the waves of the virus also to some extent. So we had a summer and here was the second wave coming at the end of last year. And we also started monitoring actually the submission asking users to flag their submission as being COVID related or not. And since April 2020, we see that about one third of all submissions on average are related to COVID. So people are studying interactions of viral proteins with receptors, but also doing drug design, drug repurposing, using Hadock. The backend of the portal to be able to accommodate these increased demands. We had more resources allocated during the start of the pandemic, more sites giving us access to their resources to distribute those jobs. So this is thanks to a resource provided by EGI, the European Open Science Cloud, but also the Open Science Grid in the US. So all jobs are also crossing the Atlantic. They're also running in Asia and some sites. Getting access from the server to say HPC resources like preys is much more complicated. There's quite a simple mechanism for us to distribute the job worldwide to IHTC resources. It will be much more complex to funnel those to run on HPC resources, but that's something that might happen in the future. So some of the highlight also mentioned already mentioned that we can go up to 20 molecules, which is not something trivial. There are a few software that can model very large assemblies. You see here one example, and we are not limited to protein or small molecule. We can mix and match. So we can have DNA, RNA, protein, small molecule. So this was work published already in 2017, but the server that goes with it was only came up live later. I didn't mention cryo-electron. Well, I did mention cryo-electron microscopy as this fantastic method to generate structures these days, but not all data in cryo-EM will reach the atomistic resolution. So we still have plenty of data, which will have medium resolutions where you cannot build the structure from scratch into the density, but where you would have to model the component into the density. I think also cryo-electron tomography, when you are looking doing the electron microscopy of whole cells, there you don't have the samples to generate often the high resolution densities that you need to build a model from scratch. So when the resolution is still too limited to build complex from scratch, what you can do is to use the information to guide the docking process. And then now we are docking components of a complex into the density. And this is something that we are now supporting as well into Hadox and was published quite some time ago, but is not part of the Hadox 2.4 port. In our efforts to move further towards larger and larger assemblies and to basically speed up the computations, we also have been building coarse-graining into Hadox, implementing basically the martini force fields, which is a one to four. So you have one bead representing four heavy atoms into Hadox available from the server. So you can transform on the server your molecule from atomistic to coarse-grained, do the modeling at a coarse-grained level. And the server will transform back the complexes at the end to an all atomistic level. And this is supported both for proteins and nucleic acids, but this will not work for small ligands, for example. Now I mentioned the computational challenge. I mentioned that Hadox is running, the server is running in high throughput mode, not high performance mode. But within the context of the BioXL project, we are working on making Hadox very efficient on HPC resources. Just to give you an idea, the server is sending more than 20 million jobs to those distributed HPC resources. This is not the number that you would like to see on HPC center. I think if you were to send 20 million jobs to the batch system of those HPC centers, the administrator will not be very amused, which is why we are also revisiting our way of running the computations to run HPC system in a much more efficient way. You know, the challenges are both it's not only computing time, but it's also the amount of data and the amount of files that we are generating. When you run molecular dynamics, you might generate a very long trajectory file of your system, but it's one large file. When we do this modeling, we generate hundreds or hundreds of thousands of small files that represent all the models that we have. And this is also stressing the file system. So we have to change the way we are doing things. So we have developed a pilot mechanism to run on HPC. So instead of Hadox sending jobs, we are sending Hadox as one job to a full node. And each node will basically calculate one complex. And you see here a scaling plot where this is the world clock time in minutes to model 100 protein complexes. So if you do it on a single node, it takes you in the order of this will be 20 hours about. And as you increase the number of cores, you increase the number of nodes that you have access to, you see that this is pretty much linear scaling. Okay, there's no communication between nodes because each node is handling one specific complex. And this is what you will need if you want to go to interact on predictions to like model all possible complexes in a given organism. Now Hadox is one of the core software in the Biaxcel Center of Excellence, probably heard already today about the Biaxcel Center of Excellence. As part of that, we have a forum in Biaxcel at askbiaxcel.eu where you can find a lot of information. So if you have questions about Hadox users, you run into problems. Very likely your question has already been answered in the forum. So you can search the forum for those answers. And if you can find the answer, you can of course post new questions there. And you can see that this is actually used a lot, quite a lot of protein, small molecule docking, we're going to come to that today, dealing with dimers and all kind of other questions. Now Hadox is not the only thing that we are doing in my group. So we are developing all kind of software all centered around the topic of biomolecular interactions. So we have software to analyze data, like cross thinking data, we have software to do predictions of interactions of prediction of surface and protein. We have software and services. So these are all web portals, by the way, related to predicting the affinity between molecules, because scoring in docking is not equal to predicting binding affinity. It's a different problem. And we have developed other tools, for example, for fitting into a map, and this provides input for Hadox again. So visit VNMR Science URL if you want to learn more. Now let's move to some application examples illustrating what can you do with these kind of techniques. And we will start with the modeling of protein-protein complexes. And the first example is an example of using basically NMR data to model a membrane complexes. By the way, if you think that Hadox is a fish, yes, Hadox is a fish, but when we chose a name for our software, we were thinking more of this character here, which is Captain Hadox, the big friend of Tintin. So the European cartoon. So if you know Tintin, you should know Captain Hadox. In all the parts of the world, people might not be very aware of this. So they think of fish, but no. When we think of Hadox, we think of Captain. And in this particular work, we are actually talking about iron policy. So this representation of Hadox as a parent is very timely for this work, because we are trying to understand and model a complex between a bacterial receptor, which is basically sitting on the membrane of the bacteria, and the soluble protein of the host, which carry an iron sofa. And the bacteria for its survival need to hijack the iron from the host. And it does that by binding pheridoxine. Now, there's a large group of researchers and different labs represented in this list of authors. So they are crystallographers, they are biochemists, they are membrane biochemists, and we are in there as a modelers more. And the problem was that they managed to get a beautiful crystal structure of this membrane protein, which is a tool of force, where they saw actually all the loops of the system, but they never managed to co-crystallize the complex. So there was no information on the complex. Well, the crystal structure itself does give us information because we know that this is sitting on a membrane. We know which part of the structure is extracellular. So the binding must occur in this reach. So this is also information, which we can use in Hadox. Now for pheridoxine, what was done is to look at pheridoxine, which is a small protein, and you see the iron sofa here slightly in yellow. So it's a small protein, so you can look at it by NMR and characterize its binding to the receptor. It's a weak binding, meaning the binding affinity is rather weak, but by NMR, in those cases, you can map the surface, the region of the protein where the binding takes place by measuring changes in NMR signals, chemical shift perturbation experiments, basically. Now, if you measure the changes in a location of the signal and you plot those changes, this is basically a distance by how much does the signal shift in the experiment as function of the sequence of amino acid of your protein, you see that those changes are in specific region of the protein. And if you map those changes on the free structure of your protein, you see that they define a well-defined interface. If you were to rotate this one, there's nothing on the backside. So this is where the action takes place. And this is typical information that Hadox can have us to guide the modeling process. So we're going to define those residues in the Hadox concept as active, meaning that they should make contact with the receptor in the final models. Now, on the receptor side, we cannot do the NMR. It's a membrane protein. It's more complicated. But we have this knowledge of which loops are the extracellular loop. So in Hadox terms, we're going to define those as passive residues, meaning that they should ideally be in the interface. But if they are not, it's not penalizing you. For the active residues, if those are not at the interface in the final models, this is going to generate an energy, which is penalizing the model. So this red region basically here has to sample somewhere on this green region. And this is the information that we give to Hadox. And when you do that, you typically get multiple solutions unless your data are really, really good. And maybe the system is so asymmetric that there's only unique solutions. But in general, you get multiple solutions. And then you have to assess those solutions. Now, in Hadox, we use a scoring function to rank the solution. And you see here two clusters. The first cluster has a score of minus 1.38 in this case. B3 unit, we don't put k-kile per mole or whatever because it's a score and it's not a binding affinity. It's a large populated cluster, 150 members. The second cluster only has seven members, but you see that the score is very close. And actually, if you consider the standard deviations, the scores are calculated on the best four models of each cluster. So we calculated the score on the same number of models for each cluster. You cannot say that this solution is significantly better than this one. So we will need to validate those solutions in some way. And this is an important part of all the modeling that we are doing in interactive modeling. So we need the model is not the end of the road. The model is the start of new experiments. And if you were to pick residues that will allow you to distinguish those two solutions and make mutations, you might be able to answer the question of which one of those two clusters is the correct solution. Now, let's move now to another class of protein-protein complexes. These are the antibody antigen complexes. Now, antibodies, as probably all of you know, consist of two chains, a light chain, a heavy chain. And also, each chain here has six loops, has three loops. And those six loops combined for the heavy and light chain are the basically create the region where the binding to the anti-chain takes place. And those loops typically are in the maturation of antibodies in our immune system. So this is where a lot of mutations take place to basically adapt the antibodies to recognize a specific epitope on the anti-chain. So this is actually information. Okay, we know where antibodies are binding their targets. So this is information that we can use in HADOC to guide the modeling process. So this is the work of Francesco, a former PhD student in a group. So what we are, we wanted to test basically, can we use this kind of information to guide the modeling process and how well is that doing. And we wanted to compare the performance of HADOC to other software, which all allow in some way or other to use information about the binding loop of antibodies to guide the modeling process. So these are CLOSPRO, available as a web server, an excellent software which is doing very well and consistently in Capri. ZDoc, also a famous docking software, and LIDOC, a more recent software integrative modeling, which in which we added the capability actually to use information to bias both the sampling and the scoring. So in CLOSPRO and ZDoc, the information that you have is used more for the scoring part, while in LIDOC and HADOC, you can bias also the sampling using the information. Now, we use a set of 16 antibody antigen complexes to basically optimize the protocol and compare the software. So these are, they are larger sets of antibody antigen complexes. Recently, they were a new set released. Excuse me. There was a new set released, but these 16 complexes are basically complexes for which have not been used for the training of some of the software that we are using. So what are we going to test? So we have three scenarios. Now, the first scenario, which we basically did, the best case scenario, we have a perfect definition of the residues that are part of the interface, both on the antigen and the antibody. So this is not a real-life scenario, but this is basically the best information that you can get without getting specific contacts, just knowing what are the interface. The second scenario is a scenario where we say, okay, we have the knowledge of the, well, of the hyperviable loops on the antibody. And on the antigen, we have some loose definition of the additive. So this is, for example, what HD exchange experiments could give you or NMR experiments. And in the third scenario, we only have the information on the loops on the antibody side and we have nothing on the antigen. So we don't know where the binding side is on the antigen. And for that, we're going to sample basically the entire surface of the antigen during the docking process. So three scenarios, nothing on the antigen, loose definition of the binding side on the antigen and perfect definition. And here are already the results of all the docking that we've been doing. So each colon is a software plus pro, ad hoc, light doc and Z doc. Each row is a scenario. No information on the binding side on the antigen, loose information, perfect information. So what you can already see is the more information we have, and this applies to all software, the better the docking results are. So that's the first, if you have no information and you only have knowledge of the binding side of the loops on the antibody, you see that docking is not a simple problem or success rate is not fantastic. So what do we see here? So on the x-axis, you see the success rate defined on those 16 complex, considering the best model scores. So that will be T1, considering the best 5, best 10, best 20, best 50 and best 100. The color coding tells you the quality of the model. So if you have dark green, it's high quality model, say, within one extreme of the crystal structure. Light green is medium quality within two extreme of the crystal structure, and blue will be acceptable quality within four extreme of the crystal structure. And these are measures of RMSD measures calculated over the interfaces of the complexes. So you see that when we have, first of all, scoring is not so simple, but sampling also not. If you look at the T1 performance of all software, if you have no information on the antigen side, but you know the binding side on antibody, you know the success rates are pretty poor. 6%, 0%, 6% Haddock does better, 25% success rate, but it's only one out of four. If you look at the top 100 models that we generate, which is a lot of models to consider, you see that now we are reaching above 50%. So this is maybe 55% with Haddock. The other software remains around 30%, 40% for plus points at dark and light dark actually has quite a nice sampling, reaching 50, reaching almost 70% success rate, but it's one of the model in the 100 that you generate. So it's clearly, they are still challenges to be solved here when you are very good on information. But as soon as you have information, you'll see that all the software benefit from the information. We've had our reaching the best performance because you start seeing high quality model popping up. Together with Close Pro, we reach about say 45% success rate for the top one model. And if you consider the top five, we are now at 75%, which is the highest of all. Top 10 model, 10 we can still look at. It's 75%. And if you look at the top 100, we reach 100%. But you don't know which one of the 100 will be the correct one. So that's not so simple. Now, if the quality of your information increases, then you see that all software are doing very well, although the scoring still remain an issue. We've had our having now 100% success rate in the top one. But this is an artificial case where we have the perfect information about the interface. But I guess this should convince you that using information really pays off. It increased not only the number of models that are correct, but it also increased the quality of the model and it increased the performance of the scoring, the detection of the correct model. This one all published last year in structure. So you can look up if you want to read all the details. So now a small molecule drug design. So we have been also working since quite many years in predicting small molecules. So using how to property small molecule docking. When you use the server, the server will automatically generate topologies and parameter for the small molecules. So it's simple to use in that sense. And we have been developing various ways of doing the docking over the years. And I'm going to show you the most recent, most performing way of doing protein small molecule docking. Some of this work has been actually catalyzed by participating to this D3R grant challenge, which is basically a blind competition to predict protein small molecule ligand conformations. So the initial way that we've been doing docking is that we have some knowledge of the binding site. And if you think of small molecule docking software like Cotodoc, Cotodovina, you have to define a box in which the sampling takes place. So you put basically a box on the binding site. It rarely happens that you do a fully blind docking against a protein. So all the small molecule, specific small molecule docking software ask you to define where the binding site is. So in Haddock term, this will be defined as active residues, the region of the binding site. And then we define a restraints between the ligand and this binding site. So this was one of the first way of doing docking. So you can read the detail. The performance at the end was not fantastic because we realized that there are many very specific aspects of small molecule docking. Selecting the template is very important. More for the receptor, the protein that you are targeting, but also for conformations of the ligand. So having learned from the problems that we encountered in our initial participation, we started doing things in a much smarter way. First of all, to deal with conformational changes, if you look at your protein targets, often you will find in a PDB identical proteins that are bound to some other ligands. So selecting smartly the protein conformation that you're going to use for your screening is important. Then we have to generate conformations of the ligand because in all this, we often start from a small string, which is basically a one-dimensional string describing the ligands, the chemistry of your ligand. So we used in that particular word, open eye omega, to generate 3D conformations of the ligand up to 500 conformations. And then we compare all of those to the ligand which was found in the template receptor. We do a comparison, both in terms of chemical similarity, but also shape similarity. So this is used using a shape ticker and open eyes. So we want to select conformations of the ligand that resemble the most, the conformation of the template ligand. It's not the same, but it's a similar ligand that we identified in the PDB. And what we do then is we don't do docking, we superimpose basically the ligands onto the receptor. So it's a template-based modeling of protein-ligand interaction. And we use Haddock only as a refinement tool to remove the clashes. And this approach has been very successful in D3R, in the third and fourth competition in D3R. So basically, if you remember the free stage of Haddock, we only do the final refinement. And what you see here, it's probably difficult to see our predictions on top of the crystal structure, which is yellow. There are still cases where you see here there's a conformational changes. Well, we didn't capture that one because the refinement here is not going to introduce large conformational changes. But we did very well for this particular set of structure, which was D3R round four in this case. And the most recent protocol, which has just been published in GCIM, is to use actually shape-restraint modeling to guide the docking process. So what we do here is we identify, again, a template structure, which has a related ligands. And we have to select that smartly. So we want to find a ligand which resembles the most the ligand that we have to dock. We transform this ligand into a shape information. So you see here a shape. So these are dummy atoms that are in the receptor. They don't interact. They don't have any energetic contribution with the other molecule. And then we use the distance restraints in Hadock to define restraints between the ligand that we have to dock and the shape. So what we effectively want is that the ligand overlaps with this shape. And this is what drives the docking. And we don't have to do any superimposition of the ligand prior to the docking. We don't need to do a smart selection of conformations. We can need all the conformations to Hadock. And the distance restraints are going to guide the docking and also induce conformational changes. So we can do that just using the shape information. Or we can use that also using a pharmacophore model, where we associate actually properties to the shape, to the bead atoms. Like it should be more positively charged or negatively charged. It can be aromatic. So you can use these informations to guide your morning process. So if you want to know all the details, you can look at the paper. So this is actually a protocol that we have been using last year at the start of the lockdown to do some drug reproposing work against corona proteins. So what we did is to sample about 2000 approved small molecules. So we did some filter in size, a number of atoms. And we did screening against free targets, the main proteins of the virus, RDRP, which is another target. And also the own receptor ACE in order to possibly prevent the mining of the spike protein. I'm going to show you some results for HEMPRO. So we use this template-based approach here to do this entire docking, this shape-based docking protocol. We could do the docking on HTC resources at the time. So using again this grid computing resource. So we sample all those 2000. We did these 2000 docking runs in three and a half days, running all over the place. So you see here a number of sites. So a lot of those sites are in Europe. But you also see the open science grid here. You see this is Beijing in China. And some of the sites have given us additional resources, especially to do this kind of work. And during that particular week here, so this is a week when we did this screening, you see about a quarter of all the submissions. So this is in the order of 60,000 per 70,000 submissions to these high throughput resources, a quarter of those were COVID related. And this gave us a lot of, say, ranking of small molecule. A number of those have been experimentally tested. You see here our top two and top five molecule were confirmed to be micromolar binding to the main proteins. But the problem is that once you start doing cellular assays, those compounds are also cytotoxic cells. And in the context of a large European project, there is a lot of these drug-appropriating has been done. And actually none of the approved drugs is still in a pipeline because of different problems. So we are able to identify compounds that are able to bind as this is demonstrated here. But generating a drug that you can use really is a different challenge. With that, I want to finish. So I hope to have given you an overview of what information-driven modeling, integrative modeling can do in this field. I think a very important message that I'm giving every time is that when you generate models, it's not the end of the road. But the model is just a start for new hypothesis and new experiments. So it's not a linear process, it's a circular process. You do some modeling, you do some experiments, you improve your model, you improve your experiments. And that makes you converge to the solution that you want to find. And in that way, integrative modeling and information reading are very much complementary to classical structural methods. And if you look a bit in the future, the protein database was 50 years old this year or is 50 years old this year, I think we are going to see more and more of integrative models. So you see a whole different experimental methods that are combined with simulations, with integrative modeling, not only to try to capture a single state of a protein, but we're going to look at what I call integrative structural biology of dynamical landscape because the state of molecule is going to change depending on what their function is, where they are in a cell, where they are in a cell cycle. So it's a dynamical landscape that we are looking at. And if you are doing this kind of integrative models, not so much for small molecule docking, but for say macromolecular assemblies, you cannot deposit those in a PDB depth. They are not accepted in a protein database, but there is a repository to deposit those. With that, I want to finish thanking the group members that have, and this is a subset of the members that over the years have contributed to all the developments around HADOC. And thank you very much for your attention. So of course, we couldn't do all this work without funding from national and European agencies and by Excel having a huge contribution to all our developments from HADOC. Thank you very much. Thank you, Alexander. This was a really nice talk. There's a couple of questions at the moment in the chat. I encourage everybody to write your questions there if you have more. So the first one is have you done comparison between HADOC, gold and autodoc? Yes. So in the actually in the shape-based protocol that we have just published, there is a comparison. A lot of the software are using the DUDE data set, which is a benchmark for small molecule docking. And this is until now, this was a bound docking benchmark, meaning that you are taking, you have a structure of a complex, you take the molecule apart and you just try to dock again. And this is, of course, a much easier problem because you don't have conformational changes. And gold and autodoc and a number of those have been tested on this DUDE benchmark so we can compare those. And what we have shown in this paper, I don't have the figure right at hand, but again, go look the publication, is that our unbound performance or the protocol starts from not the bound form of the receptor, but a template which resembles the bound form. And we start from small strings so we don't have where to generate from scratch the conformation of the ligands. And this unbound docking using the shape protocol is competitive with this say more commercial docking software or free like gold and autodoc, but with their bound performance. So we don't know what the unbound performances of this software. And we, in benchmarking, I think the groups developing the software should be doing the benchmarking and not groups that are not expert in the use of the software. Yes. There's a question regarding what's your definition of close when you talk about ranking clusters? Okay, close. So in the example I was showing, if you compare the, well, let's put up that slide here. So if you compare the score, so this one is minus 138, this one is minus 131, and the standard deviation is plus minus 20. So this cluster must contain a member which is better in score than this one. So basically what we would call close, you know, if the cluster are overlapping within their standard deviations, then you cannot state that they are significantly, that one is significantly better than the other. So in any way, you should never, when you look at docking models, you should never look only at the first solution that the software gives you. And this is true for any software. You have to look at what are the other solutions. Are they very different? And try to, in the ideal case, you also have some data that you did not use in the modeling that you might use to actually discriminate between clusters. There's another question. How do you deal with the cases of small molecules with missing stereochemistry data when using the database or open eye omega? So the benchmark that we did, I think they all have the stereochemistry in the, in the duty set that we used. Of course, if you don't have that, then you know, you will have to generate multiple conformations. And that's going to be a, we never tested that, but you will have to generate different stereochemistry. And this might become a combinatorial problem if there are multiple sites. So then you are in trouble. And you will have to generate topologies for the different stereochemistries because, you know, your improper definition might be different depending on the stereochemistry that you are. So you will have to repeat the docking for each stereochemistry. But that's not something that we have ever dealt with ever seen. Protonation state, ICI appearing. So protonation state is an important aspect also in docking because, you know, when we do, you think, oh, one mutation is getting a complex, it can change a charge by just eliminating or adding one charge unit and the complex is not formed any. In most cases, I think the residue that you have to worry about are the histidates and that's, so the way that we assign the protonation state by, well, you can do it in different ways. So you can do it manually. For the server, you can specify manually. If you don't do that manually, what the server does is to do an educated guess. So we use more probability and reduce to generate all the idle chance and then select the state which is coming out of that. This is a bit the same of what Gramax is doing when trying to guess the protonation state of histidates. You can try to do to run servers like ProPKA and there are other tools where you can try to predict the PKAs and then if you know the PKA at which the experiments or the binding is occurring, then you can figure out if your histidine should be protonated or not. It's rather rare that over residues like glutamate aspartate will be protonated or not. But yeah, this is something that you need to think about, especially if in your binding site you have charged residues. And natural amino acids. So we do have, I can show you that in the demo slash tutorial. So we have a library of modified amino acid that we are supporting. So we have phosphorylated ones, we have acetylated ones, methylated ones, so you can find that on the server online. And those on the server will also be supported by the local version of the software when you run it. To use those in Hadox, you don't have to say to generate all the missing atoms because Hadox will generate the missing atoms for you, but you should change the name of those amino acids to the non-lenture used by Hadox so that Hadox recognizes that this is a modified peptide. Sometimes users come with questions, can you implement these particular modifications? If it's simple to do, we do it, if it's complicated then it's usually not possible. But you want to, the thing is the unnatural amino acid, you could also define them as heteroatom for the modeling process, but then they will not be connected to the remaining of the peptide. So that's also, that's not a very good way of doing things. There's no more questions at the moment. So I suggest we take a break now until 3.30 when we are going to have a demonstration session. And if you have other questions then we can take them during the demonstration. I have just one comment before you all run to grab coffee or whatever you want to drink. So we have one hour for the demonstration, so I'm going to guide you, so we're going to use the web portal. In case you want yourself to try to follow, it might be hard, even the limited time, but if you want to play with the portal yourself, you should try to register for access to the portal during the coffee break. I'm pasting the link in the chat, so if you are interested to try to register, you will get an easy level access to the server. For some of what we are going to do today, you will need guru access. You can ask guru access if you have it. But again, it's not really necessary for the time we have today, so you can also follow what I'm doing, illustrating, I will illustrate things, and then we have a well-documented tutorial online that you can follow on your own later on. Okay. Thanks. Okay, so we'll meet again at 3.30. Very good.