 Hello everyone, my name is Alexandre Bronvin, I'm a professor of computational structural biology at Utrecht University at the Bifood Center for Bimoncular Research. My group is developing the HADOC software. We are partners in BioXL and I'm one of the organizers of this BioXL summer school. What you are seeing in the background of my first slide is actually a view from Pula in Sardinia, where this summer school was supposed to run this year, but because of the current pandemic, we have moved to an online school. So you will be missing the wonderful beaches of Pula and the wonderful landscape of Sardinia, but you will not be missing fantastic science, of course, and all the explanation that you're going to get. And this first lecture on my side is about integrative modeling of biomolecular complexes. This is the topic for my lecture, which will consist of two parts. So this is the topic for the entire two parts together. I will give you a general introduction on the topic of docking and integrative modeling. Then I will move into the specific of how we are doing integrative modeling in HADOC. Describe you how HADOC is working, what kind of information we can use. And I will move on with a number of complexes, a number of examples about complexes. The first one will be about modeling antibody antigen complexes. Then I will move into using mass spectrometry data to guide the modeling process and tri-electron microscopy data. Those are all dealing with rather large macromolecular complexes, but HADOC can also handle small molecule docking, and I'm going to show you some more of our recent results and adventures in protein, ligand, small molecule docking. And I will finish with a topic which is relevant for assessing the information content of all kind of distance information that you might have, information which is valuable to guide the docking, but it's also important to try to realize how much information is there in those and what are their possible false positive data in those distances. So we're speaking about proteins, we're speaking about interactions between proteins, and this first picture shows you interactions at the macroscopic level, at the human level, in the center of Uther in summer, where we have a very nice canal, the old rain, with all kind of terraces and restaurants, and you see a lot of people interacting there. It's clearly not the current situation. And all these interactions between humans are of course very important, but everything that's in life is regulated by interactions between molecules at the cellular microscopic level. So in order to understand, say, the social network of protein, we need to take the step to our modeling, the structure of those interactions, which brings me to the concept of the interactome. So what you are seeing here is an interactome representing, well, you see a lot of dots and a lot of lines, so the dots represent the proteins, the lines represent interactions between those. These kind of pictures are obtained typically by high throughput experimental methods. They suffer from a lot of false positive, but also a lot of false negative data. These kind of networks are not static, but they are going to change depending on at what stage are you in the cell cycle, where are you in the cell, localization is another factor, time is a factor, post-transnational modification are going to change this network. And if you want to understand how this network works, you need not only to add structural dimension to the dots in the network, but you also need to be able to model the interactions in the network, basically the line, which means the complexes in this protein interaction cosmos. Now, if we consider the structural biology of interactions of complexes, you see here a picture where we have both the experimental methods on the right side and the more computational methods on the left side. So you have the classical structural biology methods like cryoelectron microscopy, very popular this day, very powerful. X-ray crystallography, the classical structural biology techniques since many years, and nuclear magnetic resonance. All three methods can deliver a structure at atomic resolution, but they don't always do that. They always limitation to what you can achieve, which is why we also need modeling in this kind of context. Now, next to the classical structural biology method, there is a lot of over-experimental information available, which gives you some kind of information about the complexes, but not the full structure. So they are providing you basically pieces of the puzzle. And those pieces are very useful because these together with some kind of modeling software can allow you to still generate valuable models of those interactions, which brings me basically to the left part of the slides where you find a number of modeling techniques like molecular dynamics. You're going to hear a lot about molecular dynamics in this course, homology modeling. You can try to model complexes by homology, provided there is an homologous complex already solved in the PDB, and of course docking, which is at the core of my lecture. So what is docking in a nutshell? So given the structure of the components of a complex, in this case two proteins, can we try to sample the space between those? So it means rotating the molecules, translating them, and figure out what is the best assembly mode of those two molecules. So what is important in defining the best assembly? They should fit into each other. So there's a shape complementarity, which is important here. And next to that, you can also add all kinds of physical chemical properties of the system, like the electrostatic energy between the molecule or Van der Waals interactions between molecules. So you have to sample space and you have to measure the quality of the fit. The first docking software we're only using geometric complementarity as a way of measuring the quality of the fit. And the first docking software we're actually designed by Joel Janan, Shoshana Vodak, late 70s, early 80s. So the field is about 40 years old, if you think of docking in this context. Now, I already mentioned, so we need to sample some kind of interaction space. This is what you see here in the X axis, so conformational landscape of the system. On the Y axis, you have the interaction energy, some kind of measure of the quality of the interaction between the molecule. And the sampling process means populating this large space. And then you need a scoring stage where you're going to measure the quality of all the models that you have generated at this stage with the hope that you identify the one which is in a global minimum. Now, scoring functions are not perfect, so this is not always so straightforward. Now, if you have external information, data, which could be experimental data or bioinformatic data, you can decide to use this data in two ways. To bias the sampling and or to bias the scoring to distinguish what are good models from bad models. Now, what you see here will be on the left side, case where you don't have data and you have to search the entire interaction space between the molecule, so you're going to generate many more models to try to identify the global minimum. But if you have data, you can concentrate the search, bias the search in some region of space, which hopefully are pointing you to the right solution much quickly. So this has advantages, you don't need to sample the entire space, but this also has danger because if the information that you are using is bad or has wrong data in it, you might search in a completely wrong region of space and that's what it's called a GIGO principle, garbage in, garbage out. It's even documented in Wikipedia. So you have to trust the information that you have if you're going to use an information-driven search approach. Now, when we speak of integrative modeling in a field, we are referring to the use of not only one source of data, but multiple sources of information which use together allow you to solve large complex assembly problems. And this is just an illustration of some of the data that are useful in this context. So when you cannot solve the structure of your protein by classical structural biology methods, you might get information, for example, from FRET or EPR, where you have to attach probe to your protein, but then you can measure some kind of distances. You can do experiments like HD exchange where you dissolve your protein in D2O, look at which protons, exchangeable protons are replaced by the deterrence. And if you do that for the free protein in a complex, you can see which regions are protected. And this is going to give you information about the interface. Similar way NMR can give you information about the binding regions by looking at doing these titration experiments while you're basically recording a spectrum of a protein which has been labeled, which you see by NMR, you titrate a second molecule which you don't see, and you look at which signals are affected by the binding. And in this example, signal B is affected, which will interpret as it must be involved in interaction. Now, you can also do cross-linking experiments these days, detect those by mass spectrometry, and get information about possible distances between specific residues in your system. Those are not very accurate though. If you don't have, well, you also have the type of shape information, so this could be high resolution from cryo-EM or low resolution, or coming from small-angle X-ray scattering, for example. But simply mutagenesis, mutations in sequences, which together with a functional assay will allow you to tell you that the complex is no longer formed when the mutation is present, is information that in principle you can use in the modeling as well. And if you don't have any experimental data, you might still have sequence data, structural data, and from those, you can predict actually binding interfaces, and these days you can also use co-evolution as a way of even predicting contact between amino acid. So docking in general and integrative modeling in particular are going to give you models, or 3D models of complexes. So they have their limitations, but models are already useful because they might provide you insight into function and mechanism of action of those complexes. And if you have this insight, you can try to design experiment and test if the hypothesis that you have generated from the model are valid or not. So modeling on its own without any kind of experimental validation is nice, but I think if you can do experiments based on the model that you generate, you have a much better story to tell. Models can also help you to understand the effect of disease-related mutations in the context of interactions. And they also the starting point for drug design, for example, to prevent binding or restore binding when things do not work properly. So here are a number of reviews that we have been writing on a topic with the most recent one from this year, one about integrative modeling of complexes. We've also focused on membrane protein modeling, in this case, and a recent one about using coarse-graining hybrid approaches to model large complexes. I also want to point you to the special issues of proteins dedicated to the CAP-RE experiments, which stands for critical assessment of predicted interactions. So this is published every two, three years and the most recent issue was earlier this year, where you can basically see how the field has been evolving over the years and what are actually the methods that are working well or less well and what are innovative approaches. So it's giving you a good representation of what the field. So this is the result of a completely blind prediction experiment when it comes to complexes. After this general introduction on the topic, I want to move more into the specific of HADOC and how we are going to use information in HADOC to do the modeling of complexes. So HADOC is an integrative modeling platform. It can incorporate all kind of experimental data into restraints, which are going to be used to really guide the modeling process. Everything started in HADOC about 20 years ago from NMR iterations, but we can incorporate this area a lot of different kinds of information. The most recent addition was actually the use of cryoelectron microscopy data. Pretty much from the start of HADOC, we were able to dock more than two molecules at the same time. This was because of Capri, actually. So for a long time, we had the limit of up to six molecules and in the latest version of HADOC, which is version 24, we can handle up to 20 molecules. Of course, you have to realize that it only makes sense to model large number of molecules simultaneously if you have good data to drive the modeling process. Otherwise, the complexity of the search that you have to do and the complexity of the scoring is too high. So it becomes a waste of time to try to model, say, a complex of 10 components if you don't have enough information for it. One type of information, which I did not mention yet, is symmetry. So if you know that you are dealing with symmetrical system, this is also an information that we can leverage in HADOC so we can define restraints to impose symmetry and this very much helps in the convergence of the computations. HADOC also from the start allow for flexibility at the interface. So we're not going to do a rigid body docking, but we're going to refine the solution that we're generating by introducing some kind of flexibility at the interface between the molecule. And it has been consistently performing well in Capri over the years. So you can find more information about how I got HADOC at our group software page. So the core of the information which is given to HADOC, so the way of transforming all kinds of rather fuzzy information into restraints for HADOC is to define those as ambiguous interaction restraints. And this is the key concept in what we are doing so I'm going to spend a little bit of time to explain that. So if you assume you have two molecules A and B and for those molecules, you have identified a number of residues that are important for the interaction from some kind of experimental data. So this will be the red residues here. So this is defining an interface on both molecules and you know that those interfaces have to come together but what you don't know from this kind of information is in which orientation. In principle, it could come, it could be any orientation. So we want to define some kind of distance restraints that will force the interface to come together in space without pre-defining in which orientation this is going to happen. And the way to do that is to use this concept of ambiguous interaction restraints. So if you look at the residue here, so the red one are the experimental one. The green one will be the surface neighbors of the experimental one because experimentally, you will never detect or map exactly the full interface. So you're always missing a bit of information and we find it better to be too permissive in terms of the definition of the interface than too strict, which is why often we select automatically all the surface neighbors of the residue that we have experimentally identified as being part of the interface as well. The surface neighbor will be defined as passive residues and the one that you know should be really at the interface are defined as active. So active residues must be at the interface otherwise some energy will be generated. The passive residues, the green one here, they can be at the interface but if they are not, there is no penalty. Now, how do we define a distance restraint between multiple points because this is effectively what we have here. So we're going to calculate all pairwise interactions between all atom of one residue, this high residue on protein A and all atom of all the active and passive on the other side. So let's assume that an amino acid has on average 10 atoms. If you have one residue here and you have 10 residues on this side, you're going to calculate 1,000 individual distances, all atom-atom combinations. What do you do with these 1,000 distances? You're going to sum them as one of the distance to the six power and then you take the inverse six root of this sum. This is giving you an effective distance. This distance has the property that it will always be shorter than the shortest distance that enter the sum and this distance is what we use in a restraining energy function to bring the interface together. We use for that a potential which is, it's not harmonic, it's only harmonic for a small region of space, say up to one or two angstrom violation of your upper limit. So you have an upper limit, lower limit, the energy is zero between those, then it's going harmonically up. There's a transition period and then it becomes linear. This function has been designed for NMR structure calculation in the early 90s by Michael Nildius, an actual Brunger. It's very robust because while the energy here still goes up, the force has become constant. So this avoids that your simulations are blowing up because of too high charges, too high forces, sorry. And this is the NFJ function that we are using basically. So we have a number of those distance restraints for each active residue will have such a distance restraints here in this distance restraining function which brings the interface together without pre-defining how they are arranged. Since there is also, say, bad data, typically, so you never have perfect data unless you are doing some artificial tests, by default, what we do in Hadox is to remove 50% randomly of the information for each docking trial that we are doing. So each model will be generated from another 50% distribution of the data. So this is a way of removing bad data. So by removing the bad data, you hope to get correct models out, but from time to time, you're also going to remove good data and you're going to get the wrong model. So you need also a scoring function which is robust enough in this case. So how do we do the interaction search in Hadox? So how do we do the docking effectively? So we have this experimental information which would encode in this ambiguous interaction restraints, and we might have cryo-em restraints, diagonal restraints, residual dipolar couplings, but we have this experimental energy term that you see at the bottom, and we combine that with a classical force field as is used typically in molecular dynamics where we describe bonds between atoms, angles, rotations around bonds, and the non-bonded interactions, van der Waalsen and electrostatic interactions. And these, of course, are very important to guide the interactions between the molecule. Now the search is not a systematic search of all the possible solution of the assembling the two molecule together, but we're going to do an energy-driven search where we use the forces, the derivative of the energy to guide the search. We start with energy minimization and then we go through some refinement stage using molecular dynamics simulation. And the entire docking protocol of Hadox consists of three stages. In the first stage, we're going to consider the molecule as rock solid and we're going to do energy minimization driven by the data that we put into the system. We take the model out of this first stage or we take a fraction of the model out of this first stage, we're going to refine them using some kind of simulated annealing protocol using our molecular dynamic where we're going to optimize the interface. Those models are then passed into the final refinement stage of Hadox which might use an explicit solvent shell or not depending on the protocol. So let's have a bit more of a detailed look. So the starting point for the rigid body in energy minimization is to take the molecule to if you have a dimeric system, but if you have more than two molecule you have to spread the molecule in space so the actual maximum is 20. So we separate the molecule in space, we randomly rotate them so that there is no specific bias in the starting orientation and then we start a rigid body energy minimization. And by rigid body energy minimization, I hear a mean that they are per molecule basically six degrees of freedom, they can rotate and they can translate. And the minimization is guided by the restraints that we put in. But if you have no information at all, you can also have a Hadox has an ab initio mode and then we define distance restraints between the center of mass of the protein. Now, this is a rather fast step. So we're going to sample here in the order of 10 to 100,000 different models. We only write a fraction of those two discs. And then we take the best model, actually you see here the docking, this rigid body minimization proceeding. So you see the molecular rotating at the beginning, then you have this translation and you see that the complex is formed. The region here that you see represented as spheres are the one that were identified by NMR in this particular example as being part of the interface. Now we take a fraction of the model that have been generated. So we have a first scoring stage at a rigid body stage. We take 10%, 20% of those models and now we're going to refine those interfaces by doing a semi-flexible simulated annealing protocol. In the first stage, the molecule are now moving using molecular dynamic simulations. Actually it's molecular dynamic in torsion angle space. So first they are moving as rigid bodies, then we allow for flexibility along the side chain at the interface to optimize the interaction. And then we add flexibility both in the side chain and in the backbone at the interface. By using torsion angle dynamics, it's very easy to freeze region of your protein while leaving other region flexible. And the molecule can still freely move in space. So it's different than using position restraints where the molecule are fixed in space. Here you want the molecule ready to be able to move with respect to each other. So typically at this stage we're going to refine a few hundred model. Now the flexibility that we introduce here is not going to be to do miracles. So we are rather limited in terms of conformational changes and it very much depends on the amount of information that you have at the start to guide the modeling process. How much conformational changes can you induce? So typically with say only interface information you can expect maybe one to two angstrom. If you have much more say high quality data we have seen cases where we can have conformational changes of up to five angstrom. But that's rather rare. So we are limited in the terms of the conformational changes that we can model here. Now this is how it looks like. So now you see first the rigid body stage. Then you see that the side chain are going to be optimized. So we see side chain moving rigid body. Now the backbone is still rigid and in the final stage of the refinements the backbone becomes rigid and you see that they are indeed small conformational changes. So here is the movie once more. If you focus on this particular loop if you follow my mouse you will see that it flips over at some point. So this refinement induces small say basically induced fit conformational changes. After this stage typically we take the model that we have and we're going now to refine them in explicit solvent by building a solvent shell around it. No periodic boundary condition. It's a simple and it's a very short molecular dynamics. So nothing spectacular happens here. It does improve the energetics. And actually in the latest version of Hadoq 2.4 we don't use explicit solvent anymore because benchmarking show that it's basically not adding much if anything to the quality of a model. Only CPU time. So the default noise to not even use the water explicitly but we just end with a final minimization. But you still have the option to explicitly add water to the system. And this is how this refinement will look like. So you have a shell of water and in terms of length this is only a few picosecond to a few tenths of picosecond max. So really nothing spectacular. So in terms of flexibility because flexibility is one of the challenge in this field. So how can we handle conformational changes? So there are different levels. So first of all, we can deal implicitly with flexibility by starting the docking not from a single conformation but from an ensemble of conformations that have been pre-sampled. For example, you could take snapshots from molecular dynamic simulation. So you could take different conformation from an NMR ensemble of structure or you could use normal mode type of analysis to generate different conformation. Further, we also scale down the interactions between the molecule during the simulated annealing to allow some kind of overlap. And of course, we also have this explicit level in Hadock where side chains become flexible, backbone become flexible and you have really this induced fit effect refinement of the interface. Now, how do we score all those models because scoring is the other stage that you have to consider in this kind of modeling. So we have a scoring function which is very simple and which has survived now almost 20 years of optimization. So it might not be perfect but it's doing the job and it's robust because it can handle different type of molecule and that's one of the important things. I think an important message in general in science is keep things simple. Only go to the next level of complexity if you need it. So in terms of non-bonded parameters or the force field, we use OPRS, a united atom force field. We use a rather short cutoff to limit the computational time. So that's not what you will be using if you were to do molecular dynamics. We scale down typically the electrostatic components, the Coulomb potential by a factor of 10. And at the end, we cluster solution. That's an important part. So we are basically putting together the solution that we sampled each other and our scoring is on a cluster basis. So we're going to score the top four member of each cluster and you see the scoring function here at the different stages. This is basically the final scoring function which will be the end of the entire protocol. So we have experimental terms. So this will be the ambiguous interaction restraints in this case, but you might have cryo-EM terms or over terms. Von der Waals, the intermolecular energy between the molecule. In 20% of the electrostatic energy between the molecule and we have an empirical dissolvation energy term which is derived from the parameters of Fernandez-Recchio. This is basically telling you that if you bury hydrophobic surfaces at the interface, it's good. So this is a bonus. If you start burying charges, usually you pay a price for it. And that's the function that we are using in most cases with small modifications sometime for small molecule docking. In the initial stage, you see that the weights are different. So we have full electrostatic here. But remember at those stage, epsilon is 10 while at the water refinement, epsilon is one. And we also use the buried surface area. So the amount of surface buried between the molecule as part of the scoring function. So here is a first example of how can you use Haddock with some kind of information to guide the modeling. So this is NMR-based modeling in this case of the Fuse A ferredoxin complex. And it has to do with iron piracy because Fuse A is a protein in the membrane of a bacteria and the bacteria needs iron for its survival. And he does that by hijacking the iron from his host. And why piracy? Well piracy because of Haddock. And if you were thinking that Haddock is a fish, no Haddock, we refer to Haddock as the cartoon figure that you see here, Captain Haddock, the big friends of Tintin and all these adventures. So you see Haddock here as a pirate trying to find wind iron. So what is the information that we have at hand in this particular example just to illustrate how we can use Haddock and the information for this? So we have here a protein for which we have the crystal structure. Actually there is a large group of people who contributed to this work. We just were involved in the modeling basically. And for a long time they have been trying to crystallize the complex but they never managed to crystallize the complex. They managed only to crystallize to get the crystal structure of the membrane protein which is already quite an accomplishment because membrane protein structure are not easy to get. So there were no information here about the binding site in principle but there is still information because we know how this protein is sitting in a membrane and we know which part is the extracellular environment of the membrane. So if the bacteria is going to hijack iron from its host by binding to pheridoxine, a small protein that contains iron, it must happen in the extracellular vision. So we know that the binding site must be in this region of structure and not here at the bottom. And also obviously not in a membrane. So that's information that in principle we can use in Hadox. Now for the pheridoxine part which is the soluble part of this complex, pheridoxine is a small molecule which has an iron sulfur cluster. And because it was small, it was studied by NMR and by NMR they could map the binding site. So what you see here is the sequence of amino acid of this protein and on the y-axis you are basically measuring the displacement of signal when you titrate a fusae into the solution. You see that there are regions along the sequence that are affected by the binding and if you map those regions on the 3D structure of the protein, it's defining a well-defined binding patch here shown in red on the surface. So that's the typical information that Hadox can use and we're going to use this red region here to define active residues. So in what I was explaining you before those active residues have to be at the interface in a model that we're going to generate. Now for fusae, we had the knowledge of the membrane, we have the knowledge of the extracellular loops so we know where the binding can take place and for this reason we're going to define this extracellular loop as passive residues in Hadox meaning that they can be at the interface but if they are not, you're not paying a price so it's not going to penalize you basically. So we have active residues for phyrodoxine, we have passive residues for fusae, you give that to Hadox and then this is an example of what you're getting out. So in this case I'm showing you two clusters. So cluster number one, the numbering of the cluster in Hadox means only the size is related to the size so that's the most populated cluster but it doesn't mean that cluster one is always the top in terms of scoring. In this case it is. If you compare the score so this is minus 37 we don't put units on our scores so these are arbitrary numbers. We don't want people to think that this is a binding affinity for example because we have demonstrated that there is no correlation between docking scores and binding affinities. So if you look at the second cluster, cluster four it's minus 130 so it's scoring a bit worse but if you look at the standard deviation they are clearly overlapping. So in a real case scenario here you could not really say that cluster one is better than four. You will need to do experiments to actually to validate which one of those two solutions is the correct one and based on the model you can try to propose mutation. Now this was not the mutation, the mutagenesis was not done when the publication came out but this is something that the experimental people were trying to get going so that we could actually decide which one of those solution was the best one. So most of our users are actually using Hadox through the web portal and what you are seeing here is the new 2.4 version of our portal which you're also going to use in the tutorial part. So we have more than 17,000 users that have registered since the portal is up in 2008. We are able to provide this kind of web portal and the computational power behind it because we have access to the European open science cloud, high throughput computing resources and 50% of all the computations that we have done since 2008 has been running on those distributed resources that are mainly distributed around Europe but also in Asia and in the US. Hadox is one of the core software in the Biaxcel Center of Excellence and this entire summer school is organized by Biaxcel. Within Biaxcel we have a Hadox forum where you can find a lot of information about different topics in Hadox so if you have questions about something that you would like to know how to do it or something is not working you can first of all search the forum because you have a search engine here to try to see if your question has already been answered. Has it not been answered and you can ask new questions here and you will get answers. So next to Hadox which is our core software in terms of what we are doing in my lab we are operating a number of other services. Some of them have to do with the analysis of complexes like Binding Affinity Prediction, Hotspot Prediction, Bind Formatic Predictions. This is a new one to manipulate PDB. So you can all find those portal accessible from WNMR Science UU-NL. And now comes the last topic for this first part of the lecture where I'm going to show you how we can use Hadox to model antibody antigen complexes. So as you probably all know antibody structure consists of, there are two chains, there's a heavy chain and a light chain and the binding site of those antibodies in these regions here where you have the hyperviable loops so the VL and VH region. So this region consists of six loops, three loops comes from the light chain, six, three loops comes from the heavy chains, they are numbered L1 to 3 and H1 to H3 and this is where the antibodies are binding their targets. The loops themselves they might change so the sequence might change but this is knowledge. So we don't need to do experiments to know that the binding is going to take place in this region and that's information that Hadox can use in this case. So in order to benchmark this kind of models you always need to define a set of complexes for which you know the answer. So you have the crystal structure of the complex and you also have the starting point for the computation which should be basically the free form of those proteins. So we use for that 16 complex from the docking benchmark five which is a docking benchmark used typically in the field. Why 16? Because these were the new entries in the version five of the benchmark and those complexes have not been seen by over docking software. We wanted to compare different approach to model those complexes and we selected software that allow in some way to use information to guide or filter the docking solutions and the software in this case are Kloss Pro an excellent docking server which has been doing very well in Capri since many years, ZDock also famous docking software both Kloss Pro and ZDock are rigid body docking software and then to software LiDock and Hadox that allow some kind of flexibility and all have options to either bias the sampling or filter the docking solutions. By the way, the work is from Francesco Ambrosetti that you see here on the top left. A former PhD student in my group. So how are we going to use the information? So now on the antibody side, it's pretty clear that we have this hyperviable loops which we can define in Hadox as being active basically. Now for the anti-gen, we tested three different scenarios. In the first one, we assumed we have no knowledge of the binding site and we're going to target the entire surface. So we defined the solvent accessible, surface residues as passive in the context of Hadox. In the other scenario, we assume that we have some kind of loose definition of the epitope on the anti-gen site by selecting all the residues that are within nine angstrom of the antibody based on the crystal structure of the complex. And the last scenario is basically a perfect case scenario where we know that we have the perfect binding site defined at 4.5 angstrom. Now how do we measure the quality of the docking results in general? A little bit of introductions These are the criteria that Capri is using. So the first one will be the fraction of native contact that we produce. So you look at which contact have you generated in your model and which fraction of those contact is in the crystal structure. So this is the fraction of native contact. And then we have two measures based more on fitting the structure. One is the ligand RMSD where you fit on the largest molecule on the receptor, the antibody in this case and you measure the differences on the ligand. So that's the ligand RMSD. And the other measure you fit at the interface of both molecule and that's the interface RMSD. And to get an acceptable model you need at least 10% of the contact to be correct and an interface RMSD of 4 angstrom or better. You will have a medium quality if you are below 2 angstrom RMSD and 30% of the contact. And you have a high quality structure if you are below 1 angstrom and more than 50%. So this is already the result of the docking and you see a lot of data summarizing to one figure. So each column correspond to a docking software. Each rows correspond to a scenario. So the top row is hyperviable loops surface residue. The middle row is hyperviable loops with the epitope defined at 9 angstrom. So loose definition. And the last one is the true interface basically. So that's the perfect scenario. So you can see directly in that in all scenarios, Hadock is doing as one of the best. You can also see that in all scenarios actually Hadock is generating more medium to high quality structure than the other software. Cluspr for example is doing quite well in the ab initio mode if you have little information. So but Hadock is winning here. Once you have data, you can see that Cluspr is also doing very well in this case. You see that this is the top 10. What you see on the x-axis is the success rate if you consider the top one model, top five model, top 10, top 20, top 50, top 100. So once you have data clearly, also Cluspr and also ZDoc are doing very well. LightDoc is also doing very well. Also in the ab initio mode, you see that it has the highest performance for the top 100, but it's not able to score those models very well because it's not doing very well in the top one to 10 compared to the other one. Now the more info you have, the better all the software are doing, but you see that Hadock starts generating more and more high quality models so you get higher resolution structure, higher resolution model out of the modeling the more data you have. And this is coming in part because of the flexible refinement which is happening in Hadock and which is not present in the other software. So we are doing very well here. Now if you look at the cluster base, so these were single structure based ranking. I just show you if you do the cluster based analysis, you see that here the performance, if you have little information, the cluster based performance is not very good. So we reach 19% top one and 25% top five. If you compare that to the single model performance, you see top five, top 10 is 31% success rate. So 31% of the complexes do have a model which is of acceptable or better quality in the top 10 while this is only 18, 25%. If you have good data, you see that the scoring performance goes up if you do cluster based scoring. So we have 100% top one if you have the perfect data and we reach over 50% top one if you have a loose definition of the interface. And this is more than if you are looking at single structure scoring which is below 50%. So it makes sense to score on a cluster basis when you have reasonable data to guide the modeling process if you don't have any data, better stick to a single structure scoring. Now one of the most challenging loop to model in terms of antibodies is the H3 loop because it's also the longest. So the question we want to ask here is if our flexible refinement protocol is improving the quality of this loop and the quality of the model. Now what you are seeing here is basically we're analyzing now the number of contact, the fraction of native contact that are made by the H3 loop. And these are all the docking models, free scenario, only the surface is known, loose epitope, perfect interface. What you see on the X axis is are the rigid body model generated by Hadock in the initial stage and what you see on the Y axis will be the refined model at the end of the Hadock protocol. And these are representing the fraction of native contact. So all the points that are above the diagonal indicate that something has improved. All the points that are below the diagonal indicate that something went in the wrong direction. So you see in general that the more data you have, the more models are acceptable. So the color coding, green, accurate model, red are wrong model. So the more data we put in, the better the improvement is in terms of the quality of the contact that are made. So we go here from 50%, for example, for some cases up to 100% of the contact being reproduced. And this is also visible here. They also model that are going the wrong directions. You can never avoid that. And you see that when you have little data, in general things are not improving that much, but we still have some spectacular improvement up to 40%. Now, if you do the same analysis based on RMSDs of the loop, you see that the conformation of the loop is not changing that much, or at least not the backbone conformation is not changing that much, but the contact it makes are improving. So in conclusion for this modeling of antibody part, using information to drive the modeling process really improves the quality of the model that you are getting. And out of the software that we compared, which have different options for using the data, Hadock comes out as performing the best because we use directly the data to guide the modeling. And in terms of conformational changes, the modeling of this H3 loop remains a challenge, but the contact can be predicted more accurately. Actually, antibody modeling and antibody antigen modeling is also one of the main use case in a Bioexcel project. And we are trying to use molecular dynamic simulations among other to try to improve both the structure of this H3 loop and the complexes. If you want to read the full story, I refer you to this structure paper which was published earlier this year. Now with that, I want to close this first part of my lecture and then we will keep going on later in the second part where we're going to move into mass spectrometry, chromium data and the small molecule docking world. Thank you very much for your attention so far and see you in a bit.