 Hi everyone! My name is Zanayandova and I'm from the group of Alexander Bovain Literary. Today I would like to talk a bit about what happens when Haddock meets Chromax. So I would like to start with a short recap of what Haddock does. I'm sure but now you are all experts at this. However, it never hurts to hear it once again. The first stop is the rigid body energy minimization. This is the step where you have two proteins that are rigid, they are being separated in space, rotated and translated and then being pulled together by the restraints you define. Ideally derived from the experimental information you have. The second step you give the interface a little bit of push. So you will have a tiny bit of dynamics on the interfacial residues where they have the chance to rearrange and so you can avoid some steric clashes. In the last step that was previously refinement in the explicit solvent it can be still set also in the new Haddock version. However now it's been more abolished by default by just energy minimization. The last step which is technically not a part of docking but it's still quite important is the analysis of your results. So in this step you would rank all the models, cluster them and then you can compare them compare the energetic terms and then it's much easier to look at individual clusters than 400 models one by one. All of these phases especially the ones in the beginning here can be called sampling right because you sample different conformations and then you can refine them a little bit. But a very important step as well is scoring. So scoring is the way how you basically move between those stages. That's the criterion on which you would pick the best models from each of the stage to go to the next one. And this criterion is called the Haddock score. As you can see it is it slightly changes between the steps. However you can also manually modify it. So depending on the scenario you are working with you can slightly change the Haddock score but keep in mind you don't really change the docking procedure you just change the the ranking of the models that will be taken to the next phase. Right so now when we know more what the sampling and scoring is I would like to show you how we can incorporate MD in both of these stages. So as I said in the beginning you dock two rigid proteins right but as often in nature proteins are not rigid either and sometimes you would need a higher conformational change before the binding so you would need a little bit of rearrangement of the proteins or residues and sometimes this change is larger than what even the flexible phases of Haddock can take care of. This is the first part of my talk where I will be talking about using MD before that and this will be on the antibody use case. The second part I will address the scoring part and more specifically the case where we have similarly ranked clusters or different clusters but you don't know which one to pick and if MD can help us with this. So let's go to the first part. This is also a bioxial part called use case one antibody design and as you can guess we work with antibodies. I guess all of you know what an antibody looks like basically it's this Y-shaped protein consisting of a constant region. This constant region I have written here low immunogenicity it means that if a human antibody would be injected into your body there should be a relatively low chance that your own body would develop antibodies against the injected antibody protein. What is very powerful about antibodies is the very variable region and this is more specifically the hypervariable loops on the very interface or a very end of the variable region and this is the kind of the tool with which antibody can change the sequence and conformation of the loops and target antigens of different nature. Right, so this is somehow the workflow we have in bioxial. What would HADOC do? You would have the grey antibody here and the yellow antigen here and by default you will dock these two rigid proteins. However as I said before sometimes you need the conformational rearrangement before binding. So this was the case also in this antibody here where you see that the unbound hypervariable loops in magenta look very different to the ones in bound state. They did blue ones and this change didn't really work in HADOC. However we simulated it with grommax and now I would like to show you some first results. Right, so these are these two workflows we were working with. This would be the default HADOC scenario where we took two unbound structures but what we tried is that we took now the unbound antibody crystal structure simulated in grommax and this trajectory was then clustered based on the conformation of the loops and this clustered clusters were then fed into HADOC. So we use this ensemble docking option and then I would like to show you the results now and this was done for complex 7 antibody antigen complexes. Right, so the first result you can see here and these kind of plots show you basically the model quality versus the model ranking. So in the ideal scenario you would like to see a correlation like this. So you would like to see something where the highest quality model would gain the lowest HADOC score. However as you can see this is not always the case. You can see that it's more spread. Right, so you would have a lot of excess and acceptable quality models but they are being ranked differently. This was the scenario without MD and if we look at the after MD we see that we have more models in the medium quality field. So this is quite an improvement because you don't have so many medium quality models here. However you see that the score is still remains problematic because if you have a HADOC score of let's say minus 80 you don't really know if your model is good or bad. And these are these two plots where we look at these results in a bit different way. So we look at in which of the top X clusters or top X models ranked by HADOC you would have at least one medium acceptable or high quality model. And you see that in the case after MD we see a slight improvement because in also in the top 10 and 20 HADOC models we would have more medium quality models compared to the scenario without MD. However this doesn't always work very well. So you see that in the scenarios I showed you and the antibody I showed you before three reasons that even after MD it seems rather bad, right? You see it's rather even worse than before. So what we did we applied this accelerated weight histogram. We used the enhanced sampling analysis and this is the case where the antibody would be pulled apart from the antigen and then would sample the interface loops, right? So you would have maybe even better sampling. This was the workflow we used and you see that the results are significantly improved. So now we are more in the medium quality field. And this was rather encouraging. So with this I would like to conclude the first part of my talk where we see that MD before the working does improve the sampling. We see that we have more of better models. However we get also a higher number of worse models and we still might have some trouble with scoring them, right? However if we use the enhanced sampling method we get also better results for the very problematic amino acids. And here you can see a short movie made by Alessandra Villa and KTH where you see the accelerated weight histogram simulations of this antibody as I showed you before and you see how the antigen is being pulled away by the reaction coordinate in red. Right, so let's come to the second part of my talk which is called native or non-native. So we want to identify which cluster is native. What does it mean? As you can see here we have the reference in the middle. So this is the crystal structure you would wish to have or that's the correct one. And then you see if you would look automatically at the best ring cluster by a Hadox core it looks nothing like the reference, right? It's very different. And this could first be not very good news. However if you dig further through the Hadox result you see that okay there is actually the right solution but it may be somewhere a bit further. It may be the second or third ring cluster which you would not automatically look at. And our question was okay can MD help us with this problem? So what we did? We simulated 25 complexes which were docked by Hadox, right? So the first part was just the standard Hadox procedure and then complexes were divided into models of high quality and models of low quality. And the high quality models were called the native ones and then the low quality are the non-native ones but they are still relatively highly ranked clusters. And then we also simulated the reference structures just for comparison and we analyzed the simulations. We analyzed them and looked at the number of properties. We looked at the crucial capri properties like ligand arm is the interface arm is the fraction of native of original contacts, bird surface area, distance between the proteins, non-bundled energies as well as number of hydrogen bonds. And then from these analysis we found these properties into a machine learning classifier. Well the first question we had okay we maybe we wanted to know if the non-native models can improve over time. Like right maybe they are not good after docking but if we simulate them 400 nanoseconds they would come closer to the reference. So in this step we would compare the course of the simulation to the original crystal structure. So the reference that you would wish to achieve. And this is what we got. In these graphs you can see trajectory stretches so 0 to 5, 5 to 10 nanoseconds etc. And these box plots of all 20 complexes. So these are 20 complexes and all put together. And then we see the interface arm is D and ligand arm is D. And we see in green and there are the reference simulations which remain the lowest or the most similar to the original complex crystal structure which also makes sense. Because they are basically coming from the crystal structure so they don't deviate too much. The acceptables are close enough, don't deviate too much over time. However the non-native ones they were already from the beginning high up and they don't seem to come closer. So for both native and non-native ones we don't see any significant improvement over time or coming closer to the crystal structure which could also be expected in only 100 nanoseconds. This is another plot where I'm showing you the fraction of original contact. So you can see that even the reference simulations lose a lot of original context that they would have in the original crystal structure. So up to maybe in average 70% right they would lose up to 70% of the original contact because of the minimization, equilibration and also these intermolecular contacts are defined as rather strict. So it's only five angstroms and then that can be violated rather easily. The next question we considered a more realistic scenario. So a scenario where you don't have the reference structure because usually when you want to talk something you don't know what the result should be. And then we compare all of these properties to the start of the trajectory. And here you can see them so obviously they are the lowest in the beginning because that's where it starts, the start of the trajectory and these properties are slowly increasing. However, what you can see already that the incorrect complexes here it seems the RMSD for all 20 complexes in average is going much higher than for the reference and native complexes. And the same trend you see also with the fraction of native contacts. So even though you start all of these you compare to the point zero here, you see that by the end of that 100 nanoseconds and these non-native complexes lose up to almost 100% of the original intermolecular contacts. So they are really much more unstable than the native ones which is quite interesting and this change we didn't really expect so we thought okay let's do these properties and put them into a machine learning classifier. So how we did is we divided the trajectory into shorter snippets or stretches and then we had all the properties I was talking to you before. They were regarded as features and they were labeled as native or non-native according to the complex they came from. And this was then fed into a random forest classifier which was before optimized and I was trained. So this random forest classifier consists of many decision trees. However there are more levels of randomness. You can read about it or I can tell you later. And now we train the model and let's look if it worked. So first we tried the model on the initial training set. So we had a training set of 20 complexes and then try to see how well it can work on predicting another set within the training set. So we did this cross validation 100 times where we would always divide the training set into one subset which it was trained on and one set that we would predict the accuracy for and this was in average 0.86 which is quite nice. Also the rock curve looks rather promising so we were quite happy about this but one could still say it's not very fair or the comparison is not very fair since we don't have any other complexes. So what we did we looked at two independent sets of always five complexes where the training set never seen these right. So you would have the model was trained of the training set that was independent to those two test sets and then you see that we have the accuracy of 0.6 or 0.75 which is really nice and why is it so different. So we look also at the individual properties of both test sets as I showed you before and we see in this test set with a lower accuracy we might the properties are not really very different. So there are the native and non-native complexes don't really behave too differently however we can still distinguish between them with the accuracy of 0.6 which is still quite nice. The random forest classifier can also tell you which features are the most important ones which is really nice. So we see that all features are similarly important however the interface are mostly change in the native context and change in the bird surface area are the most important features and this is quite logical if you think about it if a binding of two proteins is not very stable of course the bird surface area will decrease more than for the stable complex and with this I would like to conclude second part of my talk well MD does not improve the quality of the lower complexes however it can show you nicely what the differences are in behavior between native and non-native ones and so we were able to build a machine learning classifier on top of this and we get a quite nice result also for the external validation set and these little movies they're just an example of behavior between those two models so we had the native one and non-native one and here you can see how it really unbinds already after maybe 30 nanoseconds and you see how it goes through its periodic copy and finds the partner on the other side of the box so with these I would like to thank you a lot for your attention thanks all my bioxial partners mostly KTH in Stockholm and then thank my computational structure biology lab thank you