 And today is our last session, and we're going to talk about the disability and molecular simulations, and I'll start working on that. I first planned to make it really maybe simple and keep it in very abstract terms, but then due to our discussions in the last day and a half, I decided to change it a little bit and maybe make it a little bit more personal and maybe make it a little bit more real in that case. I'll start with you words. So I'm currently unemployed and unaffiliated, so you must wonder, like, what am I doing here, actually? And the reason is because before I got unemployed and unaffiliated, I started working on envy books, which is a cloud-based repository for molecular analysis and simulations. Essentially, also something I love the most about the idea of crystal-stalking about yesterday, like bring computers to data, so you don't have to, like, move it from one computer to the next. You can analyze it in the cloud. Hopefully, one day, it's almost around the situation. Automate as much as workflow as possible for reproducibility. You can always check all the input parameters and so on and so forth. So I stole this slide from my previous presentation about the IP box. So the idea was to provide storage to the researchers and also toolbox. You can also do all sorts of things. And the idea for storage was that as we all move between different institutions, then we also lose access to the data whilst it was architectural. So accessing data on some other institution is a nightmare. I have a huge stack of hard drives which are failing. And I don't know also which data is on which drive. So I thought it would be really nice if you could just search for it on some kind of reliable service. And the reason, and of course, there are many repositories, as we have mentioned, like yesterday, like Zynado, Feature, OrrSafe. And the reason why I thought that we really benefited from a specialized repository is because we were able to customize the detailed metadata which will kind of improve searchability and findability. Maybe we can curate this a little bit better. We can maybe create better documentation in this way. We can maybe also think about new ways to analyze and visualize things. Maybe we can get some new insights. Maybe we can even develop new methods because we'll have more data. And essentially, it's all about having a better and more accessible sign so other people can look at your data. And we can learn. If we make mistakes, then we can learn from them. But if we don't really share them, if we don't really report how we do things, then we don't actually know where are we making those mistakes. And we are all making mistakes all the time. Which now I will kind of show you the next part of my course. And essentially, I was kind of interested how can I actually turn this data into knowledge because knowledge is more important than data just to get there, I guess. And now I would like to give you a quick overview of my career path expressing methods. So I started with the QM on some small systems on enzyme. I did QMM on these small systems to kind of validate if QMM works like if these parameters are working compared to benchmark couple clustered level theory. And then I moved to MD. Again, because I was working on enzyme, so you have to have this multi-scale approach to characterize all the, I guess, important aspects of enzyme function. Of course, I started immediately with the energy calculation because that's the best way to enter a molecular dynamic simulations. We got very frustrated, but I did some kind of free energy perturbation and thermodynamic integration to kind of check the difference in binding energy between substrate and the inhibitor. And then I got this just like a little side project. It would be just like a little thing. We will help these experimentalists in Munich. I would just do this quick QMM study with the mechanism of this photo enzyme which does like DNA repair. So there are like excited states, all sorts of fun stuff. And I was like, yeah, okay, well, let's give it a try because there are like several proposed mechanisms. They weren't really sure what's happening. I never got to the QMM part. I mean, I did, but not mechanistically. So essentially I got stuck in the first time because there were two histidines in the active sites. And we didn't know what I got for the protonation state. So there were essentially nine possible combinations, even just around MDs because you need to sign protonation states in advance there are no proton jumpings around. So I spent a year doing this. And then I did MD, I did QMM repair a lot to actually calculate some EPR parameters to prepare the experiment. I did Poisson Boltzmann to calculate PKs. I used all the available servers to calculate and estimate PKs. And they give you very, very good results. And essentially with Poisson Boltzmann, if you know how to choose your parameters, you can get the answer that you want out of it. And then, finally, I went back to my original project. I did more molecular dynamics, but this time now umbrella sampling and steering molecular dynamics, because non-equilibrium free energy is amazing but so hard to work with and it's really impossible for this system that I was working with because I will try to get the energy of coenzyme A, which in itself is a very long molecule and has so many degrees of freedom, and I was destined to fail, but I didn't really accept it. And then finally, I went to Poisson and I started to work on membrane proteins, which are completely different, these very complicated just because of their size. And I will also explain why in the next piece of life. I'm sorry that you got some rock bottom and satisfaction with the membrane proteins. Yes, I agree. Because none of my projects really work. Every now and then you need some kind of a boost to kind of get you going. But everything that I tried to do, there were always just more problems, more problems, so maybe just my perspective and I see problems, I'm just like, oh yeah, not a very interesting problem. I was like, no, like this I have to do this. I thought it would be easier. But then this is also the point where I started to work on membrane proteins. I really started to feel the need. I really wanted to see other people's data because I started to work on the protein where there was a lot of contradictory information even in the experiments and not to mention simulations. So I didn't really know who to trust and I didn't know what data was actually reliable. I mean, it all seemed okay, but it was all slightly different. Again, you could make data from face whatever you want in this case. And that's how I started to think about their repository, that's how MDVox was born and I started to really enjoy that. I had so much fun doing MDVox. Again, it's just a prototype but I learned really well. I got really super interested in the infrastructure and everything surrounding it including money. And then that's how I started looking for decentralized solutions and I also really, really enjoyed that. I think that's really interesting and exciting. There are some new and exciting developments in that area and it's really worth keeping an eye on it. So, definitely my satisfaction rises. So I guess un-affiliation and being unemployed is not that bad although there's a lot of problematic connotations with the words. And then in the meantime these are the main packages that I use. I use Gaussian, but I also use a bunch of others but these are what I use on daily basis. And I use Ember with Ember force field. I use mostly APBS for Poisson-Boltzman. But when I went to Pozo, I switched to Gromax and Gromax force field, I also then tried to use Ember again after three or four years' break. I had to really relearn how to use Ember because it has changed so much in the middle. But what I did actually used so many different analysis tools. I'll explain again why that I... Again, that's probably the biggest gain for my positive just by trying all these different protein analysis. I think I've probably tried every possible protein analysis that there is to make sense of the other dialysis. So, meet my understanding which is being like a protein and it's an ABC transporter. It's a multi-drug transporter. It has like more than 100 substrates and it's very slow. So you wonder, why would you even model this? And it's very large and it has like 1,300 residues. So this is the structure but there is also a linker which is missing from the crystal structure. It's 60 residues long. Crystal structures are between 3.8 and 4.2 angstroms and there are more than 15 different crystal structure of mouse feel like a protein currently in the PVD. And so and this is roughly the mechanism so there are really big conformational changes and also we have ATP binding to make them one complicated to 2 nuclear binding domains which meant that goes together and this has somehow supposed to drag conformational change with that the drug is being here in this cavity and then the drug is expelled from the so it's a drug pump essentially gets kicked out of the cell and you repeat the cycle. So time scales. So when I arrived and I started to work on this protein I browsed the literature every time when I would read another paper I would get more confused so at some point I just decided I'll just look at all the structural data and I'll stick with that. And the interesting thing with PGP just at that time there were two identical two structures that were derived from identical crystallographic data. So two different structures were two different solutions from the same diffraction data. And I'll call them for simplicity blue, orange, and green so I think it's easier than using PV codes. So essentially blue and orange are derived from the same crystallographic data because they were solved in a slightly different way I won't get into details. There were like some changes in the helices. So there were like the registry shifts in multiple helices which actually changes like which residues are actually water-explosive and which residues are memory-exposed exposes. Both of them had the same resolution so the question is like which one will be used for your simulations. And of course at that time most of the simulations in the literature were the original orange structure. The green structure came out at roughly the same again it is exactly the same resolution but it had a wider separation between these nucleotide binding domains. So is this like another confirmation? Like again old experimental data reported that you can have like a really a large motion between between these domains. So I used bromose, I mean bromase with bromose I used bromase 333 which was so slow I was getting 2 nanoseconds per day but I persisted so I go to the 200 at some point and of course they were like I think to learn how to set up a membrane system so it was old, it would be like a little bit slower and usually I also wanted initially to compare this blue structure to the orange like to my supervisor data but then I actually set it up by myself and rerun the game to make things even like more like similar to each other. But these are just like pairwise arenas, diplos and this is what the entire protein is just for transmembrane domain and this is going up to 1.8 nanometers which is really large. So essentially the difference between structures walk up to 20 angstroms in terms of higher density. So the difference was really big, old simulations they were all starting from the same point but they all diverged essentially. Except for the transmembrane domain, so this is PCA analysis, you can see that they are all scattered in the PC space but what these here and this one which is the blue structure is probably the one that changes the least in this area. So the blue structure is actually let's call it stable. So these are like the smet shots you can see that these are just representative clusters, so the simulation and this is like the helical content. So you can see that actually the protein is unfolding as you simulate it. And like all this is just like one set of variables that you can analyze but none of it is converged everything changes and obviously I'm under sampling because this is 200 nanoseconds and the entire protein works on a second time scale, but these were the longest simulations, everything was between 50 and 100 at that time and people reporting all sorts of findings, claiming all sorts of mechanistic insights but the problem is you can't do that at all. And then this unfolding was interpreted as dynamics and flexibility but we don't really know if it's this because the crystal structures were bad or maybe the protein is definitely dynamic but what's the dynamics and what's actually unfolding we can't really say. So I worked on this and I'm trying to analyze when you only get this noise and things falling apart like what kind of story is it? Like nobody's really interesting to hear like oh but your simulation is falling apart hence it's not working. So when we first submitted this I got a review saying your simulation is diverged and I was like yes that's what I'm trying to say but that was like a feedback and that the work is not novel. So I think this is really wrong way of looking at this type of work. These are just vanilla simulations this is not even an example this is like just the most basic thing you can do with a membrane protein. And then last year we had a student and I asked her to do something that was kind of I was really interested in like looking at this and like after a while I interviewed the frustration. So I actually asked her to use all the available force-fueling bromides to see what are we getting out of it. So do we get unfolding of the helices if we use different force-fuel do we get this form of transmembrane domains. And so these are like these of each particular of every nucleotide binding domains and transmembrane domains. And you can see that it varies like again it's not really diverged in any way. This is PCA analysis in dihedral in the Cartesian space and again all the simulations are pretty much all over the place. And here are distances between nucleotide binding domains. So we try to kind of pull out some experimental data but there are some theory and fed experiments which also had to be taken with care. But again you can see that distributions distances from all these different force-fueled are different. So if you choose OPLS you would get I don't know which color is OPLS. It's purple. So for example you would be here but for example if I wanted to use bromides which I did then I would be here. And then the green one is martini which actually gave the best result but only after the simulations were extended to a microsecond time scale. And then amber is here. Essentially you don't know what you're getting because we always choose just one force-fuel and if you would choose another one you would get a bunch of different results. I apologize for the quality of the pictures but I took them from a PDF because I didn't plan to talk about this so I had to scavenge from a thesis. So yeah, this is essentially what I wanted to say. So it's really hard to spell MD with the experiment. We often don't have experimental data that is comparable to MD. We are still mostly under-exampling definitely for the trans-membrane proteins because they are so large and usually quite slow. So I think we probably need to also set up some standards but how much can we say about these simulations without over-interpreting? If we have 100 nanosatons can we really make bold claims about making them? I think maybe sometimes but I think those would be rare cases. And again I made all of my data available when I published this paper. I put it on picture. I made all the input but I had to make a selection of what do I want to put out there and to be honest I also cut some corners there because it was easier. I was in a hurry. I had lots of things to do so I didn't put everything that I think should be out there. But I tried to provide all the input files the initial and final coordinates. I put trajectories. I stripped water because they would be too big to upload. I had parameter files CPR so I think things that could be useful and some analysis stuff as well. And then I did a drag and drop on picture. The problem is I did it several times as I was changing the manuscript as I was remembering oh maybe I should put this as well maybe I should add that as you know how the process of writing the paper goes. But some of my files were actually missing. I went recently just to check something and then I realized that well actually where is this file? I thought that I had put it out there. And the reason is what I did. I did drag and drop and I left it to upload because it was really the upload was really really slow. So these weren't even such huge files. I think it was like a few gigabytes so it wasn't some impossible size but it was slow to upload. And I think probably at some point because I left this window to upload and I went to do something else I did check but there was like I think there's like 60 files on it also I mean it's easy to miss if you don't. So we definitely I want to say we need some kind of flag making if we decide on both what are like some standard or really required files that you can't upload until you have like checked you need to do this this this and this because we can't really rely that we will notice everything again about it. And I also just proposed to include some other things that might be useful because I was busy with other things and I wanted to get it done and I wanted to get rid of this paper and move on with my life. And I also want to say like my simulations were one probably one year in the picture before I published the paper I did not get schooled. So I don't think there is a reason for view to that but so these were like really some basic simulations like nothing fancy about them at all example there is a mainframe in them. But just in setting up the system and how we do things and how we run simulations so I decided roughly cluster like where do we introduce some errors or not errors but where do we introduce different specifications or whatever you want to call it which might affect the visibility. So it's every time we need to make a decision and I'll start this is like my rough clustering of different sources of divergence or whatever you want to call it again. So there is theoretical one so we need to choose a force field. So you need to choose like a physical model to describe your data. You need to choose a sampling method it will be an elastimulation it will be in hand sampling if you do in hand sampling which is hand sampling if you do you have to choose reaction coordinate which reaction coordinate how you will choose that so on and so forth. Then we analyze data then you have to unbiase your simulation you have to do all sorts of steps again you need to choose what you need to do whether you want to use the available tools do you want to write your wrongs do you want to use empty analysis or empty trash or romance like it doesn't really matter. And then again there is what computational I don't want to even go there but there is a high verse CPUs, GPUs, architecture all sorts of things that affect precision and we know about this and there is not really much we can do so I don't really want to go too much you know but there is also software like which package will you use which version of that package will you use are you going to write your own code will you share that code so there are all sorts of things that again affect the results and then there is a code that is practical where eventually we are setting up the system topology building which paternation state so which ions do you want to put in which water model do we do want to use initial coordinates which PDB will you choose will you try to build something by yourself if you are building some kind of polymer again what will you choose as a starting point because starting points are really important and then relaxation I call this relaxation because other people do refer to this like equilibration as well but how do you when you minimize things like do you heat it up do you put position restraints all sorts of decisions that are good in here and then again production parameters which thermostat refers to which electrostatic method how many simulations do you want to run how long all of this kind of affects this reproducibility and essentially all these preservation flows I think you call it workflow so all these things are going to make a decision that's a workflow right and now I would before we continue I would be forgotten just the our voice so I think before we break out into this question which maybe we will talk about what reproducibility really means and what does it mean for us because I think it's probably also very different and there is a lot to a lot of discussion about reproducibility and so this is not a discussion this is not a definition for repeatability repeatability and reproducibility people make the difference between these three terms but there are often in language used interchangeably so repeatability means same thing same experimental setup essentially replicas when you start maybe from the same point on the same machine set it up and set it up in exactly the same way different teams same experts replicability team same experimental setup essentially someone is using your machine your devices but someone else is doing it reproducibility means completely different environment and different experimental setup they can follow your protocol but in a way everything is different including machines equipment and people working on the experiment additionally some people kind of introduced research reproducibility definition which is really that means provide sufficient detail about procedures and data so that the same procedure could be exactly repeated reproducibility obtain the same results from independent study with procedures as closely matched to the original study as possible and then there is this inferential reproducibility draw the same conclusions from either independent replication of a study or reanalysis of the original study so which definition do we choose actually when we discuss these things I think that's really important so we are all on the same page so that we are all trying to provide this solution for the same problem rather than the people thinking about different steps and we have seen that there are really many steps in terms of reproducibility and I guess the question is like what do we really want is reproducibility really important for MD like how would we achieve it what do we mean by and yesterday we discussed quality of simulation what do we mean by quality am I making better simulations than I don't know someone else are my simulations of better quality I don't know how do we find quality like how long do you run it which force field do you use I think it's a really tough question to answer maybe not even one without answering but I think what is worthwhile what is really important to push forward what is it that we really care about and what can really make this method more usable in the end so I think rather than reproducibility I'll focus on another hour that's reliability so we have really good best practices so if you follow this if you want to calculate Biden free energies for example they should be probably good not necessarily perfect but maybe this is the best way to achieve the highest accuracy and so forth we can give each other recipes and best practices and how to check out the things that are actually good we should also maybe focus on accuracy so make sure that our force fields are performing well which means we should maybe do some more benchmarking studies which maybe means that we should team up with some experimentalists who could actually derive data sets that are really comparable to MD maybe we can really try to work on that to have these benchmarking sets because we don't have them in the month we have some but I don't think enough and we can also do relative comparisons sometimes but these are studies that are kind of bird computing time and it doesn't bring with it those novel results that get you published in Nature's and finally how do we which metrics do we want to use and how do we validate what is actually useful and for this I think really the nicest thing that I have seen recently is that living journal of molecular sciences and their best practices in terms of error bars and error calculations trying to learn also that it's exactly the same thing but it took me a very long time because I was searching for this literature all over the place so I think we really need to start consolidating things that we have learned maybe in the past 10 years and make it more available and accessible to people who are coming so they don't also have to spend like months trying to understand how to calculate error bars and things and I think Eric's message from breakfast is aim well don't aim too high let's get at least something done