 Giovanni Bussi, the floor is yours. Thank you, you're muted, Giovanni. Can you hear me? Okay, so hi everyone and thank you for joining this event. I'm very happy that we managed to organize it and to see a lot of people virtually here. So I will not introduce myself because I already introduced myself yesterday. But before going to the topic of my talk, I want to just quickly remind you that we have these Q&A sessions in the afternoon. So unfortunately, I will have to really rush at 12, 25 at most because this is the first day of my daughter at the kindergarten. So I cannot be late in picking her. But I will certainly be happy to answer your questions through the talk. So please interact with me or this afternoon. And another thing that I wanted to remind you is that it's very useful to upload your posters even with preliminary work in these types of events. Because this is a very nice way to discuss with the colleagues, with other students, what you're doing, and also to get help from other students or from professors. So even if you have something very preliminary, we would be very happy to upload it so that you can discuss it with colleagues. Do it before the poster session, which is on Thursday. Okay, good. No, this is not going. Okay. Yes. So first of all, why is my group so much focused on RNA dynamics so that I have RNA word in every title, the title of each of my talks. So if you think of the central dogma of molecular biology, RNA is just storing the genetic information for a transient time. So it's copied from DNA. And then it's used to code for proteins. So that's actually the role also of these messenger RNAs that are so popularly used in vaccines now. But if you look at the numbers, you will see that actually it's a really small minority of your genome that is following this path. So for humans, it's around 2% of the genome that is coding for RNAs that are then coding for proteins. Whereas the majority of the genome is coding for RNAs that are then not coding for proteins but do something else. So there are a lot of RNA molecules that are functional. And if you want some sense, many more are non-coding RNAs than coding RNAs. If you look at numbers in different organisms, you can see that actually the more the organism gets complicated, the more this is true. So maybe oversimplifying a bit, you could think that all the organisms share basically the same pool or a very large pool of proteins. But then the RNA part, the RNA only part of your genome is actually the driving force to organize things. And so it's what basically controls the gene expression. And so it becomes predominant as you go to complex organisms like vertebrates or humans. That's to say that there are a lot of RNA molecules with relevant properties and interesting features. Then what we would like to do in principle, we would like to be able to connect the RNA structure with the RNA function. So RNA structure can be classified at three levels typically. So you have the primary structure, which is the sequence. It's just a four-letter alphabet. And this is not completely true. There are actually a number of chemically modified little tides, but these four ones are the those that are most common. And then you have the secondary structure, which is called also 2D structure, because it can be represented on a piece of paper in two dimensions, which basically is the list of all the helices that your RNA molecule will form, all the stems. Or in other words, it's the list of all the Watson-Crick pairs that your molecule will form. And then since we're living in a three-dimensional world, you have the tertiary structure, which is the way these helices assemble in 3D and the way these molecule rearranges basically in three dimensions. So how are these three levels of structure connected with function? So if you think about it, so for coding RNAs, which basically are there to store a message, what matters is the primary structure, the message itself. But as you go to known coding RNAs, the function could depend on the primary structure on the sequence, or it could depend on the secondary structure, which basically tells which parts of the sequence are accessible, or to the tertiary structure, which basically tells the three-dimensional shape. So whenever the role of RNA molecules is to bind another molecule, since binding at the molecular level often happens through some set of complementarity in the shape of the two molecules, clearly the three-dimensional structure becomes very relevant. In addition, you have to imagine that these molecules are not frozen. They can fluctuate, and they have internal dynamics. And so in principle, they could slightly change their shape or their structure when they bind. So clearly dynamics could be important. And in practice, what is known is that RNA molecules are typically highly flexible. And so their three-dimensional dynamics is crucial to understand how they bind and interact with other molecules like proteins or ligands or ions and so on. Okay, after this kind of motivational introduction, this is my agenda. Since we want to study RNA dynamics, the natural tool from the computational perspective is molecular dynamic simulations, and I will spend a few words about it. You've heard already about MD simulations yesterday, so I will not go to too much details, but I want to give you some flavor. And then I will try if I have time to present two projects. One of them is the idea of how to improve the accuracy of MD simulations, like fitting forces on experimental data. And the second one is directly to restrain MD simulation to fit experiments. Let's see if I can go through both or not. So some introduction about mod-exiled dynamic simulations. So the type of MD that we run is atomistic MD. So even though I often show movies without water, actually water has always been included in the simulation. It's not a cross-grain model. Water is there, ions are there. All the atoms were explicitly modulated. The functional form of the energy model that we are using is like the one represented here. So you have terms representing the chemical topology of your molecule, then torsional angles, van der Waals, and repulsive interactions, and the hydrostatic interactions. Basically, the functional form is kind of motivated by the chemistry of the system, and it's important to recall that we are using non-polarizable force fields and non-reactive force fields. So charges are frozen. You cannot have displacement of charges, and you cannot have transfers of electrons. So you basically have to think about a fixed chemical topology. With this type of approximations, we can run simulations on the order of hundreds of nanoseconds per day, depending on the machine that we have available around the size of the system that we want to simulate. A software that we use for MD is Romax, and we also use a lot of Tund, which is a plug-in for an anset sampling. At a standard, yes, I was talking about an anset sampling. I will give you some information during my talk. Okay, good. So the typical flowchart of MD simulation is you need initial coordinates. Often you take these initial coordinates from some experiment, and then usually you have to complete your system because the amount of coordinate that you can see experiment is not sufficient. So usually you have to add waters that it's difficult to resolve experimentally. Sometimes you have to model parts of your molecule because they were not resolved in the experiment and so on. So you do some work, which is partly done by hand and partly automatized to basically have an initial set containing all the coordinates. And then you run your simulation, which basically means you integrate the equations of motion, just nothing else that Newton's law. And then there are all the technicalities associated. I will not go through the details, but of course there could be a lot of discussion about this. So you need to use proper algorithms to integrate the equations of motion, proper thermostats and barostats, and so on. And once you have done that, you just collect a trajectory that you can then use to show a movie, or you can analyze quantitatively. You can compute the population of different substates. You can compute autocorrelation functions and so on. And the important thing is that what you obtain as an output is something which has four dimensions, your time as well, your trajectory. So you can really study what the system does in time. Okay, to give you an idea of what is the field now, I wanted to also show you some very recent applications by other group that I think can give you an idea which is the forefront of the field. So this is a paper from 2009 published on Cell that is the biggest simulation of a biological system that I am aware of. This is a simulation of a system with 136 millions atoms. And here basically the authors of this work managed to simulate a full organelle. So this is something amazing. And the time scales probably are not long enough to really see something very interesting happening. You see here it's half a microsecond, but still it's amazing that it's possible to really model at the atomistic level something that as a complexity that definitely is sufficient to, it's something interesting from the biological point of view, clearly. Okay, then another type of application that I wanted to show you is this very nice paper. So this is a joint computational and experimental paper, but the head of the computational group is Gerard Hummer, who is very well known in our community. And here the use of MD is not to be kind of predictive but more to complement experiment and explain experiments. So here the experimentalist had some very useful cryo electron tomograms of images. You can see here of these spike proteins from SARS-CoV-2, but actually you can see that these figures are very, very poor resolution. So in order to really explain what's happening, you need to do some modeling at the finer level. And that's where MD simulations were used to basically understand at the microscopic levels what was the difference between these different conformations. It's a very interesting paper. Okay, good. So why do I like molecular dynamic simulations? So they basically give you access to very high resolution information, both in time and in space. You have sub-angstrom and sub-approximately femtosecond resolution. They give access to dynamics so you can construct a trajectory, you can compute the population of different states, and they're relatively cheap. Of course that depends on what you want to do. If you want to simulate 136 million atoms, it's not going to be cheap. There are also drawbacks. So first, the timescales that are accessible are kind of short depending on what you want to obtain, of course. And in order to tackle this, you need to be ready to use analysis of something techniques like those that Alessandro was introducing yesterday. It's based on force fields. So if forces are inaccurate, you have to be ready to work in improving them. And finally, you have to always remember that you're just describing a model of a system. So you always have to validate against real experimental data. And in this sense, I'd like to quote this sentence from Vijay Pandey that is in science as in life, it is very dangerous to fall in love with useful models. And since there are many Italians in the audience, I will not translate, but I also want to quote this from one of my favorite books, which is Maguai, A Quice de la Tentazione e Discambiar una Hypothesia Elegante. Okay, good. So remember that you always have to validate your results before trusting them blindly. And then let's go to the first problem that I mentioned, the short timescales. So you can find in the literature many pictures like this one with examples for different types of systems. This is a picture for RNA. You can see here that depending on what you want to probe, the timescale that you want to access could vary by several orders of magnitude. But if you want to see something like change the base pairing patterns, you already have to run for something on the order of the microsecond, which is basically what is accessible now with standard hardware and software. Microsecond or tens of microsecond, let's say. If you want to study something more complicated like binding or unbinding of different cations that's totally out of scale and conformational switching, that's completely impossible. And so as I said, the only solution here is to combine plain molecular dynamics with enhanced sampling metals that allow you to basically cheat your simulation, make things happen faster, and then posteriori discount for what you did. The second problem, so the talk by Alessandro was mostly focused on this. So I will go to the second problem directly, which is issues with force fields. So these are just a few pictures from the literature. So on the left we see some results obtained on tetramers. These are short oligomers that are disordered. They don't have a clear structure. There is very accurate NMR data for these systems and very, so from the group of Doug Turner and very extensive and these innovations from the group of Tom Cheetham showing that basically MD and NMR do not agree. And this is a very, very important issue. It means that when you want to study disordered RNAs, you should be very careful because there are at least some sequences for which it's known that you will not reproduce the correct conformational variability that you see in the experiment. And on the right instead you see some tests on some structured models. And here these are tetraloops, which are like herping loops with four nucleotides in the apical loop. And here basically we'd expect the native structure to be the most stable. But what you see if you run the simulation carefully, you will see if you manage to profile like this one where this is the native structure. And these are the non-native structures. So the native structure is a local minimum, good. So that means that if you start your simulation from the native structure, you will be able to preserve it for at least some time. But it's a meta stable state. So if you run your simulation long enough, or if you do your simulation with the sufficiently good and unset sampling method, your system will explore this region here, which is not compatible with the experiment, very bad. So basically also for some structured model that is known to be tricky, current forces basically do not work. So which is the role of the experimental data here? You can think in this term, you can run some in this relation and then you can try to predict some experimental data and then you can compare with the actual experiment. And then if there is no agreement, what you should do is to accept it, go back and try to improve your force field and run your simulation again. Okay, so ideally we would like to be able to refine our force field parameters so as to get the agreement with the experiment. So let's have a look at how are the parameters for force fields derived traditionally. So here you have a table which tried to be kind of comprehensive with all the most used force fields for bimolecular systems from a recent review. But without going to the details, the important thing of this is that some of these parameters come from experiments. Some of these parameters comes from ab initia calculation, typically quantum chemistry calculations. All these parameters come from information that was obtained on very small fragments. For instance, for linear-johns interaction, sometimes people use experiments that say done or methane in order to infer the linear-johns parameters that will then be used for metis groups that are part of larger molecules. That's maybe extreme example. But that's the idea. So typically experiments are done on very small molecular fragments and the same is true for quantum chemistry calculation. Here you see two structures that were used to fit torsional parameters in a recent work here, okay, in a work by 2012, and you see it's basically a single nucleotide and there are also cases where the smaller fragments are used. So clearly this is risky because you fit on very small systems and then you try to transfer to larger systems where it's not guaranteed that the parameters that you obtain on a small system will still work. So the point is, can we find a way to improve force field systematically using experiments performed on larger systems, so on macromolecular systems? So for instance, I go to our Difficult UECG Tetraduc. This free energy minimum is a local minimum. Can I predict how should I modify my force field so that this minimum becomes a global minimum? I want to pull this down in the free energy landscape so that this becomes stable structure. How can I should I add the correction to my force field? So it turns out that there is a way to do this and this idea was pioneered in this work by Creston-Iver Clarkson in 2008 and we actually used something in this line with some important extension a couple of years ago with Andrea Cesare and you will actually see a talk by Andrea on the last day of this conference, even though now we left academia so we talk about his job at Alliance. So the idea is to define a function that tells you how happy you are with the results of your force field, so how far they are for the experiments typically. And then instead of, so and then let's say you modify the parameters, instead of running a new simulation whenever you modify the force field parameters, you just reweight your previous simulations so as to estimate what would happen if you were using the modified parameters. So you basically assign to each frame of your trajectory a weight like this one, which depends on the basis set that you are using for your force field modifications, for instance you could say I change all the diagonal angles and the three parameters that you are tuning. Okay and since this can be done with some analytical form you can compute the gradient of your cost function, your happiness function say with respect to these parameters and maximize it or minimize it depending on how you define it. So in this case we try to refit all the torsional parameters so you see that our reweighted ensemble is reweighted with an exponential of a linear combination of sine and cosine of torsional angles, basically all the torsions in the background, which for RNA means this alpha beta gamma delta epsilon zeta and then we have separate glycosidic bond angles for purines and polygons. Okay then you have to define your cost function here and define your training set, a set of systems on which you want to train your force field. Here what we use is two tetraloops, one kind of easier, this is GAGA and one kind of more difficult is WCG and for these two systems we were adding to our cost function the population of the native structure and in addition we have the four tetramers for which we have accurate NMR data and here we added the agreement with the row NMR data and clearly when we combine data and different systems we have to decide how much we care about one system with respect to the other so we can tune these pre-factors to say the importance, relatively importance of our system with respect to the other. Okay good, so then we can fit our parameters and but we have to be very careful because it's very easy to do overfitting when you do use this type of approach. So ideally you would like your force field modifications to be transferable to another system so the best way to do is to use a cross-validation procedure, here we use three-fold cross-validation procedure so we split our training set in three portions let's say one is the set three is the information that we want to make the tetra loops as stable as possible then set one and set two are two parts two portions of the NMR data that we are using for tetramers and then you can see so here what you are doing is in the in the first bar we are training ignoring the 3j captains on tetramers and you see that the error when these data points is left out is very large whereas when these data points is included it becomes significantly lower okay then when you leave out the set number two again you see that the error on set number two goes high when you leave out the set number three you see that the further fraction in here in this case the higher the better of tetra loops basically goes to zero so in other words this means that if you try to fit your force field on only the tetramers and apply it on the tetra loops you will not just you will not you will make it them even worse okay so because your force will be overfitted on the tetramers so this means that you have to do something to take care of overfitting and again the standard way is to use okay this was to highlight the points that show overfitting so what you can do is to introduce a regularization curve so you add something to your cost function that as a parameter here that is a high called a hyper parameter and here you have something that takes that basically tells you how large are the fitting parameters that you are that you are using and clearly if you set this alpha to very large this alpha lambda square will dominate and all your lambdas will go down to zero so we that you will have no correction whereas if alpha goes to zero you will have no penalty so basically you will have the full fitting which will be overfitting so you have to find the compromises and that's very standard procedure machine learning community where you do cross validation and you compute the cross validation error so it's basically the error on the system that was left out from the training and you find the hyper parameter that minimizes this cross validation error and you can find this value whatever it means and once you have tuned this hyper parameter you can use it to fit again over the entire data set and you can see that if you do that you will not overfit anymore so here you will see that even though the this data set was left out its error is comparable with when this data set was included same happens here and here you see that clearly if you include in the training set the tetramers like in the first two bars the tetra loops sorry therefore the fraction will be high if you don't include them it will be lower but still it will be not zero and it will be better than what you have without any correction so you are in the regime where your the multiple portions of your data set they cooperate so if you fit on one you improve the agreement with the experiment of the other one instead of making it worse okay great so this is good this means that we can hope our fitting procedure to be transferable so we basically see that our fitting procedure gives parameters that are transferred to the data set that was left out so we can hope that it will be transferred to a new data set that we didn't think about yet when we run our optimization okay putting all this this procedure together what we do is we first run a number of simulations here you see some of the details of the analysis something it's very technical but if you're interested you can look and then we take this big database of simulations and then fit the corrections with cross validation so we fit on a portion of the data points validate on the remaining part and use this procedure to choose our hyper parameter and once we have fitted our hyper parameters we optimize parameters again and rerun new simulations then combine the new simulation with the old ones and fit our parameters again and ideally this should be iterated until it converges so here you see the the parameters that we fitted at the first iteration against the parameters that we fitted at the second iteration they are basically converged so one iteration is sufficient in this specific case and probably the reason is that in this specific case the correction that we are doing is very small it's actually very small uh quantitatively if you look at the correction per biode angle the average correction is a fraction of k-calper model that's very small typically people fit these torsional angles on claim calculations and the accuracy of the k-calculation themselves is worse than this so let's say we are making a change that is within what one could make within the error of the k-calculation that was used to tune these parameters and even though it's so small this change has a significant impact as you can see here these are the results with the with the refined force field so you have the original force field in green and the fine one is the orange one and you see in this plot where we would like this to be the global minimum it's not yet the global minimum but at least it's going down okay so we are going in the right direction we are able to increase the population of the native structure of the ucg from a very small number to a number that is small honestly but larger and for the g8 g8 things are a bit better we can arrive to a population which is like seven percent which is not so bad so clearly we are going in the right direction when correcting the force field in this way even though we're not yet at the point where the force field that would be considered to be working for this instance okay let me summarize this part and so the idea is to fit force fields or corrections to existing force fields based on experimental data and on simulations done on macro molecules not on small fragments to avoid side effects what you have to do is to simultaneously fit all the parameters and also you need to simultaneously fit them on multiple systems and to take care of overfitting because it's very risky to do overfitting and in this it's very useful to borrow ideas from machine learning like cross validation and recommendation techniques and clearly there's some work to do to make these tether loops better and predictable but if you're interested in this I invite you to watch the talk of carbon flocking tomorrow we show you new results on tether loops and also if you want to see another example of this type of idea you can watch the poster from Valerio Piontponi who is showing how is it possible to fit charges for chemically modified prototypes using a procedure with a similar very similar idea okay good so I'm close to the end I have still maybe five six minutes and I want to just give you a flavor of the next thing I will not show you the results but I want to just tell you that there is an alternative procedure that one can use so for the second part I will just give you the theory but I will skip through the example because it's I will not have time to do it so if you have the experimental data before you run the simulation one could see one could say why should I try to predict experiments that I already know instead of predicting them I could use them as restraint and I could try to force my simulation to agree with those so I could try to let's say predict something compared with another experiment not with the same experiment that I enforced during my simulation obviously and use this as a kind of validation and then if they agree I can interpret my results so the the idea of this of this alternative approach is not to have corrections that would be transferable to a new system for which you don't have the experiment but rather to focus on the system on which you have the experiment and try to get the most information from the experiment by enforcing the agreement with experiment during the simulation okay here what we want to do is to enforce agreement with experiments that report average over ensemble so typically an experiment could report the average distance between two protons and so there are many possible distributions that have the same average distance between those two protons what you can use as a guiding principle here is the maximum entropy principle that tells you if you know the average distance between the two protons and you have an initial idea of which would be the distribution of your conformation just try to make the minimal changes so that the average distance matches experiment and the minimal change can be quantified by the change that maximizes the relative entropy between the posterior distribution the prior distribution this is very standard approach and you can show that by maximizing this function the new the new energy function that you should use is equal to the original one plus the quantity that you're measuring so in my example the distance between the two protons multiplied by a factor that you have to determine iteratively so if you compare this conceptory with what you have seen before in the other example we were choosing our self the functional form of the correction we were saying we are unhappy with the parametrization of the dihedral angles so we decide to change the dihedral angles parameters until we match experiments here instead we say we don't care about which which functional form we add to our force field we just want to make the minimum modification and this prescription tells me that the minimum modification is done by biasing exactly the same quantity that you're measuring clearly this modification to the potential energy function will not be transferable because unless you measure the distance between the two protons on another system you will not be able to apply the correction on the new system okay good let me now skip but I wanted to show you an example of this idea as well but I will not do it because there's no time so I will completely skip on this second part and I will just go to summarize the take-home message from this talk so the first thing is if your MB simulation does not agree with experiment hide the mistake and try to publish your paper anyway no try to understand where is the problem it often the problem is in the experiment more often the problem is in the simulation so try to understand the origin for the discrepancy and try to find the solution and and then what I I tried to discuss was okay mostly the first one two possible types of solution one is remember that the accuracy of your simulation is depending a lot on the accuracy of the force that you're using so you can try to improve the accuracy of your force that's very useful because you will have a better force that might be used also on a new system it might be transferable if you want to fit the refinement to a force that are transferable you have to feed them using a lot of data on multiple systems so that the result will be transferable and I think this is a very promising avenue and we are really making a lot of efforts in this direction also in collaboration with the other groups the second type of approach that I quickly mentioned is if you want a bit more pragmatic instead of making a general force that will work for any other system try to just improve the results for your specific system you're interested in so try to force your system to agree with experiment and again there are theoretical prescriptions that can be used to say which is the minimal modification that you should do in order to achieve agreement with experiment okay to do an unset sampling all these simulations were done with an unset something even though I didn't go to the details to do an unset sampling and to do this maximum entropy simulation we use a lot to plug in that is called the plumbed that you might be interested in so if you want to learn how to use plumbed you can watch our master classes it's a series of lecture that we gave last year with tutorials and they are all available on youtube so just have a look and you should be able to find them and stay tuned because we plan a new series on more advanced topics for 2022 if you are at the already at a higher level and you want to really understand the details of one of the work that we published so I want to mention this initiative that we launched a couple of years ago which is this plumb nest which is basically a repository of protocols so here you can contribute the protocol that you use for a given paper or you can browse and find and in this example you find exactly the protocol that we use for the paper I discussed by Andrea Cesar you can really browse the input files and understand how to do it okay let me close unknowing my collaborator so I want to acknowledge all the members of my group including the guests that we have now I also have a couple of virtual guests that are just visiting me on Slack so I want to mention Mattia we will talk tomorrow Torben we'll talk tomorrow Valerio is giving a poster and then Vittorio should be attending Isabel and then you also are attending but then the work that I presented today was actually made by by previous members of the group so in particular I want to thank Andrea Cesari that you will see on Friday for his work on this force fitting strategies and then all of you for your attention thank you Giovanni okay do you have time for some questions Giovanni now yeah okay so maybe we can if you have a question shoot it in the chat or unmute yourself go ahead so don't worry this will ring before they leave then my kid out in the kindergarten so ask hi hello thanks for the talk um first of all uh no experience in uh nucleic acid simulations so this might be a dumb question but about this decision of not parametrizing um when comparing parametrizing some nucleotides and instead parametrizing on uh big chunks of biomolecules which is again the opposite what we do in proteins right is this uh I was wondering about the the why's of this decision and if if this is something you know people do in this field I see yeah now that's a very good point actually it's something that people are doing in this field since many years even though in you know not explicit way if you look at the many forces for proteins as well you can see that for instance the torsional angles were derived by doing qm scans but they were reparametrized when observing in through simulations that the amount of the population the population of alpha helices was too large or too low okay this observation the population of alpha helices is too large or too low is exactly the type of information that we are using here so I think I would prefer I prefer to have some systematic approach to do it in a in an automatic way because this allows me to remove biases for instance I can say I'm parametrizing my force field to make the population of alpha helices higher clearly I cannot validate it on that information I should validate it on something else so doing an explicit training on the force field on on these populations I think it's more rigorous but if you look at the literature the truth is that people are doing this since 20 years okay and indeed the the first generation forces that say were apparently working and were derived the mostly from fragments and with qm calculations because simulations were short enough so that nothing was happening I I see this very clearly in the nucleic acid community in which I know better you have regularly people having faster machines identifying the issue in the force field picks it even faster machine and you issue that's the way it works I think that the key thing is to combine an unset sampling simulations and automatic fitting on these branches instead of manual fitting thank you okay good sorry I think I have to go now anyway I will be available this afternoon on gather.town if you want to discuss more about these topics