 So welcome everybody to the BioXL webinar number 64. So today we have a BioXL use case webinar and particularly we have BioXL HPC workflows, predictive power and its application in pharmacology. The speaker of today are three and they come from the Institute for Research in Biomedicine Barcelona, ERBI, and they are Adam Hospital, Miwosz Witscher and Federica Battistini. I host this webinar, I'm Alessandra Villa and I'm come from the Royal Institute of Technology and with me there is Stefan Falk from the University of Edinburgh, the today presenter. So the first presenter is Adam Hospital. Adam has both a researcher, software engineering and a postdoc in molecular modeling and bioinformatics. His key work is based on workflow on structure of responsibility web server and database. Miwosz is a Marie Curie Fellowship, a postdoc Marine Fully Fellowship and his research based on DNA damage and epigenetics but he is also very interested in visualization and in force field. Federica is a postdoc in structural and computational biology, sorry, and her research interests are nucleic acid, sequence dependence, connect to epigenetics also and she has worked on DNA protein database. More on their profile, you can find on the BioXL webpage that where you have found this webinar. So now I will give the word to Adam to start. Thanks, Alessandra again and welcome to this edition of the BioXL webinars. In this case, we are going to present a set of BioXL workflows focused on HPC resources and we are going to try to demonstrate the predictive power that they have with applications in pharmacology with three different applications that we are going to present today. Let me start with a very brief introduction about BioXL for the ones that still don't know about it. It's the center of excellence for computational and biomolecular research. It's a project that started seven years ago already and it was an European Commission Horizon 2020 funded center. It has become a central hub for biomolecular modeling and simulations now and you see in the screen and many of the different important research groups and universities and also private companies in Europe are involved in the project. And we basically, our area of expertise is working with quantum mechanics and molecular mechanics with the small molecules and macromolecules. One of our main objectives is enabling better science and we do that first improving the performance and functionality of key applications and our key applications in BioXL are grommets for MD, Haddock for protein docking, BMX for free energy and CP2K and CPMD for QM and QMM calculations. And then we are also developing user friendly computational workflows and basically these workflows are joining together our key application and trying to tackle very important and useful scientific cases. And actually we defined the set of use cases in the project three years ago. And today we have chosen three of them to demonstrate the power, the prediction power of our workflows. And the project that we are going to present today, they can be classified in this data driven science in the sense that we take information from descriptive analytics that explain us what happened. We are trying to predict what will happen using this information and we are generating information that can be hopefully used in prescriptive analysis. What we are doing, how we are doing that is basically running a macromolecular simulations. These simulations are generating lots of data that we are storing and then analyzing and with these analysis we try to predict and we then compare or correlate or try to correlate this prediction with information taken from experiments or experimental data. And sometimes data from the experiments are also added to the data that we are generating and then we are using also these data for example to train machine learning algorithms. But what makes our approach a little bit different from the common data driven science pipeline is that we are using very in a very efficient way and I will try to convince you about that in the next couple of slides. We are using supercomputer or HPC facilities to generate this kind of data to run these simulations and reduce the time to result. So I will give you a couple of examples on this kind of workflows, HPC workflows that we are developing in BioXL. The first one is the typical case of a preparation of a molecular dynamic simulation. So the MD setup pipeline together with the production molecular dynamic simulation but we have included at the beginning the modeling, the automatic modeling of protein residue mutation. This is the workflow that you see here. We build a workflow with our BioXL building blocks library wrapping the Gromach MD program and we are executing this and launching the workflow and controlling the workflow using the PyCom's workflow manager that is developing the Barcelona Supercomputing Center. Just as an example, we can launch this particular workflow to generate 12 different mutations automatically produce 10 nanoseconds of MD simulation using four nodes in MPI regime in the Manino Supercomputing using more than 2000 cores in one single job. And the final time to result is eight hours. And here, and this is very important for the HPC facilities we are using. We are demonstrating that we are using 100% of CPU usage for all of the nodes in the job, in the calculation. This is a, if you want a pretty easy example but we can go to more complex workflows like this one. This is a non-equilibrium free energy calculation workflow. The basically is telling us this Delta G number here that is giving us information on how much a protein residue mutation here in the protein is affecting the binding of the protein with a ligand or with a drug molecule. This is a complex workflow that involves running many different non-equilibrium MD simulation a short non-equilibrium simulation but the big number of them in the forward and also in the reverse. And this is again, built using the Bioxx of building blocks library. It is wrapping in this case, Gromax and PMX two of our key main application and it is launched and controlled by the PyComps workflow manager. And again, as an example, you see how we are using 100% of all the nodes that we are using more than 1500 cores in this case with a time of five hours, something that was completely impossible. It was weeks of calculation in a local small cluster. So we are using this kind of workflows in these applications that we are going to present today. Those are the projects. And I will start with this one. Hythropo prediction of the impact of genetic variability on drug sensitivity and resistance patterns for clinically relevant EGFR mutations. All of that from atomistic simulations. And I will, yes. This project, I wanted to say that I will be very quick because for the sake of time, but the story started four years ago and this was a collaboration between Nostrombiotiscovery, a small pharmaceutical company here in Barcelona and Bioexcel. We were very interested in the EGFR protein because of course, you all know that this protein is involved in different types of cancer like carcinoma, glioblastoma, and lung cancer too. It is integrated in this pathway, a cell cycle progression pathway. So it is a key component of the tumor cell proliferation and growth. And we know this is a molecule, it is a complex molecule that has three different domains, the extracellular domain that binds with the epidermal growth factor here in red, the transmembrane domain and the intracellular domain which is the tyrosine kinase. We knew that two different therapeutic approaches existed at the moment, monoclonal antibodies that were trying to block the binding of the epidermal growth factor with the extracellular domain and the ATP blockers here in the intracellular domain. So we were very interested in the last ones in the tyrosine kinase domain and we knew and we started to look at the literature. We knew that there were many different mutations existing that were conferring resistance and some of them resistant and some of them were enhancing the binding with the different drugs existing FDA approved drugs. So we wonder if we could try to predict the effect of these mutations a priori, thinking about personalized medicine and thinking about giving the right treatment to the right person looking at the sequence and the mutation. So I will try to make the long story short but we started with a sequence, trying to understand if all the information that we needed to predict the effect of these mutations were already encoded in the sequences. So we started with a multiple sequence alignment here for different human tyrosine kinases. And basically zero Shannon entropy here means that the positions are very well conserved. This big number here in Shannon entropy means that the positions are variable and these red circles that you see here are our mutations, 26 different mutations that we wanted to study and in yellow the mutations that were giving resistance to the drug molecule. And you see some of them are in the conservative region, some regions, some of them are in the variable region. So no information for us to predict the effect of these mutations. If we want to analyze, if we go to analyze the blossom 62 substitutions and see the most disruptive kind of mutations like this one here, this was not correlated to any kind of resistance. This was a sensitive one, one that is giving sensitivity to the drug. And for example, this one that is having a mild substitution penalty here, it was affecting, it was giving resistance to all the different drugs. So no information here either. We then go to the prediction of pathology. In this case in red, you see how the probability of these mutations to be pathological in green, you see the probability of being neutral. And if you see all the EGFR here and you look at the particular tyrosine kinase domain in red, you see that most of the mutations that you apply in the tyrosine kinase domain are pathological or have a high probability to be pathological. If you highlight this region here, you see that most of them, our mutations also are of course, they have a big probability to be pathological, but almost all of the residues in the neighbor of these ones or many in the domain of the decay, they have all high probability. So again, no information for the predictive power of the effect of these mutations. So we move from sequence to a structure. And the first study that we did is to take a look at the molecular interaction potentials between the drug molecules and the binding site of the tyrosine kinase. Here in the top, you have examples on when we have predictive power. So in red circle here, the mutations that are giving resistance to the drug in green, the mutations that are giving sensitivity to the drug, these plots are giving a difference in the interaction patterns between the wild type and the interaction in the mutated protein, sorry. And here, for example, you see if the mutated residue is generating a worst interaction pattern, then it gives resistance. If the mutated residue is giving a better interaction pattern then it's giving sensitivity. So in these cases, we have predictive power. In these cases, we cannot say anything because the mutation is not affecting the interaction pattern. In these cases, all of them were experimentally known to be sensitive, so we are okay here. But in these cases, for example, in this one, the interaction pattern was much better in the mutated one, whereas experimentally we know that this is giving resistance. So again, we don't have predictive power for that using this approximation. And we thought that maybe that was due to water molecules, to solvent bridges, to maybe how the side chains are reorganizing. So we went to a more sophisticated method. We tried with a docking process. And again, we had positive results here. This is the mutated, this is the wild type. And you see here that, for example, the ligand is going to the binding side correctly. In this case, the ligand with the mutated one is not going to the binding side. So it's predicted that resistance. But in this case, the ligand is going to the binding side and it's much better than a wild type, even though we know that experimentally it's resistant. So again, no predictive power, enough predictive power for us. And finally, we went to the most sophisticated and computational expensive method, of course, that was one of our HPC workflow that I presented at the beginning. That's the non-equilibrium free energy calculation. And we run 26 multiplied by two. So a lot of different calculation and all the different mutations and all the different ligands. And basically here in the final results of the prediction you see in gray, the ones that were not correctly predicted with the interaction pattern and with the docking process. And now with the final numbers, we have a 100% accuracy. Of course, not without some kind of controversy here, but the method is, we are very excited with that. We think that the method is working for this. We have prepared a draft that is now submitted. And of course the next question is if we could apply this method for different systems. And with that, I'm going to hand over the presentation to Miwok for the second one, second project. Okay, so I'm going to start about the zoonotic transition that we sampled with the same approach of trying to see the alchemical free energies of mutations and to trace the evolutionary pathway that took the SARS-CoV-2 virus from the bat variant to the human variant. Then I'm going to explore a little bit the idea of humanized bat polymorphism, which is a related idea that we formulated as a hypothesis to maybe explain part of this zoonotic transition. And then I'm going to focus on the so-called Spanish mutant. So this is a single mutation that searched in Spain in mid 2020 that we work a lot on and we're right now publishing a final report on this. So to start to give some background on what everybody knows about, so the pandemic. Of course, because of different interactions between humans and animals due to deforestation, that due to extensive agriculture or coming into contact with many pathogens that we used to and many pathogens that we're dealing with, including SARS-CoV-2 came from animals. So for the case of SARS-CoV-2, of course there's still some controversy with respect to what the exact zoonotic pathway was but the consensus view a while ago was that the closest known relative was rat G-13, a virus that was isolated from Enola Poussafini's bat species very recently due to like through massive screening they identified a new bat virus that is even closer to SARS-CoV-2 which could be another intermediate. And we're looking into that, but there's some work to be done on that. And to the point, in the virus itself, in the receptor binding domain of the virus, there are about 20 mutations that were required to go from the bat virus identified in 2013 to the human virus identified in 2019. And on the other hand, the receptor, the ACE2 receptor has about 10 differences between the bat species that was infected with the bat virus and the human counterpart. So the challenge that we wanted to take to kind of trace this zoonotic transition is which were the most important mutations that enabled the change of host, so infecting human host, so to say. And we know now from experimental investigations that there is a preference of about three Kcal per mole for the human receptor. So SARS, the human virus prefers the human receptor over the bat receptor. So what we did was we followed the protocol that Animal already introduced, which was at first, and the first idea was to use the non-equilibrium protocol in here, in horizontal we have the binding free energy. So going from let's say an unbound molecule to a bound molecule, in this case, this would be the virus, but here just simplified. And the modified binding free energy, so binding free energy of one moiety and another moiety and together relative free energy, we use the thermodynamic cycle and we calculate the alchemicals of the vertical values, changing alchemically, so to say, the chemical constituents of the molecule. And of course, because free energy is a state function, the super has to close with a total value of zero, so we can derive the horizontal values from the vertical values. And as Adam showed, we have this nice theorem that says that intersection of work values gives us a free energy. This is a famous results from Thursday Javinsky, then from Crooks, but sometimes we have these cases which are kind of ugly because it's very hard to pinpoint the intersection when we're, this is 320, oops, this is 320 k joules, this is 280, we know the intersection somewhere in between, but it's very hard to tell precisely. So we're dealing with those two classes of problems. And so the idea that we got was to start with the cheaper unequilibrium protocol, as Adam mentioned already, this can be done in hours in a massive way, so we can go both ways, for example, starting from both the bat setup and the human setup, so the bat virus and the human virus, we can go both ways accumulating those mutations. And when we know that the numbers, both ways do not match up, we can go to a equilibrium protocol which is more customized, more robust, but also much more expensive, I would say at least order of magnitude more expensive. And when we did this, we found, we took all the mutations and we calculated the numbers and we added them up with a number of something like minus 10k cal per mole. And so here are the numbers where we did the repetitions with the equilibrium protocol, you have the orange bars, the blue bars are the ones from the cheaper non-equilibrium protocol. And as you can see, sometimes they match up very nicely, sometimes they're completely opposite, but we're pretty much inclined to trust more of the equilibrium one because it's just much more robust and much more formally correct when the non-equilibrium protocol runs into trouble. But again, we still have this problem of having the wrong value. So what we did was we kind of notice that this whole difference that we're seeing, this whole problem boils down to a single mutation, which was problematic. And when we looked up the literature, it turned out it should be much closer to 2k cal per mole, 2.5 maybe, because people actually did the experiments and so what the preference was. And so when we went back to the simple, let's say chemical conceptual side and look at PKAs because if you look back at the mutation, this is a mutation of aspargene to lysine and aspargene to aspartate. So we looked at the more recent structures and it turned out that if you try to predict the PKA of the residues, this aspartic acid, well, this aspartate actually should be an aspartic acid because it has a predicted PKA of 7.5. Of course, this prediction is never perfect. I'll already tell you that there is a huge tendency for the single residue to become protonated. And then again, doing a custom mutational analysis, going from aspartate to aspartic acid, we recalculated numbers and correcting for the changes in PKA. We found out that the two mutations combined should be around minus 3.4 and the whole data set that we had arrives at minus 3.6 in that case. So we're kind of close to the experimental data for those two mutations and we're also very close to the experimental numbers for the whole set of mutations. So for just the transition between the two species. And so we're kind of happy with that. We cannot say that we have a perfect precision for every single mutation, because there is some fortitude error cancellation that will, some of the errors will go up, some of the errors will go down with respect to experimental. They will cancel, mostly cancel each other and we can end up with something that is pretty close, surprisingly close to the experimental number. And then we can of course analyze individual values to say, okay, this was important. This was probably a random mutation that didn't contribute any advantage to the virus. Then the curiosity that we saw, actually this was our friend did a bioinformatic analysis of the metagenomic data set that was published for the bat. And it turned out that there is a suspicious of this RFNIS bat that has this interesting diet here, has two residues that are, whereas in the typical bat, these are arginine and glutamate. In humans, they are histidine and aspartate and this suspicious of the bat has like humanized residues in these positions. So we wanted to see whether that could have already created some sort of evolutionary pressure to adapt to this local change. So we did, again, we did a mutational free energy analysis on these double mutants to see what happens, what are the numbers that we can get. And it turns out that actually as we go from the bat virus, like the old bat virus to the human virus, the preference for these two residues go from negative, so from like plus 1.4, so it's unfavorable, to in the case of the human virus minus 0.7, so it's favorable. So it already suggests that there might have been changes in the virus that adapted to those two residues being present in the bat subpopulation. Of course, this is very speculative and it's hard to say, okay, this was the exact evolutionary path, but this kind of gives food for thought for these ideas of maybe different subpopulations of bats harboring different species or different variants of viruses that at some point become just ready to infect humans. And in the last part, I want to talk a little bit about the Spanish mutant. This is another case where we combine many computational and experimental techniques to trace the impact of a single mutation. This is a mutant that first appeared in Spain, I think here in Catalonia in summer 2020. And it reappeared recently in late 2021 on top of Delta. So this was very interesting for us. I mean, we're kind of expecting this to be just a random fluke in the day tag is, okay, every virus can, there's always some sort of population drift that will cause random mutations to gain popularity for no good reason. But when there is a mutant that comes up twice on different genetic backgrounds, maybe it actually has some sort of evolutionary advantage. So we started looking at it. The problem is that the mutation is in the NTD, in the N-terminal domain. This is the part that is not really involved in binding of the receptor. So we try to experimentally trace all the interesting things like glycosylation, binding of antibodies, binding of the receptor. And there are very few hints. There is a small, very small advantage in receptor binding, but it's almost negligible. So we turned to simulations. And the same alchemical simulations that we did before don't seem to favor any state, any conformational state like open versus closed of the spike. So it doesn't really seem to favor easier binding of the receptor again. But when cryo-EM people did obtain the structures using cryo-electro-microscopy, there was a hint that the B-factor, so the estimate of flexibility were significantly higher for the mutant, for the A222V mutant. And in parallel, we were doing simulations. And what's happened in simulations was that we started seeing that there is this sort of bimodal behavior. When you compare the open ensembles between red and blue, there's always a component of the blue that is to the right, to the red part. So the double mutant, let's say, so the A222V with the D614G, which was the oldest mutation that's spread throughout the world, is always a little bit more open in the open state than the base variant. So that means that the additional mutation confers a small structural preference for more extreme opening. So it doesn't really favor open versus closed, it favors more open versus less open. And we tried to explain that in structural terms, and we saw there is a visible, quite notable difference in network, in contact networks between the RBD, so the receptor binding domain and entity, this N-terminal domain where the mutation is. And we could see very easily that in different sets of simulations, the connectivity between those two domains were much weaker, the connectivity was much weaker in the Spanish variant. So that suggested that with this modification, the RBD has much more flexibility, much more liberty, so to say, to move around. And we don't really have a good biological theory of what happens, but is just a structural and dynamic observation that we could kind of verify from several angles and explain. And if there is an explanation one day, this can be a piece of the puzzle of how the virus adapts to different conditions in the human body. So to wrap up my part, the workflows that Adam introduced are often very useful in answering biological questions. Sometimes we need to put another layers of complexity on top of that. So developing more robust strategies, like combining the non-equilibrium and equilibrium protocols, is going to be crucial in generating this sort of, maybe black box is a big word, but sort of strategies that will work for any set of biological problems. And we'll need to remember that there's always some sort of chemistry, fundamental chemistry that we might to remember about, like in the PH or PKA case. Then we can easily generate new hypotheses if we combine bioinformatic insights with those alchemical free energies. This can be quite inspiring for the community to look for new ideas. And then combination of those methods when we combine alchemical free energies with a lot of theory, with just multiple equilibrium simulations, we can get this nice multi-angle characterization of, in this case, very big biological systems that can be easily combined also with experimental insights to tackle those generic big problems in biology. And now I'm passing the voice to Ederica, so just a minute. Hi, everyone. So in this last part of this webinar, I'm gonna talk about a project that we're developing, using workflows, DNA affinity and machine learning approach to predict DNA binding affinities to transcription factor. So what's the aim of this project? What we wanted was to tackle a very big problem, a very big question on how, for example, in here you can see in this picture, how the DNA that is packed inside the nucleus and here you can see the DNA that is unwrapped, how transcription factor protein can recognize specific track of sequences and how they can place themselves in the what they're called transcription factor binding site, so the most preferred site. To tackle this problem is very well known that the DNA sequence is directly related to the function. So the protein recognition, protein DNA binding, genome organization, we added in this project another layer to this problem that is thinking that the DNA sequence has a physical code that is encoded in the structure, in the conformation, in the flexibility, in the properties of the structure of the DNA and we went through this layer to understand how can be connected directly to the function. We did this using machine learning workflows, so then I will show you the scheme how we basically, we built these workflows and these machine learning algorithm. We use experimental data and then we'll go through the in vitro data that we use and then we use computationally derived structural DNA properties as features to train and test our machine learning. I added here on the right of the screen some picture or some structure of transcription factor because I wanted to show to people that is not so familiar with structure of transcription factor, how you can see in pink the DNA, every transcription factor can have a different effect on the DNA, like on the top, we can see how the transcription factor doesn't affect the structure of the DNA, doesn't deform the DNA and here we can see how transcription factor can completely bend and disrupt the DNA. So we have to consider in this project that we are taking into account many transcription factor that they have a different binding to the DNA, different recognition and different effect on the DNA. So here I go through the details of our schema, the machine learning. As input file, as input, we have a DNA sequence, so coded ACG, so normal nucleotide and then we train our data and we test them on some labels that are taken from in vitro experiments and now I will, in a bit, I will give an explanation, a detailed description of the in vitro experiment that we are using and we used to train the method sees the feature. This feature involves DNA properties and always add the tetramer level. So we divide the DNA into tetramers that consider the conformation, the presence and the electrostatics of the DNA and now I'm going in detail to explain the features and the label that we used. Once we train the method over 80% of the data of the in vitro experiment using a random forest regression, then we tested on the remaining 20%. For second time, I'm not going like into details but this has been tested, the reliability of the data using bootstrap, using cross-validation. So to be sure that we were not biasing the results on the testing set that we were using. So going through the details of the feature that we use, here you can see the DNA, what we decided to use as parameters and mainly these are called shape-based algorithm is the conformation of the base pair in a tetramer environment. So here you can see the base pair parameter, six movement, translational and rotational. And we calculated the average conformation for each base pair through molecular dynamic simulation. And we calculate also the force constant. So the ability, the flexibility of each base pair to move in these six direction. We use also the sequence pattern. So the presence, probability of each vector and we added also what is called a feature that is a direct readout. So for each base pair, we described from the major and the minor group, how many hydrogen donor and hydrogen acceptor they were for each side so that we could describe also a bit the chemistry of the base pair. I want to point out in this part that for the analysis of the dynamics and so the average conformational parameter and the force constant, we developed aside another workflow and you can see it in a Jupyter notebook where it's possible to run MD simulation. And on top of that, every DNA for each MD simulation, the parameters can be calculated and also the force constant. So we did it in a project using Parami CC1. So the newly developed force field for each tetramer, but in any case if someone wants to apply this parameter to any other case they can study them using another workflow that we developed. Now I will describe a bit the labels that we use for who is not familiar with the experiment. We use I through put select experiments and protein binding micro rays. I'm going like very fast on this just to show you first of all, how using different experiment we will have different data set and different results, different formats of the results. So for the HTSLX experiment, how it works. Usually 10 amers or total amers of DNA are incubated with the transcription factor of interest. Then the transcription factor binds to the sequence the most preferred sequence. So the one that they prefer to bind. So then what is there being filtered there is the removal of the free DNA and only the DNA that is bound to the protein is taken. There is the solution. So the division of the DNA from the protein and amplification of the sequence that have been bound to the DNA. This usually the protocol of the HTSLX experiment is between four and six cycle. So at the end and usually the penultimate cycle is the one that is the one that usually is used for describing the affinity and the sequence as the most preferred sequence because there is a selection along each cycle of the most preferred sequence to bind to a specific transcription factor. And then there are data from protein binding macros rays. You can see a picture of DNA on plate probes like DNA double stranded that are fixed to a microwave and the transcription factor with the tag is put into solution. So we'll bind only to the most preferred sequence and then for fluorescence they have been detected. As you can see just from these two pictures without to go in more in details. You can see how in the HTSLX we will have a big database almost every possible 12-hour or 10-hour for each transcription factor. While for protein and macros rays we will have very long sequence that they're bound to the plate like 36, 46 minutes more or less. And there is a bit less of control of how many binding transcription factor they can bind to each sequence. For this one of the method that we have to implement in our workflow was the processing of the data to have a sort of database that was kind of unified independently on the data that we were using as input. So I'm going like very briefly for UPBM, Universal PBM as I say are 36 minutes. So what we needed to do was to align them and usually in positioning with matrix and we have the highest affinity sequences and we could trim them so not to have long sequence where multiple binding factor could bind. And we also saw that for UPBM there was another representation for some cases more than 99% of low affinity binding site. So to have more equal like data set to train with we had to use to remove some low affinity binding site. So we did an under sampling mainly to remove the noise. Then we used as data a GC-PBM that are protein binding matrix on genomic sequence. This sequence are already centered the transcription factor binding site is already in the middle. The only thing that we had to do for this data was to remove the possible multiple binding site because having low variability on the sequence sometimes we had sequences which we didn't know if the data of high affinity was given by a very good sequence of the binding of multiple proteins. So we just remove the sequences that had very of the repetition of pattern. For HTCELX we just did like a statistical quality assessment to remove data with low p-value and we filter cases using the correlation when we saw that there were cases in which passing from for example, cycle four, five or six the counts on the most preferred sequence that we found in the six isole were dropping in the fifth. So we had to check data and you will see in the next in the results that we call them, I will call them HTCELX filtered. Basically when we filter for data that were not consistent so before there may be there were some problem in the experiments. So I'm showing you the results now. We use that we have the results for the three different set that we use GC-PBM sequence, we had all the three cases. And as I said, the variability of the sequence is kind of low and the sequence are very well centered. So we had very good squares that is between the experimental and the predicted values of affinity. And as you can see here, we could match mainly the experimental data. A bit more difficult was to use UPBM data they say longer sequence and more variability among different families. And here you can see for each families that we consider in total we used around 60 cases that we had protein minima queries on mice data but we could get R-square with R-square of around 070 and to check if our method was kind of good compared to the other that are already been published. We compare our R-square, so is the last one, the affinity with all the cases that we used with on the, first of all, I want to say that we compare them with on the same data set and on the same transcription factor to be consistent. So we compare them with methods that have been already developed based on shape like ours, so machine learning and using shape confirmation. We check on a deep learning method and a neural network method. And as you can see here, we saw an improvement using our workflow and using our algorithm. So we were quite happy. So we check also for the data set of HT Selex. Here I showed like a few cases and with the arrow I show you that we are taking the values from the fifth cycle of the HT Selex selection because it literally would have been tested that is the one with the best variability between high affinity sequence and lower. And here you can see that we have standard average value between the R-square of our results around 07. So also for the data set, we could have a good correlation between the prediction and experimental values. And as I said before, we also added this filter when the data were not consistent among the different cycles. And we saw that in this way we could delete some outliers that probably are not given by our method but by the data that we are using. We compare these HT Selex as our data using always the same cases and the same protein with algorithm that is shape-based. And we saw that we had an improvement using our method. We also, like this is the last point, we also try to tackle a very, like it's kind of difficult and it's one of the problem that has been tackled lately with deep learning and railroad network is to train the data on one data set, so HT Selex and then to use this knowledge on other data set like UPBM. So it's a very difficult challenge because as I said before, our different experiment, different conditions and the length of the sequence is very different. So we did it for all the cases in which we had HT Selex and UPBM data. And we saw that compared to other method, we could have like on average, we had a better results but these are just, we will have to train on more cases probably. So to conclude this part of the webinar, we saw that using our machine learning, we were able to predict experimental transcription factor in AFINITY usually with a correlation of 70%, with CPBM, I showed that as a higher score. Our method can be applied to different experimental techniques and now we are also improving on passing from one data set to the other using the same training. And I cannot, like we had also one example of using a train model on in-vitro data and apply them to in vivo prediction. And we saw that like showing the results but we want to extend them to the whole genome and to be able to predict the in vivo transcription factor by inside. And for a few cases that we are studying right now, we saw that there was a good outcome. So we are kind of confident on what the results are giving us on this workflow. So I would like to thanks first of all, the Modesto Roscoe for the three project. So I'm not going like for each name but here are the three boxes for the free part of the webinar that we presented. So we want to thank people from our lab and all the collaborators in for this project. So here are people from the lab and thank you for listening. So now if it's a time for question, I will join Adam and Mewesh and any question we are ready. And thank you for listening. Thanks for all the three of you. It was really interesting. Yes, and now we have the Q and A's. If you have any questions, you can type them into the Q and A box. So now I just go through our current questions we have. So first one is for Mewesh which is did you take into account post-translational modifications in any of these structures? So I think this was referring to the... Like because of the SARS-CoV. That's a good question. I think for the small systems which are concerned with binding of the receptor binding domain to the receptor itself, we don't because it's an isolated system where we just look at single domains. Whereas for the big system in which we studied the impact of the single mutation, we had both a structure with the glycans and without the glycans and that gave us a confidence that the effect that we saw was observed in both. It was an interesting experiment also to compare glycosylated and non-glycosylated spike. So if that answers the question, I hope it does. We didn't explore other ones except for obviously there are dysuclide bridges which are a sort of pain in the neck for COVID researchers because there are just so many in the spike and many of them are in regions that are disordered and do not appear in the crystal structures of the cryo-EM. So if you're assimilating a spike, always check your dysuclide bridges. That's my suggestion. There's a second question for you which is sort of a two-part bit. So is the obtained value of three kilo-cals per mole the difference of the sum of all of the RATG-13 to SARS-CoV-2 mutations? Yes. And then the second bit is to what mutation corresponds to the minus 9.5 kilo-cals per mole? So yes, the first part was actually they did, I don't remember if that was BLI or just good old, what was the name of the technique then I forgot, SPR where they basically got the affinity of many variants like many animals and many viruses. And that was the experimental value. And of course, this is the value that we were produced with the sum of all mutations, whereas the mutations were accumulating. So it's not that we just did all the single point mutations. We just accumulated the mutations one by one. And the last question to which mutation they're not minus 9.5 kilo-cals per mole value correspond. I'm almost, well, the problem is we have an internal numbering that doesn't match the original numbering. And I think it was the 501, maybe Adam can confirm this because we're looking at this. The 501, which was kind of interesting, I think it was this one. I'm almost certain I can double check that later. But the 501, which was found in many variants going back to, I'm not sure if it was going back or just changing to other amino acids, but it was a very interesting position in general also for further evolution of SARS-CoV-2 in humans. So there was actually, there was a follow-up question to your, the earlier one, which was from YASA, which is, did you see a difference between the system with and without the glycans? There's even a paper, I think it's still a preference that actually look at the opening of the RBD with and without the glycans. And we kind of saw the same, I'm trying to remember now exactly which was the direction of the effect, but I think the glycans were stabilizing the distinct state. So like with the glycans, the open state is more stable, more stably open and the close is more stably close. Whereas if you get rid of the glycans, they kind of, there's more like a continuum. Of course, not a perfect continuum, but they're more like, I think the barrier just goes down. Thanks. So a question for Adam from Prasants, which is, are there any bio-BB workflows available for protein ligand alchemical non-equilibrium simulations? Can you suggest the best way to find pairs of ligands from a group of ligands that can be taken for the non-equilibrium simulations? That's a very good question. The answer is not yet. We know that there's a lot of people interested in that, especially pharmaceutical companies. So we have, one of our partners in bio-XL is the developers of the PMX software. They are helping us. And together with Nostrum Bio-Discovery again, we are trying to build this workflow of protein ligand alchemical non-equilibrium simulation with the building blocks. It will be there soon, but it's not ready yet. I cannot comment on the second part because I'm still trying to familiarize with this particular way of running the non-equilibrium simulations. But we are working on that. Well, thank you. So that's all the questions from everyone so far. I have a question for Federica, which is in the feature set bit. You mentioned you use the DNA flexibility parameters and that's from MD simulations. Is that something you calculate during the training protocol when you take your DNA sequence? Do you run the MD simulations or do you look these up from the database of existing previous things that have done? We have already a database for each tetramer. So for each base pair in a tetrameric environment and for each possible tetramer, we have in this project that was done in Modesta's group. We had the average values and the matrix, six per six like of all the possible force constant. So it's a database that is stored, is available and can be used for these analysis. Cool, thanks. So I think that's everything. So I will hand back to Alessandra and she can introduce our next webinar. Yeah, I have also a following question for Federica and then I will introduce all the webinar, maybe in the meantime I share my screen. So my question is, you choose a specific parameter to describe your structure. So did you, these are local parameter, geometrical local parameter. Did you also look to more global property or it was not possible since you are looking to tetramer? So you have all the information. So I don't know how is your database? Is your database, I think that the tremor are inside the longer DNA. Yeah, so first of all, when we check like for this problem, when we want to like study transcription factor binding side, we have to take into consideration that usually the binding side that is occupied by transcription factor is between 12, 8 base pair. So we can like extend to a bigger stretches. So we have to go very local for this. We added all the electrostatic, the hydrogen bindings, the hydrogen bonding, because we had to study specifically a very short stretch. Like then obviously this, we use like tenomer and we use the overlapping tetramer. So we study locally, but we can study like a big stretch of DNA like we are implementing on the full human genome. But we were not global. We were looking more global like curvature, DNA curvature. You didn't consider, I mean, I was just curious. No, not for now. Like we had other projects that are running like, for example, that involved the nucleosome. So there we have to consider like properties that are longer on the curvature, but for transcription factor also because they bind in a very different way, each transcription factor. So we went for very local properties. Yeah. Did you see some issues? I think sometimes artificial fashion have a cooperative effect in their binding. Yeah. Did you face this in your approach? Did you, yeah, how this is go together with your approach? That is also. We saw that there were some cases in which we had like dimers binding that obviously was reflected in higher fluorescence and higher affinity, but was not given by the binding of one transcription factor but maybe multiple transcription factor. So we look for patterns in the sequence and we remove those cases. So we had. Okay. So you had the screening of those? Yeah, we had the screening. So there is like one other thing that we wanted to develop is to see, for example, we have one particular case in which we don't know why it has very high affinity for sequences that for us are not very good. So what we think is that in solution or something there is like a cooperative effect or something, but something that we can implement and we can add it further on. But for now we check only mono, like one transcription factor binding site because like there is a big variety of transcription factors we wanted to for now having the most general one also with the pre-processing of the experimental data. That we will take into account for sure. Yeah, thank you very much. Thank you. I thank you everybody, Federica, Miloš and Adam. Thank you for the nice talk. And I want just to mention the following up webinar. So next week, this is a period with a lot of webinars. Next week we will have the student webinar that is a peculiar webinar where three students that had participated to the remote BioXcel school will present their research. These students have been selected for the best poster. So please come and listen to their research results. And then we have a following up the week after the 10th of May, we will have a webinar that is also connected of use case developing BioXcel. In this case, there will be Dimitri Morozov and Mirko Pavlikat that they will speak about the QMM simulation of flourishing protein and proton dynamics and it's covering Chromax, Cp2K, and CPMD. And I thank you again the speaker and I thank you for the attendee for the attendance and see you next time.