 I will try to give you an overview of the current state of structural bioinformatics and specifically in what we are doing right now in my team. As most of you may know, we just had a revolution in structural bioinformatics. It all started around CASP 13 with a few models including Alpha 41 and around CASP 14 it became very clear that the problem of protein structure prediction is essentially solved. So what you see on this slide is the accuracy of structure prediction with respect to the difficulty of particular targets. Difficulty means if we have structural templates in the protein data bank or not. So you may see even that even without the templates, the accuracy of protein structure prediction became a near the experimental. This is measured in very strange units which are global distance test but the accuracy of 90 corresponds to about one 1.5 Angstroms of RMSD from the crystallographic structure. So these are very well known examples from DeepMind to targets with near experimental accuracy. And why did it happen because of multiple developments mostly in data science area. And if you'd like to learn more about the progress which led to these discoveries we co-authored as a review for the special CASP 14 issue in the proteins journal. So this was during CASP 14 which was already two years ago and we just had to the conference for the CASP 15. So the differences you may see the slide from Dan Rigden who was the evaluator of the protein structure prediction during CASP 15. And you may see here that the most of the top performing methods were used in one or in other way alpha 42. So it's still alpha 42 which dominates protein structure prediction. The best to the top rank non-alpha 42 method was the one from David Baker's team here somewhere in the middle. And very interestingly we had multiple models that didn't use multiple sequence alignment at all. They were based on protein language models based on a single sequence. But unfortunately these methods so they are marked in violet. These methods are not yet performing so well. Okay, a bit of history. Why do we need, why do we care about protein structure prediction. So protein structure is rather simple because most practically structural databases contain only about 0.1% of proteins discovered across life forms. The protein structure prediction problem became known and popular quite a while ago and the first attempt to predict protein structure started back in the early 70s. There were multiple attempts using molecular dynamics or Monte Carlo simulations but we had to wait until 1998 to see the first truly nearly one microsecond long molecular dynamic simulations of a short peptide. It was conducted by Peter Coleman and it took them two months with fully occupied to craze to the computers. And it was not a true protein it was a 36 residue short peptide. Shortly after, many more folding molecular dynamic simulations followed, including all different types, for example, GPU accelerated simulations and also simulations using very specialized hardware. But very interestingly, the success in protein structure prediction didn't come from molecular dynamics. So let's look at the problem of protein structure prediction from a different angle from a slightly different angle. We may notice that the shape of a protein is preserved much more than the sequence. For example, in this slide you may see that all different but homologous sequences would result in nearly the same three dimensional structure. This motivated the researchers, the researchers to use so called homology modeling. So we may model new proteins by comparing individual sequences we may model proteins, comparing the multiple sequence alignments also called profiles, or we can also substitute three dimensional fragments for homologous sequences. For example, if we look at the sequence profile a bit closer, we may notice some conserved regions so they are marked here and also regions that would co evolve together. It turned out that couplings between the square wall position myself as a very good estimate of 3D protein contacts. To estimate these couplings were made in the early 90s followed by routine estimation of protein contacts with the developments of so called thoughts or direct coupling analysis models. In these models, the probability to observe a certain sequence depends on the conservation coefficients and the coupling coefficients. And these coefficients can be estimated from the sequence profiles. Later on around CASP 13, these models were additionally supervised using deep learning by the observed protein contacts, and in the last generation of structure prediction models. There is no any statistical model at all. These models would use the multiple sequence alignment directly and to estimate protein contacts without any statistical models just using deep learning. So, we believe that the classical protein structure prediction problem is essentially solved. But there are still many developments around protein representations. There are many more exciting developments in using protein language models and also applying protein language models and co evolution models to other types of molecules for example to RNAs and DNAs. We also believe that molecular dynamics will be used in some areas less specifically for protein structure prediction, but the so called protein coding problem and also involves protein kinetics, and the structure prediction problem is not the holding problem so holding connected ticks yet has to be predicted. There have been some progress specifically around CASP 15 around protein docking so there have been many attempts to predict it at a much higher accuracy. And we still need to predict protein multiple states and protein flexibility. Then, we need to predict or simulate protein behavior at physiological conditions. And finally, there are many exciting developments around the inverse problem, the protein design so if we know a certain shape, which serve a certain protein function then we can design a specific sequence that would fall into this three dimensional shape. And these are the current exciting developments in this field. So, I will start by introducing the algorithms developed in our team around protein interactions in solution. We've been interested in protein, protein, protein ligand interactions for quite some years, and we've been specifically interested in symmetrical assemblies. And we care about protein symmetries because most of the high order assemblies are symmetrical and if we look into the protein data bank we will see that about 30 of the assemblies are would satisfy, would satisfy some, some type of a point group symmetry. So we introduced a specific formalism based on rigid body operators that would include rotation translation and symmetry operator. And so we designed an exhaustive search algorithm based on these operators, we would project all of these operators into the free space and we demonstrated that the shape complementarity or the docking score can be formulated in the free space using all the symmetry constraint constraints implicitly. And this allowed us to exhaustively sample all possible symmetries in just a few seconds on a personal laptop. We can formulate the dual problem. If the protein assembly is known, for example, if it's solved by its recrystallography or cryo microscopy, we can basically look at the same problem and the estimation of correspondence of one unit with respect to another unit. Here, however, we would know in advance the rotation and translation operators, and we can estimate the symmetry operator. So this problem can be solved analytically with respect to the best matching, meaning that we would estimate the best rotation and the best translation for each of the symmetry axis. And the estimation would give us a list of all possible symmetry axis with their accuracy. We have also developed, I believe, the fastest calculation of small angle. We have the sketching profiles, both for x-rays and for neutrons, which is based on the past multiple expansion, which scales linearly, linearly with the size of the system. So in this example, I simulated the sketching profile of the largest particle I could find, which consists of 160,000 of sketching centers. And the whole simulation of the sketching profile takes only about three seconds on a personal laptop. So the speed and also the accuracy of the model allows to interactively refine the molecular shapes. And here you can see a couple of examples of the shape optimization, which was also a part of the CASP 13 exercise, where our model performed the best. Finally, the same rigid body formalism can be combined with the small angle sketching calculator. So we can analytically apply rotation and translation, rigid body rotation and translation operators to one of the partners. For example, to simulate a protein, sorry, a sketching profile from an assembly. So we can apply the rotation and translation operators to one of the partners, so we can project it into the free space, and we can apply some additional averaging to obtain the sketching intensities. This, for example, allows to scan many thousands of docking solutions very rapidly. So it takes about one minute to scan 100,000 of docking solutions. And if we look at the performance on a standard docking benchmark, we see that docking with respect to the sketching profile provides a much more success rate compared to only the energy based docking. So this is one of a particular examples where the energy based docking does not really find the right solution. So if we look at the RMSD to the solution with respect to the classical Z-Dock energy, there is no funnel. So if we zoom in, we see that there is no correlation between the predicted docking energies and the distance to the crystallographic solution. But if we do the same experiment and measure the accuracy of the sketching profile, so the chi-square to the experimentally measured profile, then we may see that the correlation of chi-square with respect to the distance to crystallographic solution is much, much, much better. So in this case specifically for some elongated shapes, SUCKS or SUNS profile can be very good proxies of the right docking positions. And which exactly the same formalism applying rigid body transformation to the sketching particles can be used to analytically compute protein, crowding or aggregation effects. The formalism is absolutely the same, but then we apply some additional averaging in the angular space which allows to analytically treat structure factors for complex particles and this allows to analytically compute crowding effects. Here you may see the sketching profile for a single particle in black and for a particle with crowders in purple. And finally, recently, very recently, we have extended the same formalism to simulate crowding particles. Altogether, so this was our first attempt towards self-simulation. What did we do? We have pre-computed individual interactions of crowders with respect to each other using the fast Fourier transform, and we used this pre-computed energy in a Monte Carlo simulation. These allowed us to perform very fast simulations on a personal computer which scales from milliseconds to seconds. And there are many more extensions of these simulations, for example, we're planning to also run Brownian dynamic simulations, and we can also do sequential or replicate change simulations. So in the proof of concept experiment, we simulated several systems, the largest would be composed of five different proteins, and this was our simulation box. As you can watch, the simulations allowed us to measure, so they allow us to measure, for example, the diffusion constant of one particle with respect to the concentration of the particles and the simulated values fit very well with the analytical predictions. We also measure the diffusion constants with respect to the volume of the particles. So these simulations didn't follow the analytics that well, and we're currently trying to extend the model with more physics. The method also allows to run many more simulations at different temperatures and different concentrations, so here you may see the evolution of the energy of the system with respect to the temperature at different concentrations and you may clearly see the melting temperature or we can measure the same way the diffusion constant with respect to the temperature and again, here you may see the melting regime. So these was the proof of concept study and currently we're extending it by adding hydrodynamic interactions and also flexibility because so far all our particles were rigid. So let me switch into protein flexibility. Why, why flexibility because proteins are flexible, and many of them perform their function by interacting with other partners. So flexibility can be very efficiently simulated by adding just a few additional collective coordinates or degrees of freedom which can be computed using molecular dynamics, or the so-called normal mode analysis. So why is it interesting, because we can explain experimental observations, for example, my colleagues have measured by x-ray two structures of a membrane complex, one in the active, and the other in the non-active forms and the normal mode analysis allowed us to simulate the rotation from the non-active to the active state using just a few collective coordinates. So here you can see that these collective motions involve a rotation in the membrane patch and this is a light motion in the extracellular patch. So what is the normal mode analysis and how is it linked to molecular dynamics, both technique aims to solve the Newton's equations of motions. And what normal mode analysis is doing, each assumes that our system is not very far from the equilibrium position and the potential energy in such an approximation can be expanded into the Taylor series. The first derivative vanishes and we can very efficiently represent the potential energy using just the second order term which is also known as the Hessian matrix. Now if we plug this back, it turns out that we can solve the Newton's equations of motions analytically, and we get the transition from the so-called normal space into the Cartesian space using a linear transformation, which is the diagonalization of the Hessian matrix, which has links to the very popular technique currently in tri-electro-microscopy, which estimates the latent variables for reconstruction and conducting the spectral heterogeneity. So if we look into cryo-spark or cryo-dragon, they both currently estimate the encoder part and the latent variables, or the latent variables would correspond to the normal space and the encoder would correspond to the linear transformation from the Cartesian to the normal variables. The idea, very frequently, the normal mode analysis is not performed on a very sophisticated force field, but in many practical applications we can use a much more simple model, which very often is the elastic network model. The idea is that we can link all our atoms or residues or bits or particles by harmonic springs. And in such a model, we will have only a single adjustable parameter that will be the cutoff distance at which the particles or the atoms are linked together. Historically, there was a bottleneck, how to diagonalize large matrices, how to arrive to the linear transformation from the initial space to the normal space, there have been many different approaches proposed. And one of the most successful was the cost-growing technique, where we would split the initial system into a set of rigid blocks, such that each rigid block would contain only six independent variables, three would correspond to rotations and another three would correspond to translations. And effectively, the diagonalization problem will be simplified because we will multiply it by the transformation from the initial space into the rigid block space that would produce a much smaller rigid blockation, which is much faster to diagonalize. Our contribution to this field was to notice that diagonalization in this space also leads to the estimation of angular and linear velocities applied to each of the rigid blocks, which allows us to effectively extend and extrapolate the motions to very large amplitudes. So here on top, you may see the classical normal mode analysis where at large amplitudes, the initial protein structure, well, actually it's a complex here, is very much distorted, and if we apply the non-linear approach, as we do, then it can be extrapolated to very large amplitudes without the distortion. Also, our method is very CPU and memory efficient. We can simulate very large structures at nearly interactive rates. So we can simulate the largest ribosome complex that I could find in the protein data band. And this simulation took me about 10 minutes on a personal laptop. So, how can we use this technique? There are many possible applications. For example, we can interactively flex the initial systems. We can open binding pockets. This is an interactive simulation that allows the user to open a pocket of a protein. So what do we do? We compute the normal modes, some of the lowest normal modes of the protein alone, then we specify automatically or by the user a possible binding pocket. And after we project the pre-computed normal modes on the pocket, and the algorithm estimates the best combination of the normal modes that would deform the pockets the most. We can also computationally simulate large transitions, for example, from the unbound to the bound state. So this is a classical protein-protein benchmark, where the goal is to predict the bound state if we only have the knowledge of the unbound protein. So here we show the success with respect to the number of normal modes involved. And we may see that if we use only 10 normal modes, then we can simulate about 35% of the transitions. In some cases, just a single mode provides already a half of the required transition. For example, this particular case shown on the next slide, here a single mode can very accurately simulate the transition from the unbound red to the bound blue structure. On the right, you may see another example where, again, just a few normal modes can predict the normal mode analysis is the estimation of the protein-riching domains. So what do we do? We compare the flexibility of different parts of the protein and then link those parts that behave similarly together. This allows us to automatically identify rigid protein domains, and we can also compute normal modes in the rigid domain approximation. So here you can see the estimation of mode one, mode two, and mode three for membrane protein in the approximation of seven rigid domains. I must say that by default we're using the elastic network model, which has specific limitations. For example, if the elastic network is too dense, we cannot easily simulate large structural transitions. We are trying to find walkarounds about this problem. For example, what we have recently introduced was an automatic removal of some of the artificial links in the elastic network model. And so these links are detected from the contact map as those unconnected patches, and if we remove them, then the success to generate large motions is much higher compared to the original elastic network model. And finally, there is a very nice analytical connection between molecular dynamics and the normal mode analysis. This is known as the quasi-harmonic approximation. For example, if we know a molecular dynamics trajectory of a protein, so here it was a coil-coil system simulated with a very sophisticated coarse field with explicit water for about one microsecond. And having this simulated trajectory, we can apply and talk with the principal component analysis. And what you see on the right is the principal component one and the principal component three, estimated from the MD trajectory. The MD trajectory took us about 60,000 for hours. So here you might also see the normal mode analysis from a static structure using the elastic network model. Essentially, the motions are the same. So here you may see normal mode one and normal mode three. So the first mode is bending motion. The third mode is a twist motion. So this estimation is much, much, much faster. So this estimation took me less than one second, and the speedup is 10 to the power nine times. So one of the take home messages is if you're only interested in near equilibrium dynamics, there is no need to run expensive molecular dynamics simulations. The normal mode analysis can do it much faster with the same accuracy. Another message is that there is a nice connection between molecular dynamics and the normal mode analysis, and if we know estimations of the poses we can also compute the normal mode subspace, which brings me to the last part of my talk. Thanks to the elixir European project with many European partners with the aim to chart and analyze the structure heterogeneity in the protein data bank. The project consists of three different packages. So in the first package, the goal is to cluster similar structures by sequence and by structure. In the second package, we aim to characterize all types of flexibilities. And in the final package, we aim to provide biophysical and functional characterizations all the absorbed flexibilities. We already compiled a master benchmark from what is observed in the protein data bank, starting from very simple cases where all the protein chains can be clustered to just two stable clusters such that the structure heterogeneity can be effectively described with a single motion One PCA or one normal mode is sufficient to cover 95% of the absorb structural variance. We have more difficult examples where three modes are sufficient, or even more sophisticated cases. For example, in case of car modeling we have about 500 different protein chains in the PDB and we need about 15 principal components and modes to explain 95% of the variance. So we may have even more difficult cases where many more components, many more modes are needed to describe the structure heterogeneity. And finally, we are using this benchmark to learn the linear and non-linear motions observed in the PDB. Another idea is that all the observed protein conformations can be mapped to linear or non-linear but low dimensional manifolds. And having several states, we can compute some linear or non-linear transitions from one state to the other stage. Here is an example of the ITPS cluster. If we compute linear transitions from start to the end, all the atoms, all the residues will be moving along straight lines. But if we try to estimate the non-linear transitions, non-linear manifolds, the atoms will be moving along the curves. We do we aim to to estimate non-linear transitions because watch our experiments demonstrate us, the linear transitions are often not sufficient to interpolate or extrapolate the motions. For example, here the goal was to estimate the motion manifold based on two unconnected clusters and then to predict a structure which lies somewhere in between. So if we use just the linear interpolation for one cluster to another cluster, then the linear PC reconstruction would be situated rather far from the intermediate structure observed in the protein data bank. However, if we assume that the manifold is non-linear, and we would use only two distinct clusters for the estimation of this non-linear manifold, the estimation of the intermediate is much, much better. So this is still work in progress and on the next stage we will do the same experiments on the whole protein data bank. This is my time to draw the conclusions. So protein structure prediction problem, the classical protein structure prediction problem is essentially solved. But this is far from the end of the structural bench mathematics. There are many new challenges, which include prediction simulations of proteins in solutions. The interactions with other proteins with DNAs RNAs and small molecules. Similarly, simulations on the whole cell level. And from our experiments on the structural heterogeneity in the protein data bank, we may conclude that learning motions from experimental structures are definitely possible but tricky where very often pro to overfitting. On the next step, we aim to learn the motions on the whole PDB ones. So if you're interested in our software, we have many packages on our team's website and also our server that simulates sketching profiles from x-ray. And from New France and also it allows you to do flexible fitting of initial models into the sketching profiles structures. And finally, I'd like to shamelessly advertise the upcoming workshop on interplay between AI and mathematical modeling the post structural genomics era. So we still have a few slots available. So if you're interested, please contact me. And I'd like to thank my collaborators. Hello delaying publisher complementary knowledge and share my students. And thank you very much for your attention.