 One is clustering, which basically, at least for my approach, doesn't mean to recognize or together structures of the entire molecule, but more it's about recognizing atomic scale patterns that are the building blocks of complex mesoscale kinds of systems. And then I will discuss about mapping high dimension data in low dimension. In prizvo, da bi ste vse v常estste povratili tudi vladostov, na kvalite drugi in zmažili skupnje danes kožite v edite, vsak začnem vizivno odlič si začim pri vizivno odliči, In je to včasno potrebno potrebno, in pa ne boščo však, včasno včasno, in neko v modernu poče, da je vse. Tako, imam več, da je vstrasno, da je to na premačutku vseh tudi vseh, ali nekaj pravne, da se se vseh vseh pravneo, tako, vseh, našelj s Lenarjonskimi klasči, na katku, vseh, Ribe otevno vะje vstajajo. In, kaj ti premet vzela, in sooner se dobrali, te vse malo ti pokazane čas ljenje. Ando. Svačaj, da so oteva nekaj data, od neko segnuhu, da je tako rani. Otvodno, nekaj so začala wesega in oteva pravno, z vsej držav, na kratku v sestu je nekaj tako, in bilo dobro in nekaj, da se počkega vših, da je to. Zelo ne bo, da ne bi se počkega, da se vse zradi in zelo se počkega in nekaj, da se počkega vših sestu. In to je vseč zelo izgledaj, ker je vseč vseč, da je vseč vseč, da je vseč, da je vseč, da je vseč, da je vseč, We have so much ofred pomishing details that we don't know what to do. So the first solution of this problem, is making pretty pictures. Everyone loves pretty pictures and that's exactly something really. I am though fond of pretty pictures, because I think they are a very powerful way of conveying a collecting about a since them, in a way that is really re�에 mortal. gledenje imedne. Ale vse skupnje, as preti as they can be, they don't tell you which of these structure is stable, which is not, and how you can transition from these structure to that one, so this is just a very superficial kind of form of representation of this high dimensional data. So, what can we do when we have high dimensional data, so what we can do is tudi, da vse obstrati vse problem, in tudi, da je vse konfiguracije vsehoj molekul, vsehoj fazi, vsehoj sistem je vektor v hodnjih spesih. Zato, da bomo vsehoj polipeptide, tvoja je Alanine 12, potrebno sem da vseh vsehoj komprehensivnih vseh polipeptide prejda vzbela izgledaj, zato se vzbela za vzbela v 24 dihidralje. Tako, da se pobano sekvost, vzbela na vzbela 24-dimetri, za nekaj nekaj nejihutnih in topologij, za tukaj nekaj dihidralje. In jaz je to vzbela, in več, da malo staro. And then, what can you do with this set of configurations that are represented as high dimensional vectors? je ta problem, da je, je, ne ga se sewa včasjezno za njihoj do ukljevanju, bil jaz kot v 3, jaz jaz ne da se kot, da se prosetirati dočti, da se če se nič prideli,, da se prejste zvukati. V eni jo se napravim na napravku, na nekaj 1 vsevno. Da je tudi bomo poživati nekaj veselj evo, če zelo je v museljenju, google system that you want to map. And one possibility is just to partition it in in this joint regions. This is basically what clustering is about. You take your 24-dimensional space, and you say, OK, this region corresponds to state A. And this region corresponds to state B. And this basically gives you a list of the possible configurations that are homogeneous according to some criterion. Tukaj da se zelo jaz vse openje, kako jaz jaz vse zelo jaz vse vse zelo jaz vse zelo jaz vse zelo jaz. Ako jaz vse jaz vse zelo jaz vse zelo jaz vse zelo jaz vse zelo jaz. MAKE VSELJENI VEJSELJENI, KUJK-JARINI, Kaj je, da všem je večerno vzgovoril, ta tačno so od svoj arrivedrti, tako v rahti, da sem moram vzgovoril vznik v prezernost, da sem moram vzgosez, začno se me je to tega med večerja apartments, da bi mi tjačno priplek, pa vzgovoril ja, da da sem moral na kraj. To je tačno bil v revetriji kratni hranje za vsak. Danes da res da ima kratni hranje in skrept. s obedienten才, o če, še učin ne. So, let's start from the first problem. So, a problem of clustering. How do we cluster a set of data points? So how do we take a set of data points and we subdivide it in groups? So the most obvious way of doing so, and it's actually also what is the basis of many clustering algorithms, is to say okay, I have some future space that describes vse moje konfiguracije. Vse je tudi 24-dimension vektor. Zdaj počutim, da počutim počutim, kaj je vse poslednje konformacije v tudi 24-dimensioni spasih, zelo sem vse predstavljen, zelo sem predstavljen, zelo sem zelo, da vse predstavljen vse predstavljen je karakterizuje klasik, vse je vse izgleda, da je zelo vzgleda, s klaskem. If you want to do something which is a little bit more sophisticated, you can also then represent each cluster with a Gaussian, this is a Gaussian mixture model. And then when you do this, you have, it's very nice, because you have a probabilistic approach. po vse prič, kako so igram, vsak jo, ki je še konfigurace bila kih vsega klasnikov, ali kako je vsega klasnikov. So, če, tudi je to, da je več vsega vsega vsega vsega, ker, kako se povjsči s kriškim kolikom, imaš vsega čakor do zelo, da je dočač vsega kamikalu. Za to, da kamikalu je počak vsega vsega vsega vsega vsega vsega vsega kamikalu. So, you know, if you don't need to know about electronic structure to recognize the fact that within organic chemistry, you find all the time, you know, tetraedral environments or plan of regal environments or linear environments. elektronika strukojnosti tega je odpravil za objevacje, ale po vsehojne ljute, še lahko vsevala, je vsevala, da imaš uvo vse očnički zvršenje, qobaj v ko se več všeč zvršenje organičnev kasteri. Ne, in se je prav, da imaš vsevala atomistice, is that we have a lot of data about what are the possible arrangements of atoms. And since the simulations have become relatively reliable these days, we can really use these to build a definition of chemical patterns, which is not based on our whim, on our say, oh yeah, this is hydrogen bond or this is sp3 carbon, but it comes out automatically from the simulation. So how can we do this? So the idea is really to follow through with this concept of constructing a probability distribution in feature space and subdividing it into its modes. So let's say that we have a system and that we have a chemical environment that we can describe based on true distances. So we run a long simulation and then we accumulate data points that say for this environment when you have a certain value of the true distances. And then we build a probability distribution. We can do this on a sparse grid using a kernel density estimation. Basically, the point is that we build a probability distribution of how often we observe every pair of distances. And then we can cluster these with an algorithm that is called minmax. And the idea is that you start from each point of the distribution and you jump onto the nearest point with higher probability. So you always work uphill and you stop when the jump would bring you further away than a certain threshold. You see, if I didn't put a threshold, no matter from which point I start, I wouldn't up on the point of maximum probability because I always work up and I work up onto the closest point. So if I didn't stop, I will always end up at the point with maximum probability in my distribution. But since I introduced this threshold, I do stop and depending on the point where I start, I will stop in a different position. So each of these are local maxima in the distribution and they partition the configuration space in these joint clusters. So once I have done this, I have recognized that in this system I have three different patterns and in order to recognize them in new simulations, what I can do is build a Gaussian mixture model. So I associate a Gaussian to each of these clusters. And then I can use the posterior probabilities of this Gaussian mixture model. This is basically just, so I have several Gaussians. I take, I sit here and I take the value of this Gaussian divided by the sum of all the other Gaussians. So I get something that transitions smoothly from one where I am part of this cluster to zero when I move on to another cluster. So how can we use these? Let's consider a hydrogen bond, for instance. Yes. No, that's exactly the point. So generally in a Gaussian mixture model you need to guess the number of clusters and also you don't know where you end up. You might end up, you start from three clusters and since it's an iterative optimization you might end up with two Gaussians associated with this cluster and one Gaussian here. You don't really know. So the idea is that here we do a two-step procedure. First we do this non-parametric clustering procedure. So here you just from each point on the grid you start climbing and at the end it tells you how many local maxima you have. And afterwards we do this Gaussian mixture model but it's not really a Gaussian mixture model in the proper sense. So we fix the centers to be on the maximum of the distribution of each cluster and we only optimize the covariance. So the number of clusters is given by the non-parametric clustering. So basically we use the Gaussian mixture model because it gives you these probabilistic finger points which are very nice but we do a first non-parametric clustering step so that we don't have these kind of problems of how many clusters do I have and how I initialize the Gaussian. So let's speak about the hydrogen bond. So hydrogen bond I mean there is no way to underestimate how important it is. I mean water, yes we spoke a lot about water but also polypeptides, I mean DNA. I mean really, it's probably the single most important thing for life as we know it. And chemists have really gone crazy with the hydrogen bond. So in 2005 there was a conference in Italy which was called the hydrogen bond task force 2005 and they discussed for three weeks about how you define the hydrogen bond. And in fact if you look even from a computational point of view there are dozens of ways of recognizing if you have a hydrogen bond or not based on different thresholds, on distances, angles, energy estimations. But if you stick with a purely structural definition all the information you need is within these three distances because a hydrogen bond is made up by a donor, an acceptor and a hydrogen. So with three distances you fully determine the geometry. And the problem is how to determine a region in the space defined by these distances that you have to be a hydrogen bond. So for historical reasons and also because it looks nicer we use actually a combination of distances but the algorithm really is worth just the same even if you work directly with the distances. So we have decided basically what is the abstract feature space. So how we want to describe something that could be a hydrogen bond but we still don't know what is a hydrogen bond and we learn these by analyzing a simulation. So we take snapshots of a simulation, say of ice or liquid water or whatever. And then I'm really masochistic and so I say I forget the fact that I have water molecules because we might have defects, we might have charged fluctuations. So I just have hydrogens and oxygens. And then I look at all the possible groups of oxygen-oxygen-hydrogen and then here is finding out how to recognize that this is a hydrogen bond and this is not. So what I do is I build a probability distribution for these groups of three distances and it looks like this. So here you have two symmetric lobes because since I am not saying what is a water molecule, basically if I sit on oxygen I could regard it either as a donor or as an acceptor so I have two symmetric lobes. And then you can cluster these and you find that we have a cutoff on the distance here but within this cutoff there are four clusters, one, two, three and a very weak one behind here that you probably don't even see. But the point is that this cluster is precisely in the range of distances that everyone always recognizes as the hydrogen bond. And so this is the cluster that we can call hydrogen bond but the nice thing is that with this procedure now we have a probabilistic way to assign whether any point in configuration space is a hydrogen bond or not. And if you compare it with... So you can build a three dimension... Yesterday I discussed these radial distribution functions. This is the same thing but in 3D. So here I am sitting on a water molecule and asking myself where are oxygen atoms relative to me. So since I am hydrogen bonded I will probably have one there and one there and since I am also receiving hydrogen bonds I will also have two oxygens on my back. And then basically what you want is that this region in future space is recognized as a hydrogen bond and here you can see a cut along this plane so this green wedge would be a traditional definition of a hydrogen bond that of course people have not been stupid for the past 20 years. This is a very reasonable definition of a hydrogen bond. It encircles the region of maximum probability but you see this is the isocontour of probability 0.5 based on this automatic definition and it really follows very nicely the region in future space that corresponds to what one would call a hydrogen bond. But the nice thing here is that if you change the boundary conditions if you change your model of water you change the thermodynamic state point or you switch on quantum effects the distribution of distances changes and if you trust your simulation you should also change your definition of a hydrogen bond and this happens automatically so for instance this is the probability distribution that you get when you switch on quantum effects you see that compared to these everything is fatter because I have quantum fluctuations in all directions as we have seen yesterday and now you have your new updated definition of a hydrogen bond and then you can use these to do whatever you like so you can recognize the effects look at hydrogen bonding in different systems so you can if you look at a small polypeptide in water and you look at the hydrogen bond which is formed between the water and the oxygen of the protein you have a completely I mean it's not symmetric anymore because this oxygen cannot donate so it's asymmetrical and your definition of the hydrogen bond will adapt naturally to this new situation or if you look at the nitrogen donating to water different distribution, different definition and it all falls out automatically from your simulation itself you can also do this if you work with polypeptides you very often work with dihedral and in a polypeptide you are not only interested in hydrogen bonds per se but you are interested in how the hydrogen bonds play around with the covalent backbone of the polypeptide so I suppose that you have heard at least once in your life about alpha helices and beta sheets so basically when you have a polypeptide it forms this three-dimensional structure and every protein has a different three-dimensional structure but you can recognize these patterns that appear over and over and for instance you have alpha helices in which you have a characteristic hydrogen bond pattern between residues that are I think four apart but I'm not sure that's four and beta sheets in which you have these flat layers with straight-up strands of polypeptides now in order to form these structures the backbone has to take a particular shape so there is a correlation between the secondary structure and the dihedral of the backbone and this has been recognized long, long, long time ago this is called the Ramachandran plot and basically someone looked at actually experiment at crystallography data on which angles were found in polypeptides you have these backbone angles and noticed that basically only a few values seem to be allowed and also you seem to have some clusters that can be assigned to different secondary structure patterns so what happens if we take the whole of the protein at a bank, look at the dihedral and we feed them to this probabilistic analysis approach well, it recognizes precisely that there are all of these patterns and the nice aspect here is that since this analysis is automatic you can extend it for instance to consider more dihedral to get a more precise description so for instance here these two states are not very well separated from each other but if you include a third dihedral in the description now these two states are perfectly well described so this is still work in progress but the idea here is that using this automatic approach you can really recognize even complex patterns in the condensed phase so let's try now to move on and move up so this far I was discussing how can recognize automatically the building blocks of complex structures but what about the complex structure as a whole so if I'm not interested in recognizing the structure of a small segment of a polypeptide but I want to tell you if as a whole the polypeptide is folded or denaturated how do I do that? I mean all of these clustering techniques lose steam very quickly when you move up to very large dimension so we need to do something else and if you have worked on simulations of complex systems you often end up with this concept of free energy and of collective variables and for those of you who have never heard of this before the idea is that you want to describe your big protein with just one or two numbers that tell you whether it's folded, it's not folded or it's in a pathologically misfolded state for instance but finding these variables have to be functions somehow of the large number of coordinates that compose the polypeptide and finding the relation between the high dimension description the low dimension description is far from trivial so typically you go on a trial and error procedure if you have been working on this for a long time you have sort of an intuitive feeling of what will work and what will not work but it's sort of a black magic procedure so it would be much nicer if you could make this in a fully automated manner and I will show you how you can do this and this is not only good to gain an understanding of your system but as we will see it's also important because it allows you to reach longer time scales in your simulation but I will get back to that later on so let's start from that aspect so let's start from the problem that when you have a complex system that can exist in multiple metastable states I mean this is not a complex system but it can exist in a trans or cis form and the problem is that there is a free energy barrier for going from one state to the other so if you are doing your simulation in reasonable temperature where both structures are stable what will happen is that for a very long time your system will be just will be just oscillating back and forth will be just oscillating back and forth doing nothing basically and then after a long waiting time where you see, yes of topology you mean of the fact that it's like a rotation of the dehydral well it depends on the system you have I mean in general there can be some regularities but in general depending on the details but even on I don't know the pH of your buffer solution the height of the barrier can be different and therefore the time it takes to make it will change there are no general rules so the more stable the two states very often the harder it is to go from one to the other but there is no really general rule the general problem is the fact that you are in a situation in which you need to use a small time step to integrate your equations of motion because partially because you have to follow the motion of individual atoms but also partially because once the transition starts it's typically very fast and then you have a very long waiting time where nothing interesting happens and then it basically becomes impossible to run your simulation so how can we attack this problem well a first concept that I needed to introduce to you is that of free energy so free energy in the language of atomistic simulations is not precisely even though I mean it's clearly related to but it's not precisely a macroscopic thermodynamic quantity that tells you in which direction a macroscopic reaction will go or how much work you can extract by pulling a piston what we really talk about when we speak of free energies is a quantity that is related to the probability of observing one state over another so it's basically given a variable that can tug the different states of the system you build a histogram of how often you observe each state and then you take minus kbt logarithm for it so if you want to write it in a formal manner you have your Boltzmann probability distribution you have a delta function that selects tagged values of this order parameter which could be, let's say, the dihedral angle in this case but could be, I don't know, the distance between these two atoms or whatever and you integrate it over the the whole of phase space or here there is a logarithm missing, sorry there should be a logarithm here kbt log of the integral apologies and then I mean, why we introduce it in this way? Well, because it's basically the equivalent of a potential in collective variable space so just as in coordinate space the probability distribution is given by e to the minus beta v in free energy space the probability distribution is e to the minus beta f of s so having introduced this then the next step is recognizing that what is hindering the transition between the two states is the presence of a free energy barrier in it because it means now that if I have a free energy barrier it means that it's very unlikely that I end up on the top of the mountain and so I will have to wait a lot of time before I actually get there I mean, very hand-wavingly but if we could remove the mountain if we could make it flat now I could go back and forth very easily and I could explore the transition very, very effectively so the idea is that if you add an additional bias to your simulation so if you modify the physical potential of your system by adding a bias that depends on the collective variable you are just shifting the free energy so you can see that this goes through so if you do this so if you add a bias that is a function of s since the delta function selects just this value of s this actually falls out of the integral and so the free energy in the presence of the bias is just the free energy without the bias plus the bias so this means that you can this means that you can actually control and modify the probability with which you observe transitions just by adding a bias potential that is minus the barrier fantastic, right? so we have a way to remove mountains just by adding a valet potential so what's the... and then you can actually rebuild the beautiful thing of this is that you can still obtain any average in the absence of the bias just by computing your average including a weighting factor this is basically, you see, this is e to the plus beta bias so where the bias is high this factor is high when the factor is low, this factor is small so this means that we have made this flat but we have added a large bias here so when we are in this region this factor will be very large so we will give much more statistical weight to these points than to the points at the barrier which have been made very likely by the addition of this bias potential but they shouldn't be they should be very unlikely in the real simulation in the absence of the bias so with this factor you can reconstruct the probability distribution that you would have had without the bias even when you have the bias now the problem is that of course if you knew what was the height of the mountain your problem would have been already solved so the question is how do you get what's the shape of the bias and how do you add so the idea here is to do this on the fly so the idea is to add this bias to build this bias as the simulation goes with the idea that if you stay long in a certain region of your configuration space it means that you have a low free energy and you should go away from there so let's say that you have a system that can exist in these two states and there is a barrier in between and if you run your simulation without any bias it would be very boring so it would be going back and forth forever so after a while you say well, okay, I had enough of these so you say, well, let's add a repulsive bias so I add a Gaussian hill which is centered in the position where I am and now that there is this bias potential I am pushed away and then I keep adding hills adding hills, adding hills in the end I will fill up the valley reach the top of the mountain and fall down on the other side and if I keep going I will basically fill up everything and then I can go back and forth easily without any barrier now, let's go back I mean, this was sort of an aside to explain you how you need the collective variables also to accelerate sampling and go over free energy barrier and observe activated events but now let's go back to the problem of dimensionality reduction because in order to find this low dimensional collective variable you need to do a dimensionality reduction so you could do this manually based on your intuition and your experience but it would be really, really nice to do this automatically so how can we do this? so as I said we can abstract the problem by saying that we have each configuration associated with one vector in a high dimension space now please stretch your imagination and think that each of these is a point in a 24 dimensional space and what we can do is compute what's the similarity or the similarity between these configurations this could be just the Euclidean distance between these vectors and then we have a matrix that tells us how similar is each of these reference states to each other of the reference states and then we can play this game so you all come here and we start moving around little cocktail flags on the desk trying to make sure that the distance between the cocktail flags match the similarity computed in the high dimension space if you do this, if you succeed now you have a two dimensional representation of your problem that preserves the distance relations between the high dimension configurations and then once you have done this discrete map so once you have a set of reference states and the low dimensional representation that preserves the important properties between them you can take any other configuration and using these points as reference find where you should represent it in sd so this is a little bit like a GPS navigation you are your GPS works by measuring the distance with the satellites the GPS system knows where the satellites are relative to the surface of the earth and they can find what is your position so in a similar way once you have found the position of your satellite configurations on the desk you can find the position of any other point now clearly this is a very nice idea but it only really works if I guess that many of you are pretty much mathematically oriented so this only works if your high dimension points are in subspace of the high dimensional space otherwise you can't match the Euclidean distances at least so this will have to be approximated and we will see how you can go beyond these how you can relax these constraints but now let me introduce you perhaps the simplest approach to make these operation in practice so this is called principal component analysis and the idea is that you have your data points so these are vectors in d dimensions so now in principal component analysis we are really assuming that our vectors are basically part of a Hilbert space because you need a scalar product basically to do principal component analysis but I mean this could be just points in rn so this can just be vectors in rn and you have you define the matrix X that where basically these these vectors I mean the components of these vectors are the columns and the vectors are the rows and then you can build something which is letting alone how this is done you basically build the covariance matrix of these vectors and now the covariance matrix of this data set tells you what's the important direction so the important direction is the direction over which you have a largest spread of your points so after you have found the covariance matrix you can do a singular value the composition so you basically find the largest eigenvalues of the covariance matrix and you can select a linear subspace that contains as much as possible variance of your data set so in this case for instance I could say I want to project this down to 1D well after I do a singular value the composition I pick that this is the eigen state with the largest variance and then I project all the points onto this direction and I find this one dimension representation of this two dimension data set so this is how in practice you can find a linear projector that contains as much as possible information on the variability within your data set and then actually you can generalize this idea I don't really want you to get too much into the details of these but the idea is that you can draw a connection between the principle component analysis that you see is an operation that is based on the possibility of having a scalar product so you really need you must be working in a Euclidean space basically to be able to do principle component analysis you can put it in relation to multidimensional scaling which instead is a problem in which you have a matrix of distances and you compute distances between your low dimensional representation and you try to minimize this objective function so you see what I am trying to do here is I take the matrix of distances in high dimension and I try to minimize the discrepancy between the low dimensional distances and the high dimensional distances so you can show that if this is a Euclidean distance and you are assuming that the low dimensional points are linear projections and the high dimensional points you can actually show that this problem is equivalent to the principle component analysis but the advantage is that you can formally do this set of operations even if the distances are not Euclidean distances and this will become important later so let me give you an example so this is a actually an actual example of the production community use so some botanisist has gone around and measure some morphological characteristics of flowers of these three species of iris and they measure basically the length and the width of the petals the colorful things and the sepals which are the that hold the flower together and I mean botanically they know that there are three species of iris so they have this four dimensional data set they have something like 150 flowers and they want to see if they can recognize the presence of three species based on an analysis of this four dimensional data set and if you run a principle component analysis you do find that there are three groups of points that correspond to the three species you can do so here I don't know if it's clear to you but if you run principle component analysis you get this representation if you run multidimensional scaling you get these which is actually just the mirror image of these and the reason why you get the mirror image is that in MDS you work based on distances so your result is invariant under rotations so you can always get it doesn't make any difference if you get this or that but I run the algorithm and I got this but it's really the same picture and you can do this iterative minimization which doesn't change things much and so what you can see is that there is some clear clustering but that these two species are very close to each other but if you go back and look at the flowers well this is really weird and different and it's true I wouldn't be able to distinguish that you personally so it's actually working pretty well so the problem is that if you have something which is highly nonlinear so for instance this is another typical data set that is used in this community it's called a swiss roll because apparently I don't know I live in Switzerland and I never stumbled upon a swiss cake which has this shape but apparently there is a typical swiss cake which is like a roll so you have this data set you could imagine unfolding it and finding a true dimensional representation in which you use this as one of the coordinates and the position along the spiral as the second coordinate so in a certain sense the local dimensionality of this data set is true but is embedded in a three dimension space and if you try to run principle component analysis which is a linear projection you just basically see the roll from the side so you are not able to unroll it because this is just a linear projection so what we need is something that doesn't only work where the points are perhaps a little bit noisy but really lie in a true dimensional manifold but we need something that works when points lie on a curved manifold which is only locally linear so we need something that when the points are locally part of a 2D subspace then can open this up in 2D so a lot of techniques have been developed for doing this and this really comes mostly from the image recognition and handwriting recognition community and one of these techniques is called isomap and this really relies on the fact that in multidimensional scaling you can use any distance that you like for this technique it's really important that you can generalize principal component analysis to work even with a distance which is not induced by a scalar product so you have your swiss roll that are set so what you can try to do is to measure distances between points not using Euclidean distances but by using geodesic distances so what you do is you form you connect each point to its neighbors and then if I need to go from this point to that point I don't take the Euclidean path but I follow the manifold from neighbor to neighbor and now I estimate the distance between these two points as the geodesic distance so as the sum of the distances between the neighbors if you use this matrix to run your multidimensional scaling algorithm everything opens up nicely and the main problem as we will see is the fact that this is extremely sensitive to noise so if by chance you mistakenly detect a point of the next manifold on the next fold of the roll as one of your neighbors you can take shortcuts along the geodesic path that completely mess up your distance matrix and you will see this it really destroys completely the projection so if you take the swiss roll data set you run isomap perfect so this is automatic representation of this data obtained by using geodesic distances however if you start adding some noise to the points at a certain point you fail catastrophically so you see you really for moderate levels of noise nothing happens as soon as you detect here there is just one point which is off and you see it has already sort of folded back an entire edge and as soon as you have a couple of points that cut through it's a total mess so this is a problem because when you run atomistic simulations and this is really a key understanding thermal noise is something you can't live without a physical feature of your system and you can't ignore it you can't say I filter out noise a little bit I smooth things up thermal fluctuations are physics and you have to cope with them at the design stage of your algorithm I want to since I want to sort of give you a general introduction I also want to discuss another possible algorithm standard algorithm for non-linear dimensionality reduction this is called locally linear embedding and this really relies on the concept that your data set is locally linear and only globally curved so the way it works is again and this is its weakness actually it's based on a notion of locality so you need to be able to define environments of a point for instance you take this point the few points that are closest to it and then in high dimension you try to find the weights that given the first nearest neighbors represent the point as a weighted sum of its neighbors so you see this is the real point and I try to write it as a weighted sum of the coordinates of its neighbors and I optimize the weights until I get the weighted sum to be as close as possible to the real point so this is stage one so you do this for each point so you write each point as a weighted sum of its neighbors and then you go in shoe dimension or I mean in low dimension and you keep the weights fixed and you optimize the positions in such a way that each point is self consistently as close as possible to the weighted sum of its neighbors so basically in high dimension you determine the weights with the points known in the low dimension you determine the points with the weights known and you can map actually this problem and the problem with this is that it's not an eigenvalue problem in which it's very clear what are the good eigenvalues so for instance if you do locally linear embedding and you take the true largest eigenvalues for the first rule it doesn't look very nice however if you take 3 and 4 it works very well and it's sort of the problem is that for instance I increase the noise total mess but if I go to 7 and 8 it looks fine so the problem with this algorithm is really that you don't know what are the good eigenvalues and I find this a little bit displeasing and also this is also very sensitive to noise so the problem really is that we had started thinking that we could just borrow ideas from a different community but we have perhaps to go back and look at what's the problem we are dealing with because we are not really with data sets that look like flat the mathematically precise surfaces we deal with a data set in which noise is one of the main ingredients so our data set really looks more something like this so we have locally stable structures thermal fluctuations in all directions and then we have transitions between these locally stable structures so how can we work out an algorithm that can be resistant to a system like this I mean, so first of all let me convince you that this is really the case if you are not convinced so if you look at something like Alan in the peptide how does the free energy landscape looks like well I have as a function of these two dihedrals I have these locally stable states I will have these distributions which are like two dimensional gautions around these states and then I have a network of transition states to go around them and if I look at Alan in 12 and I look at any pair of dihedrals it always looks like that so this is really the way this is the kind of data set that we have to work with and we need a technique that is designed to cope with noise and this is actually particularly problematic because when you go up in dimensionality noise starts to behave in a very non-intuitive way this is something that is sometimes referred to as the curse of dimensionality and it's in simple words it's telling you that distances become very informative when you go in high dimension and you can see why by doing I mean this is something that you can do in a piece of paper it's very simple so consider a Gaussian distribution in one dimension you take two random samples from this Gaussian and you ask yourself what's the probability distribution of the distance I mean how likely so let's say that I have a Gaussian with variance 1 what would be the most probable distance if I take two points at random any guess so if I have a Gaussian of variance 1 I take two points at random and I ask myself what's the most likely distance between them yeah it's 0 fine now I go in 3D actually down the probability of having a distance of 0 has become 0 and this is just because of the volume element in phase space when I integrate it out with the distance but the problem is that as you go up in dimensionality you get a distribution in which even if two points are part of the same multidimensional Gaussian I have no guarantee that they will be close to each other actually it would be extremely unlikely that they will take two points from the same Gaussian and they will be close to each other so the notion of distance becomes very shaky when you go to high dimension so assigning using distances straight away to assign to determine proximity is something which is very delicate when you go up in dimensionality and the problem is that if you were trying to apply one of these distance matching techniques it would be a disaster because let's say that you want to represent 24 dimensions in three dimensions and you want to reproduce the distances you have your 24 dimensional Gaussian you would like it to be represented by a three dimension Gaussian probably and there is no way you will ever manage to match the distances between 24 and 3 dimensions so distances are not really something we should work with so the idea is that we don't actually need it's not really to preserve the distances so it's not like we need to get a very precise satellite picture with all the distances nicely preserved green grass fields, the lake and we don't even need something like this where we sketch a little bit but we still work trying to match the distances this is all you need if you come visit me at EPFL where you can get a beer where you can get a coffee where you can swim in the lake which is not as nice as swimming in the sea but it's decent and how you go to town if you want to have a little bit more fun so the point is how would we encode on a computer this concept that we don't need to represent distances but we can just live with a hand sketched map of the feature space of my protein so let's let me give you a very simple example just going from 2 to 1D let's say that I have this that these points represent a chemical reaction so here I have thermal fluctuations in the reactants, a transition and thermal fluctuations in the products now I think you will all agree that this is a fantastic one-dimensional representation of that data set however, if I look at distances between points I do have points that are pretty far away from each other in the reactant basin and they end up projected on top of each other so if I was trying to do something like principle component analysis or multidimensional scaling there would be a tendency to blow up basins because there are very large distances between them and this would risk losing the resolution that allows me to distinguish reactants and products in the low-dimensional space so what can we do? well, what we can do is relax our requirements because we really don't care that we have distances preserved within the reactants or within the product basins all that we care about and if something is close in high-dimension it should stay close also in the projection and if something is far relative to a threshold distance it should stay far in high-dimension so the idea is that we move from matching distances to matching proximity so we define a characteristic threshold which is associated with the scale of thermal fluctuations and everything that is closer together thermal fluctuations I really don't care as long as it stays close together and everything which is further apart I really don't care as long as it's far apart what I want to avoid is having configurations that belong to different states being projected on top of each other so how do we do this in practice? well, we start from you probably hopefully recognize this that I had on the slides for multi-dimensional scaling so what I'm trying to do here is optimize the low-dimensional points in such a way that the distances between the low-dimensional points match the distances between the high-dimensional points so this would be multi-dimensional scaling so what I do is a very simple idea I just apply a switching function to the distances so that I define a characteristic threshold where I switch from close to far and now you see that now if I have two points in high dimension that end up here and in the low-dimensional projection they end up here there is no problem after the application of the switching function both distances are projected now to zero and they contribute nothing to the objective function so the idea is that I don't waste effort, I don't waste resolution trying to represent distances that I can never succeed in representing and I focus on representing the important distances that are basically the distances just outside the range of thermal fluctuations because these are the distances that correspond to me going from state A to state B so how does this work in practice? Well, first let's look at how a free energy landscape looks like when you use conventional collective algorithms so this is this Alanine 12 polypeptide and this is a representation of its free energy so low values mean high probability high value mean low probability and this is represented for each state of the polypeptide I represented with two other parameters which are the direction radius basically the end to end distance and the mean square displacement relative to ideal alpha helix so unsurprisingly well, the alpha helix configurations are close to zero because they are very similar to an ideal helix and as it often happens in polypeptides particular because this is all Alanine residues this is very stable and then however I have all of this I mean this blob in which I have a lot of configurations that look fairly reasonable I mean I have a lot of hydrogen bonding in each of these states but there is no clear fissure associated with it if I look at this map I would get the impression that going from here to here is barrier less however if I run a simulation of this system and I stay here I can run for 100 nanoseconds before it goes away so if this is not really like this is not stable as the free energy surface seems to suggest this is really that this intuitive order parameters don't have the solution that is necessary to tell apart the thermal fluctuations around the structure from the thermal fluctuations around that structure so how does this look like when you instead use I mean you let the simulation tell you what are the best order parameters to describe it it looks like this and you see now that the physics is the same so the alpha helix is still the most stable structure by far but now each of these reasonably looking states is associated with the local minimum in the free energy surface and I mean this is not this is important for a gazillion reasons so it's important because if you want to do accelerated sampling as hopefully we see this is a much much better way because if I was using these if you have I don't know how much you grasped of this idea of of meta dynamics of building a repulsive bias on the fly if I'm not able to tell this is different from that I will not be able to push myself out of a valley into the next valley because I have I don't even see that there is a valley here instead it will work much much better and then you know knowing that this is a meta stable state is important for so many reasons so first of all in particular for this kind of systems the force fields are very very far from perfect so if your simulation only tells you oh the alpha helix is stable and the experiment tells you well hold on a second but I think I'm seeing a beta sheet you say well buddy I see alpha helix but if you see that the beta sheet is actually locally stable then you can start thinking well hold on a second perhaps there is something wrong with my force field what's it that is stabilizing the alpha helix so much relative to this beta sheet which is stable only not as much as it so it can give you some kind of qualitative information that is really precious when your force field is less than ideal also for instance it can tell you hey look I mean this polypeptide can also be stable like this and the presence of some solute or some mutation can actually stabilize this and you know a lot of degenerative diseases are due to the accumulation of amyloid fibrils and having such clear representation that your protein which is normal in this state can actually also have that form can be really useful but let me I mean let me also look back at how this is also very nicely correlated with secondary structure and the only point I wanted to make is that I've built these maps using the dihedral angles but there is nothing special about the dihedral so you could use a different way to represent distances in high dimension and you would get a picture which is different I think that the dihedral work better but still you get same qualitative picture of stable basins joined together by a network of high energy transition states so there is one aspect that I haven't discussed yet so this far I have told you how to run sketch map so this sketch map technique builds this network of reference states so given number of hundreds of high dimension reference configurations for your polypeptide tells you where those 200 configurations are however if you have 100 microseconds of simulation time for your polypeptide chances are that you have more than 200 states on your hand and so the question is how do you find the low dimension representation of a new point so people have been doing something like this so you basically take your low dimension representation of a new high dimension point to be a weighted sum of so you look at how far you are from all of the reference states you build a weight which is high when you are close to one of the reference states and low when you are far away and you build your low dimension representation as a weighted sum of the low dimension representation of the reference states so this works sort of fine but it's a convex embedding so it means that when you move on a line in high dimension you will always be projected on a line in low dimension and the problem is that if you end up in a point which is far from everything you have seen that far let's say that these are your reference points in 3D you have this point surely you would like it to be projected here relative to the low dimension representation of these points but by construction you will actually be projected in the middle of the dataset this is because since your configuration is a weighted sum of the reference positions you can only be within the convex hull of the low dimension representation so this is problematic because again you don't want a technique that requires you to have explored all of configuration space before it can give you something predictive so you want to be able just to sample a subset of the configuration space and still be able to say something meaningful about new configurations that you might encounter so the way we solve this problem is by doing the out of sample embedding which is this procedure of finding a new point taking very, very literally this GPS analogy that I was making before so what we are doing now is basically we build an objective function which is very similar to the one before the only thing is that now I have the new high dimension configuration I look how far it is from all the references and then I optimize the low dimension projection in such a way that I minimize the discrepancy between the distances of the new point to the high dimension references with the distances of the new projection to the low dimensional references so basically I define the projection of a new point as the arg mean of this function and this is really nice because as I was anticipating it gives you a way to they say that in ancient maps when they couldn't explore a region that just would put a lion there there are lions there, you don't really want to go there but it's actually useful if you want to have a predictive method to understand what are the possible configurations for your complex system you also want to be able to know something about regions far away from what you have ever observed here I have an audience which comes from all over the world but let's say that instead one of us has been a little bit lazy and has only gone around visiting Europe and has made nice map of Europe but doesn't really have any idea about Greenland, Africa or India and then if this person has a GPS that now can only work in Europe that's pretty boring so the question is can we say something now can we tell something about where we are even if we are far away from Europe and we can if we are in Greenland we can still say that we are closer to Sweden and to London than to Sicily and if we are in Africa we can say that we are closer to Spain and to Italy than to Sweden so we can actually find somehow an indication of where we should be on the low dimensional map even if we are adventuring out of the known region and for instance we applied these ideas to the representation of a cluster of Leonard Jones atoms this actually you would have if you tried the hands on the tutorial you would actually build a map precisely for this system and the idea here is that this is a Leonard Jones 38 which is used like crazy because it's a model of a solid solid phase transition because it can exist it has a very low enthalpy state which is a truncated of the hydron and then it has another state which is a bit higher in enthalpy but is entropically stabilized which is basically a part of an icosa hydron and then you have melting and this becomes this is basically the free energy map built using this sketch map technique and the point is that it works across the phase diagram so even though you have multiple transitions and multiple different states you just build a map at any of the temperatures and this can be descriptive also for states that are at temperatures very far from there so here I basically showing you the free energy of the system as a function of temperature so at low temperature this is clearly the most stable state when you go up in temperature these states become entropically stabilized and then you basically reach melting and everything gets into the molten and the point is that all of these free energies have been obtained using as a reference the same set of structures which was taken from just one of the temperatures also an interesting thing is that these states are very, very well known and discussed like crazy in the literature but what you see here which is quite interesting is that this system also has sort of a liquid-liquid phase transition so there are two very different regions in the disordered state and this is something that I mean we didn't tell anything to sketch map about but it came out very naturally you can actually do something really crazy I think this is the most compelling proof of how predictive sketch map can be because you can say if I change the temperature in a certain sense fluctuations it's a small system so fluctuations at an intermediate temperature will give me configuration from the solid to the molten state but you can actually use the map so here I'm taking the map that I built based on this small gas phase cluster and I'm using it to project the structures for a bike of the same system so even if I'm using as a reference something that has been trained on a system which is basically 100% surface I still have enough well first of all all the solid configurations are projected very far away which is as it should be and if we had used that weighted sum approach we would end up pretty much in the center of the map which is clearly nonsense here I'm projected far away from the map because it's a different part of configuration space and there is a amazing resolving power so we can distinguish structures that are from a liquid structures that are from a solid and we can identify a vacancy an interstitial a frankel pair even a partial dislocation so even though I'm using as the reference a map that has been trained on cluster I can also say something about the condensate phase so to summarize and to leave a little bit of time for questions if you if you move beyond a hubbar model and you start working with something which you have continuous variables and thermal fluctuations and multiple metastable states it becomes a problem not only to run an accurate simulation but also to convey intuitive information about what we are seeing in that simulation and this is not a problem to underestimate also because this is a problem which will only get worse as more complex systems become amenable to modeling so I think that it's really crucial that at this time we also start thinking about how we can use the computer not only to run the simulation but also to help us understand it and I think that from this point of view we have a lot to learn from computer science and the machine learning community who have done these kind of things in completely different contexts for quite a long time now at the same time it's important that we recognize that the systems we are working on are different from the recognizing your face when you are at the airport and so we have to somehow encode into our algorithm this physical understanding at the very beginning of the design phase so I have presented you sort of an overview of the techniques that you have available for doing these things so mostly clustering and nonlinear dimensionality reduction and if you are interested so this is a very lengthy book about multi-dimensional scaling actually interestingly multi-dimensional scaling was first to give you an idea of the context in which these techniques arise so the first application of multi-dimensional scaling was for sort of market analysis so they were asking people whether they like the product or not while changing different features of the product and then they wanted to understand how to make sense of the response of the people when they had multivariate parameter space for the response of the people to the change of these different features of the product so this is really a very general idea and I think it can be applied very fruitfully to simulations then you can have a look at some of the standard techniques and you can download some code there are tutorials you can just look up this website and with this I'm done so thank you for your attention